Hash Indexes

Started by Amit Kapilaover 9 years ago180 messages
#1Amit Kapila
amit.kapila16@gmail.com
1 attachment(s)

For making hash indexes usable in production systems, we need to improve
its concurrency and make them crash-safe by WAL logging them. The first
problem I would like to tackle is improve the concurrency of hash
indexes. First
advantage, I see with improving concurrency of hash indexes is that it has
the potential of out performing btree for "equal to" searches (with my WIP
patch attached with this mail, I could see hash index outperform btree
index by 20 to 30% for very simple cases which are mentioned later in this
e-mail). Another advantage as explained by Robert [1]/messages/by-id/CA+TgmoZyMoJSrFxHXQ06G8jhjXQcsKvDiHB_8z_7nc7hj7iHYQ@mail.gmail.com earlier is that if
we remove heavy weight locks under which we perform arbitrarily large
number of operations, it can help us to sensibly WAL log it. With this
patch, I would also like to make hash indexes capable of completing the
incomplete_splits which can occur due to interrupts (like cancel) or errors
or crash.

I have studied the concurrency problems of hash index and some of the
solutions proposed for same previously and based on that came up with below
solution which is based on idea by Robert [1]/messages/by-id/CA+TgmoZyMoJSrFxHXQ06G8jhjXQcsKvDiHB_8z_7nc7hj7iHYQ@mail.gmail.com, community discussion on
thread [2]/messages/by-id/531992AF.2080306@vmware.com and some of my own thoughts.

Maintain a flag that can be set and cleared on the primary bucket page,
call it split-in-progress, and a flag that can optionally be set on
particular index tuples, call it moved-by-split. We will allow scans of all
buckets and insertions into all buckets while the split is in progress, but
(as now) we will not allow more than one split for a bucket to be in
progress at the same time. We start the split by updating metapage to
incrementing the number of buckets and set the split-in-progress flag in
primary bucket pages for old and new buckets (lets number them as old
bucket - N+1/2; new bucket - N + 1 for the matter of discussion). While the
split-in-progress flag is set, any scans of N+1 will first scan that
bucket, ignoring any tuples flagged moved-by-split, and then ALSO scan
bucket N+1/2. To ensure that vacuum doesn't clean any tuples from old or
new buckets till this scan is in progress, maintain a pin on both of the
buckets (first pin on old bucket needs to be acquired). The moved-by-split
flag never has any effect except when scanning the new bucket that existed
at the start of that particular scan, and then only if
the split-in-progress flag was also set at that time.

Once the split operation has set the split-in-progress flag, it will begin
scanning bucket (N+1)/2. Every time it finds a tuple that properly belongs
in bucket N+1, it will insert the tuple into bucket N+1 with the
moved-by-split flag set. Tuples inserted by anything other than a split
operation will leave this flag clear, and tuples inserted while the split
is in progress will target the same bucket that they would hit if the split
were already complete. Thus, bucket N+1 will end up with a mix
of moved-by-split tuples, coming from bucket (N+1)/2, and unflagged tuples
coming from parallel insertion activity. When the scan of bucket (N+1)/2
is complete, we know that bucket N+1 now contains all the tuples that are
supposed to be there, so we clear the split-in-progress flag on both
buckets. Future scans of both buckets can proceed normally. Split
operation needs to take a cleanup lock on primary bucket to ensure that it
doesn't start if there is any Insertion happening in the bucket. It will
leave the lock on primary bucket, but not pin as it proceeds for next
overflow page. Retaining pin on primary bucket will ensure that vacuum
doesn't start on this bucket till the split is finished.

Insertion will happen by scanning the appropriate bucket and needs to
retain pin on primary bucket to ensure that concurrent split doesn't
happen, otherwise split might leave this tuple unaccounted.

Now for deletion of tuples from (N+1/2) bucket, we need to wait for the
completion of any scans that began before we finished populating bucket
N+1, because otherwise we might remove tuples that they're still expecting
to find in bucket (N+1)/2. The scan will always maintain a pin on primary
bucket and Vacuum can take a buffer cleanup lock (cleanup lock includes
Exclusive lock on bucket and wait till all the pins on buffer becomes zero)
on primary bucket for the buffer. I think we can relax the requirement for
vacuum to take cleanup lock (instead take Exclusive Lock on buckets where
no split has happened) with the additional flag has_garbage which will be
set on primary bucket, if any tuples have been moved from that bucket,
however I think for squeeze phase (in this phase, we try to move the tuples
from later overflow pages to earlier overflow pages in the bucket and then
if there are any empty overflow pages, then we move them to kind of a free
pool) of vacuum, we need a cleanup lock, otherwise scan results might get
effected.

Incomplete Splits
--------------------------
Incomplete splits can be completed either by vacuum or insert as both needs
exclusive lock on bucket. If vacuum finds split-in-progress flag on a
bucket then it will complete the split operation, vacuum won't see this
flag if actually split is in progress on that bucket as vacuum needs
cleanup lock and split retains pin till end of operation. To make it work
for Insert operation, one simple idea could be that if insert finds
split-in-progress flag, then it releases the current exclusive lock on
bucket and tries to acquire a cleanup lock on bucket, if it gets cleanup
lock, then it can complete the split and then the insertion of tuple, else
it will have a exclusive lock on bucket and just perform the insertion of
tuple. The disadvantage of trying to complete the split in vacuum is that
split might require new pages and allocating new pages at time of vacuum is
not advisable. The disadvantage of doing it at time of Insert is that
Insert might skip it even if there is some scan on the bucket is going on
as scan will also retain pin on the bucket, but I think that is not a big
deal. The actual completion of split can be done in two ways: (a) scan the
new bucket and build a hash table with all of the TIDs you find there.
When copying tuples from the old bucket, first probe the hash table; if you
find a match, just skip that tuple (idea suggested by Robert Haas offlist)
(b) delete all the tuples that are marked as moved_by_split in the new
bucket and perform the split operation from the beginning using old bucket.

Although, I don't think it is a very good idea to take any performance data
with WIP patch, still I couldn't resist myself from doing so and below are
the performance numbers. To get the performance data, I have dropped the
primary key constraint on pgbench_accounts and created a hash index on aid
column as below.

alter table pgbench_accounts drop constraint pgbench_accounts_pkey;
create index pgbench_accounts_pkey on pgbench_accounts using hash(aid);

Below data is for read-only pgbench test and is a median of 3 5-min runs.
The performance tests are executed on a power-8 m/c.

Data fits in shared buffers
scale_factor - 300
shared_buffers - 8GB

Patch_Ver/Client count 1 8 16 32 64 72 80 88 96 128
HEAD-Btree 19397 122488 194433 344524 519536 527365 597368 559381 614321
609102
HEAD-Hindex 18539 141905 218635 363068 512067 522018 492103 484372 440265
393231
Patch 22504 146937 235948 419268 637871 637595 674042 669278 683704 639967
% improvement between HEAD-Hash index vs Patch and HEAD-Btree index vs
Patch-Hash index is:

Head-Hash vs Patch 21.38 3.5 7.9 15.47 24.56 22.14 36.97 38.17 55.29 62.74
Head-Btree vs. Patch 16.01 19.96 21.35 21.69 22.77 20.9 12.83 19.64 11.29
5.06
This data shows that patch improves the performance of hash index upto
62.74 and it also makes hash-index faster than btree-index by ~20% (most
client counts show the performance improvement in the range of 15~20%.

For the matter of comparison with btree, I think the impact of performance
improvement of hash index will be more when the data doesn't fit shared
buffers and the performance data for same is as below:

Data doesn't fits in shared buffers
scale_factor - 3000
shared_buffers - 8GB

Client_Count/Patch 16 64 96
Head-Btree 170042 463721 520656
Patch-Hash 227528 603594 659287
% diff 33.8 30.16 26.62
The performance with hash-index is ~30% better than Btree. Note, that for
now, I have not taken the data for HEAD- Hash index. I think there will
many more cases like when hash index is on char (20) column where the
performance of hash-index can be much better than btree-index for equal to
searches.

Note that this patch is a very-much WIP patch and I am posting it mainly to
facilitate the discussion. Currently, it doesn't have any code to perform
incomplete splits, the logic for locking/pins during Insert is yet to be
done and many more things.

[1]: /messages/by-id/CA+TgmoZyMoJSrFxHXQ06G8jhjXQcsKvDiHB_8z_7nc7hj7iHYQ@mail.gmail.com
/messages/by-id/CA+TgmoZyMoJSrFxHXQ06G8jhjXQcsKvDiHB_8z_7nc7hj7iHYQ@mail.gmail.com
[2]: /messages/by-id/531992AF.2080306@vmware.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

concurrent_hash_index_v1.patchapplication/octet-stream; name=concurrent_hash_index_v1.patchDownload
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c1122b4..d02539a 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -400,7 +400,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	Buffer		buf;
 	Page		page;
 
-	_hash_getlock(rel, blkno, HASH_SHARE);
 	buf = _hash_getbuf_with_strategy(rel, blkno, HASH_READ, 0, bstrategy);
 	page = BufferGetPage(buf);
 
@@ -431,7 +430,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	}
 
 	_hash_relbuf(rel, buf);
-	_hash_droplock(rel, blkno, HASH_SHARE);
 }
 
 /*
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 49a6c81..f95ac00 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -407,12 +407,15 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
 	so->hashso_bucket_valid = false;
-	so->hashso_bucket_blkno = 0;
 	so->hashso_curbuf = InvalidBuffer;
+	so->hashso_bucket_buf = InvalidBuffer;
+	so->hashso_old_bucket_buf = InvalidBuffer;
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
 	ItemPointerSetInvalid(&(so->hashso_heappos));
 
+	so->hashso_skip_moved_tuples = false;
+
 	scan->opaque = so;
 
 	/* register scan in case we change pages it's using */
@@ -436,10 +439,15 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		_hash_dropbuf(rel, so->hashso_curbuf);
 	so->hashso_curbuf = InvalidBuffer;
 
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	/* release pin we hold on old primary bucket */
+	if (BufferIsValid(so->hashso_old_bucket_buf))
+		_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+	so->hashso_old_bucket_buf = InvalidBuffer;
+
+	/* release pin we hold on primary bucket */
+	if (BufferIsValid(so->hashso_bucket_buf))
+		_hash_dropbuf(rel, so->hashso_bucket_buf);
+	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -453,6 +461,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 				scan->numberOfKeys * sizeof(ScanKeyData));
 		so->hashso_bucket_valid = false;
 	}
+
+	so->hashso_skip_moved_tuples = false;
 }
 
 /*
@@ -472,10 +482,15 @@ hashendscan(IndexScanDesc scan)
 		_hash_dropbuf(rel, so->hashso_curbuf);
 	so->hashso_curbuf = InvalidBuffer;
 
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	/* release pin we hold on old primary bucket */
+	if (BufferIsValid(so->hashso_old_bucket_buf))
+		_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+	so->hashso_old_bucket_buf = InvalidBuffer;
+
+	/* release pin we hold on primary bucket */
+	if (BufferIsValid(so->hashso_bucket_buf))
+		_hash_dropbuf(rel, so->hashso_bucket_buf);
+	so->hashso_bucket_buf = InvalidBuffer;
 
 	pfree(so);
 	scan->opaque = NULL;
@@ -486,6 +501,9 @@ hashendscan(IndexScanDesc scan)
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
  *
+ * This function also delete the tuples that are moved by split to other
+ * bucket.
+ *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
@@ -530,35 +548,61 @@ loop_top:
 	{
 		BlockNumber bucket_blkno;
 		BlockNumber blkno;
+		Buffer		bucket_buf;
+		Buffer		buf;
+		HashPageOpaque bucket_opaque;
+		Page		page;
 		bool		bucket_dirty = false;
+		bool		bucket_has_garbage = false;
 
 		/* Get address of bucket's start page */
 		bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
 
-		/* Exclusive-lock the bucket so we can shrink it */
-		_hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
 		/* Shouldn't have any active scans locally, either */
 		if (_hash_has_active_scan(rel, cur_bucket))
 			elog(ERROR, "hash index has active scan during VACUUM");
 
-		/* Scan each page in bucket */
 		blkno = bucket_blkno;
-		while (BlockNumberIsValid(blkno))
+
+		/*
+		 * Maintain a cleanup lock on primary bucket till we scan all the
+		 * pages in bucket.  This is required to ensure that we don't delete
+		 * tuples which are needed for concurrent scans on buckets where split
+		 * is in progress.  Retaining it till end of bucket scan ensures that
+		 * concurrent split can't be started on it.  In future, we might want
+		 * to relax the requirement for vacuum to take cleanup lock only for
+		 * buckets where split is in progress, however for squeeze phase we
+		 * need a cleanup lock, otherwise squeeze will move the tuples to a
+		 * different location and that can lead to change in order of results.
+		 */
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+		LockBufferForCleanup(buf);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+
+		page = BufferGetPage(buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		/*
+		 * If the bucket contains tuples that are moved by split, then we need
+		 * to delete such tuples as well.
+		 */
+		if (bucket_opaque->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+			bucket_has_garbage = true;
+
+		bucket_buf = buf;
+
+		/* Scan each page in bucket */
+		for (;;)
 		{
-			Buffer		buf;
-			Page		page;
 			HashPageOpaque opaque;
 			OffsetNumber offno;
 			OffsetNumber maxoffno;
 			OffsetNumber deletable[MaxOffsetNumber];
 			int			ndeletable = 0;
+			bool		release_buf = false;
 
 			vacuum_delay_point();
 
-			buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-											 info->strategy);
 			page = BufferGetPage(buf);
 			opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 			Assert(opaque->hasho_bucket == cur_bucket);
@@ -571,6 +615,7 @@ loop_top:
 			{
 				IndexTuple	itup;
 				ItemPointer htup;
+				Bucket		bucket;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offno));
@@ -581,32 +626,72 @@ loop_top:
 					deletable[ndeletable++] = offno;
 					tuples_removed += 1;
 				}
+				else if (bucket_has_garbage)
+				{
+					/* delete the tuples that are moved by split. */
+					bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+											  local_metapage.hashm_maxbucket,
+											   local_metapage.hashm_highmask,
+											   local_metapage.hashm_lowmask);
+					if (bucket != cur_bucket)
+					{
+						/* mark the item for deletion */
+						deletable[ndeletable++] = offno;
+						tuples_removed += 1;
+					}
+				}
 				else
 					num_index_tuples += 1;
 			}
 
 			/*
-			 * Apply deletions and write page if needed, advance to next page.
+			 * We don't release the lock on primary bucket till end of bucket
+			 * scan.
 			 */
+			if (blkno != bucket_blkno)
+				release_buf = true;
+
 			blkno = opaque->hasho_nextblkno;
 
+			/*
+			 * Apply deletions and write page if needed, advance to next page.
+			 */
 			if (ndeletable > 0)
 			{
 				PageIndexMultiDelete(page, deletable, ndeletable);
-				_hash_wrtbuf(rel, buf);
+				if (release_buf)
+					_hash_wrtbuf(rel, buf);
+				else
+					MarkBufferDirty(buf);
 				bucket_dirty = true;
 			}
-			else
+			else if (release_buf)
 				_hash_relbuf(rel, buf);
+
+			/* bail out if there are no more pages to scan. */
+			if (!BlockNumberIsValid(blkno))
+				break;
+
+			buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											 LH_OVERFLOW_PAGE,
+											 info->strategy);
 		}
 
+		/*
+		 * Clear the garbage flag from bucket after deleting the tuples that
+		 * are moved by split.  We purposefully clear the flag before squeeze
+		 * bucket, so that after restart, vacuum shouldn't again try to delete
+		 * the moved by split tuples.
+		 */
+		if (bucket_has_garbage)
+			bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+
 		/* If we deleted anything, try to compact free space */
 		if (bucket_dirty)
-			_hash_squeezebucket(rel, cur_bucket, bucket_blkno,
+			_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
 								info->strategy);
 
-		/* Release bucket lock */
-		_hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+		_hash_relbuf(rel, bucket_buf);
 
 		/* Advance to next bucket */
 		cur_bucket++;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..eedf6ae 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -32,8 +32,6 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	Buffer		metabuf;
 	HashMetaPage metap;
 	BlockNumber blkno;
-	BlockNumber oldblkno = InvalidBlockNumber;
-	bool		retry = false;
 	Page		page;
 	HashPageOpaque pageopaque;
 	Size		itemsz;
@@ -70,45 +68,22 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			errhint("Values larger than a buffer page cannot be indexed.")));
 
 	/*
-	 * Loop until we get a lock on the correct target bucket.
+	 * Compute the target bucket number, and convert to block number.
 	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
+	bucket = _hash_hashkey2bucket(hashkey,
 									  metap->hashm_maxbucket,
 									  metap->hashm_highmask,
 									  metap->hashm_lowmask);
 
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
-
-		/* Release metapage lock, but keep pin. */
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	blkno = BUCKET_TO_BLKNO(metap, bucket);
 
-		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
-		 */
-		if (retry)
-		{
-			if (oldblkno == blkno)
-				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
-		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
-
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
-		oldblkno = blkno;
-		retry = true;
-	}
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
 
+	/*
+	 * FixMe: If the split operation happens during insertion and it
+	 * doesn't account the tuple being inserted, then it can be lost
+	 * for future searches.
+	 */
 	/* Fetch the primary bucket page for the bucket */
 	buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
 	page = BufferGetPage(buf);
@@ -141,10 +116,10 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			 */
 
 			/* release our write lock without modifying buffer */
-			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+			_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
 
 			/* chain to a new overflow page */
-			buf = _hash_addovflpage(rel, metabuf, buf);
+			buf = _hash_addovflpage(rel, metabuf, buf, false);
 			page = BufferGetPage(buf);
 
 			/* should fit now, given test above */
@@ -161,9 +136,6 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	/* write and release the modified page */
 	_hash_wrtbuf(rel, buf);
 
-	/* We can drop the bucket lock now */
-	_hash_droplock(rel, blkno, HASH_SHARE);
-
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
 	 * incrementing it, check to see if it's time for a split.
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index db3e268..184236c 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *
  *	On entry, the caller must hold a pin but no lock on 'buf'.  The pin is
  *	dropped before exiting (we assume the caller is not interested in 'buf'
- *	anymore).  The returned overflow page will be pinned and write-locked;
- *	it is guaranteed to be empty.
+ *	anymore) if not asked to retain.  The pin will be retained only for the
+ *	primary bucket.  The returned overflow page will be pinned and
+ *	write-locked; it is guaranteed to be empty.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
  *
- *	The caller must hold at least share lock on the bucket, to ensure that
- *	no one else tries to compact the bucket meanwhile.  This guarantees that
- *	'buf' won't stop being part of the bucket while it's unlocked.
- *
  * NB: since this could be executed concurrently by multiple processes,
  * one should not assume that the returned overflow page will be the
  * immediate successor of the originally passed 'buf'.  Additional overflow
  * pages might have been added to the bucket chain in between.
  */
 Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 {
 	Buffer		ovflbuf;
 	Page		page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 			break;
 
 		/* we assume we do not need to write the unmodified page */
-		_hash_relbuf(rel, buf);
+		if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
 
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
@@ -149,7 +149,13 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 
 	/* logically chain overflow page to previous page */
 	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	_hash_wrtbuf(rel, buf);
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+	{
+		MarkBufferDirty(buf);
+		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+	}
+	else
+		_hash_wrtbuf(rel, buf);
 
 	return ovflbuf;
 }
@@ -570,7 +576,7 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
  *	required that to be true on entry as well, but it's a lot easier for
  *	callers to leave empty overflow pages and let this guy clean it up.
  *
- *	Caller must hold exclusive lock on the target bucket.  This allows
+ *	Caller must hold cleanup lock on the target bucket.  This allows
  *	us to safely lock multiple pages in the bucket.
  *
  *	Since this function is invoked in VACUUM, we provide an access strategy
@@ -580,6 +586,7 @@ void
 _hash_squeezebucket(Relation rel,
 					Bucket bucket,
 					BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy)
 {
 	BlockNumber wblkno;
@@ -591,16 +598,13 @@ _hash_squeezebucket(Relation rel,
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
 	bool		wbuf_dirty;
+	bool		release_buf = false;
 
 	/*
 	 * start squeezing into the base bucket page.
 	 */
 	wblkno = bucket_blkno;
-	wbuf = _hash_getbuf_with_strategy(rel,
-									  wblkno,
-									  HASH_WRITE,
-									  LH_BUCKET_PAGE,
-									  bstrategy);
+	wbuf = bucket_buf;
 	wpage = BufferGetPage(wbuf);
 	wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 
@@ -669,12 +673,17 @@ _hash_squeezebucket(Relation rel,
 			{
 				Assert(!PageIsEmpty(wpage));
 
+				if (wblkno != bucket_blkno)
+					release_buf = true;
+
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));
 
-				if (wbuf_dirty)
+				if (wbuf_dirty && release_buf)
 					_hash_wrtbuf(rel, wbuf);
-				else
+				else if (wbuf_dirty)
+					MarkBufferDirty(wbuf);
+				else if (release_buf)
 					_hash_relbuf(rel, wbuf);
 
 				/* nothing more to do if we reached the read page */
@@ -700,6 +709,7 @@ _hash_squeezebucket(Relation rel,
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
 				wbuf_dirty = false;
+				release_buf = false;
 			}
 
 			/*
@@ -733,11 +743,17 @@ _hash_squeezebucket(Relation rel,
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
-			/* yes, so release wbuf lock first */
-			if (wbuf_dirty)
+			if (wblkno != bucket_blkno)
+				release_buf = true;
+
+			/* yes, so release wbuf lock first if needed */
+			if (wbuf_dirty && release_buf)
 				_hash_wrtbuf(rel, wbuf);
-			else
+			else if (wbuf_dirty)
+				MarkBufferDirty(wbuf);
+			else if (release_buf)
 				_hash_relbuf(rel, wbuf);
+
 			/* free this overflow page (releases rbuf) */
 			_hash_freeovflpage(rel, rbuf, bstrategy);
 			/* done */
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 178463f..1ba4d52 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,7 +38,7 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
 					uint32 nblocks);
 static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
@@ -55,46 +55,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 
 
 /*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock.  (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		return ConditionalLockPage(rel, whichlock, access);
-	else
-		return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		UnlockPage(rel, whichlock, access);
-}
-
-/*
  *	_hash_getbuf() -- Get a buffer by block number for read or write.
  *
  *		'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -489,9 +449,8 @@ _hash_pageinit(Page page, Size size)
 /*
  * Attempt to expand the hash table by creating one new bucket.
  *
- * This will silently do nothing if it cannot get the needed locks.
- *
- * The caller should hold no locks on the hash index.
+ * This will silently do nothing if there are active scans of our own
+ * backend or the old bucket contains tuples from previous split.
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.
@@ -506,6 +465,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	BlockNumber start_oblkno;
 	BlockNumber start_nblkno;
 	Buffer		buf_nblkno;
+	Buffer		buf_oblkno;
+	Page		opage;
+	HashPageOpaque oopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
@@ -548,11 +510,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		goto fail;
 
 	/*
-	 * Determine which bucket is to be split, and attempt to lock the old
-	 * bucket.  If we can't get the lock, give up.
-	 *
-	 * The lock protects us against other backends, but not against our own
-	 * backend.  Must check for active scans separately.
+	 * Determine which bucket is to be split, and if it still contains tuples
+	 * from previous split or there is any active scan of our own backend,
+	 * then give up.
 	 */
 	new_bucket = metap->hashm_maxbucket + 1;
 
@@ -563,11 +523,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, old_bucket))
 		goto fail;
 
-	if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
+	buf_oblkno = _hash_getbuf(rel, start_oblkno, HASH_READ, LH_BUCKET_PAGE);
+
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+	if (oopaque->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+	{
+		_hash_relbuf(rel, buf_oblkno);
 		goto fail;
+	}
 
 	/*
-	 * Likewise lock the new bucket (should never fail).
+	 * There shouldn't be any active scan on new bucket.
 	 *
 	 * Note: it is safe to compute the new bucket's blkno here, even though we
 	 * may still need to update the BUCKET_TO_BLKNO mapping.  This is because
@@ -579,9 +546,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, new_bucket))
 		elog(ERROR, "scan in progress on supposedly new bucket");
 
-	if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
-		elog(ERROR, "could not get lock on supposedly new bucket");
-
 	/*
 	 * If the split point is increasing (hashm_maxbucket's log base 2
 	 * increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +564,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
 			/* can't split due to BlockNumber overflow */
-			_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-			_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+			_hash_relbuf(rel, buf_oblkno);
 			goto fail;
 		}
 	}
@@ -665,13 +628,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  start_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno,
 					  maxbucket, highmask, lowmask);
 
-	/* Release bucket locks, allowing others to access them */
-	_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-	_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -745,6 +704,10 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
  * The buffer is returned in the same state.  (The metapage is only
  * touched if it becomes necessary to add or remove overflow pages.)
  *
+ * Split needs to hold pin on primary bucket pages of both old and new
+ * buckets till end of operation.  This is to prevent vacuum to start
+ * when split is in progress.
+ *
  * In addition, the caller must have created the new bucket's base page,
  * which is passed in buffer nbuf, pinned and write-locked.  That lock and
  * pin are released here.  (The API is set up this way because we must do
@@ -756,37 +719,46 @@ _hash_splitbucket(Relation rel,
 				  Buffer metabuf,
 				  Bucket obucket,
 				  Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Buffer		obuf;
+	Buffer		bucket_obuf;
+	Buffer		bucket_nbuf;
 	Page		opage;
 	Page		npage;
 	HashPageOpaque oopaque;
 	HashPageOpaque nopaque;
+	HashPageOpaque bucket_nopaque;
 
-	/*
-	 * It should be okay to simultaneously write-lock pages from each bucket,
-	 * since no one else can be trying to acquire buffer lock on pages of
-	 * either bucket.
-	 */
-	obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
+	bucket_nbuf = nbuf;
+	bucket_obuf = obuf;
 	opage = BufferGetPage(obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 
+	/*
+	 * Mark the old bucket to indicate that it has deletable tuples. Vacuum
+	 * will clear this flag after deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE;
+
 	npage = BufferGetPage(nbuf);
 
-	/* initialize the new bucket's primary page */
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nopaque->hasho_prevblkno = InvalidBlockNumber;
 	nopaque->hasho_nextblkno = InvalidBlockNumber;
 	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_PAGE_SPLIT;
 	nopaque->hasho_page_id = HASHO_PAGE_ID;
 
+	bucket_nopaque = nopaque;
+
 	/*
 	 * Partition the tuples in the old bucket between the old bucket and the
 	 * new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +770,6 @@ _hash_splitbucket(Relation rel,
 		BlockNumber oblkno;
 		OffsetNumber ooffnum;
 		OffsetNumber omaxoffnum;
-		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
 
 		/* Scan each tuple in old page */
 		omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -822,6 +792,18 @@ _hash_splitbucket(Relation rel,
 
 			if (bucket == nbucket)
 			{
+				Size		itupsize = 0;
+
+				/*
+				 * mark the index tuple as moved by split, such tuples are
+				 * skipped by scan if there is split in progress for a primary
+				 * bucket.
+				 */
+				itupsize = itup->t_info & INDEX_SIZE_MASK;
+				itup->t_info &= ~INDEX_SIZE_MASK;
+				itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+				itup->t_info |= itupsize;
+
 				/*
 				 * insert the tuple into the new bucket.  if it doesn't fit on
 				 * the current page in the new bucket, we must allocate a new
@@ -840,9 +822,10 @@ _hash_splitbucket(Relation rel,
 					/* write out nbuf and drop lock, but keep pin */
 					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
 					/* chain to a new overflow page */
-					nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+					nbuf = _hash_addovflpage(rel, metabuf, nbuf,
+						nopaque->hasho_flag & LH_BUCKET_PAGE ? true : false);
 					npage = BufferGetPage(nbuf);
-					/* we don't need nopaque within the loop */
+					nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 				}
 
 				/*
@@ -853,11 +836,6 @@ _hash_splitbucket(Relation rel,
 				 * the new page and qsort them before insertion.
 				 */
 				(void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
-
-				/*
-				 * Mark tuple for deletion from old page.
-				 */
-				deletable[ndeletable++] = ooffnum;
 			}
 			else
 			{
@@ -870,15 +848,9 @@ _hash_splitbucket(Relation rel,
 
 		oblkno = oopaque->hasho_nextblkno;
 
-		/*
-		 * Done scanning this old page.  If we moved any tuples, delete them
-		 * from the old page.
-		 */
-		if (ndeletable > 0)
-		{
-			PageIndexMultiDelete(opage, deletable, ndeletable);
-			_hash_wrtbuf(rel, obuf);
-		}
+		/* retain the pin on the old primary bucket */
+		if (obuf == bucket_obuf)
+			_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, obuf);
 
@@ -887,18 +859,24 @@ _hash_splitbucket(Relation rel,
 			break;
 
 		/* Else, advance to next old page */
-		obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
 		opage = BufferGetPage(obuf);
 		oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	}
 
+	/* indicate that split is finished */
+	bucket_nopaque->hasho_flag &= ~LH_BUCKET_PAGE_SPLIT;
+
+	/* release the pin on the old primary bucket */
+	_hash_dropbuf(rel, bucket_obuf);
+
 	/*
 	 * We're at the end of the old bucket chain, so we're done partitioning
-	 * the tuples.  Before quitting, call _hash_squeezebucket to ensure the
-	 * tuples remaining in the old bucket (including the overflow pages) are
-	 * packed as tightly as possible.  The new bucket is already tight.
+	 * the tuples.
 	 */
 	_hash_wrtbuf(rel, nbuf);
 
-	_hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+	/* release the pin on the new primary bucket */
+	if (!(nopaque->hasho_flag & LH_BUCKET_PAGE))
+		_hash_dropbuf(rel, bucket_nbuf);
 }
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..b73559a 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,6 +20,7 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static BlockNumber _hash_get_oldblk(Relation rel, HashPageOpaque opaque);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
@@ -72,7 +73,23 @@ _hash_readnext(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_nextblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan to ensure
+	 * that Vacuum can't delete the tuples (that are moved by split to new
+	 * bucket) which are required by the scans that are started on splitted
+	 * buckets before a new bucket's split in progress flag
+	 * (LH_BUCKET_PAGE_SPLIT) is cleared.  Now the requirement to retain a pin
+	 * on primary bucket can be relaxed for buckets that are not splitted by
+	 * maintaining a flag like has_garbage in bucket but still we need to
+	 * retain pin for squeeze phase otherwise the movement of tuples could
+	 * lead to change the ordering of scan results.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -94,7 +111,23 @@ _hash_readprev(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_prevblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan to ensure
+	 * that Vacuum can't delete the tuples (that are moved by split to new
+	 * bucket) which are required by the scans that are started on splitted
+	 * buckets before a new bucket's split in progress flag
+	 * (LH_BUCKET_PAGE_SPLIT) is cleared.  Now the requirement to retain a pin
+	 * on primary bucket can be relaxed for buckets that are not splitted by
+	 * maintaining a flag like has_garbage in bucket but still we need to
+	 * retain pin for squeeze phase otherwise the movemenet of tuples could
+	 * lead to change the ordering of scan results.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -125,8 +158,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	uint32		hashkey;
 	Bucket		bucket;
 	BlockNumber blkno;
-	BlockNumber oldblkno = InvalidBuffer;
-	bool		retry = false;
 	Buffer		buf;
 	Buffer		metabuf;
 	Page		page;
@@ -192,52 +223,21 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	metap = HashPageGetMeta(page);
 
 	/*
-	 * Loop until we get a lock on the correct target bucket.
+	 * Compute the target bucket number, and convert to block number.
 	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
-									  metap->hashm_maxbucket,
-									  metap->hashm_highmask,
-									  metap->hashm_lowmask);
-
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
-
-		/* Release metapage lock, but keep pin. */
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
-
-		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
-		 */
-		if (retry)
-		{
-			if (oldblkno == blkno)
-				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
-		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+	bucket = _hash_hashkey2bucket(hashkey,
+								  metap->hashm_maxbucket,
+								  metap->hashm_highmask,
+								  metap->hashm_lowmask);
 
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
-		oldblkno = blkno;
-		retry = true;
-	}
+	blkno = BUCKET_TO_BLKNO(metap, bucket);
 
 	/* done with the metapage */
-	_hash_dropbuf(rel, metabuf);
+	_hash_relbuf(rel, metabuf);
 
 	/* Update scan opaque state to show we have lock on the bucket */
 	so->hashso_bucket = bucket;
 	so->hashso_bucket_valid = true;
-	so->hashso_bucket_blkno = blkno;
 
 	/* Fetch the primary bucket page for the bucket */
 	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
@@ -245,6 +245,54 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
+	so->hashso_bucket_buf = buf;
+
+	/*
+	 * If the bucket split is in progress, then we need to skip tuples that
+	 * are moved from old bucket.  To ensure that vacuum doesn't clean any
+	 * tuples from old or new buckets till this scan is in progress, maintain
+	 * a pin on both of the buckets.  Here, we have to be cautious about lock
+	 * ordering, first acquire the lock on old bucket, release the lock on old
+	 * bucket, but not pin, then acuire the lock on new bucket and again
+	 * re-verify whether the bucket split still is in progress. Acquiring lock
+	 * on old bucket first ensures that the vacuum waits for this scan to
+	 * finish.
+	 */
+	if (opaque->hasho_flag & LH_BUCKET_PAGE_SPLIT)
+	{
+		BlockNumber old_blkno;
+		Buffer		old_buf;
+
+		old_blkno = _hash_get_oldblk(rel, opaque);
+
+		/*
+		 * release the lock on new bucket and re-acquire it after acquiring
+		 * the lock on old bucket.
+		 */
+		_hash_relbuf(rel, buf);
+
+		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+		/*
+		 * remember the old bucket buffer so as to release it later.
+		 */
+		so->hashso_old_bucket_buf = buf;
+		_hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		Assert(opaque->hasho_bucket == bucket);
+
+		if (opaque->hasho_flag & LH_BUCKET_PAGE_SPLIT)
+			so->hashso_skip_moved_tuples = true;
+		else
+		{
+			_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+			so->hashso_old_bucket_buf = InvalidBuffer;
+		}
+	}
+
 	/* If a backwards scan is requested, move to the end of the chain */
 	if (ScanDirectionIsBackward(dir))
 	{
@@ -273,6 +321,13 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
  *		false.  Else, return true and set the hashso_curpos for the
  *		scan to the right thing.
  *
+ *		Here we also scan the old bucket if the split for current bucket
+ *		was in progress at the start of scan.  The basic idea is that
+ *		skip the tuples that are moved by split while scanning current
+ *		bucket and then scan the old bucket to cover all such tuples. This
+ *		is done ensure that we don't miss any tuples in the current scan
+ *		when split was in progress.
+ *
  *		'bufP' points to the current buffer, which is pinned and read-locked.
  *		On success exit, we have pin and read-lock on whichever page
  *		contains the right item; on failure, we have released all buffers.
@@ -338,6 +393,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum >= FirstOffsetNumber);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+							continue;
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -353,9 +418,52 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							page = BufferGetPage(so->hashso_bucket_buf);
+							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+							blkno = _hash_get_oldblk(rel, opaque);
+
+							Assert(BlockNumberIsValid(blkno));
+							buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
+							/*
+							 * remember the old bucket buffer so as to release
+							 * the pin at end of scan.  If this scan already
+							 * has a pin on old buffer, then release it as one
+							 * pin is sufficient to hold-off vacuum to clean
+							 * the bucket where scan is in progress.
+							 */
+							if (BufferIsValid(so->hashso_old_bucket_buf))
+							{
+								_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+								so->hashso_old_bucket_buf = InvalidBuffer;
+							}
+							so->hashso_old_bucket_buf = buf;
+
+							page = BufferGetPage(buf);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -379,6 +487,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum <= maxoff);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+							continue;
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -394,9 +512,64 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							page = BufferGetPage(so->hashso_bucket_buf);
+							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+							blkno = _hash_get_oldblk(rel, opaque);
+
+							/* read the old page */
+							Assert(BlockNumberIsValid(blkno));
+							buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
+							/*
+							 * remember the old bucket buffer so as to release
+							 * the pin at end of scan.  If this scan already
+							 * has a pin on old buffer, then release it as one
+							 * pin is sufficient to hold-off vacuum to clean
+							 * the bucket where scan is in progress.
+							 */
+							if (BufferIsValid(so->hashso_old_bucket_buf))
+							{
+								_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+								so->hashso_old_bucket_buf = InvalidBuffer;
+							}
+							so->hashso_old_bucket_buf = buf;
+
+							page = BufferGetPage(buf);
+
+							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+							/*
+							 * For backward scan, we need to start scan from
+							 * the last overflow page of old bucket till
+							 * primary bucket page.
+							 */
+							while (BlockNumberIsValid(opaque->hasho_nextblkno))
+								_hash_readnext(rel, &buf, &page, &opaque);
+
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -425,3 +598,39 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_get_oldblk() -- get the block number from which current bucket
+ *			is being splitted.
+ */
+static BlockNumber
+_hash_get_oldblk(Relation rel, HashPageOpaque opaque)
+{
+	Bucket		curr_bucket;
+	Bucket		old_bucket;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	/*
+	 * To get the old bucket from the current bucket, we need a mask to modulo
+	 * into lower half of table.  This mask is stored in meta page as
+	 * hashm_lowmask, but here we can't rely on the same, because we need a
+	 * value of lowmask that was prevalent at the time when bucket split was
+	 * started.  lowmask is always equal to last bucket number in lower half
+	 * of the table which can be calculate from current bucket.
+	 */
+	curr_bucket = opaque->hasho_bucket;
+	mask = (((uint32) 1) << _hash_log2((uint32) curr_bucket) / 2) - 1;
+	old_bucket = curr_bucket & mask;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index fa3f9b6..cd40ed7 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -52,6 +52,8 @@ typedef uint32 Bucket;
 #define LH_BUCKET_PAGE			(1 << 1)
 #define LH_BITMAP_PAGE			(1 << 2)
 #define LH_META_PAGE			(1 << 3)
+#define LH_BUCKET_PAGE_SPLIT	(1 << 4)
+#define LH_BUCKET_PAGE_HAS_GARBAGE	(1 << 5)
 
 typedef struct HashPageOpaqueData
 {
@@ -88,12 +90,6 @@ typedef struct HashScanOpaqueData
 	bool		hashso_bucket_valid;
 
 	/*
-	 * If we have a share lock on the bucket, we record it here.  When
-	 * hashso_bucket_blkno is zero, we have no such lock.
-	 */
-	BlockNumber hashso_bucket_blkno;
-
-	/*
 	 * We also want to remember which buffer we're currently examining in the
 	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
 	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -101,11 +97,23 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_curbuf;
 
+	/* remember the buffer associated with primary bucket */
+	Buffer		hashso_bucket_buf;
+
+	/*
+	 * remember the buffer associated with old primary bucket which is
+	 * required during the scan of the bucket for which split is in progress.
+	 */
+	Buffer		hashso_old_bucket_buf;
+
 	/* Current position of the scan, as an index TID */
 	ItemPointerData hashso_curpos;
 
 	/* Current position of the scan, as a heap TID */
 	ItemPointerData hashso_heappos;
+
+	/* Whether scan needs to skip tuples that are moved by split */
+	bool		hashso_skip_moved_tuples;
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -176,6 +184,8 @@ typedef HashMetaPageData *HashMetaPage;
 				  sizeof(ItemIdData) - \
 				  MAXALIGN(sizeof(HashPageOpaqueData)))
 
+#define INDEX_MOVED_BY_SPLIT_MASK	0x2000
+
 #define HASH_MIN_FILLFACTOR			10
 #define HASH_DEFAULT_FILLFACTOR		75
 
@@ -299,19 +309,17 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 
 /* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
 extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
 				   BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy);
 
 /* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
 extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 			 int access, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#1)
1 attachment(s)
Re: Hash Indexes

On Tue, May 10, 2016 at 5:39 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Incomplete Splits
--------------------------
Incomplete splits can be completed either by vacuum or insert as both
needs exclusive lock on bucket. If vacuum finds split-in-progress flag on
a bucket then it will complete the split operation, vacuum won't see this
flag if actually split is in progress on that bucket as vacuum needs
cleanup lock and split retains pin till end of operation. To make it work
for Insert operation, one simple idea could be that if insert finds
split-in-progress flag, then it releases the current exclusive lock on
bucket and tries to acquire a cleanup lock on bucket, if it gets cleanup
lock, then it can complete the split and then the insertion of tuple, else
it will have a exclusive lock on bucket and just perform the insertion of
tuple. The disadvantage of trying to complete the split in vacuum is that
split might require new pages and allocating new pages at time of vacuum is
not advisable. The disadvantage of doing it at time of Insert is that
Insert might skip it even if there is some scan on the bucket is going on
as scan will also retain pin on the bucket, but I think that is not a big
deal. The actual completion of split can be done in two ways: (a) scan
the new bucket and build a hash table with all of the TIDs you find
there. When copying tuples from the old bucket, first probe the hash
table; if you find a match, just skip that tuple (idea suggested by
Robert Haas offlist) (b) delete all the tuples that are marked as
moved_by_split in the new bucket and perform the split operation from the
beginning using old bucket.

I have completed the patch with respect to incomplete splits and delayed
cleanup of garbage tuples. For incomplete splits, I have used the option
(a) as mentioned above. The incomplete splits are completed if the
insertion sees split-in-progress flag in a bucket. The second major thing
this new version of patch has achieved is cleanup of garbage tuples i.e the
tuples that are left in old bucket during split. Currently (in HEAD), as
part of a split operation, we clean the tuples from old bucket after moving
them to new bucket, as we have heavy-weight locks on both old and new
bucket till the whole split operation. In the new design, we need to take
cleanup lock on old bucket and exclusive lock on new bucket to perform the
split operation and we don't retain those locks till the end (release the
lock as we move on to overflow buckets). Now to cleanup the tuples we need
a cleanup lock on a bucket which we might not have at split-end. So I
choose to perform the cleanup of garbage tuples during vacuum and when
re-split of the bucket happens as during both the operations, we do hold
cleanup lock. We can extend the cleanup of garbage to other operations as
well if required.

I have done some performance tests with this new version of patch and
results are on same lines as in my previous e-mail. I have done some
functional testing of the patch as well. I think more detailed testing is
required, however it is better to do that once the design is discussed and
agreed upon.

I have improved the code comments to make the new design clear, but still
one can have questions related to locking decisions I have taken in patch.
I think one of the important thing to verify in the patch is locking
strategy used for different operations. I have changed heavy-weight locks
to a light-weight read and write locks and a cleanup lock for vacuum and
split operation.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

concurrent_hash_index_v2.patchapplication/octet-stream; name=concurrent_hash_index_v2.patchDownload
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c1122b4..d02539a 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -400,7 +400,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	Buffer		buf;
 	Page		page;
 
-	_hash_getlock(rel, blkno, HASH_SHARE);
 	buf = _hash_getbuf_with_strategy(rel, blkno, HASH_READ, 0, bstrategy);
 	page = BufferGetPage(buf);
 
@@ -431,7 +430,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	}
 
 	_hash_relbuf(rel, buf);
-	_hash_droplock(rel, blkno, HASH_SHARE);
 }
 
 /*
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 49a6c81..861dbc8 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -407,12 +407,15 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
 	so->hashso_bucket_valid = false;
-	so->hashso_bucket_blkno = 0;
 	so->hashso_curbuf = InvalidBuffer;
+	so->hashso_bucket_buf = InvalidBuffer;
+	so->hashso_old_bucket_buf = InvalidBuffer;
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
 	ItemPointerSetInvalid(&(so->hashso_heappos));
 
+	so->hashso_skip_moved_tuples = false;
+
 	scan->opaque = so;
 
 	/* register scan in case we change pages it's using */
@@ -436,10 +439,15 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		_hash_dropbuf(rel, so->hashso_curbuf);
 	so->hashso_curbuf = InvalidBuffer;
 
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	/* release pin we hold on old primary bucket */
+	if (BufferIsValid(so->hashso_old_bucket_buf))
+		_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+	so->hashso_old_bucket_buf = InvalidBuffer;
+
+	/* release pin we hold on primary bucket */
+	if (BufferIsValid(so->hashso_bucket_buf))
+		_hash_dropbuf(rel, so->hashso_bucket_buf);
+	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -453,6 +461,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 				scan->numberOfKeys * sizeof(ScanKeyData));
 		so->hashso_bucket_valid = false;
 	}
+
+	so->hashso_skip_moved_tuples = false;
 }
 
 /*
@@ -472,10 +482,15 @@ hashendscan(IndexScanDesc scan)
 		_hash_dropbuf(rel, so->hashso_curbuf);
 	so->hashso_curbuf = InvalidBuffer;
 
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	/* release pin we hold on old primary bucket */
+	if (BufferIsValid(so->hashso_old_bucket_buf))
+		_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+	so->hashso_old_bucket_buf = InvalidBuffer;
+
+	/* release pin we hold on primary bucket */
+	if (BufferIsValid(so->hashso_bucket_buf))
+		_hash_dropbuf(rel, so->hashso_bucket_buf);
+	so->hashso_bucket_buf = InvalidBuffer;
 
 	pfree(so);
 	scan->opaque = NULL;
@@ -486,6 +501,9 @@ hashendscan(IndexScanDesc scan)
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
  *
+ * This function also delete the tuples that are moved by split to other
+ * bucket.
+ *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
@@ -530,83 +548,60 @@ loop_top:
 	{
 		BlockNumber bucket_blkno;
 		BlockNumber blkno;
-		bool		bucket_dirty = false;
+		Buffer		bucket_buf;
+		Buffer		buf;
+		HashPageOpaque bucket_opaque;
+		Page		page;
+		bool		bucket_has_garbage = false;
 
 		/* Get address of bucket's start page */
 		bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
 
-		/* Exclusive-lock the bucket so we can shrink it */
-		_hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
 		/* Shouldn't have any active scans locally, either */
 		if (_hash_has_active_scan(rel, cur_bucket))
 			elog(ERROR, "hash index has active scan during VACUUM");
 
-		/* Scan each page in bucket */
 		blkno = bucket_blkno;
-		while (BlockNumberIsValid(blkno))
-		{
-			Buffer		buf;
-			Page		page;
-			HashPageOpaque opaque;
-			OffsetNumber offno;
-			OffsetNumber maxoffno;
-			OffsetNumber deletable[MaxOffsetNumber];
-			int			ndeletable = 0;
-
-			vacuum_delay_point();
 
-			buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-											 info->strategy);
-			page = BufferGetPage(buf);
-			opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-			Assert(opaque->hasho_bucket == cur_bucket);
-
-			/* Scan each tuple in page */
-			maxoffno = PageGetMaxOffsetNumber(page);
-			for (offno = FirstOffsetNumber;
-				 offno <= maxoffno;
-				 offno = OffsetNumberNext(offno))
-			{
-				IndexTuple	itup;
-				ItemPointer htup;
+		/*
+		 * Maintain a cleanup lock on primary bucket till we scan all the
+		 * pages in bucket.  This is required to ensure that we don't delete
+		 * tuples which are needed for concurrent scans on buckets where split
+		 * is in progress.  Retaining it till end of bucket scan ensures that
+		 * concurrent split can't be started on it.  In future, we might want
+		 * to relax the requirement for vacuum to take cleanup lock only for
+		 * buckets where split is in progress, however for squeeze phase we
+		 * need a cleanup lock, otherwise squeeze will move the tuples to a
+		 * different location and that can lead to change in order of results.
+		 */
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+		LockBufferForCleanup(buf);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 
-				itup = (IndexTuple) PageGetItem(page,
-												PageGetItemId(page, offno));
-				htup = &(itup->t_tid);
-				if (callback(htup, callback_state))
-				{
-					/* mark the item for deletion */
-					deletable[ndeletable++] = offno;
-					tuples_removed += 1;
-				}
-				else
-					num_index_tuples += 1;
-			}
+		page = BufferGetPage(buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
-			/*
-			 * Apply deletions and write page if needed, advance to next page.
-			 */
-			blkno = opaque->hasho_nextblkno;
+		/*
+		 * If the bucket contains tuples that are moved by split, then we need
+		 * to delete such tuples on completion of split.  The cleanup lock on
+		 * bucket is not sufficient to detect whether a split is complete, as
+		 * the previous split could have been interrupted by cancel request or
+		 * error.
+		 */
+		if (H_HAS_GARBAGE(bucket_opaque) &&
+			!H_INCOMPLETE_SPLIT(bucket_opaque))
+			bucket_has_garbage = true;
 
-			if (ndeletable > 0)
-			{
-				PageIndexMultiDelete(page, deletable, ndeletable);
-				_hash_wrtbuf(rel, buf);
-				bucket_dirty = true;
-			}
-			else
-				_hash_relbuf(rel, buf);
-		}
+		bucket_buf = buf;
 
-		/* If we deleted anything, try to compact free space */
-		if (bucket_dirty)
-			_hash_squeezebucket(rel, cur_bucket, bucket_blkno,
-								info->strategy);
+		hashbucketcleanup(rel, bucket_buf, blkno, info->strategy,
+						  local_metapage.hashm_maxbucket,
+						  local_metapage.hashm_highmask,
+						  local_metapage.hashm_lowmask, &tuples_removed,
+						  &num_index_tuples, bucket_has_garbage, true,
+						  callback, callback_state);
 
-		/* Release bucket lock */
-		_hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+		_hash_relbuf(rel, bucket_buf);
 
 		/* Advance to next bucket */
 		cur_bucket++;
@@ -687,6 +682,155 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }
 
+/*
+ * Helper function to perform deletion of index entries from a bucket.
+ *
+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.
+ */
+void
+hashbucketcleanup(Relation rel, Buffer bucket_buf,
+				  BlockNumber bucket_blkno,
+				  BufferAccessStrategy bstrategy,
+				  uint32 maxbucket,
+				  uint32 highmask, uint32 lowmask,
+				  double *tuples_removed,
+				  double *num_index_tuples,
+				  bool bucket_has_garbage,
+				  bool delay,
+				  IndexBulkDeleteCallback callback,
+				  void *callback_state)
+{
+	BlockNumber blkno;
+	Buffer		buf;
+	Bucket		cur_bucket;
+	Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY;
+	Page		page;
+	bool		bucket_dirty = false;
+
+	blkno = bucket_blkno;
+	buf = bucket_buf;
+	page = BufferGetPage(buf);
+	cur_bucket = ((HashPageOpaque) PageGetSpecialPointer(page))->hasho_bucket;
+
+	if (bucket_has_garbage)
+		new_bucket = _hash_get_newbucket(rel, cur_bucket,
+										 lowmask, maxbucket);
+
+	/* Scan each page in bucket */
+	for (;;)
+	{
+		HashPageOpaque opaque;
+		OffsetNumber offno;
+		OffsetNumber maxoffno;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+		bool		release_buf = false;
+
+		if (delay)
+			vacuum_delay_point();
+
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		/* Scan each tuple in page */
+		maxoffno = PageGetMaxOffsetNumber(page);
+		for (offno = FirstOffsetNumber;
+			 offno <= maxoffno;
+			 offno = OffsetNumberNext(offno))
+		{
+			IndexTuple	itup;
+			ItemPointer htup;
+			Bucket		bucket;
+
+			itup = (IndexTuple) PageGetItem(page,
+											PageGetItemId(page, offno));
+			htup = &(itup->t_tid);
+			if (callback && callback(htup, callback_state))
+			{
+				/* mark the item for deletion */
+				deletable[ndeletable++] = offno;
+				tuples_removed += 1;
+			}
+			else if (bucket_has_garbage)
+			{
+				/* delete the tuples that are moved by split. */
+				bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+											  maxbucket,
+											  highmask,
+											  lowmask);
+				/* mark the item for deletion */
+				if (bucket != cur_bucket)
+				{
+					/*
+					 * We expect tuples to either belong to curent bucket or
+					 * new_bucket.  This is ensured because we don't allow
+					 * further splits from bucket that contains garbage. See
+					 * comments in _hash_expandtable.
+					 */
+					Assert(bucket == new_bucket);
+					deletable[ndeletable++] = offno;
+				}
+			}
+			else
+				num_index_tuples += 1;
+		}
+
+		/*
+		 * We don't release the lock on primary bucket till end of bucket
+		 * scan.
+		 */
+		if (blkno != bucket_blkno)
+			release_buf = true;
+
+		blkno = opaque->hasho_nextblkno;
+
+		/*
+		 * Apply deletions and write page if needed, advance to next page.
+		 */
+		if (ndeletable > 0)
+		{
+			PageIndexMultiDelete(page, deletable, ndeletable);
+			if (release_buf)
+				_hash_wrtbuf(rel, buf);
+			else
+				MarkBufferDirty(buf);
+			bucket_dirty = true;
+		}
+		else if (release_buf)
+			_hash_relbuf(rel, buf);
+
+		/* bail out if there are no more pages to scan. */
+		if (!BlockNumberIsValid(blkno))
+			break;
+
+		buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+										 LH_OVERFLOW_PAGE,
+										 bstrategy);
+	}
+
+	/*
+	 * Clear the garbage flag from bucket after deleting the tuples that are
+	 * moved by split.  We purposefully clear the flag before squeeze bucket,
+	 * so that after restart, vacuum shouldn't again try to delete the moved
+	 * by split tuples.
+	 */
+	if (bucket_has_garbage)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = BufferGetPage(bucket_buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+	}
+
+	/* If we deleted anything, try to compact free space */
+	if (bucket_dirty)
+		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
+							bstrategy);
+}
 
 void
 hash_redo(XLogReaderState *record)
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..e7a7b51 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -18,6 +18,8 @@
 #include "access/hash.h"
 #include "utils/rel.h"
 
+static void
+			_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf);
 
 /*
  *	_hash_doinsert() -- Handle insertion of a single index tuple.
@@ -28,7 +30,8 @@
 void
 _hash_doinsert(Relation rel, IndexTuple itup)
 {
-	Buffer		buf;
+	Buffer		buf = InvalidBuffer;
+	Buffer		bucket_buf;
 	Buffer		metabuf;
 	HashMetaPage metap;
 	BlockNumber blkno;
@@ -70,51 +73,136 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			errhint("Values larger than a buffer page cannot be indexed.")));
 
 	/*
-	 * Loop until we get a lock on the correct target bucket.
+	 * Conditionally get the lock on primary bucket page for insertion while
+	 * holding lock on meta page. If we have to wait, then release the meta
+	 * page lock and retry it in a hard way.
 	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
-									  metap->hashm_maxbucket,
-									  metap->hashm_highmask,
-									  metap->hashm_lowmask);
+	bucket = _hash_hashkey2bucket(hashkey,
+								  metap->hashm_maxbucket,
+								  metap->hashm_highmask,
+								  metap->hashm_lowmask);
 
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
+	blkno = BUCKET_TO_BLKNO(metap, bucket);
 
-		/* Release metapage lock, but keep pin. */
+	/* Fetch the primary bucket page for the bucket */
+	buf = ReadBuffer(rel, blkno);
+	if (!ConditionalLockBuffer(buf))
+	{
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(buf, HASH_WRITE);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		oldblkno = blkno;
+		retry = true;
+	}
+	else
+	{
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	}
 
+	if (retry)
+	{
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * Loop until we get a lock on the correct target bucket.  We get the
+		 * lock on primary bucket page and retain the pin on it during insert
+		 * operation to prevent the concurrent splits.  Retaining pin on a
+		 * primary bucket page ensures that split can't happen as it needs to
+		 * acquire the cleanup lock on primary bucket page.  Acquiring lock on
+		 * primary bucket and rechecking if it is a target bucket is mandatory
+		 * as otherwise a concurrent split might cause this insertion to fall
+		 * in wrong bucket.
 		 */
-		if (retry)
+		for (;;)
 		{
-			if (oldblkno == blkno)
-				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
-		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+			/*
+			 * Compute the target bucket number, and convert to block number.
+			 */
+			bucket = _hash_hashkey2bucket(hashkey,
+										  metap->hashm_maxbucket,
+										  metap->hashm_highmask,
+										  metap->hashm_lowmask);
 
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
-		oldblkno = blkno;
-		retry = true;
+			blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+			/* Release metapage lock, but keep pin. */
+			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+			/*
+			 * If the previous iteration of this loop locked what is still the
+			 * correct target bucket, we are done.  Otherwise, drop any old
+			 * lock and lock what now appears to be the correct bucket.
+			 */
+			if (retry)
+			{
+				if (oldblkno == blkno)
+					break;
+				_hash_relbuf(rel, buf);
+			}
+
+			/* Fetch the primary bucket page for the bucket */
+			buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+
+			/*
+			 * Reacquire metapage lock and check that no bucket split has
+			 * taken place while we were awaiting the bucket lock.
+			 */
+			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			oldblkno = blkno;
+			retry = true;
+		}
 	}
 
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+	/* remember the primary bucket buffer to release the pin on it at end. */
+	bucket_buf = buf;
+
 	page = BufferGetPage(buf);
 	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(pageopaque->hasho_bucket == bucket);
 
+	/*
+	 * if there is any pending split, finish it before proceeding for the
+	 * insertion as insertion can cause a new split.  We don't want to allow
+	 * split from a bucket where there is a pending split as there is no
+	 * apparent benefit by doing so and it will make the code complicated to
+	 * finish the split that involves multiple buckets considering the case
+	 * where new split can also fail.
+	 */
+	if (H_NEW_INCOMPLETE_SPLIT(pageopaque))
+	{
+		BlockNumber oblkno;
+		Buffer		obuf;
+
+		oblkno = _hash_get_oldblk(rel, pageopaque);
+
+		/* Fetch the primary bucket page for the bucket */
+		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_BUCKET_PAGE);
+
+		_hash_finish_split(rel, metabuf, obuf, buf);
+
+		/*
+		 * release the buffer here as the insertion will happen in new bucket.
+		 */
+		_hash_relbuf(rel, obuf);
+	}
+	else if (H_OLD_INCOMPLETE_SPLIT(pageopaque))
+	{
+		BlockNumber nblkno;
+		Buffer		nbuf;
+
+		nblkno = _hash_get_newblk(rel, pageopaque);
+
+		/* Fetch the primary bucket page for the bucket */
+		nbuf = _hash_getbuf(rel, nblkno, HASH_WRITE, LH_BUCKET_PAGE);
+
+		_hash_finish_split(rel, metabuf, buf, nbuf);
+
+		/*
+		 * release the buffer here as the insertion will happen in old bucket.
+		 */
+		_hash_relbuf(rel, nbuf);
+	}
+
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
@@ -127,14 +215,23 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		{
 			/*
 			 * ovfl page exists; go get it.  if it doesn't have room, we'll
-			 * find out next pass through the loop test above.
+			 * find out next pass through the loop test above.  Retain the pin
+			 * if it is a primary bucket.
 			 */
-			_hash_relbuf(rel, buf);
+			if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
+				_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+			else
+				_hash_relbuf(rel, buf);
 			buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 			page = BufferGetPage(buf);
 		}
 		else
 		{
+			bool		retain_pin = false;
+
+			/* page flags must be accessed before releasing lock on a page. */
+			retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;
+
 			/*
 			 * we're at the end of the bucket chain and we haven't found a
 			 * page with enough room.  allocate a new overflow page.
@@ -144,7 +241,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 
 			/* chain to a new overflow page */
-			buf = _hash_addovflpage(rel, metabuf, buf);
+			buf = _hash_addovflpage(rel, metabuf, buf, retain_pin);
 			page = BufferGetPage(buf);
 
 			/* should fit now, given test above */
@@ -158,11 +255,13 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	/* found page with enough space, so add the item here */
 	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
 
-	/* write and release the modified page */
+	/*
+	 * write and release the modified page and ensure to release the pin on
+	 * primary page.
+	 */
 	_hash_wrtbuf(rel, buf);
-
-	/* We can drop the bucket lock now */
-	_hash_droplock(rel, blkno, HASH_SHARE);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
 
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
@@ -188,6 +287,127 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 }
 
 /*
+ *	_hash_finish_split() -- Finish the previously interrupted split operation
+ *
+ * To complete the split operation, we form the hash table of TIDs in new
+ * bucket which is then used by split operation to skip tuples that are
+ * already moved before the split operation was previously interruptted.
+ *
+ * The caller must hold a pin, but no lock, on the metapage buffer.
+ * The buffer is returned in the same state.  (The metapage is only
+ * touched if it becomes necessary to add or remove overflow pages.)
+ *
+ * 'obuf' and 'nbuf' must be locked by the caller which is also responsible
+ * for unlocking it.
+ */
+static void
+_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf)
+{
+	HASHCTL		hash_ctl;
+	HTAB	   *tidhtab;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque opageopaque;
+	HashPageOpaque npageopaque;
+	HashMetaPage metap;
+	Bucket		obucket;
+	Bucket		nbucket;
+	uint32		maxbucket;
+	uint32		highmask;
+	uint32		lowmask;
+	bool		found;
+
+	/* Initialize hash tables used to track TIDs */
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+	hash_ctl.keysize = sizeof(ItemPointerData);
+	hash_ctl.entrysize = sizeof(ItemPointerData);
+	hash_ctl.hcxt = CurrentMemoryContext;
+
+	tidhtab =
+		hash_create("bucket ctids",
+					256,		/* arbitrary initial size */
+					&hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	/*
+	 * Scan the new bucket and build hash table of TIDs
+	 */
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	for (;;)
+	{
+		BlockNumber nblkno;
+		OffsetNumber noffnum;
+		OffsetNumber nmaxoffnum;
+
+		/* Scan each tuple in new page */
+		nmaxoffnum = PageGetMaxOffsetNumber(npage);
+		for (noffnum = FirstOffsetNumber;
+			 noffnum <= nmaxoffnum;
+			 noffnum = OffsetNumberNext(noffnum))
+		{
+			IndexTuple	itup;
+
+			/* Fetch the item's TID and insert it in hash table. */
+			itup = (IndexTuple) PageGetItem(npage,
+											PageGetItemId(npage, noffnum));
+
+			(void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
+
+			Assert(!found);
+		}
+
+		nblkno = npageopaque->hasho_nextblkno;
+
+		/*
+		 * release our write lock without modifying buffer and ensure to
+		 * retain the pin on primary bucket.
+		 */
+		if (nbuf == bucket_nbuf)
+			_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, nbuf);
+
+		/* Exit loop if no more overflow pages in new bucket */
+		if (!BlockNumberIsValid(nblkno))
+			break;
+
+		/* Else, advance to next page */
+		nbuf = _hash_getbuf(rel, nblkno, HASH_READ, LH_OVERFLOW_PAGE);
+		npage = BufferGetPage(nbuf);
+		npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	}
+
+	/* Get the metapage info */
+	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	maxbucket = metap->hashm_maxbucket;
+	highmask = metap->hashm_highmask;
+	lowmask = metap->hashm_lowmask;
+
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	_hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+
+	npage = BufferGetPage(bucket_nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nbucket = npageopaque->hasho_bucket;
+
+	opage = BufferGetPage(obuf);
+	opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+	obucket = opageopaque->hasho_bucket;
+
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, bucket_nbuf, tidhtab,
+						   maxbucket, highmask, lowmask);
+
+	hash_destroy(tidhtab);
+}
+
+/*
  *	_hash_pgaddtup() -- add a tuple to a particular page in the index.
  *
  * This routine adds the tuple to the page as requested; it does not write out
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index db3e268..760563a 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *
  *	On entry, the caller must hold a pin but no lock on 'buf'.  The pin is
  *	dropped before exiting (we assume the caller is not interested in 'buf'
- *	anymore).  The returned overflow page will be pinned and write-locked;
- *	it is guaranteed to be empty.
+ *	anymore) if not asked to retain.  The pin will be retained only for the
+ *	primary bucket.  The returned overflow page will be pinned and
+ *	write-locked; it is guaranteed to be empty.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
  *
- *	The caller must hold at least share lock on the bucket, to ensure that
- *	no one else tries to compact the bucket meanwhile.  This guarantees that
- *	'buf' won't stop being part of the bucket while it's unlocked.
- *
  * NB: since this could be executed concurrently by multiple processes,
  * one should not assume that the returned overflow page will be the
  * immediate successor of the originally passed 'buf'.  Additional overflow
  * pages might have been added to the bucket chain in between.
  */
 Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 {
 	Buffer		ovflbuf;
 	Page		page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 			break;
 
 		/* we assume we do not need to write the unmodified page */
-		_hash_relbuf(rel, buf);
+		if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
 
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
@@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 
 	/* logically chain overflow page to previous page */
 	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	_hash_wrtbuf(rel, buf);
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, buf);
 
 	return ovflbuf;
 }
@@ -370,11 +373,11 @@ _hash_firstfreebit(uint32 map)
  *	in the bucket, or InvalidBlockNumber if no following page.
  *
  *	NB: caller must not hold lock on metapage, nor on either page that's
- *	adjacent in the bucket chain.  The caller had better hold exclusive lock
- *	on the bucket, too.
+ *	adjacent in the bucket chain except from primary bucket.  The caller had
+ *	better hold cleanup lock on the primary bucket.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf,
+_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
@@ -413,22 +416,41 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
-	 * deleted.  No concurrency issues since we hold exclusive lock on the
-	 * entire bucket.
+	 * deleted.  No concurrency issues since we hold the cleanup lock on
+	 * primary bucket.  We don't need to aqcuire buffer lock to fix the
+	 * primary bucket, as we already have that lock.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-														 prevblkno,
-														 HASH_WRITE,
+		if (prevblkno == bucket_blkno)
+		{
+			Buffer		prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
+													 prevblkno,
+													 RBM_NORMAL,
+													 bstrategy);
+
+			Page		prevpage = BufferGetPage(prevbuf);
+			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+			Assert(prevopaque->hasho_bucket == bucket);
+			prevopaque->hasho_nextblkno = nextblkno;
+			MarkBufferDirty(prevbuf);
+			ReleaseBuffer(prevbuf);
+		}
+		else
+		{
+			Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
+															 prevblkno,
+															 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		prevpage = BufferGetPage(prevbuf);
-		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+															 bstrategy);
+			Page		prevpage = BufferGetPage(prevbuf);
+			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
 
-		Assert(prevopaque->hasho_bucket == bucket);
-		prevopaque->hasho_nextblkno = nextblkno;
-		_hash_wrtbuf(rel, prevbuf);
+			Assert(prevopaque->hasho_bucket == bucket);
+			prevopaque->hasho_nextblkno = nextblkno;
+			_hash_wrtbuf(rel, prevbuf);
+		}
 	}
 	if (BlockNumberIsValid(nextblkno))
 	{
@@ -570,7 +592,7 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
  *	required that to be true on entry as well, but it's a lot easier for
  *	callers to leave empty overflow pages and let this guy clean it up.
  *
- *	Caller must hold exclusive lock on the target bucket.  This allows
+ *	Caller must hold cleanup lock on the target bucket.  This allows
  *	us to safely lock multiple pages in the bucket.
  *
  *	Since this function is invoked in VACUUM, we provide an access strategy
@@ -580,6 +602,7 @@ void
 _hash_squeezebucket(Relation rel,
 					Bucket bucket,
 					BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy)
 {
 	BlockNumber wblkno;
@@ -591,27 +614,22 @@ _hash_squeezebucket(Relation rel,
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
 	bool		wbuf_dirty;
+	bool		release_buf = false;
 
 	/*
 	 * start squeezing into the base bucket page.
 	 */
 	wblkno = bucket_blkno;
-	wbuf = _hash_getbuf_with_strategy(rel,
-									  wblkno,
-									  HASH_WRITE,
-									  LH_BUCKET_PAGE,
-									  bstrategy);
+	wbuf = bucket_buf;
 	wpage = BufferGetPage(wbuf);
 	wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 
 	/*
-	 * if there aren't any overflow pages, there's nothing to squeeze.
+	 * if there aren't any overflow pages, there's nothing to squeeze. caller
+	 * is responsible to release the lock on primary bucket.
 	 */
 	if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
-	{
-		_hash_relbuf(rel, wbuf);
 		return;
-	}
 
 	/*
 	 * Find the last page in the bucket chain by starting at the base bucket
@@ -669,12 +687,17 @@ _hash_squeezebucket(Relation rel,
 			{
 				Assert(!PageIsEmpty(wpage));
 
+				if (wblkno != bucket_blkno)
+					release_buf = true;
+
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));
 
-				if (wbuf_dirty)
+				if (wbuf_dirty && release_buf)
 					_hash_wrtbuf(rel, wbuf);
-				else
+				else if (wbuf_dirty)
+					MarkBufferDirty(wbuf);
+				else if (release_buf)
 					_hash_relbuf(rel, wbuf);
 
 				/* nothing more to do if we reached the read page */
@@ -700,6 +723,7 @@ _hash_squeezebucket(Relation rel,
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
 				wbuf_dirty = false;
+				release_buf = false;
 			}
 
 			/*
@@ -733,19 +757,25 @@ _hash_squeezebucket(Relation rel,
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
-			/* yes, so release wbuf lock first */
-			if (wbuf_dirty)
+			if (wblkno != bucket_blkno)
+				release_buf = true;
+
+			/* yes, so release wbuf lock first if needed */
+			if (wbuf_dirty && release_buf)
 				_hash_wrtbuf(rel, wbuf);
-			else
+			else if (wbuf_dirty)
+				MarkBufferDirty(wbuf);
+			else if (release_buf)
 				_hash_relbuf(rel, wbuf);
+
 			/* free this overflow page (releases rbuf) */
-			_hash_freeovflpage(rel, rbuf, bstrategy);
+			_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 			/* done */
 			return;
 		}
 
 		/* free this overflow page, then get the previous one */
-		_hash_freeovflpage(rel, rbuf, bstrategy);
+		_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 178463f..83007ac 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,7 +38,7 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
 					uint32 nblocks);
 static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
@@ -55,46 +55,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 
 
 /*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock.  (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		return ConditionalLockPage(rel, whichlock, access);
-	else
-		return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		UnlockPage(rel, whichlock, access);
-}
-
-/*
  *	_hash_getbuf() -- Get a buffer by block number for read or write.
  *
  *		'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -489,9 +449,11 @@ _hash_pageinit(Page page, Size size)
 /*
  * Attempt to expand the hash table by creating one new bucket.
  *
- * This will silently do nothing if it cannot get the needed locks.
+ * This will silently do nothing if there are active scans of our own
+ * backend or if we don't get cleanup lock on old bucket.
  *
- * The caller should hold no locks on the hash index.
+ * We do remove the tuples from old bucket, if there are any left over from
+ * previous split.
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.
@@ -506,10 +468,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	BlockNumber start_oblkno;
 	BlockNumber start_nblkno;
 	Buffer		buf_nblkno;
+	Buffer		buf_oblkno;
+	Page		opage;
+	HashPageOpaque oopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
 
+restart_expand:
+
 	/*
 	 * Write-lock the meta page.  It used to be necessary to acquire a
 	 * heavyweight lock to begin a split, but that is no longer required.
@@ -548,11 +515,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		goto fail;
 
 	/*
-	 * Determine which bucket is to be split, and attempt to lock the old
-	 * bucket.  If we can't get the lock, give up.
+	 * Determine which bucket is to be split, and attempt to take cleanup lock
+	 * on the old bucket.  If we can't get the lock, give up.
 	 *
-	 * The lock protects us against other backends, but not against our own
-	 * backend.  Must check for active scans separately.
+	 * The cleanup lock protects us against other backends, but not against
+	 * our own backend.  Must check for active scans separately.
+	 *
+	 * The cleanup lock is mainly to protect the split from concurrent
+	 * inserts, however if there is any pending scan it will give up which is
+	 * not good, but harmless.
 	 */
 	new_bucket = metap->hashm_maxbucket + 1;
 
@@ -563,11 +534,50 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, old_bucket))
 		goto fail;
 
-	if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
+	buf_oblkno = ReadBuffer(rel, start_oblkno);
+	if (!ConditionalLockBufferForCleanup(buf_oblkno))
+	{
+		ReleaseBuffer(buf_oblkno);
 		goto fail;
+	}
+	_hash_checkpage(rel, buf_oblkno, LH_BUCKET_PAGE);
+
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/* we don't expect any pending split at this stage. */
+	Assert(!H_INCOMPLETE_SPLIT(oopaque));
+
+	/*
+	 * Clean the tuples remained from previous split.  This operation requires
+	 * cleanup lock and we already have one on old bucket, so let's do it. We
+	 * also don't want to allow further splits from the bucket till the
+	 * garbage of previous split is cleaned.  This has two advantages, first
+	 * it helps in avoiding the bloat due to garbage and second is, during
+	 * cleanup of bucket, we are always sure that the garbage tuples belong to
+	 * most recently splitted bucket.  On the contrary, if we allow cleanup of
+	 * bucket after meta page is updated to indicate the new split and before
+	 * the actual split, the cleanup operation won't be able to decide whether
+	 * the tuple has been moved to the newly created bucket and ended up
+	 * deleting such tuples.
+	 */
+	if (H_HAS_GARBAGE(oopaque))
+	{
+		/* Release the metapage lock. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		hashbucketcleanup(rel, buf_oblkno, start_oblkno, NULL,
+						  metap->hashm_maxbucket, metap->hashm_highmask,
+						  metap->hashm_lowmask, NULL,
+						  NULL, true, false, NULL, NULL);
+
+		_hash_relbuf(rel, buf_oblkno);
+
+		goto restart_expand;
+	}
 
 	/*
-	 * Likewise lock the new bucket (should never fail).
+	 * There shouldn't be any active scan on new bucket.
 	 *
 	 * Note: it is safe to compute the new bucket's blkno here, even though we
 	 * may still need to update the BUCKET_TO_BLKNO mapping.  This is because
@@ -579,9 +589,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, new_bucket))
 		elog(ERROR, "scan in progress on supposedly new bucket");
 
-	if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
-		elog(ERROR, "could not get lock on supposedly new bucket");
-
 	/*
 	 * If the split point is increasing (hashm_maxbucket's log base 2
 	 * increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +607,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
 			/* can't split due to BlockNumber overflow */
-			_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-			_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+			_hash_relbuf(rel, buf_oblkno);
 			goto fail;
 		}
 	}
@@ -609,7 +615,8 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/*
 	 * Physically allocate the new bucket's primary page.  We want to do this
 	 * before changing the metapage's mapping info, in case we can't get the
-	 * disk space.
+	 * disk space.  We don't need to take cleanup lock on new bucket as no
+	 * other backend could find this bucket unless meta page is updated.
 	 */
 	buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
 
@@ -665,13 +672,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  start_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno,
 					  maxbucket, highmask, lowmask);
 
-	/* Release bucket locks, allowing others to access them */
-	_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-	_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -745,6 +748,10 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
  * The buffer is returned in the same state.  (The metapage is only
  * touched if it becomes necessary to add or remove overflow pages.)
  *
+ * Split needs to hold pin on primary bucket pages of both old and new
+ * buckets till end of operation.  This is to prevent vacuum to start
+ * when split is in progress.
+ *
  * In addition, the caller must have created the new bucket's base page,
  * which is passed in buffer nbuf, pinned and write-locked.  That lock and
  * pin are released here.  (The API is set up this way because we must do
@@ -756,37 +763,87 @@ _hash_splitbucket(Relation rel,
 				  Buffer metabuf,
 				  Bucket obucket,
 				  Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Buffer		obuf;
 	Page		opage;
 	Page		npage;
 	HashPageOpaque oopaque;
 	HashPageOpaque nopaque;
 
-	/*
-	 * It should be okay to simultaneously write-lock pages from each bucket,
-	 * since no one else can be trying to acquire buffer lock on pages of
-	 * either bucket.
-	 */
-	obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
 	opage = BufferGetPage(obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 
+	/*
+	 * Mark the old bucket to indicate that split is in progress and it has
+	 * deletable tuples. At operation end, we clear split in progress flag and
+	 * vacuum will clear page_has_garbage flag after deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
 	npage = BufferGetPage(nbuf);
 
-	/* initialize the new bucket's primary page */
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nopaque->hasho_prevblkno = InvalidBlockNumber;
 	nopaque->hasho_nextblkno = InvalidBlockNumber;
 	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
 	nopaque->hasho_page_id = HASHO_PAGE_ID;
 
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, nbuf, NULL,
+						   maxbucket, highmask, lowmask);
+
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, obuf);
+	_hash_relbuf(rel, nbuf);
+}
+
+/*
+ * _hash_splitbucket_guts -- Helper function to perform the split operation
+ *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, caller needs to fill htab.  If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
+ * Caller needs to lock and unlock the old and new primary buckets.
+ */
+void
+_hash_splitbucket_guts(Relation rel,
+					   Buffer metabuf,
+					   Bucket obucket,
+					   Bucket nbucket,
+					   Buffer obuf,
+					   Buffer nbuf,
+					   HTAB *htab,
+					   uint32 maxbucket,
+					   uint32 highmask,
+					   uint32 lowmask)
+{
+	Buffer		bucket_obuf;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
+
+	bucket_obuf = obuf;
+	opage = BufferGetPage(obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
 	/*
 	 * Partition the tuples in the old bucket between the old bucket and the
 	 * new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +855,6 @@ _hash_splitbucket(Relation rel,
 		BlockNumber oblkno;
 		OffsetNumber ooffnum;
 		OffsetNumber omaxoffnum;
-		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
 
 		/* Scan each tuple in old page */
 		omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -810,18 +865,45 @@ _hash_splitbucket(Relation rel,
 			IndexTuple	itup;
 			Size		itemsz;
 			Bucket		bucket;
+			bool		found = false;
 
 			/*
-			 * Fetch the item's hash key (conveniently stored in the item) and
-			 * determine which bucket it now belongs in.
+			 * Before inserting tuple, probe the hash table containing TIDs of
+			 * tuples belonging to new bucket, if we find a match, then skip
+			 * that tuple, else fetch the item's hash key (conveniently stored
+			 * in the item) and determine which bucket it now belongs in.
 			 */
 			itup = (IndexTuple) PageGetItem(opage,
 											PageGetItemId(opage, ooffnum));
+
+			if (htab)
+				(void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
+
+			if (found)
+				continue;
+
 			bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
 										  maxbucket, highmask, lowmask);
 
 			if (bucket == nbucket)
 			{
+				Size		itupsize = 0;
+				IndexTuple	new_itup;
+
+				/*
+				 * make a copy of index tuple as we have to scribble on it.
+				 */
+				new_itup = CopyIndexTuple(itup);
+
+				/*
+				 * mark the index tuple as moved by split, such tuples are
+				 * skipped by scan if there is split in progress for a bucket.
+				 */
+				itupsize = new_itup->t_info & INDEX_SIZE_MASK;
+				new_itup->t_info &= ~INDEX_SIZE_MASK;
+				new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+				new_itup->t_info |= itupsize;
+
 				/*
 				 * insert the tuple into the new bucket.  if it doesn't fit on
 				 * the current page in the new bucket, we must allocate a new
@@ -832,17 +914,25 @@ _hash_splitbucket(Relation rel,
 				 * only partially complete, meaning the index is corrupt,
 				 * since searches may fail to find entries they should find.
 				 */
-				itemsz = IndexTupleDSize(*itup);
+				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
 				if (PageGetFreeSpace(npage) < itemsz)
 				{
+					bool		retain_pin = false;
+
+					/*
+					 * page flags must be accessed before releasing lock on a
+					 * page.
+					 */
+					retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
+
 					/* write out nbuf and drop lock, but keep pin */
 					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
 					/* chain to a new overflow page */
-					nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+					nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
 					npage = BufferGetPage(nbuf);
-					/* we don't need nopaque within the loop */
+					nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 				}
 
 				/*
@@ -852,12 +942,10 @@ _hash_splitbucket(Relation rel,
 				 * Possible future improvement: accumulate all the items for
 				 * the new page and qsort them before insertion.
 				 */
-				(void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
+				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
-				/*
-				 * Mark tuple for deletion from old page.
-				 */
-				deletable[ndeletable++] = ooffnum;
+				/* be tidy */
+				pfree(new_itup);
 			}
 			else
 			{
@@ -870,15 +958,9 @@ _hash_splitbucket(Relation rel,
 
 		oblkno = oopaque->hasho_nextblkno;
 
-		/*
-		 * Done scanning this old page.  If we moved any tuples, delete them
-		 * from the old page.
-		 */
-		if (ndeletable > 0)
-		{
-			PageIndexMultiDelete(opage, deletable, ndeletable);
-			_hash_wrtbuf(rel, obuf);
-		}
+		/* retain the pin on the old primary bucket */
+		if (obuf == bucket_obuf)
+			_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, obuf);
 
@@ -887,18 +969,42 @@ _hash_splitbucket(Relation rel,
 			break;
 
 		/* Else, advance to next old page */
-		obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
 		opage = BufferGetPage(obuf);
 		oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	}
 
 	/*
 	 * We're at the end of the old bucket chain, so we're done partitioning
-	 * the tuples.  Before quitting, call _hash_squeezebucket to ensure the
-	 * tuples remaining in the old bucket (including the overflow pages) are
-	 * packed as tightly as possible.  The new bucket is already tight.
+	 * the tuples.  Mark the old and new buckets to indicate split is
+	 * finished.
+	 */
+	if (!(nopaque->hasho_flag & LH_BUCKET_PAGE))
+		_hash_wrtbuf(rel, nbuf);
+
+	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+	opage = BufferGetPage(bucket_obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * need to acquire the write lock only if current bucket is not a primary
+	 * bucket, otherwise we already have a lock on it.
 	 */
-	_hash_wrtbuf(rel, nbuf);
+	if (!(nopaque->hasho_flag & LH_BUCKET_PAGE))
+	{
+		_hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+		npage = BufferGetPage(bucket_nbuf);
+		nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	}
 
-	_hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+	/* indicate that split is finished */
+	oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
+	nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
+
+	/*
+	 * now write the buffers, here we don't release the locks as caller is
+	 * responsible to release locks.
+	 */
+	MarkBufferDirty(bucket_obuf);
+	MarkBufferDirty(bucket_nbuf);
 }
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..d87cf8b 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -72,7 +72,23 @@ _hash_readnext(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_nextblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan to ensure
+	 * that vacuum can't delete the tuples that are moved by split to new
+	 * bucket. Such tuples are required by the scans that are started on
+	 * splitted buckets, before a new buckets split in progress flag
+	 * (LH_BUCKET_NEW_PAGE_SPLIT) is cleared.  Now the requirement to retain a
+	 * pin on primary bucket can be relaxed for buckets that are not splitted
+	 * by checking has_garbage flag in bucket, but still we need to retain pin
+	 * for squeeze phase otherwise the movement of tuples could lead to change
+	 * the ordering of scan results, so let's keep it for all buckets.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -94,7 +110,16 @@ _hash_readprev(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_prevblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan. See
+	 * comments in _hash_readnext to know the reason of retaining pin.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -192,43 +217,85 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	metap = HashPageGetMeta(page);
 
 	/*
-	 * Loop until we get a lock on the correct target bucket.
+	 * Conditionally get the lock on primary bucket page for search while
+	 * holding lock on meta page. If we have to wait, then release the meta
+	 * page lock and retry it in a hard way.
 	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
-									  metap->hashm_maxbucket,
-									  metap->hashm_highmask,
-									  metap->hashm_lowmask);
+	bucket = _hash_hashkey2bucket(hashkey,
+								  metap->hashm_maxbucket,
+								  metap->hashm_highmask,
+								  metap->hashm_lowmask);
 
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
+	blkno = BUCKET_TO_BLKNO(metap, bucket);
 
-		/* Release metapage lock, but keep pin. */
+	/* Fetch the primary bucket page for the bucket */
+	buf = ReadBuffer(rel, blkno);
+	if (!ConditionalLockBufferShared(buf))
+	{
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(buf, HASH_READ);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		oldblkno = blkno;
+		retry = true;
+	}
+	else
+	{
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	}
 
+	if (retry)
+	{
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * Loop until we get a lock on the correct target bucket.  We get the
+		 * lock on primary bucket page and retain the pin on it during read
+		 * operation to prevent the concurrent splits.  Retaining pin on a
+		 * primary bucket page ensures that split can't happen as it needs to
+		 * acquire the cleanup lock on primary bucket page.  Acquiring lock on
+		 * primary bucket and rechecking if it is a target bucket is mandatory
+		 * as otherwise a concurrent split followed by vacuum could remove
+		 * tuples from the selected bucket which otherwise would have been
+		 * visible.
 		 */
-		if (retry)
+		for (;;)
 		{
-			if (oldblkno == blkno)
-				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
+			/*
+			 * Compute the target bucket number, and convert to block number.
+			 */
+			bucket = _hash_hashkey2bucket(hashkey,
+										  metap->hashm_maxbucket,
+										  metap->hashm_highmask,
+										  metap->hashm_lowmask);
+
+			blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+			/* Release metapage lock, but keep pin. */
+			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+			/*
+			 * If the previous iteration of this loop locked what is still the
+			 * correct target bucket, we are done.  Otherwise, drop any old
+			 * lock and lock what now appears to be the correct bucket.
+			 */
+			if (retry)
+			{
+				if (oldblkno == blkno)
+					break;
+				_hash_relbuf(rel, buf);
+			}
+
+			/* Fetch the primary bucket page for the bucket */
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
+			/*
+			 * Reacquire metapage lock and check that no bucket split has
+			 * taken place while we were awaiting the bucket lock.
+			 */
+			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			oldblkno = blkno;
+			retry = true;
 		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
-
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
-		oldblkno = blkno;
-		retry = true;
 	}
 
 	/* done with the metapage */
@@ -237,14 +304,60 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	/* Update scan opaque state to show we have lock on the bucket */
 	so->hashso_bucket = bucket;
 	so->hashso_bucket_valid = true;
-	so->hashso_bucket_blkno = blkno;
 
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
 	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
+	so->hashso_bucket_buf = buf;
+
+	/*
+	 * If the bucket split is in progress, then we need to skip tuples that
+	 * are moved from old bucket.  To ensure that vacuum doesn't clean any
+	 * tuples from old or new buckets till this scan is in progress, maintain
+	 * a pin on both of the buckets.  Here, we have to be cautious about lock
+	 * ordering, first acquire the lock on old bucket, release the lock on old
+	 * bucket, but not pin, then acuire the lock on new bucket and again
+	 * re-verify whether the bucket split still is in progress. Acquiring lock
+	 * on old bucket first ensures that the vacuum waits for this scan to
+	 * finish.
+	 */
+	if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+	{
+		BlockNumber old_blkno;
+		Buffer		old_buf;
+
+		old_blkno = _hash_get_oldblk(rel, opaque);
+
+		/*
+		 * release the lock on new bucket and re-acquire it after acquiring
+		 * the lock on old bucket.
+		 */
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+		/*
+		 * remember the old bucket buffer so as to use it later for scanning.
+		 */
+		so->hashso_old_bucket_buf = old_buf;
+		_hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+		_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		Assert(opaque->hasho_bucket == bucket);
+
+		if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+			so->hashso_skip_moved_tuples = true;
+		else
+		{
+			_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+			so->hashso_old_bucket_buf = InvalidBuffer;
+		}
+	}
+
 	/* If a backwards scan is requested, move to the end of the chain */
 	if (ScanDirectionIsBackward(dir))
 	{
@@ -273,6 +386,13 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
  *		false.  Else, return true and set the hashso_curpos for the
  *		scan to the right thing.
  *
+ *		Here we also scan the old bucket if the split for current bucket
+ *		was in progress at the start of scan.  The basic idea is that
+ *		skip the tuples that are moved by split while scanning current
+ *		bucket and then scan the old bucket to cover all such tuples. This
+ *		is done to ensure that we don't miss any tuples in the current scan
+ *		when split was in progress.
+ *
  *		'bufP' points to the current buffer, which is pinned and read-locked.
  *		On success exit, we have pin and read-lock on whichever page
  *		contains the right item; on failure, we have released all buffers.
@@ -338,6 +458,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum >= FirstOffsetNumber);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberNext(offnum);	/* move forward */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -353,9 +486,41 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -379,6 +544,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum <= maxoff);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberPrev(offnum);	/* move back */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -394,9 +572,41 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 456954b..bdbeb84 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -147,6 +147,23 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_msb-- returns most significant bit position.
+ */
+static uint32
+_hash_msb(uint32 num)
+{
+	uint32		i = 0;
+
+	while (num)
+	{
+		num = num >> 1;
+		++i;
+	}
+
+	return i - 1;
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
@@ -342,3 +359,123 @@ _hash_binsearch_last(Page page, uint32 hash_value)
 
 	return lower;
 }
+
+/*
+ *	_hash_get_oldblk() -- get the block number from which current bucket
+ *			is being splitted.
+ */
+BlockNumber
+_hash_get_oldblk(Relation rel, HashPageOpaque opaque)
+{
+	Bucket		curr_bucket;
+	Bucket		old_bucket;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	/*
+	 * To get the old bucket from the current bucket, we need a mask to modulo
+	 * into lower half of table.  This mask is stored in meta page as
+	 * hashm_lowmask, but here we can't rely on the same, because we need a
+	 * value of lowmask that was prevalent at the time when bucket split was
+	 * started.  Masking the most significant bit of new bucket would give us
+	 * old bucket.
+	 */
+	curr_bucket = opaque->hasho_bucket;
+	mask = (((uint32) 1) << _hash_msb(curr_bucket)) - 1;
+	old_bucket = curr_bucket & mask;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newblk() -- get the block number of bucket for the new bucket
+ *			that will be generated after split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket based on current table
+ * half.  It is mainly required to finsh the incomplete splits where we are
+ * sure that not more than one bucket could have split in progress from old
+ * bucket.
+ */
+BlockNumber
+_hash_get_newblk(Relation rel, HashPageOpaque opaque)
+{
+	Bucket		curr_bucket;
+	Bucket		new_bucket;
+	uint32		lowmask;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	curr_bucket = opaque->hasho_bucket;
+
+	/*
+	 * new bucket can be obtained by OR'ing old bucket with most significant
+	 * bit of current table half.  There could be multiple buckets that could
+	 * have splitted from curent bucket.  We need the first such bucket that
+	 * exists based on current table half.
+	 */
+	lowmask = metap->hashm_lowmask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = curr_bucket | mask;
+		if (new_bucket > metap->hashm_maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+		break;
+	}
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newbucket() -- get the new bucket that will be generated after
+ *			split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket.  New bucket can be
+ * obtained by OR'ing old bucket with most significant bit of table half
+ * for lowmask passed in this function.  There could be multiple buckets that
+ * could have splitted from curent bucket.  We need the first such bucket that
+ * exists.  Caller must ensure that no more than one split has happened from
+ * old bucket.
+ */
+Bucket
+_hash_get_newbucket(Relation rel, Bucket curr_bucket,
+					uint32 lowmask, uint32 maxbucket)
+{
+	Bucket		new_bucket;
+	uint32		mask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = curr_bucket | mask;
+		if (new_bucket > maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		break;
+	}
+
+	return new_bucket;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b7ca9bf..00129ed 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3567,6 +3567,26 @@ ConditionalLockBuffer(Buffer buffer)
 }
 
 /*
+ * Acquire the content_lock for the buffer, but only if we don't have to wait.
+ *
+ * This assumes the caller wants BUFFER_LOCK_SHARED mode.
+ */
+bool
+ConditionalLockBufferShared(Buffer buffer)
+{
+	BufferDesc *buf;
+
+	Assert(BufferIsValid(buffer));
+	if (BufferIsLocal(buffer))
+		return true;			/* act as though we got it */
+
+	buf = GetBufferDescriptor(buffer - 1);
+
+	return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+									LW_SHARED);
+}
+
+/*
  * LockBufferForCleanup - lock a buffer in preparation for deleting items
  *
  * Items may be deleted from a disk page only when the caller (a) holds an
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index fa3f9b6..3a64c9d 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -25,6 +25,7 @@
 #include "lib/stringinfo.h"
 #include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
+#include "utils/hsearch.h"
 #include "utils/relcache.h"
 
 /*
@@ -52,6 +53,9 @@ typedef uint32 Bucket;
 #define LH_BUCKET_PAGE			(1 << 1)
 #define LH_BITMAP_PAGE			(1 << 2)
 #define LH_META_PAGE			(1 << 3)
+#define LH_BUCKET_NEW_PAGE_SPLIT	(1 << 4)
+#define LH_BUCKET_OLD_PAGE_SPLIT	(1 << 5)
+#define LH_BUCKET_PAGE_HAS_GARBAGE	(1 << 6)
 
 typedef struct HashPageOpaqueData
 {
@@ -64,6 +68,12 @@ typedef struct HashPageOpaqueData
 
 typedef HashPageOpaqueData *HashPageOpaque;
 
+#define H_HAS_GARBAGE(opaque)			((opaque)->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+#define H_OLD_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+#define H_INCOMPLETE_SPLIT(opaque)		(((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) || \
+										 ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT))
+
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
  * which otherwise would have a hard time telling pages of different index
@@ -88,12 +98,6 @@ typedef struct HashScanOpaqueData
 	bool		hashso_bucket_valid;
 
 	/*
-	 * If we have a share lock on the bucket, we record it here.  When
-	 * hashso_bucket_blkno is zero, we have no such lock.
-	 */
-	BlockNumber hashso_bucket_blkno;
-
-	/*
 	 * We also want to remember which buffer we're currently examining in the
 	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
 	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -101,11 +105,23 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_curbuf;
 
+	/* remember the buffer associated with primary bucket */
+	Buffer		hashso_bucket_buf;
+
+	/*
+	 * remember the buffer associated with old primary bucket which is
+	 * required during the scan of the bucket for which split is in progress.
+	 */
+	Buffer		hashso_old_bucket_buf;
+
 	/* Current position of the scan, as an index TID */
 	ItemPointerData hashso_curpos;
 
 	/* Current position of the scan, as a heap TID */
 	ItemPointerData hashso_heappos;
+
+	/* Whether scan needs to skip tuples that are moved by split */
+	bool		hashso_skip_moved_tuples;
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -176,6 +192,8 @@ typedef HashMetaPageData *HashMetaPage;
 				  sizeof(ItemIdData) - \
 				  MAXALIGN(sizeof(HashPageOpaqueData)))
 
+#define INDEX_MOVED_BY_SPLIT_MASK	0x2000
+
 #define HASH_MIN_FILLFACTOR			10
 #define HASH_DEFAULT_FILLFACTOR		75
 
@@ -224,9 +242,6 @@ typedef HashMetaPageData *HashMetaPage;
 #define HASH_WRITE		BUFFER_LOCK_EXCLUSIVE
 #define HASH_NOLOCK		(-1)
 
-#define HASH_SHARE		ShareLock
-#define HASH_EXCLUSIVE	ExclusiveLock
-
 /*
  *	Strategy number. There's only one valid strategy for hashing: equality.
  */
@@ -299,19 +314,17 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 
 /* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
 extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BufferAccessStrategy bstrategy);
+				   BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy);
 
 /* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
 extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 			 int access, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
@@ -329,6 +342,10 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
 				ForkNumber forkNum);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
+extern void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
+					   Bucket obucket, Bucket nbucket, Buffer obuf,
+					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
+					   uint32 highmask, uint32 lowmask);
 
 /* hashscan.c */
 extern void _hash_regscan(IndexScanDesc scan);
@@ -363,10 +380,20 @@ extern IndexTuple _hash_form_tuple(Relation index,
 				 Datum *values, bool *isnull);
 extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
 extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+extern BlockNumber _hash_get_oldblk(Relation rel, HashPageOpaque opaque);
+extern BlockNumber _hash_get_newblk(Relation rel, HashPageOpaque opaque);
+extern Bucket _hash_get_newbucket(Relation rel, Bucket curr_bucket,
+					uint32 lowmask, uint32 maxbucket);
 
 /* hash.c */
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
 extern const char *hash_identify(uint8 info);
+extern void hashbucketcleanup(Relation rel, Buffer bucket_buf,
+				  BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
+				  double *tuples_removed, double *num_index_tuples,
+				  bool bucket_has_garbage, bool delay,
+				  IndexBulkDeleteCallback callback, void *callback_state);
 
 #endif   /* HASH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3d5dea7..4b318a8 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -226,6 +226,7 @@ extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
 extern void UnlockBuffers(void);
 extern void LockBuffer(Buffer buffer, int mode);
 extern bool ConditionalLockBuffer(Buffer buffer);
+extern bool ConditionalLockBufferShared(Buffer buffer);
 extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
#3Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#1)
Re: Hash Indexes

On Tue, May 10, 2016 at 8:09 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

For making hash indexes usable in production systems, we need to improve its concurrency and make them crash-safe by WAL logging them. The first problem I would like to tackle is improve the concurrency of hash indexes. First advantage, I see with improving concurrency of hash indexes is that it has the potential of out performing btree for "equal to" searches (with my WIP patch attached with this mail, I could see hash index outperform btree index by 20 to 30% for very simple cases which are mentioned later in this e-mail). Another advantage as explained by Robert [1] earlier is that if we remove heavy weight locks under which we perform arbitrarily large number of operations, it can help us to sensibly WAL log it. With this patch, I would also like to make hash indexes capable of completing the incomplete_splits which can occur due to interrupts (like cancel) or errors or crash.

I have studied the concurrency problems of hash index and some of the solutions proposed for same previously and based on that came up with below solution which is based on idea by Robert [1], community discussion on thread [2] and some of my own thoughts.

Maintain a flag that can be set and cleared on the primary bucket page, call it split-in-progress, and a flag that can optionally be set on particular index tuples, call it moved-by-split. We will allow scans of all buckets and insertions into all buckets while the split is in progress, but (as now) we will not allow more than one split for a bucket to be in progress at the same time. We start the split by updating metapage to incrementing the number of buckets and set the split-in-progress flag in primary bucket pages for old and new buckets (lets number them as old bucket - N+1/2; new bucket - N + 1 for the matter of discussion). While the split-in-progress flag is set, any scans of N+1 will first scan that bucket, ignoring any tuples flagged moved-by-split, and then ALSO scan bucket N+1/2. To ensure that vacuum doesn't clean any tuples from old or new buckets till this scan is in progress, maintain a pin on both of the buckets (first pin on old bucket needs to be acquired). The moved-by-split flag never has any effect except when scanning the new bucket that existed at the start of that particular scan, and then only if the split-in-progress flag was also set at that time.

You really need parentheses in (N+1)/2. Because you are not trying to
add 1/2 to N. https://en.wikipedia.org/wiki/Order_of_operations

Once the split operation has set the split-in-progress flag, it will begin scanning bucket (N+1)/2. Every time it finds a tuple that properly belongs in bucket N+1, it will insert the tuple into bucket N+1 with the moved-by-split flag set. Tuples inserted by anything other than a split operation will leave this flag clear, and tuples inserted while the split is in progress will target the same bucket that they would hit if the split were already complete. Thus, bucket N+1 will end up with a mix of moved-by-split tuples, coming from bucket (N+1)/2, and unflagged tuples coming from parallel insertion activity. When the scan of bucket (N+1)/2 is complete, we know that bucket N+1 now contains all the tuples that are supposed to be there, so we clear the split-in-progress flag on both buckets. Future scans of both buckets can proceed normally. Split operation needs to take a cleanup lock on primary bucket to ensure that it doesn't start if there is any Insertion happening in the bucket. It will leave the lock on primary bucket, but not pin as it proceeds for next overflow page. Retaining pin on primary bucket will ensure that vacuum doesn't start on this bucket till the split is finished.

In the second-to-last sentence, I believe you have reversed the words
"lock" and "pin".

Insertion will happen by scanning the appropriate bucket and needs to retain pin on primary bucket to ensure that concurrent split doesn't happen, otherwise split might leave this tuple unaccounted.

What do you mean by "unaccounted"?

Now for deletion of tuples from (N+1/2) bucket, we need to wait for the completion of any scans that began before we finished populating bucket N+1, because otherwise we might remove tuples that they're still expecting to find in bucket (N+1)/2. The scan will always maintain a pin on primary bucket and Vacuum can take a buffer cleanup lock (cleanup lock includes Exclusive lock on bucket and wait till all the pins on buffer becomes zero) on primary bucket for the buffer. I think we can relax the requirement for vacuum to take cleanup lock (instead take Exclusive Lock on buckets where no split has happened) with the additional flag has_garbage which will be set on primary bucket, if any tuples have been moved from that bucket, however I think for squeeze phase (in this phase, we try to move the tuples from later overflow pages to earlier overflow pages in the bucket and then if there are any empty overflow pages, then we move them to kind of a free pool) of vacuum, we need a cleanup lock, otherwise scan results might get effected.

affected, not effected.

I think this is basically correct, although I don't find it to be as
clear as I think it could be. It seems very clear that any operation
which potentially changes the order of tuples in the bucket chain,
such as the squeeze phase as currently implemented, also needs to
exclude all concurrent scans. However, I think that it's OK for
vacuum to remove tuples from a given page with only an exclusive lock
on that particular page. Also, I think that when cleaning up after a
split, an exclusive lock is likewise sufficient to remove tuples from
a particular page provided that we know that every scan currently in
progress started after split-in-progress was set. If each scan holds
a pin on the primary bucket and setting the split-in-progress flag
requires a cleanup lock on that page, then this is always true.

(Plain text email is preferred to HTML on this mailing list.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#2)
Re: Hash Indexes

On Thu, Jun 16, 2016 at 3:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Incomplete splits can be completed either by vacuum or insert as both
needs exclusive lock on bucket. If vacuum finds split-in-progress flag on a
bucket then it will complete the split operation, vacuum won't see this flag
if actually split is in progress on that bucket as vacuum needs cleanup lock
and split retains pin till end of operation. To make it work for Insert
operation, one simple idea could be that if insert finds split-in-progress
flag, then it releases the current exclusive lock on bucket and tries to
acquire a cleanup lock on bucket, if it gets cleanup lock, then it can
complete the split and then the insertion of tuple, else it will have a
exclusive lock on bucket and just perform the insertion of tuple. The
disadvantage of trying to complete the split in vacuum is that split might
require new pages and allocating new pages at time of vacuum is not
advisable. The disadvantage of doing it at time of Insert is that Insert
might skip it even if there is some scan on the bucket is going on as scan
will also retain pin on the bucket, but I think that is not a big deal. The
actual completion of split can be done in two ways: (a) scan the new bucket
and build a hash table with all of the TIDs you find there. When copying
tuples from the old bucket, first probe the hash table; if you find a match,
just skip that tuple (idea suggested by Robert Haas offlist) (b) delete all
the tuples that are marked as moved_by_split in the new bucket and perform
the split operation from the beginning using old bucket.

I have completed the patch with respect to incomplete splits and delayed
cleanup of garbage tuples. For incomplete splits, I have used the option
(a) as mentioned above. The incomplete splits are completed if the
insertion sees split-in-progress flag in a bucket.

It seems to me that there is a potential performance problem here. If
the split is still being performed, every insert will see the
split-in-progress flag set. The in-progress split retains only a pin
on the primary bucket, so other backends could also get an exclusive
lock, which is all they need for an insert. It seems that under this
algorithm they will now take the exclusive lock, release the exclusive
lock, try to take a cleanup lock, fail, again take the exclusive lock.
That seems like a lot of extra monkeying around. Wouldn't it be
better to take the exclusive lock and then afterwards check if the pin
count is 1? If so, even though we only intended to take an exclusive
lock, it is actually a cleanup lock. If not, we can simply proceed
with the insertion. This way you avoid unlocking and relocking the
buffer repeatedly.

The second major thing
this new version of patch has achieved is cleanup of garbage tuples i.e the
tuples that are left in old bucket during split. Currently (in HEAD), as
part of a split operation, we clean the tuples from old bucket after moving
them to new bucket, as we have heavy-weight locks on both old and new bucket
till the whole split operation. In the new design, we need to take cleanup
lock on old bucket and exclusive lock on new bucket to perform the split
operation and we don't retain those locks till the end (release the lock as
we move on to overflow buckets). Now to cleanup the tuples we need a
cleanup lock on a bucket which we might not have at split-end. So I choose
to perform the cleanup of garbage tuples during vacuum and when re-split of
the bucket happens as during both the operations, we do hold cleanup lock.
We can extend the cleanup of garbage to other operations as well if
required.

I think it's OK for the squeeze phase to be deferred until vacuum or a
subsequent split, but simply removing dead tuples seems like it should
be done earlier if possible. As I noted in my last email, it seems
like any process that gets an exclusive lock can do that, and probably
should. Otherwise, the index might become quite bloated.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#3)
Re: Hash Indexes

On Tue, Jun 21, 2016 at 9:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, May 10, 2016 at 8:09 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Once the split operation has set the split-in-progress flag, it will

begin scanning bucket (N+1)/2. Every time it finds a tuple that properly
belongs in bucket N+1, it will insert the tuple into bucket N+1 with the
moved-by-split flag set. Tuples inserted by anything other than a split
operation will leave this flag clear, and tuples inserted while the split
is in progress will target the same bucket that they would hit if the split
were already complete. Thus, bucket N+1 will end up with a mix of
moved-by-split tuples, coming from bucket (N+1)/2, and unflagged tuples
coming from parallel insertion activity. When the scan of bucket (N+1)/2
is complete, we know that bucket N+1 now contains all the tuples that are
supposed to be there, so we clear the split-in-progress flag on both
buckets. Future scans of both buckets can proceed normally. Split
operation needs to take a cleanup lock on primary bucket to ensure that it
doesn't start if there is any Insertion happening in the bucket. It will
leave the lock on primary bucket, but not pin as it proceeds for next
overflow page. Retaining pin on primary bucket will ensure that vacuum
doesn't start on this bucket till the split is finished.

In the second-to-last sentence, I believe you have reversed the words
"lock" and "pin".

Yes. What, I mean to say is release the lock, but retain the pin on primary
bucket till end of operation.

Insertion will happen by scanning the appropriate bucket and needs to

retain pin on primary bucket to ensure that concurrent split doesn't
happen, otherwise split might leave this tuple unaccounted.

What do you mean by "unaccounted"?

It means that split might leave this tuple in old bucket even if it can be
moved to new bucket. Consider a case where insertion has to add a tuple on
some intermediate overflow bucket in the bucket chain, if we allow split
when insertion is in progress, split might not move this newly inserted
tuple.

Now for deletion of tuples from (N+1/2) bucket, we need to wait for the

completion of any scans that began before we finished populating bucket
N+1, because otherwise we might remove tuples that they're still expecting
to find in bucket (N+1)/2. The scan will always maintain a pin on primary
bucket and Vacuum can take a buffer cleanup lock (cleanup lock includes
Exclusive lock on bucket and wait till all the pins on buffer becomes zero)
on primary bucket for the buffer. I think we can relax the requirement for
vacuum to take cleanup lock (instead take Exclusive Lock on buckets where
no split has happened) with the additional flag has_garbage which will be
set on primary bucket, if any tuples have been moved from that bucket,
however I think for squeeze phase (in this phase, we try to move the tuples
from later overflow pages to earlier overflow pages in the bucket and then
if there are any empty overflow pages, then we move them to kind of a free
pool) of vacuum, we need a cleanup lock, otherwise scan results might get
effected.

affected, not effected.

I think this is basically correct, although I don't find it to be as
clear as I think it could be. It seems very clear that any operation
which potentially changes the order of tuples in the bucket chain,
such as the squeeze phase as currently implemented, also needs to
exclude all concurrent scans. However, I think that it's OK for
vacuum to remove tuples from a given page with only an exclusive lock
on that particular page.

How can we guarantee that it doesn't remove a tuple that is required by
scan which is started after split-in-progress flag is set?

Also, I think that when cleaning up after a
split, an exclusive lock is likewise sufficient to remove tuples from
a particular page provided that we know that every scan currently in
progress started after split-in-progress was set.

I think this could also have a similar issue as above, unless we have
something which prevents concurrent scans.

(Plain text email is preferred to HTML on this mailing list.)

If I turn to Plain text [1]http://www.mail-signatures.com/articles/how-to-add-or-change-an-email-signature-in-gmailgoogle-apps/, then the signature of my e-mail also changes
to Plain text which don't want. Is there a way, I can retain signature
settings in Rich Text and mail content as Plain Text.

[1]: http://www.mail-signatures.com/articles/how-to-add-or-change-an-email-signature-in-gmailgoogle-apps/
http://www.mail-signatures.com/articles/how-to-add-or-change-an-email-signature-in-gmailgoogle-apps/

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#4)
Re: Hash Indexes

On Tue, Jun 21, 2016 at 9:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 16, 2016 at 3:28 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Incomplete splits can be completed either by vacuum or insert as both
needs exclusive lock on bucket. If vacuum finds split-in-progress

flag on a

bucket then it will complete the split operation, vacuum won't see

this flag

if actually split is in progress on that bucket as vacuum needs

cleanup lock

and split retains pin till end of operation. To make it work for

Insert

operation, one simple idea could be that if insert finds

split-in-progress

flag, then it releases the current exclusive lock on bucket and tries

to

acquire a cleanup lock on bucket, if it gets cleanup lock, then it can
complete the split and then the insertion of tuple, else it will have a
exclusive lock on bucket and just perform the insertion of tuple. The
disadvantage of trying to complete the split in vacuum is that split

might

require new pages and allocating new pages at time of vacuum is not
advisable. The disadvantage of doing it at time of Insert is that

Insert

might skip it even if there is some scan on the bucket is going on as

scan

will also retain pin on the bucket, but I think that is not a big

deal. The

actual completion of split can be done in two ways: (a) scan the new

bucket

and build a hash table with all of the TIDs you find there. When

copying

tuples from the old bucket, first probe the hash table; if you find a

match,

just skip that tuple (idea suggested by Robert Haas offlist) (b)

delete all

the tuples that are marked as moved_by_split in the new bucket and

perform

the split operation from the beginning using old bucket.

I have completed the patch with respect to incomplete splits and delayed
cleanup of garbage tuples. For incomplete splits, I have used the

option

(a) as mentioned above. The incomplete splits are completed if the
insertion sees split-in-progress flag in a bucket.

It seems to me that there is a potential performance problem here. If
the split is still being performed, every insert will see the
split-in-progress flag set. The in-progress split retains only a pin
on the primary bucket, so other backends could also get an exclusive
lock, which is all they need for an insert. It seems that under this
algorithm they will now take the exclusive lock, release the exclusive
lock, try to take a cleanup lock, fail, again take the exclusive lock.
That seems like a lot of extra monkeying around. Wouldn't it be
better to take the exclusive lock and then afterwards check if the pin
count is 1? If so, even though we only intended to take an exclusive
lock, it is actually a cleanup lock. If not, we can simply proceed
with the insertion. This way you avoid unlocking and relocking the
buffer repeatedly.

We can do it in the way as you are suggesting, but there is another thing
which we need to consider here. As of now, the patch tries to finish the
split if it finds split-in-progress flag in either old or new bucket. We
need to lock both old and new buckets to finish the split, so it is quite
possible that two different backends try to lock them in opposite order
leading to a deadlock. I think the correct way to handle is to always try
to lock the old bucket first and then new bucket. To achieve that, if the
insertion on new bucket finds that split-in-progress flag is set on a
bucket, it needs to release the lock and then acquire the lock first on old
bucket, ensure pincount is 1 and then lock new bucket again and ensure that
pincount is 1. I have already maintained the order of locks in scan (old
bucket first and then new bucket; refer changes in _hash_first()).
Alternatively, we can try to finish the splits only when someone tries to
insert in old bucket.

The second major thing
this new version of patch has achieved is cleanup of garbage tuples i.e

the

tuples that are left in old bucket during split. Currently (in HEAD),

as

part of a split operation, we clean the tuples from old bucket after

moving

them to new bucket, as we have heavy-weight locks on both old and new

bucket

till the whole split operation. In the new design, we need to take

cleanup

lock on old bucket and exclusive lock on new bucket to perform the split
operation and we don't retain those locks till the end (release the

lock as

we move on to overflow buckets). Now to cleanup the tuples we need a
cleanup lock on a bucket which we might not have at split-end. So I

choose

to perform the cleanup of garbage tuples during vacuum and when

re-split of

the bucket happens as during both the operations, we do hold cleanup

lock.

We can extend the cleanup of garbage to other operations as well if
required.

I think it's OK for the squeeze phase to be deferred until vacuum or a
subsequent split, but simply removing dead tuples seems like it should
be done earlier if possible.

Yes, probably we can do it at time of insertion in a bucket, if we don't
have concurrent scan issue.

--

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#7Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#5)
Re: Hash Indexes

On Wed, Jun 22, 2016 at 5:10 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Insertion will happen by scanning the appropriate bucket and needs to
retain pin on primary bucket to ensure that concurrent split doesn't happen,
otherwise split might leave this tuple unaccounted.

What do you mean by "unaccounted"?

It means that split might leave this tuple in old bucket even if it can be
moved to new bucket. Consider a case where insertion has to add a tuple on
some intermediate overflow bucket in the bucket chain, if we allow split
when insertion is in progress, split might not move this newly inserted
tuple.

OK, that's a good point.

Now for deletion of tuples from (N+1/2) bucket, we need to wait for the
completion of any scans that began before we finished populating bucket N+1,
because otherwise we might remove tuples that they're still expecting to
find in bucket (N+1)/2. The scan will always maintain a pin on primary
bucket and Vacuum can take a buffer cleanup lock (cleanup lock includes
Exclusive lock on bucket and wait till all the pins on buffer becomes zero)
on primary bucket for the buffer. I think we can relax the requirement for
vacuum to take cleanup lock (instead take Exclusive Lock on buckets where no
split has happened) with the additional flag has_garbage which will be set
on primary bucket, if any tuples have been moved from that bucket, however I
think for squeeze phase (in this phase, we try to move the tuples from later
overflow pages to earlier overflow pages in the bucket and then if there are
any empty overflow pages, then we move them to kind of a free pool) of
vacuum, we need a cleanup lock, otherwise scan results might get effected.

affected, not effected.

I think this is basically correct, although I don't find it to be as
clear as I think it could be. It seems very clear that any operation
which potentially changes the order of tuples in the bucket chain,
such as the squeeze phase as currently implemented, also needs to
exclude all concurrent scans. However, I think that it's OK for
vacuum to remove tuples from a given page with only an exclusive lock
on that particular page.

How can we guarantee that it doesn't remove a tuple that is required by scan
which is started after split-in-progress flag is set?

If the tuple is being removed by VACUUM, it is dead. We can remove
dead tuples right away, because no MVCC scan will see them. In fact,
the only snapshot that will see them is SnapshotAny, and there's no
problem with removing dead tuples while a SnapshotAny scan is in
progress. It's no different than heap_page_prune() removing tuples
that a SnapshotAny sequential scan was about to see.

If the tuple is being removed because the bucket was split, it's only
a problem if the scan predates setting the split-in-progress flag.
But since your design involves out-waiting all scans currently in
progress before setting that flag, there can't be any scan in progress
that hasn't seen it. A scan that has seen the flag won't look at the
tuple in any case.

(Plain text email is preferred to HTML on this mailing list.)

If I turn to Plain text [1], then the signature of my e-mail also changes to
Plain text which don't want. Is there a way, I can retain signature
settings in Rich Text and mail content as Plain Text.

Nope, but I don't see what you are worried about. There's no HTML
content in your signature anyway except for a link, and most
mail-reading software will turn that into a hyperlink even without the
HTML.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#6)
Re: Hash Indexes

On Wed, Jun 22, 2016 at 5:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

We can do it in the way as you are suggesting, but there is another thing
which we need to consider here. As of now, the patch tries to finish the
split if it finds split-in-progress flag in either old or new bucket. We
need to lock both old and new buckets to finish the split, so it is quite
possible that two different backends try to lock them in opposite order
leading to a deadlock. I think the correct way to handle is to always try
to lock the old bucket first and then new bucket. To achieve that, if the
insertion on new bucket finds that split-in-progress flag is set on a
bucket, it needs to release the lock and then acquire the lock first on old
bucket, ensure pincount is 1 and then lock new bucket again and ensure that
pincount is 1. I have already maintained the order of locks in scan (old
bucket first and then new bucket; refer changes in _hash_first()).
Alternatively, we can try to finish the splits only when someone tries to
insert in old bucket.

Yes, I think locking buckets in increasing order is a good solution.
I also think it's fine to only try to finish the split when the insert
targets the old bucket. Finishing the split enables us to remove
tuples from the old bucket, which lets us reuse space instead of
accelerating more. So there is at least some potential benefit to the
backend inserting into the old bucket. On the other hand, a process
inserting into the new bucket derives no direct benefit from finishing
the split.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#7)
Re: Hash Indexes

On Wed, Jun 22, 2016 at 8:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 22, 2016 at 5:10 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think this is basically correct, although I don't find it to be as
clear as I think it could be. It seems very clear that any operation
which potentially changes the order of tuples in the bucket chain,
such as the squeeze phase as currently implemented, also needs to
exclude all concurrent scans. However, I think that it's OK for
vacuum to remove tuples from a given page with only an exclusive lock
on that particular page.

How can we guarantee that it doesn't remove a tuple that is required by scan
which is started after split-in-progress flag is set?

If the tuple is being removed by VACUUM, it is dead. We can remove
dead tuples right away, because no MVCC scan will see them. In fact,
the only snapshot that will see them is SnapshotAny, and there's no
problem with removing dead tuples while a SnapshotAny scan is in
progress. It's no different than heap_page_prune() removing tuples
that a SnapshotAny sequential scan was about to see.

If the tuple is being removed because the bucket was split, it's only
a problem if the scan predates setting the split-in-progress flag.
But since your design involves out-waiting all scans currently in
progress before setting that flag, there can't be any scan in progress
that hasn't seen it.

For above cases, just an exclusive lock will work.

A scan that has seen the flag won't look at the
tuple in any case.

Why so? Assume that scan started on new bucket where
split-in-progress flag was set, now it will not look at tuples that
are marked as moved-by-split in this bucket, as it will assume to find
all such tuples in old bucket. Now, if allow Vacuum or someone else
to remove tuples from old with just an Exclusive lock, it is quite
possible that scan miss the tuple in old bucket which got removed by
vacuum.

(Plain text email is preferred to HTML on this mailing list.)

If I turn to Plain text [1], then the signature of my e-mail also changes to
Plain text which don't want. Is there a way, I can retain signature
settings in Rich Text and mail content as Plain Text.

Nope, but I don't see what you are worried about. There's no HTML
content in your signature anyway except for a link, and most
mail-reading software will turn that into a hyperlink even without the
HTML.

Okay, I didn't knew that mail-reading software does that. Thanks for
pointing out.

--

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#8)
Re: Hash Indexes

On Wed, Jun 22, 2016 at 8:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 22, 2016 at 5:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

We can do it in the way as you are suggesting, but there is another thing
which we need to consider here. As of now, the patch tries to finish the
split if it finds split-in-progress flag in either old or new bucket. We
need to lock both old and new buckets to finish the split, so it is quite
possible that two different backends try to lock them in opposite order
leading to a deadlock. I think the correct way to handle is to always try
to lock the old bucket first and then new bucket. To achieve that, if the
insertion on new bucket finds that split-in-progress flag is set on a
bucket, it needs to release the lock and then acquire the lock first on old
bucket, ensure pincount is 1 and then lock new bucket again and ensure that
pincount is 1. I have already maintained the order of locks in scan (old
bucket first and then new bucket; refer changes in _hash_first()).
Alternatively, we can try to finish the splits only when someone tries to
insert in old bucket.

Yes, I think locking buckets in increasing order is a good solution.

Okay.

I also think it's fine to only try to finish the split when the insert
targets the old bucket. Finishing the split enables us to remove
tuples from the old bucket, which lets us reuse space instead of
accelerating more. So there is at least some potential benefit to the
backend inserting into the old bucket. On the other hand, a process
inserting into the new bucket derives no direct benefit from finishing
the split.

makes sense, will change that way and will add a comment why we are
just doing it for old bucket.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#9)
Re: Hash Indexes

On Wed, Jun 22, 2016 at 10:13 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

A scan that has seen the flag won't look at the
tuple in any case.

Why so? Assume that scan started on new bucket where
split-in-progress flag was set, now it will not look at tuples that
are marked as moved-by-split in this bucket, as it will assume to find
all such tuples in old bucket. Now, if allow Vacuum or someone else
to remove tuples from old with just an Exclusive lock, it is quite
possible that scan miss the tuple in old bucket which got removed by
vacuum.

Oh, you're right. So we really need to CLEAR the split-in-progress
flag before removing any tuples from the old bucket. Does that sound
right?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#11)
Re: Hash Indexes

On Thu, Jun 23, 2016 at 10:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 22, 2016 at 10:13 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

A scan that has seen the flag won't look at the
tuple in any case.

Why so? Assume that scan started on new bucket where
split-in-progress flag was set, now it will not look at tuples that
are marked as moved-by-split in this bucket, as it will assume to find
all such tuples in old bucket. Now, if allow Vacuum or someone else
to remove tuples from old with just an Exclusive lock, it is quite
possible that scan miss the tuple in old bucket which got removed by
vacuum.

Oh, you're right. So we really need to CLEAR the split-in-progress
flag before removing any tuples from the old bucket.

I think that alone is not sufficient, we also need to out-wait any
scan that has started when the flag is set and till it is cleared.
Before vacuum starts cleaning particular bucket, we can certainly
detect if it has to clean garbage tuples (the patch sets has_garbage
flag in old bucket for split operation) and only for that case we can
out-wait the scans. So probably, how it can work is during vacuum,
take Exclusive lock on bucket, check if has_garbage flag is set and
split-in-progress flag is cleared on bucket, if so then wait till the
pin-count on bucket is 1, else if has_garbage is not set, then just
proceed with clearing dead tuples from bucket. This will reduce the
requirement of having cleanup lock only when it is required (namely
when bucket has garbage tuples).

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Amit Kapila (#2)
Re: Hash Indexes

On Thu, Jun 16, 2016 at 12:58 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

I have a question regarding code changes in *_hash_first*.

+        /*
+         * Conditionally get the lock on primary bucket page for search
while
+        * holding lock on meta page. If we have to wait, then release the
meta
+         * page lock and retry it in a hard way.
+         */
+        bucket = _hash_hashkey2bucket(hashkey,
+
 metap->hashm_maxbucket,
+
 metap->hashm_highmask,
+
 metap->hashm_lowmask);
+
+        blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+        /* Fetch the primary bucket page for the bucket */
+        buf = ReadBuffer(rel, blkno);
+        if (!ConditionalLockBufferShared(buf))

Here we try to take lock on bucket page but I think if successful we do not
recheck whether any split happened before taking lock. Is this not
necessary now?

Also  below "if" is always true as we enter here only when outer "if
(retry)" is true.
+                        if (retry)
+                        {
+                                if (oldblkno == blkno)
+                                        break;
+                                _hash_relbuf(rel, buf);
+                        }

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

#14Amit Kapila
amit.kapila16@gmail.com
In reply to: Mithun Cy (#13)
Re: Hash Indexes

On Fri, Jun 24, 2016 at 2:38 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

On Thu, Jun 16, 2016 at 12:58 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

I have a question regarding code changes in _hash_first.

+        /*
+         * Conditionally get the lock on primary bucket page for search
while
+        * holding lock on meta page. If we have to wait, then release the
meta
+         * page lock and retry it in a hard way.
+         */
+        bucket = _hash_hashkey2bucket(hashkey,
+
metap->hashm_maxbucket,
+
metap->hashm_highmask,
+
metap->hashm_lowmask);
+
+        blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+        /* Fetch the primary bucket page for the bucket */
+        buf = ReadBuffer(rel, blkno);
+        if (!ConditionalLockBufferShared(buf))

Here we try to take lock on bucket page but I think if successful we do not
recheck whether any split happened before taking lock. Is this not necessary
now?

Yes, now that is not needed, because we are doing that by holding the
read lock on metapage. Split happens by having a write lock on
metapage. The basic idea of this optimization is that if we get the
lock immediately, then do so by holding metapage lock, else if we
decide to wait for getting a lock on bucket page then we still
fallback to previous kind of mechanism.

Also  below "if" is always true as we enter here only when outer "if
(retry)" is true.
+                        if (retry)
+                        {
+                                if (oldblkno == blkno)
+                                        break;
+                                _hash_relbuf(rel, buf);
+                        }

Good catch, I think we don't need this retry check now. We do need
similar change in _hash_doinsert().

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#7)
Re: Hash Indexes

On Wed, Jun 22, 2016 at 8:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 22, 2016 at 5:10 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Insertion will happen by scanning the appropriate bucket and needs to
retain pin on primary bucket to ensure that concurrent split doesn't happen,
otherwise split might leave this tuple unaccounted.

What do you mean by "unaccounted"?

It means that split might leave this tuple in old bucket even if it can be
moved to new bucket. Consider a case where insertion has to add a tuple on
some intermediate overflow bucket in the bucket chain, if we allow split
when insertion is in progress, split might not move this newly inserted
tuple.

I think this is basically correct, although I don't find it to be as
clear as I think it could be. It seems very clear that any operation
which potentially changes the order of tuples in the bucket chain,
such as the squeeze phase as currently implemented, also needs to
exclude all concurrent scans. However, I think that it's OK for
vacuum to remove tuples from a given page with only an exclusive lock
on that particular page.

How can we guarantee that it doesn't remove a tuple that is required by scan
which is started after split-in-progress flag is set?

If the tuple is being removed by VACUUM, it is dead. We can remove
dead tuples right away, because no MVCC scan will see them. In fact,
the only snapshot that will see them is SnapshotAny, and there's no
problem with removing dead tuples while a SnapshotAny scan is in
progress. It's no different than heap_page_prune() removing tuples
that a SnapshotAny sequential scan was about to see.

While again thinking about this case, it seems to me that we need a
cleanup lock even for dead tuple removal. The reason for the same is
that scans that return multiple tuples always restart the scan from
the previous offset number from which they have returned last tuple.
Now, consider the case where the first tuple is returned from offset
number-3 in page and after that another backend removes the
corresponding tuple from heap and vacuum also removes the dead tuple
corresponding to offset-3. When the scan will try to get the next
tuple, it will start from offset-3 which can lead to incorrect
results.

Now, one way to solve above problem could be if we change scan for
hash indexes such that it works page at a time like we do for btree
scans (refer BTScanPosData and comments on top of it). This has an
additional advantage that it will reduce lock/unlock calls for
retrieving tuples from a page. However, I think this solution can only
work for MVCC scans. For non-MVCC scans, still there is a problem,
because after fetching all the tuples from a page, when it tries to
check the validity of tuples in heap, we won't be able to detect if
the old tuple is deleted and a new tuple has been placed at that
location in heap.

I think what we can do to solve this for non-MVCC scans is that allow
vacuum to always take a cleanup lock on a bucket and MVCC-scans will
release both the lock and pin as it proceeds. Non-MVCC scans and
scans that are started during split-in-progress will release the lock,
but not a pin on primary bucket. This way, we can allow vacuum to
proceed even if there is a MVCC-scan going on a bucket if it is not
started during a bucket split operation. For btree code, we do
something similar, which means that vacuum always take cleanup lock on
a bucket and non-MVCC scan retains a pin on the bucket.

The insertions should work as they are currently in patch, that is
they always need to retain a pin on primary bucket to avoid the
concurrent split problem as mentioned above (refer the first paragraph
discussion of this mail).

Thoughts?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#8)
1 attachment(s)
Re: Hash Indexes

On Wed, Jun 22, 2016 at 8:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 22, 2016 at 5:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

We can do it in the way as you are suggesting, but there is another thing
which we need to consider here. As of now, the patch tries to finish the
split if it finds split-in-progress flag in either old or new bucket. We
need to lock both old and new buckets to finish the split, so it is quite
possible that two different backends try to lock them in opposite order
leading to a deadlock. I think the correct way to handle is to always try
to lock the old bucket first and then new bucket. To achieve that, if the
insertion on new bucket finds that split-in-progress flag is set on a
bucket, it needs to release the lock and then acquire the lock first on old
bucket, ensure pincount is 1 and then lock new bucket again and ensure that
pincount is 1. I have already maintained the order of locks in scan (old
bucket first and then new bucket; refer changes in _hash_first()).
Alternatively, we can try to finish the splits only when someone tries to
insert in old bucket.

Yes, I think locking buckets in increasing order is a good solution.
I also think it's fine to only try to finish the split when the insert
targets the old bucket. Finishing the split enables us to remove
tuples from the old bucket, which lets us reuse space instead of
accelerating more. So there is at least some potential benefit to the
backend inserting into the old bucket. On the other hand, a process
inserting into the new bucket derives no direct benefit from finishing
the split.

Okay, following this suggestion, I have updated the patch so that only
insertion into old-bucket can try to finish the splits. Apart from
that, I have fixed the issue reported by Mithun upthread. I have
updated the README to explain the locking used in patch. Also, I
have changed the locking around vacuum, so that it can work with
concurrent scans when ever possible. In previous patch version,
vacuum used to take cleanup lock on a bucket to remove the dead
tuples, moved-due-to-split tuples and squeeze operation, also it holds
the lock on bucket till end of cleanup. Now, also it takes cleanup
lock on a bucket to out-wait scans, but it releases the lock as it
proceeds to clean the overflow pages. The idea is first we need to
lock the next bucket page and then release the lock on current bucket
page. This ensures that any concurrent scan started after we start
cleaning the bucket will always be behind the cleanup. Allowing scans
to cross vacuum will allow it to remove tuples required for sanctity
of scan. Also for squeeze-phase we are just checking if the pincount
of buffer is one (we already have Exclusive lock on buffer of bucket
by that time), then only proceed, else will try to squeeze next time
the cleanup is required for that bucket.

Thoughts/Suggestions?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

concurrent_hash_index_v3.patchapplication/octet-stream; name=concurrent_hash_index_v3.patchDownload
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c1122b4..d02539a 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -400,7 +400,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	Buffer		buf;
 	Page		page;
 
-	_hash_getlock(rel, blkno, HASH_SHARE);
 	buf = _hash_getbuf_with_strategy(rel, blkno, HASH_READ, 0, bstrategy);
 	page = BufferGetPage(buf);
 
@@ -431,7 +430,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	}
 
 	_hash_relbuf(rel, buf);
-	_hash_droplock(rel, blkno, HASH_SHARE);
 }
 
 /*
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 0a7da89..a0feb2f 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -125,49 +125,45 @@ the initially created buckets.
 
 Lock Definitions
 ----------------
-
-We use both lmgr locks ("heavyweight" locks) and buffer context locks
-(LWLocks) to control access to a hash index.  lmgr locks are needed for
-long-term locking since there is a (small) risk of deadlock, which we must
-be able to detect.  Buffer context locks are used for short-term access
-control to individual pages of the index.
-
-LockPage(rel, page), where page is the page number of a hash bucket page,
-represents the right to split or compact an individual bucket.  A process
-splitting a bucket must exclusive-lock both old and new halves of the
-bucket until it is done.  A process doing VACUUM must exclusive-lock the
-bucket it is currently purging tuples from.  Processes doing scans or
-insertions must share-lock the bucket they are scanning or inserting into.
-(It is okay to allow concurrent scans and insertions.)
-
-The lmgr lock IDs corresponding to overflow pages are currently unused.
-These are available for possible future refinements.  LockPage(rel, 0)
-is also currently undefined (it was previously used to represent the right
-to modify the hash-code-to-bucket mapping, but it is no longer needed for
-that purpose).
-
-Note that these lock definitions are conceptually distinct from any sort
-of lock on the pages whose numbers they share.  A process must also obtain
-read or write buffer lock on the metapage or bucket page before accessing
-said page.
-
-Processes performing hash index scans must hold share lock on the bucket
-they are scanning throughout the scan.  This seems to be essential, since
-there is no reasonable way for a scan to cope with its bucket being split
-underneath it.  This creates a possibility of deadlock external to the
-hash index code, since a process holding one of these locks could block
-waiting for an unrelated lock held by another process.  If that process
-then does something that requires exclusive lock on the bucket, we have
-deadlock.  Therefore the bucket locks must be lmgr locks so that deadlock
-can be detected and recovered from.
-
-Processes must obtain read (share) buffer context lock on any hash index
-page while reading it, and write (exclusive) lock while modifying it.
-To prevent deadlock we enforce these coding rules: no buffer lock may be
-held long term (across index AM calls), nor may any buffer lock be held
-while waiting for an lmgr lock, nor may more than one buffer lock
-be held at a time by any one process.  (The third restriction is probably
-stronger than necessary, but it makes the proof of no deadlock obvious.)
+We use buffer content locks (LWLocks) and buffer pins to control access to
+a hash index.
+
+Scan will take a lock in shared mode on primary or overflow buckets.  Inserts
+will acquire exclusive lock on the bucket in which it has to insert.  Both the
+operations releases the lock on previous bucket before moving to the next
+overflow bucket.  They will retain a pin on primary bucket till end of operation.
+Split operation must acquire cleanup lock on both old and new halves of the
+bucket and mark split-in-progress on both the buckets.  The cleanup lock at
+the start of split ensures that parallel insert won't get lost.  Consider a
+case where insertion has to add a tuple on some intermediate overflow bucket
+in the bucket chain, if we allow split when insertion is in progress, split
+might not move this newly inserted tuple.  It releases the lock on previous
+bucket before moving to the next overflow bucket either for old bucket or for
+new bucket.  After partitioning the tuples between old and new buckets, it
+again needs to acquire exclusive lock on both old and new buckets to clear
+the split-in-progress flag.  Like inserts and scans, it will also retain pins
+on both the old and new primary buckets till end of split operation, although
+we can do without that as well.
+
+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or tuples
+that are moved due to split.  The need for cleanup lock to remove dead tuples
+is to ensure that scans' returns correct results.  Scan that returns multiple
+tuples from the same bucket page always restart the scan from the previous
+offset number from which it has returned last tuple.  If we allow vacuum to
+remove the dead tuples with just an exclusive lock, it could remove the tuple
+required to resume the scan.  The need for cleanup lock to remove the tuples
+that are moved by split is to ensure that there is no pending scan that has
+started after the start of split and before the finish of split on bucket.
+If we don't do that, then vacuum can remove tuples that are required by such
+a scan.  We don't need to retain this cleanup lock during whole vacuum
+operation on bucket.  We releases the lock as we move ahead in the bucket
+chain.  In the end, for squeeze-phase, we conditionally acquire cleanup lock
+and if we don't get, then we just abandon the squeeze phase.
+
+To avoid deadlocks, we must be consistent about the lock order in which we
+lock the buckets for operations that requires locks on two different buckets.
+We use the rule "first lock the old bucket and then new bucket, basically
+lock the lowered number bucket first".
 
 
 Pseudocode Algorithms
@@ -188,63 +184,105 @@ track of available overflow pages.
 The reader algorithm is:
 
 	pin meta page and take buffer content lock in shared mode
-	loop:
-		compute bucket number for target hash key
-		release meta page buffer content lock
-		if (correct bucket page is already locked)
-			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock
-		retake meta page buffer content lock in shared mode
+	compute bucket number for target hash key
+	read and pin the primary bucket page
+	conditionally get the buffer content lock in shared mode on primary bucket page for search
+	if we didn't get the lock (need to wait for lock)
+		release the buffer content lock on meta page
+		acquire buffer content lock on primary bucket page in shared mode
+		acquire the buffer content lock in shared mode on meta page
+		to check for possibility of split, we need to recompute the bucket and
+		verify, if it is a correct bucket; set the retry flag
+	else if we get the lock, then we can skip the retry path
+	if (retry)
+		loop:
+			compute bucket number for target hash key
+			release meta page buffer content lock
+			if (correct bucket page is already locked)
+				break
+			release any existing content lock on bucket page (if a concurrent split happened)
+			pin primary bucket page and take shared buffer content lock
+			retake meta page buffer content lock in shared mode
 -- then, per read request:
 	release pin on metapage
-	read current page of bucket and take shared buffer content lock
-		step to next page if necessary (no chaining of locks)
+	if the split is in progress for current bucket and this is a new bucket
+		release the buffer content lock on current bucket page
+		pin and acquire the buffer content lock on old bucket in shared mode
+		release the buffer content lock on old bucket, but not pin
+		retake the buffer content lock on new bucket
+		mark the scan such that it skips the tuples that are marked as moved by split
+	step to next page if necessary (no chaining of locks)
+		if the scan indicates moved by split, then move to old bucket after the scan
+		of current bucket is finished
 	get tuple
 	release buffer content lock and pin on current page
 -- at scan shutdown:
-	release bucket share-lock
-
-We can't hold the metapage lock while acquiring a lock on the target bucket,
-because that might result in an undetected deadlock (lwlocks do not participate
-in deadlock detection).  Instead, we relock the metapage after acquiring the
-bucket page lock and check whether the bucket has been split.  If not, we're
-done.  If so, we release our previously-acquired lock and repeat the process
-using the new bucket number.  Holding the bucket sharelock for
+	release any pin we hold on current buffer, old bucket buffer, new bucket buffer
+
+We don't want to hold the meta page lock if we have to wait for acquiring the
+content lock on bucket page, because that might result in poor concurrency.
+Instead, we relock the metapage after acquiring the bucket page content lock
+and check whether the bucket has been split.  If not, we're done.  If so, we
+release our previously-acquired content lock, but not pin and repeat the
+process using the new bucket number.  Holding the buffer pin on bucket page for
 the remainder of the scan prevents the reader's current-tuple pointer from
-being invalidated by splits or compactions.  Notice that the reader's lock
+being invalidated by splits or compactions.  Notice that the reader's pin
 does not prevent other buckets from being split or compacted.
 
 To keep concurrency reasonably good, we require readers to cope with
 concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the page sharelock.  Since
-deletion is not possible while a reader holds the bucket sharelock, and
-we assume that heap tuple TIDs are unique, this can be implemented by
+their current scan position after re-acquiring the buffer content lock on
+page.  Since deletion is not possible while a reader holds the pin on bucket,
+and we assume that heap tuple TIDs are unique, this can be implemented by
 searching for the same heap tuple TID previously returned.  Insertion does
 not move index entries across pages, so the previously-returned index entry
 should always be on the same page, at the same or higher offset number,
 as it was before.
 
+To allow scan during bucket split, if at the start of the scan, bucket is
+marked as split-in-progress, it scan all the tuples in that bucket except for
+those that are marked as moved-by-split.  Once it finishes the scan of all the
+tuples in the current bucket, it scans the old bucket from which this bucket
+is formed by split.  This happens only for the new half bucket.
+
 The insertion algorithm is rather similar:
 
 	pin meta page and take buffer content lock in shared mode
-	loop:
-		compute bucket number for target hash key
-		release meta page buffer content lock
-		if (correct bucket page is already locked)
-			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock in shared mode
-		retake meta page buffer content lock in shared mode
--- (so far same as reader)
+	compute bucket number for target hash key
+	read and pin the primary bucket page
+	conditionally get the buffer content lock in exclusive mode on primary bucket page for search
+	if we didn't get the lock (need to wait for lock)
+		release the buffer content lock on meta page
+		acquire buffer content lock on primary bucket page in exclusive mode
+		acquire the buffer content lock in shared mode on meta page
+		to check for possibility of split, we need to recompute the bucket and
+		verify, if it is a correct bucket; set the retry flag
+	else if we get the lock, then we can skip the retry path
+	if (retry)
+		loop:
+			compute bucket number for target hash key
+			release meta page buffer content lock
+			if (correct bucket page is already locked)
+				break
+			release any existing content lock on bucket page (if a concurrent split happened)
+			pin primary bucket page and take exclusive buffer content lock
+			retake meta page buffer content lock in shared mode
+-- (so far same as reader, except for acquisation of buffer content lock in
+	exclusive mode on primary bucket page)
 	release pin on metapage
-	pin current page of bucket and take exclusive buffer content lock
-	if full, release, read/exclusive-lock next page; repeat as needed
+	if the split-in-progress flag is set for bucket in old half of split
+	and pin count on it is one, then finish the split
+		we already have a buffer content lock on old bucket, conditionally get the content lock on new bucket
+		if get the lock on new bucket
+			finish the split using algorithm mentioned below for split
+			release the buffer content lock and pin on new bucket
+	if full, release lock but not pin, read/exclusive-lock next page; repeat as needed
 	>> see below if no space in any page of bucket
 	insert tuple at appropriate place in page
 	mark current page dirty and release buffer content lock and pin
+	if current page is not a bucket page, release the pin on bucket page
 	release heavyweight share-lock
-	pin meta page and take buffer content lock in shared mode
+	pin meta page and take buffer content lock in exclusive mode
 	increment tuple count, decide if split needed
 	mark meta page dirty and release buffer content lock and pin
 	done if no split needed, else enter Split algorithm below
@@ -256,11 +294,13 @@ bucket that is being actively scanned, because readers can cope with this
 as explained above.  We only need the short-term buffer locks to ensure
 that readers do not see a partially-updated page.
 
-It is clearly impossible for readers and inserters to deadlock, and in
-fact this algorithm allows them a very high degree of concurrency.
-(The exclusive metapage lock taken to update the tuple count is stronger
-than necessary, since readers do not care about the tuple count, but the
-lock is held for such a short time that this is probably not an issue.)
+To avoid deadlock between readers and inserters, whenever there is a need
+to lock multiple buckets, we always take in the order suggested in Locking
+Definitions above.  This algorithm allows them a very high degree of
+concurrency.  (The exclusive metapage lock taken to update the tuple count
+is stronger than necessary, since readers do not care about the tuple count,
+but the lock is held for such a short time that this is probably not an
+issue.)
 
 When an inserter cannot find space in any existing page of a bucket, it
 must obtain an overflow page and add that page to the bucket's chain.
@@ -271,46 +311,79 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
 The algorithm attempts, but does not necessarily succeed, to split one
 existing bucket in two, thereby lowering the fill ratio:
 
-	pin meta page and take buffer content lock in exclusive mode
-	check split still needed
-	if split not needed anymore, drop buffer content lock and pin and exit
-	decide which bucket to split
-	Attempt to X-lock old bucket number (definitely could fail)
-	Attempt to X-lock new bucket number (shouldn't fail, but...)
-	if above fail, drop locks and pin and exit
+	expand:
+		take buffer content lock in exclusive mode on meta page
+		check split still needed
+		if split not needed anymore, drop buffer content lock and exit
+		decide which bucket to split
+		Attempt to acquire cleanup lock on old bucket number (definitely could fail)
+		if above fail, release lock and pin and exit
+		if the split-in-progress flag is set, then finish the split
+			conditionally get the content lock on new bucket which was involved in split
+			if got the lock on new bucket
+				finish the split using algorithm mentioned below for split
+				release the buffer content lock and pin on old and new bucketa
+				try to expand from start
+			else
+				release the buffer conetent lock and pin on old bucket and exit
+		if the garbage flag (indicates that tuples are moved by split) is set on bucket
+			release the buffer content lock on meta page
+			remove the tuples that doesn't belong to this bucket; see bucket cleanup below
+	Attempt to acquire cleanup lock on new bucket number (shouldn't fail, but...)
 	update meta page to reflect new number of buckets
-	mark meta page dirty and release buffer content lock and pin
+	mark meta page dirty and release buffer content lock
 	-- now, accesses to all other buckets can proceed.
 	Perform actual split of bucket, moving tuples as needed
 	>> see below about acquiring needed extra space
 	Release X-locks of old and new buckets
 
+	split guts
+	mark the old and new buckets indicating split-in-progress
+	mark the old bucket indicating has-garbage
+	copy the tuples that belongs to new bucket from old bucket
+	during copy mark such tuples as move-by-split
+	release lock but not pin for primary bucket page of old bucket,
+	read/shared-lock next page; repeat as needed
+	>> see below if no space in bucket page of new bucket
+	ensure to have exclusive-lock on both old and new buckets in that order
+	clear the split-in-progress flag from both the buckets
+	mark buffers dirty and release the locks and pins on both old and new buckets
+
 Note the metapage lock is not held while the actual tuple rearrangement is
 performed, so accesses to other buckets can proceed in parallel; in fact,
 it's possible for multiple bucket splits to proceed in parallel.
 
-Split's attempt to X-lock the old bucket number could fail if another
-process holds S-lock on it.  We do not want to wait if that happens, first
-because we don't want to wait while holding the metapage exclusive-lock,
-and second because it could very easily result in deadlock.  (The other
-process might be out of the hash AM altogether, and could do something
-that blocks on another lock this process holds; so even if the hash
-algorithm itself is deadlock-free, a user-induced deadlock could occur.)
-So, this is a conditional LockAcquire operation, and if it fails we just
-abandon the attempt to split.  This is all right since the index is
-overfull but perfectly functional.  Every subsequent inserter will try to
-split, and eventually one will succeed.  If multiple inserters failed to
-split, the index might still be overfull, but eventually, the index will
+Split's attempt to acquire cleanup-lock on the old bucket number could fail
+if another process holds any lock or pin on it.  We do not want to wait if
+that happens, because we don't want to wait while holding the metapage
+exclusive-lock.  So, this is a conditional LWLockAcquire operation, and if
+it fails we just abandon the attempt to split.  This is all right since the
+index is overfull but perfectly functional.  Every subsequent inserter will
+try to split, and eventually one will succeed.  If multiple inserters failed
+to split, the index might still be overfull, but eventually, the index will
 not be overfull and split attempts will stop.  (We could make a successful
 splitter loop to see if the index is still overfull, but it seems better to
 distribute the split overhead across successive insertions.)
 
+has_garbage flag indicates that the bucket contains tuples that are moved due
+to split.  This will be set only for old bucket.  Now, why we need it besides
+split-in-progress flag is to distinguish the case when the split is over
+(aka split-in-progress flag is cleared.).  This is used both by vacuum as
+well as during re-split operation.  Vacuum, uses it to decide if it needs to
+clear the tuples (that are moved-by-split) from bucket along with dead tuples.
+Re-split of bucket uses it to ensure that it doesn't start a new split from a
+bucket without clearing the previous tuples from old bucket.  The usage by
+re-split helps to keep bloat under control and makes the design somewhat
+simpler as we don't have to any time handle the situation where a bucket can
+contain dead-tuples from multiple splits.
+
 A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt.  The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
+disk space or crash) the index is left corrupt.  The probability of that
+could be made quite low if we grab a free page or two before we update the
+meta page, but the only real solution is to treat a split as a WAL-loggable,
 must-complete action.  I'm not planning to teach hash about WAL in this
-go-round.
+go-round.  However, we do try to finish the incomplete splits during insert
+and split.
 
 The fourth operation is garbage collection (bulk deletion):
 
@@ -319,9 +392,13 @@ The fourth operation is garbage collection (bulk deletion):
 	fetch current max bucket number
 	release meta page buffer content lock and pin
 	while next bucket <= max bucket do
-		Acquire X lock on target bucket
-		Scan and remove tuples, compact free space as needed
-		Release X lock
+		Acquire cleanup lock on target bucket
+		Scan and remove tuples
+		For overflow buckets, first we need to lock the next bucket and then
+		release the lock on current bucket
+		Ensure to have X lock on bucket page
+		If buffer pincount is one, then compact free space as needed
+		Release lock
 		next bucket ++
 	end loop
 	pin metapage and take buffer content lock in exclusive mode
@@ -330,20 +407,23 @@ The fourth operation is garbage collection (bulk deletion):
 	else update metapage tuple count
 	mark meta page dirty and release buffer content lock and pin
 
-Note that this is designed to allow concurrent splits.  If a split occurs,
-tuples relocated into the new bucket will be visited twice by the scan,
-but that does no harm.  (We must however be careful about the statistics
-reported by the VACUUM operation.  What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count
-if the stored tuple count and number of buckets did *not* change at any
-time during the scan.  This provides a way of correcting the stored tuple
-count if it gets out of sync for some reason.  But if a split or insertion
-does occur concurrently, the scan count is untrustworthy; instead,
-subtract the number of tuples deleted from the stored tuple count and
-use that.)
-
-The exclusive lock request could deadlock in some strange scenarios, but
-we can just error out without any great harm being done.
+Note that this is designed to allow concurrent splits and scans.  If a
+split occurs, tuples relocated into the new bucket will be visited twice
+by the scan, but that does no harm.  As we are releasing the locks during
+scan of a bucket, it will allow concurrent scan to start on a bucket and
+ensures that scan will always be behind cleanup.  It is must to keep scans
+behind cleanup, else vacuum could remove tuples that are required to
+complete the scan as explained in Lock Definitions section above.  This holds
+true for backward scans as well (backward scans first traverse each bucket
+starting from first bucket to last overflow bucket in the chain).
+We must be careful about the statistics reported by the VACUUM operation.
+What we can do is count the number of tuples scanned, and believe this in
+preference to the stored tuple count if the stored tuple count and number
+of buckets did *not* change at any time during the scan.  This provides a
+way of correcting the stored tuple count if it gets out of sync for some
+reason.  But if a split or insertion does occur concurrently, the scan
+count is untrustworthy; instead, subtract the number of tuples deleted
+from the stored tuple count and use that.
 
 
 Free Space Management
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 19695ee..5552f2d 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -271,10 +271,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		/*
 		 * An insertion into the current index page could have happened while
 		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold share lock on
-		 * the bucket, no deletions or splits could have occurred; therefore
-		 * we can expect that the TID still exists in the current index page,
-		 * at an offset >= where we were.)
+		 * for the TID we previously returned.  (Because we hold pin on the
+		 * bucket, no deletions or splits could have occurred; therefore we
+		 * can expect that the TID still exists in the current index page, at
+		 * an offset >= where we were.)
 		 */
 		OffsetNumber maxoffnum;
 
@@ -409,12 +409,15 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
 	so->hashso_bucket_valid = false;
-	so->hashso_bucket_blkno = 0;
 	so->hashso_curbuf = InvalidBuffer;
+	so->hashso_bucket_buf = InvalidBuffer;
+	so->hashso_old_bucket_buf = InvalidBuffer;
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
 	ItemPointerSetInvalid(&(so->hashso_heappos));
 
+	so->hashso_skip_moved_tuples = false;
+
 	scan->opaque = so;
 
 	/* register scan in case we change pages it's using */
@@ -438,10 +441,15 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		_hash_dropbuf(rel, so->hashso_curbuf);
 	so->hashso_curbuf = InvalidBuffer;
 
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	/* release pin we hold on old primary bucket */
+	if (BufferIsValid(so->hashso_old_bucket_buf))
+		_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+	so->hashso_old_bucket_buf = InvalidBuffer;
+
+	/* release pin we hold on primary bucket */
+	if (BufferIsValid(so->hashso_bucket_buf))
+		_hash_dropbuf(rel, so->hashso_bucket_buf);
+	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -455,6 +463,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 				scan->numberOfKeys * sizeof(ScanKeyData));
 		so->hashso_bucket_valid = false;
 	}
+
+	so->hashso_skip_moved_tuples = false;
 }
 
 /*
@@ -474,10 +484,15 @@ hashendscan(IndexScanDesc scan)
 		_hash_dropbuf(rel, so->hashso_curbuf);
 	so->hashso_curbuf = InvalidBuffer;
 
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	/* release pin we hold on old primary bucket */
+	if (BufferIsValid(so->hashso_old_bucket_buf))
+		_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+	so->hashso_old_bucket_buf = InvalidBuffer;
+
+	/* release pin we hold on primary bucket */
+	if (BufferIsValid(so->hashso_bucket_buf))
+		_hash_dropbuf(rel, so->hashso_bucket_buf);
+	so->hashso_bucket_buf = InvalidBuffer;
 
 	pfree(so);
 	scan->opaque = NULL;
@@ -488,6 +503,9 @@ hashendscan(IndexScanDesc scan)
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
  *
+ * This function also delete the tuples that are moved by split to other
+ * bucket.
+ *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
@@ -532,83 +550,52 @@ loop_top:
 	{
 		BlockNumber bucket_blkno;
 		BlockNumber blkno;
-		bool		bucket_dirty = false;
+		Buffer		bucket_buf;
+		Buffer		buf;
+		HashPageOpaque bucket_opaque;
+		Page		page;
+		bool		bucket_has_garbage = false;
 
 		/* Get address of bucket's start page */
 		bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
 
-		/* Exclusive-lock the bucket so we can shrink it */
-		_hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
 		/* Shouldn't have any active scans locally, either */
 		if (_hash_has_active_scan(rel, cur_bucket))
 			elog(ERROR, "hash index has active scan during VACUUM");
 
-		/* Scan each page in bucket */
 		blkno = bucket_blkno;
-		while (BlockNumberIsValid(blkno))
-		{
-			Buffer		buf;
-			Page		page;
-			HashPageOpaque opaque;
-			OffsetNumber offno;
-			OffsetNumber maxoffno;
-			OffsetNumber deletable[MaxOffsetNumber];
-			int			ndeletable = 0;
-
-			vacuum_delay_point();
 
-			buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-											 info->strategy);
-			page = BufferGetPage(buf);
-			opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-			Assert(opaque->hasho_bucket == cur_bucket);
-
-			/* Scan each tuple in page */
-			maxoffno = PageGetMaxOffsetNumber(page);
-			for (offno = FirstOffsetNumber;
-				 offno <= maxoffno;
-				 offno = OffsetNumberNext(offno))
-			{
-				IndexTuple	itup;
-				ItemPointer htup;
+		/*
+		 * We need to acquire a cleanup lock on the primary bucket to out wait
+		 * concurrent scans.
+		 */
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+		LockBufferForCleanup(buf);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 
-				itup = (IndexTuple) PageGetItem(page,
-												PageGetItemId(page, offno));
-				htup = &(itup->t_tid);
-				if (callback(htup, callback_state))
-				{
-					/* mark the item for deletion */
-					deletable[ndeletable++] = offno;
-					tuples_removed += 1;
-				}
-				else
-					num_index_tuples += 1;
-			}
+		page = BufferGetPage(buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
-			/*
-			 * Apply deletions and write page if needed, advance to next page.
-			 */
-			blkno = opaque->hasho_nextblkno;
+		/*
+		 * If the bucket contains tuples that are moved by split, then we need
+		 * to delete such tuples on completion of split.  Before cleaning, we
+		 * need to out-wait the scans that have started when the split was in
+		 * progress for a bucket.
+		 */
+		if (H_HAS_GARBAGE(bucket_opaque) &&
+			!H_INCOMPLETE_SPLIT(bucket_opaque))
+			bucket_has_garbage = true;
 
-			if (ndeletable > 0)
-			{
-				PageIndexMultiDelete(page, deletable, ndeletable);
-				_hash_wrtbuf(rel, buf);
-				bucket_dirty = true;
-			}
-			else
-				_hash_relbuf(rel, buf);
-		}
+		bucket_buf = buf;
 
-		/* If we deleted anything, try to compact free space */
-		if (bucket_dirty)
-			_hash_squeezebucket(rel, cur_bucket, bucket_blkno,
-								info->strategy);
+		hashbucketcleanup(rel, bucket_buf, blkno, info->strategy,
+						  local_metapage.hashm_maxbucket,
+						  local_metapage.hashm_highmask,
+						  local_metapage.hashm_lowmask, &tuples_removed,
+						  &num_index_tuples, bucket_has_garbage, true,
+						  callback, callback_state);
 
-		/* Release bucket lock */
-		_hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+		_hash_relbuf(rel, bucket_buf);
 
 		/* Advance to next bucket */
 		cur_bucket++;
@@ -689,6 +676,197 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }
 
+/*
+ * Helper function to perform deletion of index entries from a bucket.
+ *
+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.
+ *
+ * During scan of overflow buckets, first we need to lock the next bucket and
+ * then release the lock on current bucket.  This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.
+ *
+ * We need to retain a pin on the primary bucket to ensure that no concurrent
+ * split can start.
+ */
+void
+hashbucketcleanup(Relation rel, Buffer bucket_buf,
+				  BlockNumber bucket_blkno,
+				  BufferAccessStrategy bstrategy,
+				  uint32 maxbucket,
+				  uint32 highmask, uint32 lowmask,
+				  double *tuples_removed,
+				  double *num_index_tuples,
+				  bool bucket_has_garbage,
+				  bool delay,
+				  IndexBulkDeleteCallback callback,
+				  void *callback_state)
+{
+	BlockNumber blkno;
+	Buffer		buf;
+	Bucket		cur_bucket;
+	Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY;
+	Page		page;
+	bool		bucket_dirty = false;
+
+	blkno = bucket_blkno;
+	buf = bucket_buf;
+	page = BufferGetPage(buf);
+	cur_bucket = ((HashPageOpaque) PageGetSpecialPointer(page))->hasho_bucket;
+
+	if (bucket_has_garbage)
+		new_bucket = _hash_get_newbucket(rel, cur_bucket,
+										 lowmask, maxbucket);
+
+	/* Scan each page in bucket */
+	for (;;)
+	{
+		HashPageOpaque opaque;
+		OffsetNumber offno;
+		OffsetNumber maxoffno;
+		Buffer		next_buf;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+		bool		retain_pin = false;
+		bool		curr_page_dirty = false;
+
+		if (delay)
+			vacuum_delay_point();
+
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		/* Scan each tuple in page */
+		maxoffno = PageGetMaxOffsetNumber(page);
+		for (offno = FirstOffsetNumber;
+			 offno <= maxoffno;
+			 offno = OffsetNumberNext(offno))
+		{
+			IndexTuple	itup;
+			ItemPointer htup;
+			Bucket		bucket;
+
+			itup = (IndexTuple) PageGetItem(page,
+											PageGetItemId(page, offno));
+			htup = &(itup->t_tid);
+			if (callback && callback(htup, callback_state))
+			{
+				/* mark the item for deletion */
+				deletable[ndeletable++] = offno;
+				if (tuples_removed)
+					*tuples_removed += 1;
+			}
+			else if (bucket_has_garbage)
+			{
+				/* delete the tuples that are moved by split. */
+				bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+											  maxbucket,
+											  highmask,
+											  lowmask);
+				/* mark the item for deletion */
+				if (bucket != cur_bucket)
+				{
+					/*
+					 * We expect tuples to either belong to curent bucket or
+					 * new_bucket.  This is ensured because we don't allow
+					 * further splits from bucket that contains garbage. See
+					 * comments in _hash_expandtable.
+					 */
+					Assert(bucket == new_bucket);
+					deletable[ndeletable++] = offno;
+				}
+				else if (num_index_tuples)
+					*num_index_tuples += 1;
+			}
+			else if (num_index_tuples)
+				*num_index_tuples += 1;
+		}
+
+		/* retain the pin on primary bucket till end of bucket scan */
+		if (blkno == bucket_blkno)
+			retain_pin = true;
+		else
+			retain_pin = false;
+
+		blkno = opaque->hasho_nextblkno;
+
+		/*
+		 * Apply deletions and write page if needed, advance to next page.
+		 */
+		if (ndeletable > 0)
+		{
+			PageIndexMultiDelete(page, deletable, ndeletable);
+			bucket_dirty = true;
+			curr_page_dirty = true;
+		}
+
+		/* bail out if there are no more pages to scan. */
+		if (!BlockNumberIsValid(blkno))
+			break;
+
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
+		/*
+		 * release the lock on previous page after acquiring the lock on next
+		 * page
+		 */
+		if (curr_page_dirty)
+		{
+			if (retain_pin)
+				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+			else
+				_hash_wrtbuf(rel, buf);
+			curr_page_dirty = false;
+		}
+		else if (retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
+
+		buf = next_buf;
+	}
+
+	/*
+	 * lock the bucket page to clear the garbage flag and squeeze the bucket.
+	 * if the current buffer is same as bucket buffer, then we already have
+	 * lock on bucket page.
+	 */
+	if (buf != bucket_buf)
+	{
+		_hash_relbuf(rel, buf);
+		_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+	}
+
+	/*
+	 * Clear the garbage flag from bucket after deleting the tuples that are
+	 * moved by split.  We purposefully clear the flag before squeeze bucket,
+	 * so that after restart, vacuum shouldn't again try to delete the moved
+	 * by split tuples.
+	 */
+	if (bucket_has_garbage)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = BufferGetPage(bucket_buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+	}
+
+	/*
+	 * If we deleted anything, try to compact free space.  For squeezing the
+	 * bucket, we must have a cleanup lock, else it can impact the ordering of
+	 * tuples for a scan that has started before it.
+	 */
+	if (bucket_dirty && CheckBufferForCleanup(bucket_buf))
+		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
+							bstrategy);
+}
 
 void
 hash_redo(XLogReaderState *record)
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..b1e79b5 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -28,7 +28,8 @@
 void
 _hash_doinsert(Relation rel, IndexTuple itup)
 {
-	Buffer		buf;
+	Buffer		buf = InvalidBuffer;
+	Buffer		bucket_buf;
 	Buffer		metabuf;
 	HashMetaPage metap;
 	BlockNumber blkno;
@@ -40,6 +41,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	bool		do_expand;
 	uint32		hashkey;
 	Bucket		bucket;
+	uint32		maxbucket;
+	uint32		highmask;
+	uint32		lowmask;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -70,51 +74,131 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			errhint("Values larger than a buffer page cannot be indexed.")));
 
 	/*
-	 * Loop until we get a lock on the correct target bucket.
+	 * Copy bucket mapping info now;  The comment in _hash_expandtable where
+	 * we copy this information and calls _hash_splitbucket explains why this
+	 * is OK.
 	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
-									  metap->hashm_maxbucket,
-									  metap->hashm_highmask,
-									  metap->hashm_lowmask);
+	maxbucket = metap->hashm_maxbucket;
+	highmask = metap->hashm_highmask;
+	lowmask = metap->hashm_lowmask;
 
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
+	/*
+	 * Conditionally get the lock on primary bucket page for insertion while
+	 * holding lock on meta page. If we have to wait, then release the meta
+	 * page lock and retry it in a hard way.
+	 */
+	bucket = _hash_hashkey2bucket(hashkey,
+								  maxbucket,
+								  highmask,
+								  lowmask);
+
+	blkno = BUCKET_TO_BLKNO(metap, bucket);
 
-		/* Release metapage lock, but keep pin. */
+	/* Fetch the primary bucket page for the bucket */
+	buf = ReadBuffer(rel, blkno);
+	if (!ConditionalLockBuffer(buf))
+	{
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(buf, HASH_WRITE);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		oldblkno = blkno;
+		retry = true;
+	}
+	else
+	{
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	}
 
+	if (retry)
+	{
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * Loop until we get a lock on the correct target bucket.  We get the
+		 * lock on primary bucket page and retain the pin on it during insert
+		 * operation to prevent the concurrent splits.  Retaining pin on a
+		 * primary bucket page ensures that split can't happen as it needs to
+		 * acquire the cleanup lock on primary bucket page.  Acquiring lock on
+		 * primary bucket and rechecking if it is a target bucket is mandatory
+		 * as otherwise a concurrent split might cause this insertion to fall
+		 * in wrong bucket.
 		 */
-		if (retry)
+		for (;;)
 		{
+			/*
+			 * Compute the target bucket number, and convert to block number.
+			 */
+			bucket = _hash_hashkey2bucket(hashkey,
+										  metap->hashm_maxbucket,
+										  metap->hashm_highmask,
+										  metap->hashm_lowmask);
+
+			blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+			/* Release metapage lock, but keep pin. */
+			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+			/*
+			 * If the previous iteration of this loop locked what is still the
+			 * correct target bucket, we are done.  Otherwise, drop any old
+			 * lock and lock what now appears to be the correct bucket.
+			 */
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
-		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
-		oldblkno = blkno;
-		retry = true;
+			/* Fetch the primary bucket page for the bucket */
+			buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+
+			/*
+			 * Reacquire metapage lock and check that no bucket split has
+			 * taken place while we were awaiting the bucket lock.
+			 */
+			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			oldblkno = blkno;
+		}
 	}
 
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+	/* remember the primary bucket buffer to release the pin on it at end. */
+	bucket_buf = buf;
+
 	page = BufferGetPage(buf);
 	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(pageopaque->hasho_bucket == bucket);
 
+	/*
+	 * If there is any pending split, try to finish it before proceeding for
+	 * the insertion.  We try to finish the split for the insertion in old
+	 * bucket, as that will allow us to remove the tuples from old bucket and
+	 * reuse the space.  There is no such apparent benefit from finsihing the
+	 * split during insertion in new bucket.
+	 *
+	 * In future, if we want to finish the splits during insertion in new
+	 * bucket, we must ensure the locking order such that old bucket is locked
+	 * before new bucket.
+	 */
+	if (H_OLD_INCOMPLETE_SPLIT(pageopaque) && CheckBufferForCleanup(buf))
+	{
+		BlockNumber nblkno;
+		Buffer		nbuf;
+
+		nblkno = _hash_get_newblk(rel, pageopaque);
+
+		/* Fetch the primary bucket page for the new bucket */
+		nbuf = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+		if (nbuf)
+		{
+			_hash_finish_split(rel, metabuf, buf, nbuf, maxbucket,
+							   highmask, lowmask);
+
+			/*
+			 * release the buffer here as the insertion will happen in old
+			 * bucket.
+			 */
+			_hash_relbuf(rel, nbuf);
+		}
+	}
+
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
@@ -127,14 +211,23 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		{
 			/*
 			 * ovfl page exists; go get it.  if it doesn't have room, we'll
-			 * find out next pass through the loop test above.
+			 * find out next pass through the loop test above.  Retain the pin
+			 * if it is a primary bucket.
 			 */
-			_hash_relbuf(rel, buf);
+			if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
+				_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+			else
+				_hash_relbuf(rel, buf);
 			buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 			page = BufferGetPage(buf);
 		}
 		else
 		{
+			bool		retain_pin = false;
+
+			/* page flags must be accessed before releasing lock on a page. */
+			retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;
+
 			/*
 			 * we're at the end of the bucket chain and we haven't found a
 			 * page with enough room.  allocate a new overflow page.
@@ -144,7 +237,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 
 			/* chain to a new overflow page */
-			buf = _hash_addovflpage(rel, metabuf, buf);
+			buf = _hash_addovflpage(rel, metabuf, buf, retain_pin);
 			page = BufferGetPage(buf);
 
 			/* should fit now, given test above */
@@ -158,11 +251,13 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	/* found page with enough space, so add the item here */
 	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
 
-	/* write and release the modified page */
+	/*
+	 * write and release the modified page and ensure to release the pin on
+	 * primary page.
+	 */
 	_hash_wrtbuf(rel, buf);
-
-	/* We can drop the bucket lock now */
-	_hash_droplock(rel, blkno, HASH_SHARE);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
 
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index db3e268..760563a 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *
  *	On entry, the caller must hold a pin but no lock on 'buf'.  The pin is
  *	dropped before exiting (we assume the caller is not interested in 'buf'
- *	anymore).  The returned overflow page will be pinned and write-locked;
- *	it is guaranteed to be empty.
+ *	anymore) if not asked to retain.  The pin will be retained only for the
+ *	primary bucket.  The returned overflow page will be pinned and
+ *	write-locked; it is guaranteed to be empty.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
  *
- *	The caller must hold at least share lock on the bucket, to ensure that
- *	no one else tries to compact the bucket meanwhile.  This guarantees that
- *	'buf' won't stop being part of the bucket while it's unlocked.
- *
  * NB: since this could be executed concurrently by multiple processes,
  * one should not assume that the returned overflow page will be the
  * immediate successor of the originally passed 'buf'.  Additional overflow
  * pages might have been added to the bucket chain in between.
  */
 Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 {
 	Buffer		ovflbuf;
 	Page		page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 			break;
 
 		/* we assume we do not need to write the unmodified page */
-		_hash_relbuf(rel, buf);
+		if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
 
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
@@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 
 	/* logically chain overflow page to previous page */
 	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	_hash_wrtbuf(rel, buf);
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, buf);
 
 	return ovflbuf;
 }
@@ -370,11 +373,11 @@ _hash_firstfreebit(uint32 map)
  *	in the bucket, or InvalidBlockNumber if no following page.
  *
  *	NB: caller must not hold lock on metapage, nor on either page that's
- *	adjacent in the bucket chain.  The caller had better hold exclusive lock
- *	on the bucket, too.
+ *	adjacent in the bucket chain except from primary bucket.  The caller had
+ *	better hold cleanup lock on the primary bucket.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf,
+_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
@@ -413,22 +416,41 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
-	 * deleted.  No concurrency issues since we hold exclusive lock on the
-	 * entire bucket.
+	 * deleted.  No concurrency issues since we hold the cleanup lock on
+	 * primary bucket.  We don't need to aqcuire buffer lock to fix the
+	 * primary bucket, as we already have that lock.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-														 prevblkno,
-														 HASH_WRITE,
+		if (prevblkno == bucket_blkno)
+		{
+			Buffer		prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
+													 prevblkno,
+													 RBM_NORMAL,
+													 bstrategy);
+
+			Page		prevpage = BufferGetPage(prevbuf);
+			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+			Assert(prevopaque->hasho_bucket == bucket);
+			prevopaque->hasho_nextblkno = nextblkno;
+			MarkBufferDirty(prevbuf);
+			ReleaseBuffer(prevbuf);
+		}
+		else
+		{
+			Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
+															 prevblkno,
+															 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		prevpage = BufferGetPage(prevbuf);
-		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+															 bstrategy);
+			Page		prevpage = BufferGetPage(prevbuf);
+			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
 
-		Assert(prevopaque->hasho_bucket == bucket);
-		prevopaque->hasho_nextblkno = nextblkno;
-		_hash_wrtbuf(rel, prevbuf);
+			Assert(prevopaque->hasho_bucket == bucket);
+			prevopaque->hasho_nextblkno = nextblkno;
+			_hash_wrtbuf(rel, prevbuf);
+		}
 	}
 	if (BlockNumberIsValid(nextblkno))
 	{
@@ -570,7 +592,7 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
  *	required that to be true on entry as well, but it's a lot easier for
  *	callers to leave empty overflow pages and let this guy clean it up.
  *
- *	Caller must hold exclusive lock on the target bucket.  This allows
+ *	Caller must hold cleanup lock on the target bucket.  This allows
  *	us to safely lock multiple pages in the bucket.
  *
  *	Since this function is invoked in VACUUM, we provide an access strategy
@@ -580,6 +602,7 @@ void
 _hash_squeezebucket(Relation rel,
 					Bucket bucket,
 					BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy)
 {
 	BlockNumber wblkno;
@@ -591,27 +614,22 @@ _hash_squeezebucket(Relation rel,
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
 	bool		wbuf_dirty;
+	bool		release_buf = false;
 
 	/*
 	 * start squeezing into the base bucket page.
 	 */
 	wblkno = bucket_blkno;
-	wbuf = _hash_getbuf_with_strategy(rel,
-									  wblkno,
-									  HASH_WRITE,
-									  LH_BUCKET_PAGE,
-									  bstrategy);
+	wbuf = bucket_buf;
 	wpage = BufferGetPage(wbuf);
 	wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 
 	/*
-	 * if there aren't any overflow pages, there's nothing to squeeze.
+	 * if there aren't any overflow pages, there's nothing to squeeze. caller
+	 * is responsible to release the lock on primary bucket.
 	 */
 	if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
-	{
-		_hash_relbuf(rel, wbuf);
 		return;
-	}
 
 	/*
 	 * Find the last page in the bucket chain by starting at the base bucket
@@ -669,12 +687,17 @@ _hash_squeezebucket(Relation rel,
 			{
 				Assert(!PageIsEmpty(wpage));
 
+				if (wblkno != bucket_blkno)
+					release_buf = true;
+
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));
 
-				if (wbuf_dirty)
+				if (wbuf_dirty && release_buf)
 					_hash_wrtbuf(rel, wbuf);
-				else
+				else if (wbuf_dirty)
+					MarkBufferDirty(wbuf);
+				else if (release_buf)
 					_hash_relbuf(rel, wbuf);
 
 				/* nothing more to do if we reached the read page */
@@ -700,6 +723,7 @@ _hash_squeezebucket(Relation rel,
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
 				wbuf_dirty = false;
+				release_buf = false;
 			}
 
 			/*
@@ -733,19 +757,25 @@ _hash_squeezebucket(Relation rel,
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
-			/* yes, so release wbuf lock first */
-			if (wbuf_dirty)
+			if (wblkno != bucket_blkno)
+				release_buf = true;
+
+			/* yes, so release wbuf lock first if needed */
+			if (wbuf_dirty && release_buf)
 				_hash_wrtbuf(rel, wbuf);
-			else
+			else if (wbuf_dirty)
+				MarkBufferDirty(wbuf);
+			else if (release_buf)
 				_hash_relbuf(rel, wbuf);
+
 			/* free this overflow page (releases rbuf) */
-			_hash_freeovflpage(rel, rbuf, bstrategy);
+			_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 			/* done */
 			return;
 		}
 
 		/* free this overflow page, then get the previous one */
-		_hash_freeovflpage(rel, rbuf, bstrategy);
+		_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 178463f..6dfd411 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,10 +38,14 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
 					uint32 nblocks);
 static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
+static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
+					   Bucket obucket, Bucket nbucket, Buffer obuf,
+					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
+					   uint32 highmask, uint32 lowmask);
 
 
 /*
@@ -55,46 +59,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 
 
 /*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock.  (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		return ConditionalLockPage(rel, whichlock, access);
-	else
-		return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		UnlockPage(rel, whichlock, access);
-}
-
-/*
  *	_hash_getbuf() -- Get a buffer by block number for read or write.
  *
  *		'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -132,6 +96,35 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
 }
 
 /*
+ * _hash_getbuf_with_condlock_cleanup() -- as above, but get the buffer for write.
+ *
+ *		We try to take the conditional cleanup lock and if we get it then
+ *		retrun the buffer, else return InvalidBuffer.
+ */
+Buffer
+_hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
+{
+	Buffer		buf;
+
+	if (blkno == P_NEW)
+		elog(ERROR, "hash AM does not use P_NEW");
+
+	buf = ReadBuffer(rel, blkno);
+
+	if (!ConditionalLockBufferForCleanup(buf))
+	{
+		ReleaseBuffer(buf);
+		return InvalidBuffer;
+	}
+
+	/* ref count and lock type are correct */
+
+	_hash_checkpage(rel, buf, flags);
+
+	return buf;
+}
+
+/*
  *	_hash_getinitbuf() -- Get and initialize a buffer by block number.
  *
  *		This must be used only to fetch pages that are known to be before
@@ -489,9 +482,11 @@ _hash_pageinit(Page page, Size size)
 /*
  * Attempt to expand the hash table by creating one new bucket.
  *
- * This will silently do nothing if it cannot get the needed locks.
+ * This will silently do nothing if there are active scans of our own
+ * backend or if we don't get cleanup lock on old or new bucket.
  *
- * The caller should hold no locks on the hash index.
+ * Complete the pending splits and remove the tuples from old bucket,
+ * if there are any left over from previous split.
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.
@@ -506,10 +501,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	BlockNumber start_oblkno;
 	BlockNumber start_nblkno;
 	Buffer		buf_nblkno;
+	Buffer		buf_oblkno;
+	Page		opage;
+	HashPageOpaque oopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
 
+restart_expand:
+
 	/*
 	 * Write-lock the meta page.  It used to be necessary to acquire a
 	 * heavyweight lock to begin a split, but that is no longer required.
@@ -548,11 +548,16 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		goto fail;
 
 	/*
-	 * Determine which bucket is to be split, and attempt to lock the old
-	 * bucket.  If we can't get the lock, give up.
+	 * Determine which bucket is to be split, and attempt to take cleanup lock
+	 * on the old bucket.  If we can't get the lock, give up.
+	 *
+	 * The cleanup lock protects us against other backends, but not against
+	 * our own backend.  Must check for active scans separately.
 	 *
-	 * The lock protects us against other backends, but not against our own
-	 * backend.  Must check for active scans separately.
+	 * The cleanup lock is mainly to protect the split from concurrent
+	 * inserts. See src/backend/access/hash/README, Lock Definitions for
+	 * further details.  Due to this locking restriction, if there is any
+	 * pending scan, split will give up which is not good, but harmless.
 	 */
 	new_bucket = metap->hashm_maxbucket + 1;
 
@@ -563,11 +568,90 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, old_bucket))
 		goto fail;
 
-	if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
+	buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
+	if (!buf_oblkno)
 		goto fail;
 
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * We want to finish the split from a bucket as there is no apparent
+	 * benefit by not doing so and it will make the code complicated to finish
+	 * the split that involves multiple buckets considering the case where new
+	 * split also fails.  We don't need to cosider the new bucket for
+	 * completing the split here as it is not possible that a re-split of new
+	 * bucket starts when there is still a pending split from old bucket.
+	 */
+	if (H_OLD_INCOMPLETE_SPLIT(oopaque))
+	{
+		BlockNumber nblkno;
+		Buffer		buf_nblkno;
+
+		/*
+		 * Copy bucket mapping info now;  The comment in code below where we
+		 * copy this information and calls _hash_splitbucket explains why this
+		 * is OK.
+		 */
+		maxbucket = metap->hashm_maxbucket;
+		highmask = metap->hashm_highmask;
+		lowmask = metap->hashm_lowmask;
+
+		/* Release the metapage lock, before completing the split. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		nblkno = _hash_get_newblk(rel, oopaque);
+
+		/* Fetch the primary bucket page for the new bucket */
+		buf_nblkno = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+		if (!buf_nblkno)
+		{
+			_hash_relbuf(rel, buf_oblkno);
+			goto fail;
+		}
+
+		_hash_finish_split(rel, metabuf, buf_oblkno, buf_nblkno, maxbucket,
+						   highmask, lowmask);
+
+		/*
+		 * release the buffers and retry for expand.
+		 */
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+
+		goto restart_expand;
+	}
+
 	/*
-	 * Likewise lock the new bucket (should never fail).
+	 * Clean the tuples remained from previous split.  This operation requires
+	 * cleanup lock and we already have one on old bucket, so let's do it. We
+	 * also don't want to allow further splits from the bucket till the
+	 * garbage of previous split is cleaned.  This has two advantages, first
+	 * it helps in avoiding the bloat due to garbage and second is, during
+	 * cleanup of bucket, we are always sure that the garbage tuples belong to
+	 * most recently splitted bucket.  On the contrary, if we allow cleanup of
+	 * bucket after meta page is updated to indicate the new split and before
+	 * the actual split, the cleanup operation won't be able to decide whether
+	 * the tuple has been moved to the newly created bucket and ended up
+	 * deleting such tuples.
+	 */
+	if (H_HAS_GARBAGE(oopaque))
+	{
+		/* Release the metapage lock. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		hashbucketcleanup(rel, buf_oblkno, start_oblkno, NULL,
+						  metap->hashm_maxbucket, metap->hashm_highmask,
+						  metap->hashm_lowmask, NULL,
+						  NULL, true, false, NULL, NULL);
+
+		_hash_relbuf(rel, buf_oblkno);
+
+		goto restart_expand;
+	}
+
+	/*
+	 * There shouldn't be any active scan on new bucket.
 	 *
 	 * Note: it is safe to compute the new bucket's blkno here, even though we
 	 * may still need to update the BUCKET_TO_BLKNO mapping.  This is because
@@ -579,9 +663,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, new_bucket))
 		elog(ERROR, "scan in progress on supposedly new bucket");
 
-	if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
-		elog(ERROR, "could not get lock on supposedly new bucket");
-
 	/*
 	 * If the split point is increasing (hashm_maxbucket's log base 2
 	 * increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +681,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
 			/* can't split due to BlockNumber overflow */
-			_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-			_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+			_hash_relbuf(rel, buf_oblkno);
 			goto fail;
 		}
 	}
@@ -609,9 +689,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/*
 	 * Physically allocate the new bucket's primary page.  We want to do this
 	 * before changing the metapage's mapping info, in case we can't get the
-	 * disk space.
+	 * disk space.  Ideally, we don't need to check for cleanup lock on new
+	 * bucket as no other backend could find this bucket unless meta page is
+	 * updated.  However, it is good to be consistent with old bucket locking.
 	 */
 	buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
+	if (!CheckBufferForCleanup(buf_nblkno))
+	{
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+		goto fail;
+	}
+
 
 	/*
 	 * Okay to proceed with split.  Update the metapage bucket mapping info.
@@ -665,13 +754,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  start_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno,
 					  maxbucket, highmask, lowmask);
 
-	/* Release bucket locks, allowing others to access them */
-	_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-	_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -745,6 +830,10 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
  * The buffer is returned in the same state.  (The metapage is only
  * touched if it becomes necessary to add or remove overflow pages.)
  *
+ * Split needs to hold pin on primary bucket pages of both old and new
+ * buckets till end of operation.  This is to prevent vacuum to start
+ * when split is in progress.
+ *
  * In addition, the caller must have created the new bucket's base page,
  * which is passed in buffer nbuf, pinned and write-locked.  That lock and
  * pin are released here.  (The API is set up this way because we must do
@@ -756,37 +845,87 @@ _hash_splitbucket(Relation rel,
 				  Buffer metabuf,
 				  Bucket obucket,
 				  Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Buffer		obuf;
 	Page		opage;
 	Page		npage;
 	HashPageOpaque oopaque;
 	HashPageOpaque nopaque;
 
-	/*
-	 * It should be okay to simultaneously write-lock pages from each bucket,
-	 * since no one else can be trying to acquire buffer lock on pages of
-	 * either bucket.
-	 */
-	obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
 	opage = BufferGetPage(obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 
+	/*
+	 * Mark the old bucket to indicate that split is in progress and it has
+	 * deletable tuples. At operation end, we clear split in progress flag and
+	 * vacuum will clear page_has_garbage flag after deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
 	npage = BufferGetPage(nbuf);
 
-	/* initialize the new bucket's primary page */
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nopaque->hasho_prevblkno = InvalidBlockNumber;
 	nopaque->hasho_nextblkno = InvalidBlockNumber;
 	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
 	nopaque->hasho_page_id = HASHO_PAGE_ID;
 
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, nbuf, NULL,
+						   maxbucket, highmask, lowmask);
+
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, obuf);
+	_hash_relbuf(rel, nbuf);
+}
+
+/*
+ * _hash_splitbucket_guts -- Helper function to perform the split operation
+ *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, caller needs to fill htab.  If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
+ * Caller needs to lock and unlock the old and new primary buckets.
+ */
+static void
+_hash_splitbucket_guts(Relation rel,
+					   Buffer metabuf,
+					   Bucket obucket,
+					   Bucket nbucket,
+					   Buffer obuf,
+					   Buffer nbuf,
+					   HTAB *htab,
+					   uint32 maxbucket,
+					   uint32 highmask,
+					   uint32 lowmask)
+{
+	Buffer		bucket_obuf;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
+
+	bucket_obuf = obuf;
+	opage = BufferGetPage(obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
 	/*
 	 * Partition the tuples in the old bucket between the old bucket and the
 	 * new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +937,6 @@ _hash_splitbucket(Relation rel,
 		BlockNumber oblkno;
 		OffsetNumber ooffnum;
 		OffsetNumber omaxoffnum;
-		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
 
 		/* Scan each tuple in old page */
 		omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -810,18 +947,45 @@ _hash_splitbucket(Relation rel,
 			IndexTuple	itup;
 			Size		itemsz;
 			Bucket		bucket;
+			bool		found = false;
 
 			/*
-			 * Fetch the item's hash key (conveniently stored in the item) and
-			 * determine which bucket it now belongs in.
+			 * Before inserting tuple, probe the hash table containing TIDs of
+			 * tuples belonging to new bucket, if we find a match, then skip
+			 * that tuple, else fetch the item's hash key (conveniently stored
+			 * in the item) and determine which bucket it now belongs in.
 			 */
 			itup = (IndexTuple) PageGetItem(opage,
 											PageGetItemId(opage, ooffnum));
+
+			if (htab)
+				(void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
+
+			if (found)
+				continue;
+
 			bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
 										  maxbucket, highmask, lowmask);
 
 			if (bucket == nbucket)
 			{
+				Size		itupsize = 0;
+				IndexTuple	new_itup;
+
+				/*
+				 * make a copy of index tuple as we have to scribble on it.
+				 */
+				new_itup = CopyIndexTuple(itup);
+
+				/*
+				 * mark the index tuple as moved by split, such tuples are
+				 * skipped by scan if there is split in progress for a bucket.
+				 */
+				itupsize = new_itup->t_info & INDEX_SIZE_MASK;
+				new_itup->t_info &= ~INDEX_SIZE_MASK;
+				new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+				new_itup->t_info |= itupsize;
+
 				/*
 				 * insert the tuple into the new bucket.  if it doesn't fit on
 				 * the current page in the new bucket, we must allocate a new
@@ -832,17 +996,25 @@ _hash_splitbucket(Relation rel,
 				 * only partially complete, meaning the index is corrupt,
 				 * since searches may fail to find entries they should find.
 				 */
-				itemsz = IndexTupleDSize(*itup);
+				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
 				if (PageGetFreeSpace(npage) < itemsz)
 				{
+					bool		retain_pin = false;
+
+					/*
+					 * page flags must be accessed before releasing lock on a
+					 * page.
+					 */
+					retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
+
 					/* write out nbuf and drop lock, but keep pin */
 					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
 					/* chain to a new overflow page */
-					nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+					nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
 					npage = BufferGetPage(nbuf);
-					/* we don't need nopaque within the loop */
+					nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 				}
 
 				/*
@@ -852,12 +1024,10 @@ _hash_splitbucket(Relation rel,
 				 * Possible future improvement: accumulate all the items for
 				 * the new page and qsort them before insertion.
 				 */
-				(void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
+				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
-				/*
-				 * Mark tuple for deletion from old page.
-				 */
-				deletable[ndeletable++] = ooffnum;
+				/* be tidy */
+				pfree(new_itup);
 			}
 			else
 			{
@@ -870,15 +1040,9 @@ _hash_splitbucket(Relation rel,
 
 		oblkno = oopaque->hasho_nextblkno;
 
-		/*
-		 * Done scanning this old page.  If we moved any tuples, delete them
-		 * from the old page.
-		 */
-		if (ndeletable > 0)
-		{
-			PageIndexMultiDelete(opage, deletable, ndeletable);
-			_hash_wrtbuf(rel, obuf);
-		}
+		/* retain the pin on the old primary bucket */
+		if (obuf == bucket_obuf)
+			_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, obuf);
 
@@ -887,18 +1051,153 @@ _hash_splitbucket(Relation rel,
 			break;
 
 		/* Else, advance to next old page */
-		obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
 		opage = BufferGetPage(obuf);
 		oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	}
 
 	/*
 	 * We're at the end of the old bucket chain, so we're done partitioning
-	 * the tuples.  Before quitting, call _hash_squeezebucket to ensure the
-	 * tuples remaining in the old bucket (including the overflow pages) are
-	 * packed as tightly as possible.  The new bucket is already tight.
+	 * the tuples.  Mark the old and new buckets to indicate split is
+	 * finished.
+	 *
+	 * To avoid deadlocks due to locking order of buckets, first lock the old
+	 * bucket and then the new bucket.
+	 */
+	if (nopaque->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, nbuf);
+
+	/*
+	 * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+	 * there is no pending scan that has seen the flag after it is cleared.
+	 */
+	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+	opage = BufferGetPage(bucket_obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	_hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+	npage = BufferGetPage(bucket_nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+	/* indicate that split is finished */
+	oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
+	nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
+
+	/*
+	 * now write the buffers, here we don't release the locks as caller is
+	 * responsible to release locks.
 	 */
-	_hash_wrtbuf(rel, nbuf);
+	MarkBufferDirty(bucket_obuf);
+	MarkBufferDirty(bucket_nbuf);
+}
+
+/*
+ *	_hash_finish_split() -- Finish the previously interrupted split operation
+ *
+ * To complete the split operation, we form the hash table of TIDs in new
+ * bucket which is then used by split operation to skip tuples that are
+ * already moved before the split operation was previously interruptted.
+ *
+ * The caller must hold a pin, but no lock, on the metapage buffer.
+ * The buffer is returned in the same state.  (The metapage is only
+ * touched if it becomes necessary to add or remove overflow pages.)
+ *
+ * 'obuf' and 'nbuf' must be locked by the caller which is also responsible
+ * for unlocking it.
+ */
+void
+_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf,
+				   uint32 maxbucket, uint32 highmask, uint32 lowmask)
+{
+	HASHCTL		hash_ctl;
+	HTAB	   *tidhtab;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque opageopaque;
+	HashPageOpaque npageopaque;
+	Bucket		obucket;
+	Bucket		nbucket;
+	bool		found;
+
+	/* Initialize hash tables used to track TIDs */
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+	hash_ctl.keysize = sizeof(ItemPointerData);
+	hash_ctl.entrysize = sizeof(ItemPointerData);
+	hash_ctl.hcxt = CurrentMemoryContext;
+
+	tidhtab =
+		hash_create("bucket ctids",
+					256,		/* arbitrary initial size */
+					&hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	/*
+	 * Scan the new bucket and build hash table of TIDs
+	 */
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	for (;;)
+	{
+		BlockNumber nblkno;
+		OffsetNumber noffnum;
+		OffsetNumber nmaxoffnum;
+
+		/* Scan each tuple in new page */
+		nmaxoffnum = PageGetMaxOffsetNumber(npage);
+		for (noffnum = FirstOffsetNumber;
+			 noffnum <= nmaxoffnum;
+			 noffnum = OffsetNumberNext(noffnum))
+		{
+			IndexTuple	itup;
+
+			/* Fetch the item's TID and insert it in hash table. */
+			itup = (IndexTuple) PageGetItem(npage,
+											PageGetItemId(npage, noffnum));
+
+			(void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
+
+			Assert(!found);
+		}
+
+		nblkno = npageopaque->hasho_nextblkno;
+
+		/*
+		 * release our write lock without modifying buffer and ensure to
+		 * retain the pin on primary bucket.
+		 */
+		if (nbuf == bucket_nbuf)
+			_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, nbuf);
+
+		/* Exit loop if no more overflow pages in new bucket */
+		if (!BlockNumberIsValid(nblkno))
+			break;
+
+		/* Else, advance to next page */
+		nbuf = _hash_getbuf(rel, nblkno, HASH_READ, LH_OVERFLOW_PAGE);
+		npage = BufferGetPage(nbuf);
+		npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	}
+
+	/* Need a cleanup lock to perform split operation. */
+	LockBufferForCleanup(bucket_nbuf);
+
+	npage = BufferGetPage(bucket_nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nbucket = npageopaque->hasho_bucket;
+
+	opage = BufferGetPage(obuf);
+	opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+	obucket = opageopaque->hasho_bucket;
+
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, bucket_nbuf, tidhtab,
+						   maxbucket, highmask, lowmask);
 
-	_hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+	hash_destroy(tidhtab);
 }
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..b0cb638 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -72,7 +72,19 @@ _hash_readnext(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_nextblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan to ensure
+	 * that vacuum can't delete the tuples that are moved by split to new
+	 * bucket. Such tuples are required by the scans that are started on
+	 * splitted buckets, before a new buckets split in progress flag
+	 * (LH_BUCKET_NEW_PAGE_SPLIT) is cleared.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -94,7 +106,16 @@ _hash_readprev(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_prevblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan. See
+	 * comments in _hash_readnext to know the reason of retaining pin.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -192,43 +213,81 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	metap = HashPageGetMeta(page);
 
 	/*
-	 * Loop until we get a lock on the correct target bucket.
+	 * Conditionally get the lock on primary bucket page for search while
+	 * holding lock on meta page. If we have to wait, then release the meta
+	 * page lock and retry it in a hard way.
 	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
-									  metap->hashm_maxbucket,
-									  metap->hashm_highmask,
-									  metap->hashm_lowmask);
+	bucket = _hash_hashkey2bucket(hashkey,
+								  metap->hashm_maxbucket,
+								  metap->hashm_highmask,
+								  metap->hashm_lowmask);
 
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
+	blkno = BUCKET_TO_BLKNO(metap, bucket);
 
-		/* Release metapage lock, but keep pin. */
+	/* Fetch the primary bucket page for the bucket */
+	buf = ReadBuffer(rel, blkno);
+	if (!ConditionalLockBufferShared(buf))
+	{
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(buf, HASH_READ);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		oldblkno = blkno;
+		retry = true;
+	}
+	else
+	{
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	}
 
+	if (retry)
+	{
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * Loop until we get a lock on the correct target bucket.  We get the
+		 * lock on primary bucket page and retain the pin on it during read
+		 * operation to prevent the concurrent splits.  Retaining pin on a
+		 * primary bucket page ensures that split can't happen as it needs to
+		 * acquire the cleanup lock on primary bucket page.  Acquiring lock on
+		 * primary bucket and rechecking if it is a target bucket is mandatory
+		 * as otherwise a concurrent split followed by vacuum could remove
+		 * tuples from the selected bucket which otherwise would have been
+		 * visible.
 		 */
-		if (retry)
+		for (;;)
 		{
+			/*
+			 * Compute the target bucket number, and convert to block number.
+			 */
+			bucket = _hash_hashkey2bucket(hashkey,
+										  metap->hashm_maxbucket,
+										  metap->hashm_highmask,
+										  metap->hashm_lowmask);
+
+			blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+			/* Release metapage lock, but keep pin. */
+			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+			/*
+			 * If the previous iteration of this loop locked what is still the
+			 * correct target bucket, we are done.  Otherwise, drop any old
+			 * lock and lock what now appears to be the correct bucket.
+			 */
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
-		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
-		oldblkno = blkno;
-		retry = true;
+			/* Fetch the primary bucket page for the bucket */
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
+			/*
+			 * Reacquire metapage lock and check that no bucket split has
+			 * taken place while we were awaiting the bucket lock.
+			 */
+			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			oldblkno = blkno;
+		}
 	}
 
 	/* done with the metapage */
@@ -237,14 +296,60 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	/* Update scan opaque state to show we have lock on the bucket */
 	so->hashso_bucket = bucket;
 	so->hashso_bucket_valid = true;
-	so->hashso_bucket_blkno = blkno;
 
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
 	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
+	so->hashso_bucket_buf = buf;
+
+	/*
+	 * If the bucket split is in progress, then we need to skip tuples that
+	 * are moved from old bucket.  To ensure that vacuum doesn't clean any
+	 * tuples from old or new buckets till this scan is in progress, maintain
+	 * a pin on both of the buckets.  Here, we have to be cautious about lock
+	 * ordering, first acquire the lock on old bucket, release the lock on old
+	 * bucket, but not pin, then acuire the lock on new bucket and again
+	 * re-verify whether the bucket split still is in progress. Acquiring lock
+	 * on old bucket first ensures that the vacuum waits for this scan to
+	 * finish.
+	 */
+	if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+	{
+		BlockNumber old_blkno;
+		Buffer		old_buf;
+
+		old_blkno = _hash_get_oldblk(rel, opaque);
+
+		/*
+		 * release the lock on new bucket and re-acquire it after acquiring
+		 * the lock on old bucket.
+		 */
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+		/*
+		 * remember the old bucket buffer so as to use it later for scanning.
+		 */
+		so->hashso_old_bucket_buf = old_buf;
+		_hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+		_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		Assert(opaque->hasho_bucket == bucket);
+
+		if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+			so->hashso_skip_moved_tuples = true;
+		else
+		{
+			_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+			so->hashso_old_bucket_buf = InvalidBuffer;
+		}
+	}
+
 	/* If a backwards scan is requested, move to the end of the chain */
 	if (ScanDirectionIsBackward(dir))
 	{
@@ -273,6 +378,13 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
  *		false.  Else, return true and set the hashso_curpos for the
  *		scan to the right thing.
  *
+ *		Here we also scan the old bucket if the split for current bucket
+ *		was in progress at the start of scan.  The basic idea is that
+ *		skip the tuples that are moved by split while scanning current
+ *		bucket and then scan the old bucket to cover all such tuples. This
+ *		is done to ensure that we don't miss any tuples in the scans that
+ *		started during split.
+ *
  *		'bufP' points to the current buffer, which is pinned and read-locked.
  *		On success exit, we have pin and read-lock on whichever page
  *		contains the right item; on failure, we have released all buffers.
@@ -338,6 +450,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum >= FirstOffsetNumber);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberNext(offnum);	/* move forward */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -353,9 +478,41 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -379,6 +536,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum <= maxoff);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberPrev(offnum);	/* move back */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -394,9 +564,41 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..1648581 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -147,6 +147,23 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_msb-- returns most significant bit position.
+ */
+static uint32
+_hash_msb(uint32 num)
+{
+	uint32		i = 0;
+
+	while (num)
+	{
+		num = num >> 1;
+		++i;
+	}
+
+	return i - 1;
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
@@ -352,3 +369,123 @@ _hash_binsearch_last(Page page, uint32 hash_value)
 
 	return lower;
 }
+
+/*
+ *	_hash_get_oldblk() -- get the block number from which current bucket
+ *			is being splitted.
+ */
+BlockNumber
+_hash_get_oldblk(Relation rel, HashPageOpaque opaque)
+{
+	Bucket		curr_bucket;
+	Bucket		old_bucket;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	/*
+	 * To get the old bucket from the current bucket, we need a mask to modulo
+	 * into lower half of table.  This mask is stored in meta page as
+	 * hashm_lowmask, but here we can't rely on the same, because we need a
+	 * value of lowmask that was prevalent at the time when bucket split was
+	 * started.  Masking the most significant bit of new bucket would give us
+	 * old bucket.
+	 */
+	curr_bucket = opaque->hasho_bucket;
+	mask = (((uint32) 1) << _hash_msb(curr_bucket)) - 1;
+	old_bucket = curr_bucket & mask;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newblk() -- get the block number of bucket for the new bucket
+ *			that will be generated after split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket based on current table
+ * half.  It is mainly required to finsh the incomplete splits where we are
+ * sure that not more than one bucket could have split in progress from old
+ * bucket.
+ */
+BlockNumber
+_hash_get_newblk(Relation rel, HashPageOpaque opaque)
+{
+	Bucket		curr_bucket;
+	Bucket		new_bucket;
+	uint32		lowmask;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	curr_bucket = opaque->hasho_bucket;
+
+	/*
+	 * new bucket can be obtained by OR'ing old bucket with most significant
+	 * bit of current table half.  There could be multiple buckets that could
+	 * have splitted from curent bucket.  We need the first such bucket that
+	 * exists based on current table half.
+	 */
+	lowmask = metap->hashm_lowmask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = curr_bucket | mask;
+		if (new_bucket > metap->hashm_maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+		break;
+	}
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newbucket() -- get the new bucket that will be generated after
+ *			split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket.  New bucket can be
+ * obtained by OR'ing old bucket with most significant bit of table half
+ * for lowmask passed in this function.  There could be multiple buckets that
+ * could have splitted from curent bucket.  We need the first such bucket that
+ * exists.  Caller must ensure that no more than one split has happened from
+ * old bucket.
+ */
+Bucket
+_hash_get_newbucket(Relation rel, Bucket curr_bucket,
+					uint32 lowmask, uint32 maxbucket)
+{
+	Bucket		new_bucket;
+	uint32		mask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = curr_bucket | mask;
+		if (new_bucket > maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		break;
+	}
+
+	return new_bucket;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 76ade37..1c9be40 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3567,6 +3567,26 @@ ConditionalLockBuffer(Buffer buffer)
 }
 
 /*
+ * Acquire the content_lock for the buffer, but only if we don't have to wait.
+ *
+ * This assumes the caller wants BUFFER_LOCK_SHARED mode.
+ */
+bool
+ConditionalLockBufferShared(Buffer buffer)
+{
+	BufferDesc *buf;
+
+	Assert(BufferIsValid(buffer));
+	if (BufferIsLocal(buffer))
+		return true;			/* act as though we got it */
+
+	buf = GetBufferDescriptor(buffer - 1);
+
+	return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+									LW_SHARED);
+}
+
+/*
  * LockBufferForCleanup - lock a buffer in preparation for deleting items
  *
  * Items may be deleted from a disk page only when the caller (a) holds an
@@ -3750,6 +3770,49 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 	return false;
 }
 
+/*
+ * CheckBufferForCleanup - as above, but don't attempt to take lock
+ *
+ * We won't loop, but just check once to see if the pin count is OK.  If
+ * not, return FALSE.
+ */
+bool
+CheckBufferForCleanup(Buffer buffer)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(BufferIsValid(buffer));
+
+	if (BufferIsLocal(buffer))
+	{
+		/* There should be exactly one pin */
+		if (LocalRefCount[-buffer - 1] != 1)
+			return false;
+		/* Nobody else to wait for */
+		return true;
+	}
+
+	/* There should be exactly one local pin */
+	if (GetPrivateRefCount(buffer) != 1)
+		return false;
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+	buf_state = LockBufHdr(bufHdr);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+	{
+		/* pincount is OK. */
+		UnlockBufHdr(bufHdr, buf_state);
+		return true;
+	}
+
+	UnlockBufHdr(bufHdr, buf_state);
+	return false;
+}
+
 
 /*
  *	Functions for buffer I/O handling
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index ce31418..0b41563 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -25,6 +25,7 @@
 #include "lib/stringinfo.h"
 #include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
+#include "utils/hsearch.h"
 #include "utils/relcache.h"
 
 /*
@@ -52,6 +53,9 @@ typedef uint32 Bucket;
 #define LH_BUCKET_PAGE			(1 << 1)
 #define LH_BITMAP_PAGE			(1 << 2)
 #define LH_META_PAGE			(1 << 3)
+#define LH_BUCKET_NEW_PAGE_SPLIT	(1 << 4)
+#define LH_BUCKET_OLD_PAGE_SPLIT	(1 << 5)
+#define LH_BUCKET_PAGE_HAS_GARBAGE	(1 << 6)
 
 typedef struct HashPageOpaqueData
 {
@@ -64,6 +68,12 @@ typedef struct HashPageOpaqueData
 
 typedef HashPageOpaqueData *HashPageOpaque;
 
+#define H_HAS_GARBAGE(opaque)			((opaque)->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+#define H_OLD_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+#define H_INCOMPLETE_SPLIT(opaque)		(((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) || \
+										 ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT))
+
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
  * which otherwise would have a hard time telling pages of different index
@@ -88,12 +98,6 @@ typedef struct HashScanOpaqueData
 	bool		hashso_bucket_valid;
 
 	/*
-	 * If we have a share lock on the bucket, we record it here.  When
-	 * hashso_bucket_blkno is zero, we have no such lock.
-	 */
-	BlockNumber hashso_bucket_blkno;
-
-	/*
 	 * We also want to remember which buffer we're currently examining in the
 	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
 	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -101,11 +105,23 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_curbuf;
 
+	/* remember the buffer associated with primary bucket */
+	Buffer		hashso_bucket_buf;
+
+	/*
+	 * remember the buffer associated with old primary bucket which is
+	 * required during the scan of the bucket for which split is in progress.
+	 */
+	Buffer		hashso_old_bucket_buf;
+
 	/* Current position of the scan, as an index TID */
 	ItemPointerData hashso_curpos;
 
 	/* Current position of the scan, as a heap TID */
 	ItemPointerData hashso_heappos;
+
+	/* Whether scan needs to skip tuples that are moved by split */
+	bool		hashso_skip_moved_tuples;
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -176,6 +192,8 @@ typedef HashMetaPageData *HashMetaPage;
 				  sizeof(ItemIdData) - \
 				  MAXALIGN(sizeof(HashPageOpaqueData)))
 
+#define INDEX_MOVED_BY_SPLIT_MASK	0x2000
+
 #define HASH_MIN_FILLFACTOR			10
 #define HASH_DEFAULT_FILLFACTOR		75
 
@@ -224,9 +242,6 @@ typedef HashMetaPageData *HashMetaPage;
 #define HASH_WRITE		BUFFER_LOCK_EXCLUSIVE
 #define HASH_NOLOCK		(-1)
 
-#define HASH_SHARE		ShareLock
-#define HASH_EXCLUSIVE	ExclusiveLock
-
 /*
  *	Strategy number. There's only one valid strategy for hashing: equality.
  */
@@ -299,21 +314,21 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 
 /* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
 extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BufferAccessStrategy bstrategy);
+				   BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy);
 
 /* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
 extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 			 int access, int flags);
+extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
+								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
@@ -329,6 +344,9 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
 				ForkNumber forkNum);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
+extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
+				   Buffer nbuf, uint32 maxbucket, uint32 highmask,
+				   uint32 lowmask);
 
 /* hashscan.c */
 extern void _hash_regscan(IndexScanDesc scan);
@@ -364,10 +382,20 @@ extern bool _hash_convert_tuple(Relation index,
 					Datum *index_values, bool *index_isnull);
 extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
 extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+extern BlockNumber _hash_get_oldblk(Relation rel, HashPageOpaque opaque);
+extern BlockNumber _hash_get_newblk(Relation rel, HashPageOpaque opaque);
+extern Bucket _hash_get_newbucket(Relation rel, Bucket curr_bucket,
+					uint32 lowmask, uint32 maxbucket);
 
 /* hash.c */
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
 extern const char *hash_identify(uint8 info);
+extern void hashbucketcleanup(Relation rel, Buffer bucket_buf,
+				  BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
+				  double *tuples_removed, double *num_index_tuples,
+				  bool bucket_has_garbage, bool delay,
+				  IndexBulkDeleteCallback callback, void *callback_state);
 
 #endif   /* HASH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3d5dea7..6d0a29c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -226,8 +226,10 @@ extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
 extern void UnlockBuffers(void);
 extern void LockBuffer(Buffer buffer, int mode);
 extern bool ConditionalLockBuffer(Buffer buffer);
+extern bool ConditionalLockBufferShared(Buffer buffer);
 extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
+extern bool CheckBufferForCleanup(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern void AbortBufferIO(void);
#17Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Amit Kapila (#16)
Re: Hash Indexes

I did some basic testing of same. In that I found one issue with cursor.

+BEGIN;

+SET enable_seqscan = OFF;

+SET enable_bitmapscan = OFF;

+CREATE FUNCTION declares_cursor(int)

+ RETURNS void

+ AS 'DECLARE c CURSOR FOR SELECT * from con_hash_index_table WHERE keycol
= $1;'

+LANGUAGE SQL;

+

+SELECT declares_cursor(1);

+MOVE FORWARD ALL FROM c;

+MOVE BACKWARD 10000 FROM c;

+ CLOSE c;

+ WARNING: buffer refcount leak: [5835] (rel=base/16384/30537,
blockNum=327, flags=0x93800000, refcount=1 1)

ROLLBACK;

closing the cursor comes with the warning which say we forgot to unpin the
buffer.

I have also added tests [1]Some tests to cover hash_index. </messages/by-id/CAD__OugeoQuu3mP09erV3gBdF-nX7o844kW7hAnwCF_rdzr6Qw@mail.gmail.com&gt; for coverage improvements.

[1]: Some tests to cover hash_index. </messages/by-id/CAD__OugeoQuu3mP09erV3gBdF-nX7o844kW7hAnwCF_rdzr6Qw@mail.gmail.com&gt;
</messages/by-id/CAD__OugeoQuu3mP09erV3gBdF-nX7o844kW7hAnwCF_rdzr6Qw@mail.gmail.com&gt;

On Thu, Jul 14, 2016 at 4:33 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Jun 22, 2016 at 8:48 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Wed, Jun 22, 2016 at 5:14 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

We can do it in the way as you are suggesting, but there is another

thing

which we need to consider here. As of now, the patch tries to finish

the

split if it finds split-in-progress flag in either old or new bucket.

We

need to lock both old and new buckets to finish the split, so it is

quite

possible that two different backends try to lock them in opposite order
leading to a deadlock. I think the correct way to handle is to always

try

to lock the old bucket first and then new bucket. To achieve that, if

the

insertion on new bucket finds that split-in-progress flag is set on a
bucket, it needs to release the lock and then acquire the lock first on

old

bucket, ensure pincount is 1 and then lock new bucket again and ensure

that

pincount is 1. I have already maintained the order of locks in scan (old
bucket first and then new bucket; refer changes in _hash_first()).
Alternatively, we can try to finish the splits only when someone tries

to

insert in old bucket.

Yes, I think locking buckets in increasing order is a good solution.
I also think it's fine to only try to finish the split when the insert
targets the old bucket. Finishing the split enables us to remove
tuples from the old bucket, which lets us reuse space instead of
accelerating more. So there is at least some potential benefit to the
backend inserting into the old bucket. On the other hand, a process
inserting into the new bucket derives no direct benefit from finishing
the split.

Okay, following this suggestion, I have updated the patch so that only
insertion into old-bucket can try to finish the splits. Apart from
that, I have fixed the issue reported by Mithun upthread. I have
updated the README to explain the locking used in patch. Also, I
have changed the locking around vacuum, so that it can work with
concurrent scans when ever possible. In previous patch version,
vacuum used to take cleanup lock on a bucket to remove the dead
tuples, moved-due-to-split tuples and squeeze operation, also it holds
the lock on bucket till end of cleanup. Now, also it takes cleanup
lock on a bucket to out-wait scans, but it releases the lock as it
proceeds to clean the overflow pages. The idea is first we need to
lock the next bucket page and then release the lock on current bucket
page. This ensures that any concurrent scan started after we start
cleaning the bucket will always be behind the cleanup. Allowing scans
to cross vacuum will allow it to remove tuples required for sanctity
of scan. Also for squeeze-phase we are just checking if the pincount
of buffer is one (we already have Exclusive lock on buffer of bucket
by that time), then only proceed, else will try to squeeze next time
the cleanup is required for that bucket.

Thoughts/Suggestions?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

#18Amit Kapila
amit.kapila16@gmail.com
In reply to: Mithun Cy (#17)
1 attachment(s)
Re: Hash Indexes

On Thu, Aug 4, 2016 at 8:02 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

I did some basic testing of same. In that I found one issue with cursor.

Thanks for the testing. The reason for failure was that the patch
didn't take into account the fact that for scrolling cursors, scan can
reacquire the lock and pin on bucket buffer multiple times. I have
fixed it such that we release the pin on bucket buffers after we scan
the last overflow page in bucket. Attached patch fixes the issue for
me, let me know if you still see the issue.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

concurrent_hash_index_v4.patchapplication/octet-stream; name=concurrent_hash_index_v4.patchDownload
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c1122b4..d02539a 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -400,7 +400,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	Buffer		buf;
 	Page		page;
 
-	_hash_getlock(rel, blkno, HASH_SHARE);
 	buf = _hash_getbuf_with_strategy(rel, blkno, HASH_READ, 0, bstrategy);
 	page = BufferGetPage(buf);
 
@@ -431,7 +430,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	}
 
 	_hash_relbuf(rel, buf);
-	_hash_droplock(rel, blkno, HASH_SHARE);
 }
 
 /*
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 0a7da89..a0feb2f 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -125,49 +125,45 @@ the initially created buckets.
 
 Lock Definitions
 ----------------
-
-We use both lmgr locks ("heavyweight" locks) and buffer context locks
-(LWLocks) to control access to a hash index.  lmgr locks are needed for
-long-term locking since there is a (small) risk of deadlock, which we must
-be able to detect.  Buffer context locks are used for short-term access
-control to individual pages of the index.
-
-LockPage(rel, page), where page is the page number of a hash bucket page,
-represents the right to split or compact an individual bucket.  A process
-splitting a bucket must exclusive-lock both old and new halves of the
-bucket until it is done.  A process doing VACUUM must exclusive-lock the
-bucket it is currently purging tuples from.  Processes doing scans or
-insertions must share-lock the bucket they are scanning or inserting into.
-(It is okay to allow concurrent scans and insertions.)
-
-The lmgr lock IDs corresponding to overflow pages are currently unused.
-These are available for possible future refinements.  LockPage(rel, 0)
-is also currently undefined (it was previously used to represent the right
-to modify the hash-code-to-bucket mapping, but it is no longer needed for
-that purpose).
-
-Note that these lock definitions are conceptually distinct from any sort
-of lock on the pages whose numbers they share.  A process must also obtain
-read or write buffer lock on the metapage or bucket page before accessing
-said page.
-
-Processes performing hash index scans must hold share lock on the bucket
-they are scanning throughout the scan.  This seems to be essential, since
-there is no reasonable way for a scan to cope with its bucket being split
-underneath it.  This creates a possibility of deadlock external to the
-hash index code, since a process holding one of these locks could block
-waiting for an unrelated lock held by another process.  If that process
-then does something that requires exclusive lock on the bucket, we have
-deadlock.  Therefore the bucket locks must be lmgr locks so that deadlock
-can be detected and recovered from.
-
-Processes must obtain read (share) buffer context lock on any hash index
-page while reading it, and write (exclusive) lock while modifying it.
-To prevent deadlock we enforce these coding rules: no buffer lock may be
-held long term (across index AM calls), nor may any buffer lock be held
-while waiting for an lmgr lock, nor may more than one buffer lock
-be held at a time by any one process.  (The third restriction is probably
-stronger than necessary, but it makes the proof of no deadlock obvious.)
+We use buffer content locks (LWLocks) and buffer pins to control access to
+a hash index.
+
+Scan will take a lock in shared mode on primary or overflow buckets.  Inserts
+will acquire exclusive lock on the bucket in which it has to insert.  Both the
+operations releases the lock on previous bucket before moving to the next
+overflow bucket.  They will retain a pin on primary bucket till end of operation.
+Split operation must acquire cleanup lock on both old and new halves of the
+bucket and mark split-in-progress on both the buckets.  The cleanup lock at
+the start of split ensures that parallel insert won't get lost.  Consider a
+case where insertion has to add a tuple on some intermediate overflow bucket
+in the bucket chain, if we allow split when insertion is in progress, split
+might not move this newly inserted tuple.  It releases the lock on previous
+bucket before moving to the next overflow bucket either for old bucket or for
+new bucket.  After partitioning the tuples between old and new buckets, it
+again needs to acquire exclusive lock on both old and new buckets to clear
+the split-in-progress flag.  Like inserts and scans, it will also retain pins
+on both the old and new primary buckets till end of split operation, although
+we can do without that as well.
+
+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or tuples
+that are moved due to split.  The need for cleanup lock to remove dead tuples
+is to ensure that scans' returns correct results.  Scan that returns multiple
+tuples from the same bucket page always restart the scan from the previous
+offset number from which it has returned last tuple.  If we allow vacuum to
+remove the dead tuples with just an exclusive lock, it could remove the tuple
+required to resume the scan.  The need for cleanup lock to remove the tuples
+that are moved by split is to ensure that there is no pending scan that has
+started after the start of split and before the finish of split on bucket.
+If we don't do that, then vacuum can remove tuples that are required by such
+a scan.  We don't need to retain this cleanup lock during whole vacuum
+operation on bucket.  We releases the lock as we move ahead in the bucket
+chain.  In the end, for squeeze-phase, we conditionally acquire cleanup lock
+and if we don't get, then we just abandon the squeeze phase.
+
+To avoid deadlocks, we must be consistent about the lock order in which we
+lock the buckets for operations that requires locks on two different buckets.
+We use the rule "first lock the old bucket and then new bucket, basically
+lock the lowered number bucket first".
 
 
 Pseudocode Algorithms
@@ -188,63 +184,105 @@ track of available overflow pages.
 The reader algorithm is:
 
 	pin meta page and take buffer content lock in shared mode
-	loop:
-		compute bucket number for target hash key
-		release meta page buffer content lock
-		if (correct bucket page is already locked)
-			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock
-		retake meta page buffer content lock in shared mode
+	compute bucket number for target hash key
+	read and pin the primary bucket page
+	conditionally get the buffer content lock in shared mode on primary bucket page for search
+	if we didn't get the lock (need to wait for lock)
+		release the buffer content lock on meta page
+		acquire buffer content lock on primary bucket page in shared mode
+		acquire the buffer content lock in shared mode on meta page
+		to check for possibility of split, we need to recompute the bucket and
+		verify, if it is a correct bucket; set the retry flag
+	else if we get the lock, then we can skip the retry path
+	if (retry)
+		loop:
+			compute bucket number for target hash key
+			release meta page buffer content lock
+			if (correct bucket page is already locked)
+				break
+			release any existing content lock on bucket page (if a concurrent split happened)
+			pin primary bucket page and take shared buffer content lock
+			retake meta page buffer content lock in shared mode
 -- then, per read request:
 	release pin on metapage
-	read current page of bucket and take shared buffer content lock
-		step to next page if necessary (no chaining of locks)
+	if the split is in progress for current bucket and this is a new bucket
+		release the buffer content lock on current bucket page
+		pin and acquire the buffer content lock on old bucket in shared mode
+		release the buffer content lock on old bucket, but not pin
+		retake the buffer content lock on new bucket
+		mark the scan such that it skips the tuples that are marked as moved by split
+	step to next page if necessary (no chaining of locks)
+		if the scan indicates moved by split, then move to old bucket after the scan
+		of current bucket is finished
 	get tuple
 	release buffer content lock and pin on current page
 -- at scan shutdown:
-	release bucket share-lock
-
-We can't hold the metapage lock while acquiring a lock on the target bucket,
-because that might result in an undetected deadlock (lwlocks do not participate
-in deadlock detection).  Instead, we relock the metapage after acquiring the
-bucket page lock and check whether the bucket has been split.  If not, we're
-done.  If so, we release our previously-acquired lock and repeat the process
-using the new bucket number.  Holding the bucket sharelock for
+	release any pin we hold on current buffer, old bucket buffer, new bucket buffer
+
+We don't want to hold the meta page lock if we have to wait for acquiring the
+content lock on bucket page, because that might result in poor concurrency.
+Instead, we relock the metapage after acquiring the bucket page content lock
+and check whether the bucket has been split.  If not, we're done.  If so, we
+release our previously-acquired content lock, but not pin and repeat the
+process using the new bucket number.  Holding the buffer pin on bucket page for
 the remainder of the scan prevents the reader's current-tuple pointer from
-being invalidated by splits or compactions.  Notice that the reader's lock
+being invalidated by splits or compactions.  Notice that the reader's pin
 does not prevent other buckets from being split or compacted.
 
 To keep concurrency reasonably good, we require readers to cope with
 concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the page sharelock.  Since
-deletion is not possible while a reader holds the bucket sharelock, and
-we assume that heap tuple TIDs are unique, this can be implemented by
+their current scan position after re-acquiring the buffer content lock on
+page.  Since deletion is not possible while a reader holds the pin on bucket,
+and we assume that heap tuple TIDs are unique, this can be implemented by
 searching for the same heap tuple TID previously returned.  Insertion does
 not move index entries across pages, so the previously-returned index entry
 should always be on the same page, at the same or higher offset number,
 as it was before.
 
+To allow scan during bucket split, if at the start of the scan, bucket is
+marked as split-in-progress, it scan all the tuples in that bucket except for
+those that are marked as moved-by-split.  Once it finishes the scan of all the
+tuples in the current bucket, it scans the old bucket from which this bucket
+is formed by split.  This happens only for the new half bucket.
+
 The insertion algorithm is rather similar:
 
 	pin meta page and take buffer content lock in shared mode
-	loop:
-		compute bucket number for target hash key
-		release meta page buffer content lock
-		if (correct bucket page is already locked)
-			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock in shared mode
-		retake meta page buffer content lock in shared mode
--- (so far same as reader)
+	compute bucket number for target hash key
+	read and pin the primary bucket page
+	conditionally get the buffer content lock in exclusive mode on primary bucket page for search
+	if we didn't get the lock (need to wait for lock)
+		release the buffer content lock on meta page
+		acquire buffer content lock on primary bucket page in exclusive mode
+		acquire the buffer content lock in shared mode on meta page
+		to check for possibility of split, we need to recompute the bucket and
+		verify, if it is a correct bucket; set the retry flag
+	else if we get the lock, then we can skip the retry path
+	if (retry)
+		loop:
+			compute bucket number for target hash key
+			release meta page buffer content lock
+			if (correct bucket page is already locked)
+				break
+			release any existing content lock on bucket page (if a concurrent split happened)
+			pin primary bucket page and take exclusive buffer content lock
+			retake meta page buffer content lock in shared mode
+-- (so far same as reader, except for acquisation of buffer content lock in
+	exclusive mode on primary bucket page)
 	release pin on metapage
-	pin current page of bucket and take exclusive buffer content lock
-	if full, release, read/exclusive-lock next page; repeat as needed
+	if the split-in-progress flag is set for bucket in old half of split
+	and pin count on it is one, then finish the split
+		we already have a buffer content lock on old bucket, conditionally get the content lock on new bucket
+		if get the lock on new bucket
+			finish the split using algorithm mentioned below for split
+			release the buffer content lock and pin on new bucket
+	if full, release lock but not pin, read/exclusive-lock next page; repeat as needed
 	>> see below if no space in any page of bucket
 	insert tuple at appropriate place in page
 	mark current page dirty and release buffer content lock and pin
+	if current page is not a bucket page, release the pin on bucket page
 	release heavyweight share-lock
-	pin meta page and take buffer content lock in shared mode
+	pin meta page and take buffer content lock in exclusive mode
 	increment tuple count, decide if split needed
 	mark meta page dirty and release buffer content lock and pin
 	done if no split needed, else enter Split algorithm below
@@ -256,11 +294,13 @@ bucket that is being actively scanned, because readers can cope with this
 as explained above.  We only need the short-term buffer locks to ensure
 that readers do not see a partially-updated page.
 
-It is clearly impossible for readers and inserters to deadlock, and in
-fact this algorithm allows them a very high degree of concurrency.
-(The exclusive metapage lock taken to update the tuple count is stronger
-than necessary, since readers do not care about the tuple count, but the
-lock is held for such a short time that this is probably not an issue.)
+To avoid deadlock between readers and inserters, whenever there is a need
+to lock multiple buckets, we always take in the order suggested in Locking
+Definitions above.  This algorithm allows them a very high degree of
+concurrency.  (The exclusive metapage lock taken to update the tuple count
+is stronger than necessary, since readers do not care about the tuple count,
+but the lock is held for such a short time that this is probably not an
+issue.)
 
 When an inserter cannot find space in any existing page of a bucket, it
 must obtain an overflow page and add that page to the bucket's chain.
@@ -271,46 +311,79 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
 The algorithm attempts, but does not necessarily succeed, to split one
 existing bucket in two, thereby lowering the fill ratio:
 
-	pin meta page and take buffer content lock in exclusive mode
-	check split still needed
-	if split not needed anymore, drop buffer content lock and pin and exit
-	decide which bucket to split
-	Attempt to X-lock old bucket number (definitely could fail)
-	Attempt to X-lock new bucket number (shouldn't fail, but...)
-	if above fail, drop locks and pin and exit
+	expand:
+		take buffer content lock in exclusive mode on meta page
+		check split still needed
+		if split not needed anymore, drop buffer content lock and exit
+		decide which bucket to split
+		Attempt to acquire cleanup lock on old bucket number (definitely could fail)
+		if above fail, release lock and pin and exit
+		if the split-in-progress flag is set, then finish the split
+			conditionally get the content lock on new bucket which was involved in split
+			if got the lock on new bucket
+				finish the split using algorithm mentioned below for split
+				release the buffer content lock and pin on old and new bucketa
+				try to expand from start
+			else
+				release the buffer conetent lock and pin on old bucket and exit
+		if the garbage flag (indicates that tuples are moved by split) is set on bucket
+			release the buffer content lock on meta page
+			remove the tuples that doesn't belong to this bucket; see bucket cleanup below
+	Attempt to acquire cleanup lock on new bucket number (shouldn't fail, but...)
 	update meta page to reflect new number of buckets
-	mark meta page dirty and release buffer content lock and pin
+	mark meta page dirty and release buffer content lock
 	-- now, accesses to all other buckets can proceed.
 	Perform actual split of bucket, moving tuples as needed
 	>> see below about acquiring needed extra space
 	Release X-locks of old and new buckets
 
+	split guts
+	mark the old and new buckets indicating split-in-progress
+	mark the old bucket indicating has-garbage
+	copy the tuples that belongs to new bucket from old bucket
+	during copy mark such tuples as move-by-split
+	release lock but not pin for primary bucket page of old bucket,
+	read/shared-lock next page; repeat as needed
+	>> see below if no space in bucket page of new bucket
+	ensure to have exclusive-lock on both old and new buckets in that order
+	clear the split-in-progress flag from both the buckets
+	mark buffers dirty and release the locks and pins on both old and new buckets
+
 Note the metapage lock is not held while the actual tuple rearrangement is
 performed, so accesses to other buckets can proceed in parallel; in fact,
 it's possible for multiple bucket splits to proceed in parallel.
 
-Split's attempt to X-lock the old bucket number could fail if another
-process holds S-lock on it.  We do not want to wait if that happens, first
-because we don't want to wait while holding the metapage exclusive-lock,
-and second because it could very easily result in deadlock.  (The other
-process might be out of the hash AM altogether, and could do something
-that blocks on another lock this process holds; so even if the hash
-algorithm itself is deadlock-free, a user-induced deadlock could occur.)
-So, this is a conditional LockAcquire operation, and if it fails we just
-abandon the attempt to split.  This is all right since the index is
-overfull but perfectly functional.  Every subsequent inserter will try to
-split, and eventually one will succeed.  If multiple inserters failed to
-split, the index might still be overfull, but eventually, the index will
+Split's attempt to acquire cleanup-lock on the old bucket number could fail
+if another process holds any lock or pin on it.  We do not want to wait if
+that happens, because we don't want to wait while holding the metapage
+exclusive-lock.  So, this is a conditional LWLockAcquire operation, and if
+it fails we just abandon the attempt to split.  This is all right since the
+index is overfull but perfectly functional.  Every subsequent inserter will
+try to split, and eventually one will succeed.  If multiple inserters failed
+to split, the index might still be overfull, but eventually, the index will
 not be overfull and split attempts will stop.  (We could make a successful
 splitter loop to see if the index is still overfull, but it seems better to
 distribute the split overhead across successive insertions.)
 
+has_garbage flag indicates that the bucket contains tuples that are moved due
+to split.  This will be set only for old bucket.  Now, why we need it besides
+split-in-progress flag is to distinguish the case when the split is over
+(aka split-in-progress flag is cleared.).  This is used both by vacuum as
+well as during re-split operation.  Vacuum, uses it to decide if it needs to
+clear the tuples (that are moved-by-split) from bucket along with dead tuples.
+Re-split of bucket uses it to ensure that it doesn't start a new split from a
+bucket without clearing the previous tuples from old bucket.  The usage by
+re-split helps to keep bloat under control and makes the design somewhat
+simpler as we don't have to any time handle the situation where a bucket can
+contain dead-tuples from multiple splits.
+
 A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt.  The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
+disk space or crash) the index is left corrupt.  The probability of that
+could be made quite low if we grab a free page or two before we update the
+meta page, but the only real solution is to treat a split as a WAL-loggable,
 must-complete action.  I'm not planning to teach hash about WAL in this
-go-round.
+go-round.  However, we do try to finish the incomplete splits during insert
+and split.
 
 The fourth operation is garbage collection (bulk deletion):
 
@@ -319,9 +392,13 @@ The fourth operation is garbage collection (bulk deletion):
 	fetch current max bucket number
 	release meta page buffer content lock and pin
 	while next bucket <= max bucket do
-		Acquire X lock on target bucket
-		Scan and remove tuples, compact free space as needed
-		Release X lock
+		Acquire cleanup lock on target bucket
+		Scan and remove tuples
+		For overflow buckets, first we need to lock the next bucket and then
+		release the lock on current bucket
+		Ensure to have X lock on bucket page
+		If buffer pincount is one, then compact free space as needed
+		Release lock
 		next bucket ++
 	end loop
 	pin metapage and take buffer content lock in exclusive mode
@@ -330,20 +407,23 @@ The fourth operation is garbage collection (bulk deletion):
 	else update metapage tuple count
 	mark meta page dirty and release buffer content lock and pin
 
-Note that this is designed to allow concurrent splits.  If a split occurs,
-tuples relocated into the new bucket will be visited twice by the scan,
-but that does no harm.  (We must however be careful about the statistics
-reported by the VACUUM operation.  What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count
-if the stored tuple count and number of buckets did *not* change at any
-time during the scan.  This provides a way of correcting the stored tuple
-count if it gets out of sync for some reason.  But if a split or insertion
-does occur concurrently, the scan count is untrustworthy; instead,
-subtract the number of tuples deleted from the stored tuple count and
-use that.)
-
-The exclusive lock request could deadlock in some strange scenarios, but
-we can just error out without any great harm being done.
+Note that this is designed to allow concurrent splits and scans.  If a
+split occurs, tuples relocated into the new bucket will be visited twice
+by the scan, but that does no harm.  As we are releasing the locks during
+scan of a bucket, it will allow concurrent scan to start on a bucket and
+ensures that scan will always be behind cleanup.  It is must to keep scans
+behind cleanup, else vacuum could remove tuples that are required to
+complete the scan as explained in Lock Definitions section above.  This holds
+true for backward scans as well (backward scans first traverse each bucket
+starting from first bucket to last overflow bucket in the chain).
+We must be careful about the statistics reported by the VACUUM operation.
+What we can do is count the number of tuples scanned, and believe this in
+preference to the stored tuple count if the stored tuple count and number
+of buckets did *not* change at any time during the scan.  This provides a
+way of correcting the stored tuple count if it gets out of sync for some
+reason.  But if a split or insertion does occur concurrently, the scan
+count is untrustworthy; instead, subtract the number of tuples deleted
+from the stored tuple count and use that.
 
 
 Free Space Management
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 30c82e1..190c394 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -285,10 +285,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		/*
 		 * An insertion into the current index page could have happened while
 		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold share lock on
-		 * the bucket, no deletions or splits could have occurred; therefore
-		 * we can expect that the TID still exists in the current index page,
-		 * at an offset >= where we were.)
+		 * for the TID we previously returned.  (Because we hold pin on the
+		 * bucket, no deletions or splits could have occurred; therefore we
+		 * can expect that the TID still exists in the current index page, at
+		 * an offset >= where we were.)
 		 */
 		OffsetNumber maxoffnum;
 
@@ -423,12 +423,15 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
 	so->hashso_bucket_valid = false;
-	so->hashso_bucket_blkno = 0;
 	so->hashso_curbuf = InvalidBuffer;
+	so->hashso_bucket_buf = InvalidBuffer;
+	so->hashso_old_bucket_buf = InvalidBuffer;
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
 	ItemPointerSetInvalid(&(so->hashso_heappos));
 
+	so->hashso_skip_moved_tuples = false;
+
 	scan->opaque = so;
 
 	/* register scan in case we change pages it's using */
@@ -447,15 +450,7 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -469,6 +464,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 				scan->numberOfKeys * sizeof(ScanKeyData));
 		so->hashso_bucket_valid = false;
 	}
+
+	so->hashso_skip_moved_tuples = false;
 }
 
 /*
@@ -482,16 +479,7 @@ hashendscan(IndexScanDesc scan)
 
 	/* don't need scan registered anymore */
 	_hash_dropscan(scan);
-
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	pfree(so);
 	scan->opaque = NULL;
@@ -502,6 +490,9 @@ hashendscan(IndexScanDesc scan)
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
  *
+ * This function also delete the tuples that are moved by split to other
+ * bucket.
+ *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
@@ -546,83 +537,52 @@ loop_top:
 	{
 		BlockNumber bucket_blkno;
 		BlockNumber blkno;
-		bool		bucket_dirty = false;
+		Buffer		bucket_buf;
+		Buffer		buf;
+		HashPageOpaque bucket_opaque;
+		Page		page;
+		bool		bucket_has_garbage = false;
 
 		/* Get address of bucket's start page */
 		bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
 
-		/* Exclusive-lock the bucket so we can shrink it */
-		_hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
 		/* Shouldn't have any active scans locally, either */
 		if (_hash_has_active_scan(rel, cur_bucket))
 			elog(ERROR, "hash index has active scan during VACUUM");
 
-		/* Scan each page in bucket */
 		blkno = bucket_blkno;
-		while (BlockNumberIsValid(blkno))
-		{
-			Buffer		buf;
-			Page		page;
-			HashPageOpaque opaque;
-			OffsetNumber offno;
-			OffsetNumber maxoffno;
-			OffsetNumber deletable[MaxOffsetNumber];
-			int			ndeletable = 0;
 
-			vacuum_delay_point();
-
-			buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-											 info->strategy);
-			page = BufferGetPage(buf);
-			opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-			Assert(opaque->hasho_bucket == cur_bucket);
-
-			/* Scan each tuple in page */
-			maxoffno = PageGetMaxOffsetNumber(page);
-			for (offno = FirstOffsetNumber;
-				 offno <= maxoffno;
-				 offno = OffsetNumberNext(offno))
-			{
-				IndexTuple	itup;
-				ItemPointer htup;
+		/*
+		 * We need to acquire a cleanup lock on the primary bucket to out wait
+		 * concurrent scans.
+		 */
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+		LockBufferForCleanup(buf);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 
-				itup = (IndexTuple) PageGetItem(page,
-												PageGetItemId(page, offno));
-				htup = &(itup->t_tid);
-				if (callback(htup, callback_state))
-				{
-					/* mark the item for deletion */
-					deletable[ndeletable++] = offno;
-					tuples_removed += 1;
-				}
-				else
-					num_index_tuples += 1;
-			}
+		page = BufferGetPage(buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
-			/*
-			 * Apply deletions and write page if needed, advance to next page.
-			 */
-			blkno = opaque->hasho_nextblkno;
+		/*
+		 * If the bucket contains tuples that are moved by split, then we need
+		 * to delete such tuples on completion of split.  Before cleaning, we
+		 * need to out-wait the scans that have started when the split was in
+		 * progress for a bucket.
+		 */
+		if (H_HAS_GARBAGE(bucket_opaque) &&
+			!H_INCOMPLETE_SPLIT(bucket_opaque))
+			bucket_has_garbage = true;
 
-			if (ndeletable > 0)
-			{
-				PageIndexMultiDelete(page, deletable, ndeletable);
-				_hash_wrtbuf(rel, buf);
-				bucket_dirty = true;
-			}
-			else
-				_hash_relbuf(rel, buf);
-		}
+		bucket_buf = buf;
 
-		/* If we deleted anything, try to compact free space */
-		if (bucket_dirty)
-			_hash_squeezebucket(rel, cur_bucket, bucket_blkno,
-								info->strategy);
+		hashbucketcleanup(rel, bucket_buf, blkno, info->strategy,
+						  local_metapage.hashm_maxbucket,
+						  local_metapage.hashm_highmask,
+						  local_metapage.hashm_lowmask, &tuples_removed,
+						  &num_index_tuples, bucket_has_garbage, true,
+						  callback, callback_state);
 
-		/* Release bucket lock */
-		_hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+		_hash_relbuf(rel, bucket_buf);
 
 		/* Advance to next bucket */
 		cur_bucket++;
@@ -703,6 +663,197 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }
 
+/*
+ * Helper function to perform deletion of index entries from a bucket.
+ *
+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.
+ *
+ * During scan of overflow buckets, first we need to lock the next bucket and
+ * then release the lock on current bucket.  This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.
+ *
+ * We need to retain a pin on the primary bucket to ensure that no concurrent
+ * split can start.
+ */
+void
+hashbucketcleanup(Relation rel, Buffer bucket_buf,
+				  BlockNumber bucket_blkno,
+				  BufferAccessStrategy bstrategy,
+				  uint32 maxbucket,
+				  uint32 highmask, uint32 lowmask,
+				  double *tuples_removed,
+				  double *num_index_tuples,
+				  bool bucket_has_garbage,
+				  bool delay,
+				  IndexBulkDeleteCallback callback,
+				  void *callback_state)
+{
+	BlockNumber blkno;
+	Buffer		buf;
+	Bucket		cur_bucket;
+	Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY;
+	Page		page;
+	bool		bucket_dirty = false;
+
+	blkno = bucket_blkno;
+	buf = bucket_buf;
+	page = BufferGetPage(buf);
+	cur_bucket = ((HashPageOpaque) PageGetSpecialPointer(page))->hasho_bucket;
+
+	if (bucket_has_garbage)
+		new_bucket = _hash_get_newbucket(rel, cur_bucket,
+										 lowmask, maxbucket);
+
+	/* Scan each page in bucket */
+	for (;;)
+	{
+		HashPageOpaque opaque;
+		OffsetNumber offno;
+		OffsetNumber maxoffno;
+		Buffer		next_buf;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+		bool		retain_pin = false;
+		bool		curr_page_dirty = false;
+
+		if (delay)
+			vacuum_delay_point();
+
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		/* Scan each tuple in page */
+		maxoffno = PageGetMaxOffsetNumber(page);
+		for (offno = FirstOffsetNumber;
+			 offno <= maxoffno;
+			 offno = OffsetNumberNext(offno))
+		{
+			IndexTuple	itup;
+			ItemPointer htup;
+			Bucket		bucket;
+
+			itup = (IndexTuple) PageGetItem(page,
+											PageGetItemId(page, offno));
+			htup = &(itup->t_tid);
+			if (callback && callback(htup, callback_state))
+			{
+				/* mark the item for deletion */
+				deletable[ndeletable++] = offno;
+				if (tuples_removed)
+					*tuples_removed += 1;
+			}
+			else if (bucket_has_garbage)
+			{
+				/* delete the tuples that are moved by split. */
+				bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+											  maxbucket,
+											  highmask,
+											  lowmask);
+				/* mark the item for deletion */
+				if (bucket != cur_bucket)
+				{
+					/*
+					 * We expect tuples to either belong to curent bucket or
+					 * new_bucket.  This is ensured because we don't allow
+					 * further splits from bucket that contains garbage. See
+					 * comments in _hash_expandtable.
+					 */
+					Assert(bucket == new_bucket);
+					deletable[ndeletable++] = offno;
+				}
+				else if (num_index_tuples)
+					*num_index_tuples += 1;
+			}
+			else if (num_index_tuples)
+				*num_index_tuples += 1;
+		}
+
+		/* retain the pin on primary bucket till end of bucket scan */
+		if (blkno == bucket_blkno)
+			retain_pin = true;
+		else
+			retain_pin = false;
+
+		blkno = opaque->hasho_nextblkno;
+
+		/*
+		 * Apply deletions and write page if needed, advance to next page.
+		 */
+		if (ndeletable > 0)
+		{
+			PageIndexMultiDelete(page, deletable, ndeletable);
+			bucket_dirty = true;
+			curr_page_dirty = true;
+		}
+
+		/* bail out if there are no more pages to scan. */
+		if (!BlockNumberIsValid(blkno))
+			break;
+
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
+		/*
+		 * release the lock on previous page after acquiring the lock on next
+		 * page
+		 */
+		if (curr_page_dirty)
+		{
+			if (retain_pin)
+				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+			else
+				_hash_wrtbuf(rel, buf);
+			curr_page_dirty = false;
+		}
+		else if (retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
+
+		buf = next_buf;
+	}
+
+	/*
+	 * lock the bucket page to clear the garbage flag and squeeze the bucket.
+	 * if the current buffer is same as bucket buffer, then we already have
+	 * lock on bucket page.
+	 */
+	if (buf != bucket_buf)
+	{
+		_hash_relbuf(rel, buf);
+		_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+	}
+
+	/*
+	 * Clear the garbage flag from bucket after deleting the tuples that are
+	 * moved by split.  We purposefully clear the flag before squeeze bucket,
+	 * so that after restart, vacuum shouldn't again try to delete the moved
+	 * by split tuples.
+	 */
+	if (bucket_has_garbage)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = BufferGetPage(bucket_buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+	}
+
+	/*
+	 * If we deleted anything, try to compact free space.  For squeezing the
+	 * bucket, we must have a cleanup lock, else it can impact the ordering of
+	 * tuples for a scan that has started before it.
+	 */
+	if (bucket_dirty && CheckBufferForCleanup(bucket_buf))
+		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
+							bstrategy);
+}
 
 void
 hash_redo(XLogReaderState *record)
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..b1e79b5 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -28,7 +28,8 @@
 void
 _hash_doinsert(Relation rel, IndexTuple itup)
 {
-	Buffer		buf;
+	Buffer		buf = InvalidBuffer;
+	Buffer		bucket_buf;
 	Buffer		metabuf;
 	HashMetaPage metap;
 	BlockNumber blkno;
@@ -40,6 +41,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	bool		do_expand;
 	uint32		hashkey;
 	Bucket		bucket;
+	uint32		maxbucket;
+	uint32		highmask;
+	uint32		lowmask;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -70,51 +74,131 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			errhint("Values larger than a buffer page cannot be indexed.")));
 
 	/*
-	 * Loop until we get a lock on the correct target bucket.
+	 * Copy bucket mapping info now;  The comment in _hash_expandtable where
+	 * we copy this information and calls _hash_splitbucket explains why this
+	 * is OK.
 	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
-									  metap->hashm_maxbucket,
-									  metap->hashm_highmask,
-									  metap->hashm_lowmask);
+	maxbucket = metap->hashm_maxbucket;
+	highmask = metap->hashm_highmask;
+	lowmask = metap->hashm_lowmask;
 
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
+	/*
+	 * Conditionally get the lock on primary bucket page for insertion while
+	 * holding lock on meta page. If we have to wait, then release the meta
+	 * page lock and retry it in a hard way.
+	 */
+	bucket = _hash_hashkey2bucket(hashkey,
+								  maxbucket,
+								  highmask,
+								  lowmask);
+
+	blkno = BUCKET_TO_BLKNO(metap, bucket);
 
-		/* Release metapage lock, but keep pin. */
+	/* Fetch the primary bucket page for the bucket */
+	buf = ReadBuffer(rel, blkno);
+	if (!ConditionalLockBuffer(buf))
+	{
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(buf, HASH_WRITE);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		oldblkno = blkno;
+		retry = true;
+	}
+	else
+	{
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	}
 
+	if (retry)
+	{
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * Loop until we get a lock on the correct target bucket.  We get the
+		 * lock on primary bucket page and retain the pin on it during insert
+		 * operation to prevent the concurrent splits.  Retaining pin on a
+		 * primary bucket page ensures that split can't happen as it needs to
+		 * acquire the cleanup lock on primary bucket page.  Acquiring lock on
+		 * primary bucket and rechecking if it is a target bucket is mandatory
+		 * as otherwise a concurrent split might cause this insertion to fall
+		 * in wrong bucket.
 		 */
-		if (retry)
+		for (;;)
 		{
+			/*
+			 * Compute the target bucket number, and convert to block number.
+			 */
+			bucket = _hash_hashkey2bucket(hashkey,
+										  metap->hashm_maxbucket,
+										  metap->hashm_highmask,
+										  metap->hashm_lowmask);
+
+			blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+			/* Release metapage lock, but keep pin. */
+			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+			/*
+			 * If the previous iteration of this loop locked what is still the
+			 * correct target bucket, we are done.  Otherwise, drop any old
+			 * lock and lock what now appears to be the correct bucket.
+			 */
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
-		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
-		oldblkno = blkno;
-		retry = true;
+			/* Fetch the primary bucket page for the bucket */
+			buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+
+			/*
+			 * Reacquire metapage lock and check that no bucket split has
+			 * taken place while we were awaiting the bucket lock.
+			 */
+			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			oldblkno = blkno;
+		}
 	}
 
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+	/* remember the primary bucket buffer to release the pin on it at end. */
+	bucket_buf = buf;
+
 	page = BufferGetPage(buf);
 	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(pageopaque->hasho_bucket == bucket);
 
+	/*
+	 * If there is any pending split, try to finish it before proceeding for
+	 * the insertion.  We try to finish the split for the insertion in old
+	 * bucket, as that will allow us to remove the tuples from old bucket and
+	 * reuse the space.  There is no such apparent benefit from finsihing the
+	 * split during insertion in new bucket.
+	 *
+	 * In future, if we want to finish the splits during insertion in new
+	 * bucket, we must ensure the locking order such that old bucket is locked
+	 * before new bucket.
+	 */
+	if (H_OLD_INCOMPLETE_SPLIT(pageopaque) && CheckBufferForCleanup(buf))
+	{
+		BlockNumber nblkno;
+		Buffer		nbuf;
+
+		nblkno = _hash_get_newblk(rel, pageopaque);
+
+		/* Fetch the primary bucket page for the new bucket */
+		nbuf = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+		if (nbuf)
+		{
+			_hash_finish_split(rel, metabuf, buf, nbuf, maxbucket,
+							   highmask, lowmask);
+
+			/*
+			 * release the buffer here as the insertion will happen in old
+			 * bucket.
+			 */
+			_hash_relbuf(rel, nbuf);
+		}
+	}
+
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
@@ -127,14 +211,23 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		{
 			/*
 			 * ovfl page exists; go get it.  if it doesn't have room, we'll
-			 * find out next pass through the loop test above.
+			 * find out next pass through the loop test above.  Retain the pin
+			 * if it is a primary bucket.
 			 */
-			_hash_relbuf(rel, buf);
+			if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
+				_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+			else
+				_hash_relbuf(rel, buf);
 			buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 			page = BufferGetPage(buf);
 		}
 		else
 		{
+			bool		retain_pin = false;
+
+			/* page flags must be accessed before releasing lock on a page. */
+			retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;
+
 			/*
 			 * we're at the end of the bucket chain and we haven't found a
 			 * page with enough room.  allocate a new overflow page.
@@ -144,7 +237,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 
 			/* chain to a new overflow page */
-			buf = _hash_addovflpage(rel, metabuf, buf);
+			buf = _hash_addovflpage(rel, metabuf, buf, retain_pin);
 			page = BufferGetPage(buf);
 
 			/* should fit now, given test above */
@@ -158,11 +251,13 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	/* found page with enough space, so add the item here */
 	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
 
-	/* write and release the modified page */
+	/*
+	 * write and release the modified page and ensure to release the pin on
+	 * primary page.
+	 */
 	_hash_wrtbuf(rel, buf);
-
-	/* We can drop the bucket lock now */
-	_hash_droplock(rel, blkno, HASH_SHARE);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
 
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index db3e268..760563a 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *
  *	On entry, the caller must hold a pin but no lock on 'buf'.  The pin is
  *	dropped before exiting (we assume the caller is not interested in 'buf'
- *	anymore).  The returned overflow page will be pinned and write-locked;
- *	it is guaranteed to be empty.
+ *	anymore) if not asked to retain.  The pin will be retained only for the
+ *	primary bucket.  The returned overflow page will be pinned and
+ *	write-locked; it is guaranteed to be empty.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
  *
- *	The caller must hold at least share lock on the bucket, to ensure that
- *	no one else tries to compact the bucket meanwhile.  This guarantees that
- *	'buf' won't stop being part of the bucket while it's unlocked.
- *
  * NB: since this could be executed concurrently by multiple processes,
  * one should not assume that the returned overflow page will be the
  * immediate successor of the originally passed 'buf'.  Additional overflow
  * pages might have been added to the bucket chain in between.
  */
 Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 {
 	Buffer		ovflbuf;
 	Page		page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 			break;
 
 		/* we assume we do not need to write the unmodified page */
-		_hash_relbuf(rel, buf);
+		if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
 
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
@@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 
 	/* logically chain overflow page to previous page */
 	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	_hash_wrtbuf(rel, buf);
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, buf);
 
 	return ovflbuf;
 }
@@ -370,11 +373,11 @@ _hash_firstfreebit(uint32 map)
  *	in the bucket, or InvalidBlockNumber if no following page.
  *
  *	NB: caller must not hold lock on metapage, nor on either page that's
- *	adjacent in the bucket chain.  The caller had better hold exclusive lock
- *	on the bucket, too.
+ *	adjacent in the bucket chain except from primary bucket.  The caller had
+ *	better hold cleanup lock on the primary bucket.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf,
+_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
@@ -413,22 +416,41 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
-	 * deleted.  No concurrency issues since we hold exclusive lock on the
-	 * entire bucket.
+	 * deleted.  No concurrency issues since we hold the cleanup lock on
+	 * primary bucket.  We don't need to aqcuire buffer lock to fix the
+	 * primary bucket, as we already have that lock.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-														 prevblkno,
-														 HASH_WRITE,
+		if (prevblkno == bucket_blkno)
+		{
+			Buffer		prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
+													 prevblkno,
+													 RBM_NORMAL,
+													 bstrategy);
+
+			Page		prevpage = BufferGetPage(prevbuf);
+			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+			Assert(prevopaque->hasho_bucket == bucket);
+			prevopaque->hasho_nextblkno = nextblkno;
+			MarkBufferDirty(prevbuf);
+			ReleaseBuffer(prevbuf);
+		}
+		else
+		{
+			Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
+															 prevblkno,
+															 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		prevpage = BufferGetPage(prevbuf);
-		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+															 bstrategy);
+			Page		prevpage = BufferGetPage(prevbuf);
+			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
 
-		Assert(prevopaque->hasho_bucket == bucket);
-		prevopaque->hasho_nextblkno = nextblkno;
-		_hash_wrtbuf(rel, prevbuf);
+			Assert(prevopaque->hasho_bucket == bucket);
+			prevopaque->hasho_nextblkno = nextblkno;
+			_hash_wrtbuf(rel, prevbuf);
+		}
 	}
 	if (BlockNumberIsValid(nextblkno))
 	{
@@ -570,7 +592,7 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
  *	required that to be true on entry as well, but it's a lot easier for
  *	callers to leave empty overflow pages and let this guy clean it up.
  *
- *	Caller must hold exclusive lock on the target bucket.  This allows
+ *	Caller must hold cleanup lock on the target bucket.  This allows
  *	us to safely lock multiple pages in the bucket.
  *
  *	Since this function is invoked in VACUUM, we provide an access strategy
@@ -580,6 +602,7 @@ void
 _hash_squeezebucket(Relation rel,
 					Bucket bucket,
 					BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy)
 {
 	BlockNumber wblkno;
@@ -591,27 +614,22 @@ _hash_squeezebucket(Relation rel,
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
 	bool		wbuf_dirty;
+	bool		release_buf = false;
 
 	/*
 	 * start squeezing into the base bucket page.
 	 */
 	wblkno = bucket_blkno;
-	wbuf = _hash_getbuf_with_strategy(rel,
-									  wblkno,
-									  HASH_WRITE,
-									  LH_BUCKET_PAGE,
-									  bstrategy);
+	wbuf = bucket_buf;
 	wpage = BufferGetPage(wbuf);
 	wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 
 	/*
-	 * if there aren't any overflow pages, there's nothing to squeeze.
+	 * if there aren't any overflow pages, there's nothing to squeeze. caller
+	 * is responsible to release the lock on primary bucket.
 	 */
 	if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
-	{
-		_hash_relbuf(rel, wbuf);
 		return;
-	}
 
 	/*
 	 * Find the last page in the bucket chain by starting at the base bucket
@@ -669,12 +687,17 @@ _hash_squeezebucket(Relation rel,
 			{
 				Assert(!PageIsEmpty(wpage));
 
+				if (wblkno != bucket_blkno)
+					release_buf = true;
+
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));
 
-				if (wbuf_dirty)
+				if (wbuf_dirty && release_buf)
 					_hash_wrtbuf(rel, wbuf);
-				else
+				else if (wbuf_dirty)
+					MarkBufferDirty(wbuf);
+				else if (release_buf)
 					_hash_relbuf(rel, wbuf);
 
 				/* nothing more to do if we reached the read page */
@@ -700,6 +723,7 @@ _hash_squeezebucket(Relation rel,
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
 				wbuf_dirty = false;
+				release_buf = false;
 			}
 
 			/*
@@ -733,19 +757,25 @@ _hash_squeezebucket(Relation rel,
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
-			/* yes, so release wbuf lock first */
-			if (wbuf_dirty)
+			if (wblkno != bucket_blkno)
+				release_buf = true;
+
+			/* yes, so release wbuf lock first if needed */
+			if (wbuf_dirty && release_buf)
 				_hash_wrtbuf(rel, wbuf);
-			else
+			else if (wbuf_dirty)
+				MarkBufferDirty(wbuf);
+			else if (release_buf)
 				_hash_relbuf(rel, wbuf);
+
 			/* free this overflow page (releases rbuf) */
-			_hash_freeovflpage(rel, rbuf, bstrategy);
+			_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 			/* done */
 			return;
 		}
 
 		/* free this overflow page, then get the previous one */
-		_hash_freeovflpage(rel, rbuf, bstrategy);
+		_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 178463f..bb43aaa 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,10 +38,14 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
 					uint32 nblocks);
 static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
+static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
+					   Bucket obucket, Bucket nbucket, Buffer obuf,
+					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
+					   uint32 highmask, uint32 lowmask);
 
 
 /*
@@ -55,46 +59,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 
 
 /*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock.  (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		return ConditionalLockPage(rel, whichlock, access);
-	else
-		return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		UnlockPage(rel, whichlock, access);
-}
-
-/*
  *	_hash_getbuf() -- Get a buffer by block number for read or write.
  *
  *		'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -132,6 +96,35 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
 }
 
 /*
+ * _hash_getbuf_with_condlock_cleanup() -- as above, but get the buffer for write.
+ *
+ *		We try to take the conditional cleanup lock and if we get it then
+ *		retrun the buffer, else return InvalidBuffer.
+ */
+Buffer
+_hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
+{
+	Buffer		buf;
+
+	if (blkno == P_NEW)
+		elog(ERROR, "hash AM does not use P_NEW");
+
+	buf = ReadBuffer(rel, blkno);
+
+	if (!ConditionalLockBufferForCleanup(buf))
+	{
+		ReleaseBuffer(buf);
+		return InvalidBuffer;
+	}
+
+	/* ref count and lock type are correct */
+
+	_hash_checkpage(rel, buf, flags);
+
+	return buf;
+}
+
+/*
  *	_hash_getinitbuf() -- Get and initialize a buffer by block number.
  *
  *		This must be used only to fetch pages that are known to be before
@@ -266,6 +259,33 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
+ *	_hash_dropscanbuf() -- release buffers used in scan.
+ *
+ * This routine unpins the buffers used during scan on which we
+ * hold no lock.
+ */
+void
+_hash_dropscanbuf(Relation rel, HashScanOpaque so)
+{
+	/* release pin we hold on primary bucket */
+	if (BufferIsValid(so->hashso_bucket_buf) &&
+		so->hashso_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_bucket_buf);
+	so->hashso_bucket_buf = InvalidBuffer;
+
+	/* release pin we hold on old primary bucket */
+	if (BufferIsValid(so->hashso_old_bucket_buf) &&
+		so->hashso_old_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+	so->hashso_old_bucket_buf = InvalidBuffer;
+
+	/* release any pin we still hold */
+	if (BufferIsValid(so->hashso_curbuf))
+		_hash_dropbuf(rel, so->hashso_curbuf);
+	so->hashso_curbuf = InvalidBuffer;
+}
+
+/*
  *	_hash_wrtbuf() -- write a hash page to disk.
  *
  *		This routine releases the lock held on the buffer and our refcount
@@ -489,9 +509,11 @@ _hash_pageinit(Page page, Size size)
 /*
  * Attempt to expand the hash table by creating one new bucket.
  *
- * This will silently do nothing if it cannot get the needed locks.
+ * This will silently do nothing if there are active scans of our own
+ * backend or if we don't get cleanup lock on old or new bucket.
  *
- * The caller should hold no locks on the hash index.
+ * Complete the pending splits and remove the tuples from old bucket,
+ * if there are any left over from previous split.
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.
@@ -506,10 +528,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	BlockNumber start_oblkno;
 	BlockNumber start_nblkno;
 	Buffer		buf_nblkno;
+	Buffer		buf_oblkno;
+	Page		opage;
+	HashPageOpaque oopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
 
+restart_expand:
+
 	/*
 	 * Write-lock the meta page.  It used to be necessary to acquire a
 	 * heavyweight lock to begin a split, but that is no longer required.
@@ -548,11 +575,16 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		goto fail;
 
 	/*
-	 * Determine which bucket is to be split, and attempt to lock the old
-	 * bucket.  If we can't get the lock, give up.
+	 * Determine which bucket is to be split, and attempt to take cleanup lock
+	 * on the old bucket.  If we can't get the lock, give up.
 	 *
-	 * The lock protects us against other backends, but not against our own
-	 * backend.  Must check for active scans separately.
+	 * The cleanup lock protects us against other backends, but not against
+	 * our own backend.  Must check for active scans separately.
+	 *
+	 * The cleanup lock is mainly to protect the split from concurrent
+	 * inserts. See src/backend/access/hash/README, Lock Definitions for
+	 * further details.  Due to this locking restriction, if there is any
+	 * pending scan, split will give up which is not good, but harmless.
 	 */
 	new_bucket = metap->hashm_maxbucket + 1;
 
@@ -563,11 +595,90 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, old_bucket))
 		goto fail;
 
-	if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
+	buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
+	if (!buf_oblkno)
 		goto fail;
 
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
 	/*
-	 * Likewise lock the new bucket (should never fail).
+	 * We want to finish the split from a bucket as there is no apparent
+	 * benefit by not doing so and it will make the code complicated to finish
+	 * the split that involves multiple buckets considering the case where new
+	 * split also fails.  We don't need to cosider the new bucket for
+	 * completing the split here as it is not possible that a re-split of new
+	 * bucket starts when there is still a pending split from old bucket.
+	 */
+	if (H_OLD_INCOMPLETE_SPLIT(oopaque))
+	{
+		BlockNumber nblkno;
+		Buffer		buf_nblkno;
+
+		/*
+		 * Copy bucket mapping info now;  The comment in code below where we
+		 * copy this information and calls _hash_splitbucket explains why this
+		 * is OK.
+		 */
+		maxbucket = metap->hashm_maxbucket;
+		highmask = metap->hashm_highmask;
+		lowmask = metap->hashm_lowmask;
+
+		/* Release the metapage lock, before completing the split. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		nblkno = _hash_get_newblk(rel, oopaque);
+
+		/* Fetch the primary bucket page for the new bucket */
+		buf_nblkno = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+		if (!buf_nblkno)
+		{
+			_hash_relbuf(rel, buf_oblkno);
+			goto fail;
+		}
+
+		_hash_finish_split(rel, metabuf, buf_oblkno, buf_nblkno, maxbucket,
+						   highmask, lowmask);
+
+		/*
+		 * release the buffers and retry for expand.
+		 */
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+
+		goto restart_expand;
+	}
+
+	/*
+	 * Clean the tuples remained from previous split.  This operation requires
+	 * cleanup lock and we already have one on old bucket, so let's do it. We
+	 * also don't want to allow further splits from the bucket till the
+	 * garbage of previous split is cleaned.  This has two advantages, first
+	 * it helps in avoiding the bloat due to garbage and second is, during
+	 * cleanup of bucket, we are always sure that the garbage tuples belong to
+	 * most recently splitted bucket.  On the contrary, if we allow cleanup of
+	 * bucket after meta page is updated to indicate the new split and before
+	 * the actual split, the cleanup operation won't be able to decide whether
+	 * the tuple has been moved to the newly created bucket and ended up
+	 * deleting such tuples.
+	 */
+	if (H_HAS_GARBAGE(oopaque))
+	{
+		/* Release the metapage lock. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		hashbucketcleanup(rel, buf_oblkno, start_oblkno, NULL,
+						  metap->hashm_maxbucket, metap->hashm_highmask,
+						  metap->hashm_lowmask, NULL,
+						  NULL, true, false, NULL, NULL);
+
+		_hash_relbuf(rel, buf_oblkno);
+
+		goto restart_expand;
+	}
+
+	/*
+	 * There shouldn't be any active scan on new bucket.
 	 *
 	 * Note: it is safe to compute the new bucket's blkno here, even though we
 	 * may still need to update the BUCKET_TO_BLKNO mapping.  This is because
@@ -579,9 +690,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, new_bucket))
 		elog(ERROR, "scan in progress on supposedly new bucket");
 
-	if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
-		elog(ERROR, "could not get lock on supposedly new bucket");
-
 	/*
 	 * If the split point is increasing (hashm_maxbucket's log base 2
 	 * increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +708,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
 			/* can't split due to BlockNumber overflow */
-			_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-			_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+			_hash_relbuf(rel, buf_oblkno);
 			goto fail;
 		}
 	}
@@ -609,9 +716,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/*
 	 * Physically allocate the new bucket's primary page.  We want to do this
 	 * before changing the metapage's mapping info, in case we can't get the
-	 * disk space.
+	 * disk space.  Ideally, we don't need to check for cleanup lock on new
+	 * bucket as no other backend could find this bucket unless meta page is
+	 * updated.  However, it is good to be consistent with old bucket locking.
 	 */
 	buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
+	if (!CheckBufferForCleanup(buf_nblkno))
+	{
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+		goto fail;
+	}
+
 
 	/*
 	 * Okay to proceed with split.  Update the metapage bucket mapping info.
@@ -665,13 +781,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  start_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno,
 					  maxbucket, highmask, lowmask);
 
-	/* Release bucket locks, allowing others to access them */
-	_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-	_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -745,6 +857,10 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
  * The buffer is returned in the same state.  (The metapage is only
  * touched if it becomes necessary to add or remove overflow pages.)
  *
+ * Split needs to hold pin on primary bucket pages of both old and new
+ * buckets till end of operation.  This is to prevent vacuum to start
+ * when split is in progress.
+ *
  * In addition, the caller must have created the new bucket's base page,
  * which is passed in buffer nbuf, pinned and write-locked.  That lock and
  * pin are released here.  (The API is set up this way because we must do
@@ -756,37 +872,87 @@ _hash_splitbucket(Relation rel,
 				  Buffer metabuf,
 				  Bucket obucket,
 				  Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Buffer		obuf;
 	Page		opage;
 	Page		npage;
 	HashPageOpaque oopaque;
 	HashPageOpaque nopaque;
 
-	/*
-	 * It should be okay to simultaneously write-lock pages from each bucket,
-	 * since no one else can be trying to acquire buffer lock on pages of
-	 * either bucket.
-	 */
-	obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
 	opage = BufferGetPage(obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 
+	/*
+	 * Mark the old bucket to indicate that split is in progress and it has
+	 * deletable tuples. At operation end, we clear split in progress flag and
+	 * vacuum will clear page_has_garbage flag after deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
 	npage = BufferGetPage(nbuf);
 
-	/* initialize the new bucket's primary page */
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nopaque->hasho_prevblkno = InvalidBlockNumber;
 	nopaque->hasho_nextblkno = InvalidBlockNumber;
 	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
 	nopaque->hasho_page_id = HASHO_PAGE_ID;
 
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, nbuf, NULL,
+						   maxbucket, highmask, lowmask);
+
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, obuf);
+	_hash_relbuf(rel, nbuf);
+}
+
+/*
+ * _hash_splitbucket_guts -- Helper function to perform the split operation
+ *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, caller needs to fill htab.  If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
+ * Caller needs to lock and unlock the old and new primary buckets.
+ */
+static void
+_hash_splitbucket_guts(Relation rel,
+					   Buffer metabuf,
+					   Bucket obucket,
+					   Bucket nbucket,
+					   Buffer obuf,
+					   Buffer nbuf,
+					   HTAB *htab,
+					   uint32 maxbucket,
+					   uint32 highmask,
+					   uint32 lowmask)
+{
+	Buffer		bucket_obuf;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
+
+	bucket_obuf = obuf;
+	opage = BufferGetPage(obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
 	/*
 	 * Partition the tuples in the old bucket between the old bucket and the
 	 * new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +964,6 @@ _hash_splitbucket(Relation rel,
 		BlockNumber oblkno;
 		OffsetNumber ooffnum;
 		OffsetNumber omaxoffnum;
-		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
 
 		/* Scan each tuple in old page */
 		omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -810,18 +974,45 @@ _hash_splitbucket(Relation rel,
 			IndexTuple	itup;
 			Size		itemsz;
 			Bucket		bucket;
+			bool		found = false;
 
 			/*
-			 * Fetch the item's hash key (conveniently stored in the item) and
-			 * determine which bucket it now belongs in.
+			 * Before inserting tuple, probe the hash table containing TIDs of
+			 * tuples belonging to new bucket, if we find a match, then skip
+			 * that tuple, else fetch the item's hash key (conveniently stored
+			 * in the item) and determine which bucket it now belongs in.
 			 */
 			itup = (IndexTuple) PageGetItem(opage,
 											PageGetItemId(opage, ooffnum));
+
+			if (htab)
+				(void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
+
+			if (found)
+				continue;
+
 			bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
 										  maxbucket, highmask, lowmask);
 
 			if (bucket == nbucket)
 			{
+				Size		itupsize = 0;
+				IndexTuple	new_itup;
+
+				/*
+				 * make a copy of index tuple as we have to scribble on it.
+				 */
+				new_itup = CopyIndexTuple(itup);
+
+				/*
+				 * mark the index tuple as moved by split, such tuples are
+				 * skipped by scan if there is split in progress for a bucket.
+				 */
+				itupsize = new_itup->t_info & INDEX_SIZE_MASK;
+				new_itup->t_info &= ~INDEX_SIZE_MASK;
+				new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+				new_itup->t_info |= itupsize;
+
 				/*
 				 * insert the tuple into the new bucket.  if it doesn't fit on
 				 * the current page in the new bucket, we must allocate a new
@@ -832,17 +1023,25 @@ _hash_splitbucket(Relation rel,
 				 * only partially complete, meaning the index is corrupt,
 				 * since searches may fail to find entries they should find.
 				 */
-				itemsz = IndexTupleDSize(*itup);
+				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
 				if (PageGetFreeSpace(npage) < itemsz)
 				{
+					bool		retain_pin = false;
+
+					/*
+					 * page flags must be accessed before releasing lock on a
+					 * page.
+					 */
+					retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
+
 					/* write out nbuf and drop lock, but keep pin */
 					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
 					/* chain to a new overflow page */
-					nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+					nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
 					npage = BufferGetPage(nbuf);
-					/* we don't need nopaque within the loop */
+					nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 				}
 
 				/*
@@ -852,12 +1051,10 @@ _hash_splitbucket(Relation rel,
 				 * Possible future improvement: accumulate all the items for
 				 * the new page and qsort them before insertion.
 				 */
-				(void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
+				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
-				/*
-				 * Mark tuple for deletion from old page.
-				 */
-				deletable[ndeletable++] = ooffnum;
+				/* be tidy */
+				pfree(new_itup);
 			}
 			else
 			{
@@ -870,15 +1067,9 @@ _hash_splitbucket(Relation rel,
 
 		oblkno = oopaque->hasho_nextblkno;
 
-		/*
-		 * Done scanning this old page.  If we moved any tuples, delete them
-		 * from the old page.
-		 */
-		if (ndeletable > 0)
-		{
-			PageIndexMultiDelete(opage, deletable, ndeletable);
-			_hash_wrtbuf(rel, obuf);
-		}
+		/* retain the pin on the old primary bucket */
+		if (obuf == bucket_obuf)
+			_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, obuf);
 
@@ -887,18 +1078,153 @@ _hash_splitbucket(Relation rel,
 			break;
 
 		/* Else, advance to next old page */
-		obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
 		opage = BufferGetPage(obuf);
 		oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	}
 
 	/*
 	 * We're at the end of the old bucket chain, so we're done partitioning
-	 * the tuples.  Before quitting, call _hash_squeezebucket to ensure the
-	 * tuples remaining in the old bucket (including the overflow pages) are
-	 * packed as tightly as possible.  The new bucket is already tight.
+	 * the tuples.  Mark the old and new buckets to indicate split is
+	 * finished.
+	 *
+	 * To avoid deadlocks due to locking order of buckets, first lock the old
+	 * bucket and then the new bucket.
 	 */
-	_hash_wrtbuf(rel, nbuf);
+	if (nopaque->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, nbuf);
+
+	/*
+	 * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+	 * there is no pending scan that has seen the flag after it is cleared.
+	 */
+	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+	opage = BufferGetPage(bucket_obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	_hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+	npage = BufferGetPage(bucket_nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+	/* indicate that split is finished */
+	oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
+	nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
+
+	/*
+	 * now write the buffers, here we don't release the locks as caller is
+	 * responsible to release locks.
+	 */
+	MarkBufferDirty(bucket_obuf);
+	MarkBufferDirty(bucket_nbuf);
+}
+
+/*
+ *	_hash_finish_split() -- Finish the previously interrupted split operation
+ *
+ * To complete the split operation, we form the hash table of TIDs in new
+ * bucket which is then used by split operation to skip tuples that are
+ * already moved before the split operation was previously interruptted.
+ *
+ * The caller must hold a pin, but no lock, on the metapage buffer.
+ * The buffer is returned in the same state.  (The metapage is only
+ * touched if it becomes necessary to add or remove overflow pages.)
+ *
+ * 'obuf' and 'nbuf' must be locked by the caller which is also responsible
+ * for unlocking it.
+ */
+void
+_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf,
+				   uint32 maxbucket, uint32 highmask, uint32 lowmask)
+{
+	HASHCTL		hash_ctl;
+	HTAB	   *tidhtab;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque opageopaque;
+	HashPageOpaque npageopaque;
+	Bucket		obucket;
+	Bucket		nbucket;
+	bool		found;
+
+	/* Initialize hash tables used to track TIDs */
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+	hash_ctl.keysize = sizeof(ItemPointerData);
+	hash_ctl.entrysize = sizeof(ItemPointerData);
+	hash_ctl.hcxt = CurrentMemoryContext;
+
+	tidhtab =
+		hash_create("bucket ctids",
+					256,		/* arbitrary initial size */
+					&hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	/*
+	 * Scan the new bucket and build hash table of TIDs
+	 */
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	for (;;)
+	{
+		BlockNumber nblkno;
+		OffsetNumber noffnum;
+		OffsetNumber nmaxoffnum;
+
+		/* Scan each tuple in new page */
+		nmaxoffnum = PageGetMaxOffsetNumber(npage);
+		for (noffnum = FirstOffsetNumber;
+			 noffnum <= nmaxoffnum;
+			 noffnum = OffsetNumberNext(noffnum))
+		{
+			IndexTuple	itup;
+
+			/* Fetch the item's TID and insert it in hash table. */
+			itup = (IndexTuple) PageGetItem(npage,
+											PageGetItemId(npage, noffnum));
+
+			(void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
+
+			Assert(!found);
+		}
+
+		nblkno = npageopaque->hasho_nextblkno;
+
+		/*
+		 * release our write lock without modifying buffer and ensure to
+		 * retain the pin on primary bucket.
+		 */
+		if (nbuf == bucket_nbuf)
+			_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, nbuf);
+
+		/* Exit loop if no more overflow pages in new bucket */
+		if (!BlockNumberIsValid(nblkno))
+			break;
+
+		/* Else, advance to next page */
+		nbuf = _hash_getbuf(rel, nblkno, HASH_READ, LH_OVERFLOW_PAGE);
+		npage = BufferGetPage(nbuf);
+		npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	}
+
+	/* Need a cleanup lock to perform split operation. */
+	LockBufferForCleanup(bucket_nbuf);
+
+	npage = BufferGetPage(bucket_nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nbucket = npageopaque->hasho_bucket;
+
+	opage = BufferGetPage(obuf);
+	opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+	obucket = opageopaque->hasho_bucket;
+
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, bucket_nbuf, tidhtab,
+						   maxbucket, highmask, lowmask);
 
-	_hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+	hash_destroy(tidhtab);
 }
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..512dabd 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -72,7 +72,19 @@ _hash_readnext(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_nextblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan to ensure
+	 * that vacuum can't delete the tuples that are moved by split to new
+	 * bucket. Such tuples are required by the scans that are started on
+	 * splitted buckets, before a new buckets split in progress flag
+	 * (LH_BUCKET_NEW_PAGE_SPLIT) is cleared.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -94,7 +106,16 @@ _hash_readprev(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_prevblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan. See
+	 * comments in _hash_readnext to know the reason of retaining pin.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -104,6 +125,13 @@ _hash_readprev(Relation rel,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
+
+		/*
+		 * We always maintain the pin on bucket page for whole scan operation,
+		 * so releasing the additional pin we have acquired here.
+		 */
+		if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+			_hash_dropbuf(rel, *bufp);
 	}
 }
 
@@ -192,43 +220,81 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	metap = HashPageGetMeta(page);
 
 	/*
-	 * Loop until we get a lock on the correct target bucket.
+	 * Conditionally get the lock on primary bucket page for search while
+	 * holding lock on meta page. If we have to wait, then release the meta
+	 * page lock and retry it in a hard way.
 	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
-									  metap->hashm_maxbucket,
-									  metap->hashm_highmask,
-									  metap->hashm_lowmask);
+	bucket = _hash_hashkey2bucket(hashkey,
+								  metap->hashm_maxbucket,
+								  metap->hashm_highmask,
+								  metap->hashm_lowmask);
 
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
+	blkno = BUCKET_TO_BLKNO(metap, bucket);
 
-		/* Release metapage lock, but keep pin. */
+	/* Fetch the primary bucket page for the bucket */
+	buf = ReadBuffer(rel, blkno);
+	if (!ConditionalLockBufferShared(buf))
+	{
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(buf, HASH_READ);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		oldblkno = blkno;
+		retry = true;
+	}
+	else
+	{
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	}
 
+	if (retry)
+	{
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * Loop until we get a lock on the correct target bucket.  We get the
+		 * lock on primary bucket page and retain the pin on it during read
+		 * operation to prevent the concurrent splits.  Retaining pin on a
+		 * primary bucket page ensures that split can't happen as it needs to
+		 * acquire the cleanup lock on primary bucket page.  Acquiring lock on
+		 * primary bucket and rechecking if it is a target bucket is mandatory
+		 * as otherwise a concurrent split followed by vacuum could remove
+		 * tuples from the selected bucket which otherwise would have been
+		 * visible.
 		 */
-		if (retry)
+		for (;;)
 		{
+			/*
+			 * Compute the target bucket number, and convert to block number.
+			 */
+			bucket = _hash_hashkey2bucket(hashkey,
+										  metap->hashm_maxbucket,
+										  metap->hashm_highmask,
+										  metap->hashm_lowmask);
+
+			blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+			/* Release metapage lock, but keep pin. */
+			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+			/*
+			 * If the previous iteration of this loop locked what is still the
+			 * correct target bucket, we are done.  Otherwise, drop any old
+			 * lock and lock what now appears to be the correct bucket.
+			 */
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
-		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
-		oldblkno = blkno;
-		retry = true;
+			/* Fetch the primary bucket page for the bucket */
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
+			/*
+			 * Reacquire metapage lock and check that no bucket split has
+			 * taken place while we were awaiting the bucket lock.
+			 */
+			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			oldblkno = blkno;
+		}
 	}
 
 	/* done with the metapage */
@@ -237,14 +303,60 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	/* Update scan opaque state to show we have lock on the bucket */
 	so->hashso_bucket = bucket;
 	so->hashso_bucket_valid = true;
-	so->hashso_bucket_blkno = blkno;
 
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
 	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
+	so->hashso_bucket_buf = buf;
+
+	/*
+	 * If the bucket split is in progress, then we need to skip tuples that
+	 * are moved from old bucket.  To ensure that vacuum doesn't clean any
+	 * tuples from old or new buckets till this scan is in progress, maintain
+	 * a pin on both of the buckets.  Here, we have to be cautious about lock
+	 * ordering, first acquire the lock on old bucket, release the lock on old
+	 * bucket, but not pin, then acuire the lock on new bucket and again
+	 * re-verify whether the bucket split still is in progress. Acquiring lock
+	 * on old bucket first ensures that the vacuum waits for this scan to
+	 * finish.
+	 */
+	if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+	{
+		BlockNumber old_blkno;
+		Buffer		old_buf;
+
+		old_blkno = _hash_get_oldblk(rel, opaque);
+
+		/*
+		 * release the lock on new bucket and re-acquire it after acquiring
+		 * the lock on old bucket.
+		 */
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+		/*
+		 * remember the old bucket buffer so as to use it later for scanning.
+		 */
+		so->hashso_old_bucket_buf = old_buf;
+		_hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+		_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		Assert(opaque->hasho_bucket == bucket);
+
+		if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+			so->hashso_skip_moved_tuples = true;
+		else
+		{
+			_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+			so->hashso_old_bucket_buf = InvalidBuffer;
+		}
+	}
+
 	/* If a backwards scan is requested, move to the end of the chain */
 	if (ScanDirectionIsBackward(dir))
 	{
@@ -273,6 +385,13 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
  *		false.  Else, return true and set the hashso_curpos for the
  *		scan to the right thing.
  *
+ *		Here we also scan the old bucket if the split for current bucket
+ *		was in progress at the start of scan.  The basic idea is that
+ *		skip the tuples that are moved by split while scanning current
+ *		bucket and then scan the old bucket to cover all such tuples. This
+ *		is done to ensure that we don't miss any tuples in the scans that
+ *		started during split.
+ *
  *		'bufP' points to the current buffer, which is pinned and read-locked.
  *		On success exit, we have pin and read-lock on whichever page
  *		contains the right item; on failure, we have released all buffers.
@@ -338,6 +457,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum >= FirstOffsetNumber);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberNext(offnum);	/* move forward */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -353,9 +485,41 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -379,6 +543,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum <= maxoff);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberPrev(offnum);	/* move back */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -394,9 +571,41 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -410,9 +619,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 
 		if (itup == NULL)
 		{
-			/* we ran off the end of the bucket without finding a match */
+			/*
+			 * We ran off the end of the bucket without finding a match.
+			 * Release the pin on bucket buffers.  Normally, such pins are
+			 * released at end of scan, however scrolling cursors can
+			 * reacquire the bucket lock and pin in the same scan multiple
+			 * times.
+			 */
 			*bufP = so->hashso_curbuf = InvalidBuffer;
 			ItemPointerSetInvalid(current);
+			_hash_dropscanbuf(rel, so);
 			return false;
 		}
 
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..1648581 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -147,6 +147,23 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_msb-- returns most significant bit position.
+ */
+static uint32
+_hash_msb(uint32 num)
+{
+	uint32		i = 0;
+
+	while (num)
+	{
+		num = num >> 1;
+		++i;
+	}
+
+	return i - 1;
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
@@ -352,3 +369,123 @@ _hash_binsearch_last(Page page, uint32 hash_value)
 
 	return lower;
 }
+
+/*
+ *	_hash_get_oldblk() -- get the block number from which current bucket
+ *			is being splitted.
+ */
+BlockNumber
+_hash_get_oldblk(Relation rel, HashPageOpaque opaque)
+{
+	Bucket		curr_bucket;
+	Bucket		old_bucket;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	/*
+	 * To get the old bucket from the current bucket, we need a mask to modulo
+	 * into lower half of table.  This mask is stored in meta page as
+	 * hashm_lowmask, but here we can't rely on the same, because we need a
+	 * value of lowmask that was prevalent at the time when bucket split was
+	 * started.  Masking the most significant bit of new bucket would give us
+	 * old bucket.
+	 */
+	curr_bucket = opaque->hasho_bucket;
+	mask = (((uint32) 1) << _hash_msb(curr_bucket)) - 1;
+	old_bucket = curr_bucket & mask;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newblk() -- get the block number of bucket for the new bucket
+ *			that will be generated after split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket based on current table
+ * half.  It is mainly required to finsh the incomplete splits where we are
+ * sure that not more than one bucket could have split in progress from old
+ * bucket.
+ */
+BlockNumber
+_hash_get_newblk(Relation rel, HashPageOpaque opaque)
+{
+	Bucket		curr_bucket;
+	Bucket		new_bucket;
+	uint32		lowmask;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	curr_bucket = opaque->hasho_bucket;
+
+	/*
+	 * new bucket can be obtained by OR'ing old bucket with most significant
+	 * bit of current table half.  There could be multiple buckets that could
+	 * have splitted from curent bucket.  We need the first such bucket that
+	 * exists based on current table half.
+	 */
+	lowmask = metap->hashm_lowmask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = curr_bucket | mask;
+		if (new_bucket > metap->hashm_maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+		break;
+	}
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newbucket() -- get the new bucket that will be generated after
+ *			split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket.  New bucket can be
+ * obtained by OR'ing old bucket with most significant bit of table half
+ * for lowmask passed in this function.  There could be multiple buckets that
+ * could have splitted from curent bucket.  We need the first such bucket that
+ * exists.  Caller must ensure that no more than one split has happened from
+ * old bucket.
+ */
+Bucket
+_hash_get_newbucket(Relation rel, Bucket curr_bucket,
+					uint32 lowmask, uint32 maxbucket)
+{
+	Bucket		new_bucket;
+	uint32		mask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = curr_bucket | mask;
+		if (new_bucket > maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		break;
+	}
+
+	return new_bucket;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 76ade37..1c9be40 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3567,6 +3567,26 @@ ConditionalLockBuffer(Buffer buffer)
 }
 
 /*
+ * Acquire the content_lock for the buffer, but only if we don't have to wait.
+ *
+ * This assumes the caller wants BUFFER_LOCK_SHARED mode.
+ */
+bool
+ConditionalLockBufferShared(Buffer buffer)
+{
+	BufferDesc *buf;
+
+	Assert(BufferIsValid(buffer));
+	if (BufferIsLocal(buffer))
+		return true;			/* act as though we got it */
+
+	buf = GetBufferDescriptor(buffer - 1);
+
+	return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+									LW_SHARED);
+}
+
+/*
  * LockBufferForCleanup - lock a buffer in preparation for deleting items
  *
  * Items may be deleted from a disk page only when the caller (a) holds an
@@ -3750,6 +3770,49 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 	return false;
 }
 
+/*
+ * CheckBufferForCleanup - as above, but don't attempt to take lock
+ *
+ * We won't loop, but just check once to see if the pin count is OK.  If
+ * not, return FALSE.
+ */
+bool
+CheckBufferForCleanup(Buffer buffer)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(BufferIsValid(buffer));
+
+	if (BufferIsLocal(buffer))
+	{
+		/* There should be exactly one pin */
+		if (LocalRefCount[-buffer - 1] != 1)
+			return false;
+		/* Nobody else to wait for */
+		return true;
+	}
+
+	/* There should be exactly one local pin */
+	if (GetPrivateRefCount(buffer) != 1)
+		return false;
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+	buf_state = LockBufHdr(bufHdr);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+	{
+		/* pincount is OK. */
+		UnlockBufHdr(bufHdr, buf_state);
+		return true;
+	}
+
+	UnlockBufHdr(bufHdr, buf_state);
+	return false;
+}
+
 
 /*
  *	Functions for buffer I/O handling
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index ce31418..6e8fc4c 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -25,6 +25,7 @@
 #include "lib/stringinfo.h"
 #include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
+#include "utils/hsearch.h"
 #include "utils/relcache.h"
 
 /*
@@ -52,6 +53,9 @@ typedef uint32 Bucket;
 #define LH_BUCKET_PAGE			(1 << 1)
 #define LH_BITMAP_PAGE			(1 << 2)
 #define LH_META_PAGE			(1 << 3)
+#define LH_BUCKET_NEW_PAGE_SPLIT	(1 << 4)
+#define LH_BUCKET_OLD_PAGE_SPLIT	(1 << 5)
+#define LH_BUCKET_PAGE_HAS_GARBAGE	(1 << 6)
 
 typedef struct HashPageOpaqueData
 {
@@ -64,6 +68,12 @@ typedef struct HashPageOpaqueData
 
 typedef HashPageOpaqueData *HashPageOpaque;
 
+#define H_HAS_GARBAGE(opaque)			((opaque)->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+#define H_OLD_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+#define H_INCOMPLETE_SPLIT(opaque)		(((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) || \
+										 ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT))
+
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
  * which otherwise would have a hard time telling pages of different index
@@ -88,12 +98,6 @@ typedef struct HashScanOpaqueData
 	bool		hashso_bucket_valid;
 
 	/*
-	 * If we have a share lock on the bucket, we record it here.  When
-	 * hashso_bucket_blkno is zero, we have no such lock.
-	 */
-	BlockNumber hashso_bucket_blkno;
-
-	/*
 	 * We also want to remember which buffer we're currently examining in the
 	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
 	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -101,11 +105,23 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_curbuf;
 
+	/* remember the buffer associated with primary bucket */
+	Buffer		hashso_bucket_buf;
+
+	/*
+	 * remember the buffer associated with old primary bucket which is
+	 * required during the scan of the bucket for which split is in progress.
+	 */
+	Buffer		hashso_old_bucket_buf;
+
 	/* Current position of the scan, as an index TID */
 	ItemPointerData hashso_curpos;
 
 	/* Current position of the scan, as a heap TID */
 	ItemPointerData hashso_heappos;
+
+	/* Whether scan needs to skip tuples that are moved by split */
+	bool		hashso_skip_moved_tuples;
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -176,6 +192,8 @@ typedef HashMetaPageData *HashMetaPage;
 				  sizeof(ItemIdData) - \
 				  MAXALIGN(sizeof(HashPageOpaqueData)))
 
+#define INDEX_MOVED_BY_SPLIT_MASK	0x2000
+
 #define HASH_MIN_FILLFACTOR			10
 #define HASH_DEFAULT_FILLFACTOR		75
 
@@ -224,9 +242,6 @@ typedef HashMetaPageData *HashMetaPage;
 #define HASH_WRITE		BUFFER_LOCK_EXCLUSIVE
 #define HASH_NOLOCK		(-1)
 
-#define HASH_SHARE		ShareLock
-#define HASH_EXCLUSIVE	ExclusiveLock
-
 /*
  *	Strategy number. There's only one valid strategy for hashing: equality.
  */
@@ -299,21 +314,21 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 
 /* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
 extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BufferAccessStrategy bstrategy);
+				   BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy);
 
 /* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
 extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 			 int access, int flags);
+extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
+								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
@@ -322,6 +337,7 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 						   BufferAccessStrategy bstrategy);
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
+extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
 extern void _hash_wrtbuf(Relation rel, Buffer buf);
 extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
 				   int to_access);
@@ -329,6 +345,9 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
 				ForkNumber forkNum);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
+extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
+				   Buffer nbuf, uint32 maxbucket, uint32 highmask,
+				   uint32 lowmask);
 
 /* hashscan.c */
 extern void _hash_regscan(IndexScanDesc scan);
@@ -364,10 +383,20 @@ extern bool _hash_convert_tuple(Relation index,
 					Datum *index_values, bool *index_isnull);
 extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
 extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+extern BlockNumber _hash_get_oldblk(Relation rel, HashPageOpaque opaque);
+extern BlockNumber _hash_get_newblk(Relation rel, HashPageOpaque opaque);
+extern Bucket _hash_get_newbucket(Relation rel, Bucket curr_bucket,
+					uint32 lowmask, uint32 maxbucket);
 
 /* hash.c */
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
 extern const char *hash_identify(uint8 info);
+extern void hashbucketcleanup(Relation rel, Buffer bucket_buf,
+				  BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
+				  double *tuples_removed, double *num_index_tuples,
+				  bool bucket_has_garbage, bool delay,
+				  IndexBulkDeleteCallback callback, void *callback_state);
 
 #endif   /* HASH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3d5dea7..6d0a29c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -226,8 +226,10 @@ extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
 extern void UnlockBuffers(void);
 extern void LockBuffer(Buffer buffer, int mode);
 extern bool ConditionalLockBuffer(Buffer buffer);
+extern bool ConditionalLockBufferShared(Buffer buffer);
 extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
+extern bool CheckBufferForCleanup(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern void AbortBufferIO(void);
#19Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Amit Kapila (#18)
Re: Hash Indexes

On 08/05/2016 07:36 AM, Amit Kapila wrote:

On Thu, Aug 4, 2016 at 8:02 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

I did some basic testing of same. In that I found one issue with cursor.

Thanks for the testing. The reason for failure was that the patch
didn't take into account the fact that for scrolling cursors, scan can
reacquire the lock and pin on bucket buffer multiple times. I have
fixed it such that we release the pin on bucket buffers after we scan
the last overflow page in bucket. Attached patch fixes the issue for
me, let me know if you still see the issue.

Needs a rebase.

hashinsert.c

+ * reuse the space. There is no such apparent benefit from finsihing the

-> finishing

hashpage.c

+ * retrun the buffer, else return InvalidBuffer.

-> return

+	if (blkno == P_NEW)
+		elog(ERROR, "hash AM does not use P_NEW");

Left over ?

+ * for unlocking it.

-> for unlocking them.

hashsearch.c

+ * bucket, but not pin, then acuire the lock on new bucket and again

-> acquire

hashutil.c

+ * half. It is mainly required to finsh the incomplete splits where we are

-> finish

Ran some tests on a CHAR() based column which showed good results. Will
have to compare with a run with the WAL patch applied.

make check-world passes.

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Amit Kapila
amit.kapila16@gmail.com
In reply to: Jesper Pedersen (#19)
1 attachment(s)
Re: Hash Indexes

On Thu, Sep 1, 2016 at 11:33 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

On 08/05/2016 07:36 AM, Amit Kapila wrote:

Needs a rebase.

Done.

+       if (blkno == P_NEW)
+               elog(ERROR, "hash AM does not use P_NEW");

Left over ?

No. We need this check similar to all other _hash_*buf API's, as we
never expect caller of those API's to pass P_NEW. The new buckets
(blocks) are created during split and it uses different mechanism to
allocate blocks in bulk.

I have fixed all other issues you have raised. Updated patch is
attached with this mail.

Ran some tests on a CHAR() based column which showed good results. Will have
to compare with a run with the WAL patch applied.

Okay, Thanks for testing. I think WAL patch is still not ready for
performance testing, I am fixing few issues in that patch, but you can
do the design or code level review of that patch at this stage. I
think it is fine even if you share the performance numbers with this
and or Mithun's patch [1]https://commitfest.postgresql.org/10/715/.

[1]: https://commitfest.postgresql.org/10/715/

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

concurrent_hash_index_v5.patchapplication/octet-stream; name=concurrent_hash_index_v5.patchDownload
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c1122b4..d02539a 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -400,7 +400,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	Buffer		buf;
 	Page		page;
 
-	_hash_getlock(rel, blkno, HASH_SHARE);
 	buf = _hash_getbuf_with_strategy(rel, blkno, HASH_READ, 0, bstrategy);
 	page = BufferGetPage(buf);
 
@@ -431,7 +430,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	}
 
 	_hash_relbuf(rel, buf);
-	_hash_droplock(rel, blkno, HASH_SHARE);
 }
 
 /*
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 0a7da89..a0feb2f 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -125,49 +125,45 @@ the initially created buckets.
 
 Lock Definitions
 ----------------
-
-We use both lmgr locks ("heavyweight" locks) and buffer context locks
-(LWLocks) to control access to a hash index.  lmgr locks are needed for
-long-term locking since there is a (small) risk of deadlock, which we must
-be able to detect.  Buffer context locks are used for short-term access
-control to individual pages of the index.
-
-LockPage(rel, page), where page is the page number of a hash bucket page,
-represents the right to split or compact an individual bucket.  A process
-splitting a bucket must exclusive-lock both old and new halves of the
-bucket until it is done.  A process doing VACUUM must exclusive-lock the
-bucket it is currently purging tuples from.  Processes doing scans or
-insertions must share-lock the bucket they are scanning or inserting into.
-(It is okay to allow concurrent scans and insertions.)
-
-The lmgr lock IDs corresponding to overflow pages are currently unused.
-These are available for possible future refinements.  LockPage(rel, 0)
-is also currently undefined (it was previously used to represent the right
-to modify the hash-code-to-bucket mapping, but it is no longer needed for
-that purpose).
-
-Note that these lock definitions are conceptually distinct from any sort
-of lock on the pages whose numbers they share.  A process must also obtain
-read or write buffer lock on the metapage or bucket page before accessing
-said page.
-
-Processes performing hash index scans must hold share lock on the bucket
-they are scanning throughout the scan.  This seems to be essential, since
-there is no reasonable way for a scan to cope with its bucket being split
-underneath it.  This creates a possibility of deadlock external to the
-hash index code, since a process holding one of these locks could block
-waiting for an unrelated lock held by another process.  If that process
-then does something that requires exclusive lock on the bucket, we have
-deadlock.  Therefore the bucket locks must be lmgr locks so that deadlock
-can be detected and recovered from.
-
-Processes must obtain read (share) buffer context lock on any hash index
-page while reading it, and write (exclusive) lock while modifying it.
-To prevent deadlock we enforce these coding rules: no buffer lock may be
-held long term (across index AM calls), nor may any buffer lock be held
-while waiting for an lmgr lock, nor may more than one buffer lock
-be held at a time by any one process.  (The third restriction is probably
-stronger than necessary, but it makes the proof of no deadlock obvious.)
+We use buffer content locks (LWLocks) and buffer pins to control access to
+a hash index.
+
+Scan will take a lock in shared mode on primary or overflow buckets.  Inserts
+will acquire exclusive lock on the bucket in which it has to insert.  Both the
+operations releases the lock on previous bucket before moving to the next
+overflow bucket.  They will retain a pin on primary bucket till end of operation.
+Split operation must acquire cleanup lock on both old and new halves of the
+bucket and mark split-in-progress on both the buckets.  The cleanup lock at
+the start of split ensures that parallel insert won't get lost.  Consider a
+case where insertion has to add a tuple on some intermediate overflow bucket
+in the bucket chain, if we allow split when insertion is in progress, split
+might not move this newly inserted tuple.  It releases the lock on previous
+bucket before moving to the next overflow bucket either for old bucket or for
+new bucket.  After partitioning the tuples between old and new buckets, it
+again needs to acquire exclusive lock on both old and new buckets to clear
+the split-in-progress flag.  Like inserts and scans, it will also retain pins
+on both the old and new primary buckets till end of split operation, although
+we can do without that as well.
+
+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or tuples
+that are moved due to split.  The need for cleanup lock to remove dead tuples
+is to ensure that scans' returns correct results.  Scan that returns multiple
+tuples from the same bucket page always restart the scan from the previous
+offset number from which it has returned last tuple.  If we allow vacuum to
+remove the dead tuples with just an exclusive lock, it could remove the tuple
+required to resume the scan.  The need for cleanup lock to remove the tuples
+that are moved by split is to ensure that there is no pending scan that has
+started after the start of split and before the finish of split on bucket.
+If we don't do that, then vacuum can remove tuples that are required by such
+a scan.  We don't need to retain this cleanup lock during whole vacuum
+operation on bucket.  We releases the lock as we move ahead in the bucket
+chain.  In the end, for squeeze-phase, we conditionally acquire cleanup lock
+and if we don't get, then we just abandon the squeeze phase.
+
+To avoid deadlocks, we must be consistent about the lock order in which we
+lock the buckets for operations that requires locks on two different buckets.
+We use the rule "first lock the old bucket and then new bucket, basically
+lock the lowered number bucket first".
 
 
 Pseudocode Algorithms
@@ -188,63 +184,105 @@ track of available overflow pages.
 The reader algorithm is:
 
 	pin meta page and take buffer content lock in shared mode
-	loop:
-		compute bucket number for target hash key
-		release meta page buffer content lock
-		if (correct bucket page is already locked)
-			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock
-		retake meta page buffer content lock in shared mode
+	compute bucket number for target hash key
+	read and pin the primary bucket page
+	conditionally get the buffer content lock in shared mode on primary bucket page for search
+	if we didn't get the lock (need to wait for lock)
+		release the buffer content lock on meta page
+		acquire buffer content lock on primary bucket page in shared mode
+		acquire the buffer content lock in shared mode on meta page
+		to check for possibility of split, we need to recompute the bucket and
+		verify, if it is a correct bucket; set the retry flag
+	else if we get the lock, then we can skip the retry path
+	if (retry)
+		loop:
+			compute bucket number for target hash key
+			release meta page buffer content lock
+			if (correct bucket page is already locked)
+				break
+			release any existing content lock on bucket page (if a concurrent split happened)
+			pin primary bucket page and take shared buffer content lock
+			retake meta page buffer content lock in shared mode
 -- then, per read request:
 	release pin on metapage
-	read current page of bucket and take shared buffer content lock
-		step to next page if necessary (no chaining of locks)
+	if the split is in progress for current bucket and this is a new bucket
+		release the buffer content lock on current bucket page
+		pin and acquire the buffer content lock on old bucket in shared mode
+		release the buffer content lock on old bucket, but not pin
+		retake the buffer content lock on new bucket
+		mark the scan such that it skips the tuples that are marked as moved by split
+	step to next page if necessary (no chaining of locks)
+		if the scan indicates moved by split, then move to old bucket after the scan
+		of current bucket is finished
 	get tuple
 	release buffer content lock and pin on current page
 -- at scan shutdown:
-	release bucket share-lock
-
-We can't hold the metapage lock while acquiring a lock on the target bucket,
-because that might result in an undetected deadlock (lwlocks do not participate
-in deadlock detection).  Instead, we relock the metapage after acquiring the
-bucket page lock and check whether the bucket has been split.  If not, we're
-done.  If so, we release our previously-acquired lock and repeat the process
-using the new bucket number.  Holding the bucket sharelock for
+	release any pin we hold on current buffer, old bucket buffer, new bucket buffer
+
+We don't want to hold the meta page lock if we have to wait for acquiring the
+content lock on bucket page, because that might result in poor concurrency.
+Instead, we relock the metapage after acquiring the bucket page content lock
+and check whether the bucket has been split.  If not, we're done.  If so, we
+release our previously-acquired content lock, but not pin and repeat the
+process using the new bucket number.  Holding the buffer pin on bucket page for
 the remainder of the scan prevents the reader's current-tuple pointer from
-being invalidated by splits or compactions.  Notice that the reader's lock
+being invalidated by splits or compactions.  Notice that the reader's pin
 does not prevent other buckets from being split or compacted.
 
 To keep concurrency reasonably good, we require readers to cope with
 concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the page sharelock.  Since
-deletion is not possible while a reader holds the bucket sharelock, and
-we assume that heap tuple TIDs are unique, this can be implemented by
+their current scan position after re-acquiring the buffer content lock on
+page.  Since deletion is not possible while a reader holds the pin on bucket,
+and we assume that heap tuple TIDs are unique, this can be implemented by
 searching for the same heap tuple TID previously returned.  Insertion does
 not move index entries across pages, so the previously-returned index entry
 should always be on the same page, at the same or higher offset number,
 as it was before.
 
+To allow scan during bucket split, if at the start of the scan, bucket is
+marked as split-in-progress, it scan all the tuples in that bucket except for
+those that are marked as moved-by-split.  Once it finishes the scan of all the
+tuples in the current bucket, it scans the old bucket from which this bucket
+is formed by split.  This happens only for the new half bucket.
+
 The insertion algorithm is rather similar:
 
 	pin meta page and take buffer content lock in shared mode
-	loop:
-		compute bucket number for target hash key
-		release meta page buffer content lock
-		if (correct bucket page is already locked)
-			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock in shared mode
-		retake meta page buffer content lock in shared mode
--- (so far same as reader)
+	compute bucket number for target hash key
+	read and pin the primary bucket page
+	conditionally get the buffer content lock in exclusive mode on primary bucket page for search
+	if we didn't get the lock (need to wait for lock)
+		release the buffer content lock on meta page
+		acquire buffer content lock on primary bucket page in exclusive mode
+		acquire the buffer content lock in shared mode on meta page
+		to check for possibility of split, we need to recompute the bucket and
+		verify, if it is a correct bucket; set the retry flag
+	else if we get the lock, then we can skip the retry path
+	if (retry)
+		loop:
+			compute bucket number for target hash key
+			release meta page buffer content lock
+			if (correct bucket page is already locked)
+				break
+			release any existing content lock on bucket page (if a concurrent split happened)
+			pin primary bucket page and take exclusive buffer content lock
+			retake meta page buffer content lock in shared mode
+-- (so far same as reader, except for acquisation of buffer content lock in
+	exclusive mode on primary bucket page)
 	release pin on metapage
-	pin current page of bucket and take exclusive buffer content lock
-	if full, release, read/exclusive-lock next page; repeat as needed
+	if the split-in-progress flag is set for bucket in old half of split
+	and pin count on it is one, then finish the split
+		we already have a buffer content lock on old bucket, conditionally get the content lock on new bucket
+		if get the lock on new bucket
+			finish the split using algorithm mentioned below for split
+			release the buffer content lock and pin on new bucket
+	if full, release lock but not pin, read/exclusive-lock next page; repeat as needed
 	>> see below if no space in any page of bucket
 	insert tuple at appropriate place in page
 	mark current page dirty and release buffer content lock and pin
+	if current page is not a bucket page, release the pin on bucket page
 	release heavyweight share-lock
-	pin meta page and take buffer content lock in shared mode
+	pin meta page and take buffer content lock in exclusive mode
 	increment tuple count, decide if split needed
 	mark meta page dirty and release buffer content lock and pin
 	done if no split needed, else enter Split algorithm below
@@ -256,11 +294,13 @@ bucket that is being actively scanned, because readers can cope with this
 as explained above.  We only need the short-term buffer locks to ensure
 that readers do not see a partially-updated page.
 
-It is clearly impossible for readers and inserters to deadlock, and in
-fact this algorithm allows them a very high degree of concurrency.
-(The exclusive metapage lock taken to update the tuple count is stronger
-than necessary, since readers do not care about the tuple count, but the
-lock is held for such a short time that this is probably not an issue.)
+To avoid deadlock between readers and inserters, whenever there is a need
+to lock multiple buckets, we always take in the order suggested in Locking
+Definitions above.  This algorithm allows them a very high degree of
+concurrency.  (The exclusive metapage lock taken to update the tuple count
+is stronger than necessary, since readers do not care about the tuple count,
+but the lock is held for such a short time that this is probably not an
+issue.)
 
 When an inserter cannot find space in any existing page of a bucket, it
 must obtain an overflow page and add that page to the bucket's chain.
@@ -271,46 +311,79 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
 The algorithm attempts, but does not necessarily succeed, to split one
 existing bucket in two, thereby lowering the fill ratio:
 
-	pin meta page and take buffer content lock in exclusive mode
-	check split still needed
-	if split not needed anymore, drop buffer content lock and pin and exit
-	decide which bucket to split
-	Attempt to X-lock old bucket number (definitely could fail)
-	Attempt to X-lock new bucket number (shouldn't fail, but...)
-	if above fail, drop locks and pin and exit
+	expand:
+		take buffer content lock in exclusive mode on meta page
+		check split still needed
+		if split not needed anymore, drop buffer content lock and exit
+		decide which bucket to split
+		Attempt to acquire cleanup lock on old bucket number (definitely could fail)
+		if above fail, release lock and pin and exit
+		if the split-in-progress flag is set, then finish the split
+			conditionally get the content lock on new bucket which was involved in split
+			if got the lock on new bucket
+				finish the split using algorithm mentioned below for split
+				release the buffer content lock and pin on old and new bucketa
+				try to expand from start
+			else
+				release the buffer conetent lock and pin on old bucket and exit
+		if the garbage flag (indicates that tuples are moved by split) is set on bucket
+			release the buffer content lock on meta page
+			remove the tuples that doesn't belong to this bucket; see bucket cleanup below
+	Attempt to acquire cleanup lock on new bucket number (shouldn't fail, but...)
 	update meta page to reflect new number of buckets
-	mark meta page dirty and release buffer content lock and pin
+	mark meta page dirty and release buffer content lock
 	-- now, accesses to all other buckets can proceed.
 	Perform actual split of bucket, moving tuples as needed
 	>> see below about acquiring needed extra space
 	Release X-locks of old and new buckets
 
+	split guts
+	mark the old and new buckets indicating split-in-progress
+	mark the old bucket indicating has-garbage
+	copy the tuples that belongs to new bucket from old bucket
+	during copy mark such tuples as move-by-split
+	release lock but not pin for primary bucket page of old bucket,
+	read/shared-lock next page; repeat as needed
+	>> see below if no space in bucket page of new bucket
+	ensure to have exclusive-lock on both old and new buckets in that order
+	clear the split-in-progress flag from both the buckets
+	mark buffers dirty and release the locks and pins on both old and new buckets
+
 Note the metapage lock is not held while the actual tuple rearrangement is
 performed, so accesses to other buckets can proceed in parallel; in fact,
 it's possible for multiple bucket splits to proceed in parallel.
 
-Split's attempt to X-lock the old bucket number could fail if another
-process holds S-lock on it.  We do not want to wait if that happens, first
-because we don't want to wait while holding the metapage exclusive-lock,
-and second because it could very easily result in deadlock.  (The other
-process might be out of the hash AM altogether, and could do something
-that blocks on another lock this process holds; so even if the hash
-algorithm itself is deadlock-free, a user-induced deadlock could occur.)
-So, this is a conditional LockAcquire operation, and if it fails we just
-abandon the attempt to split.  This is all right since the index is
-overfull but perfectly functional.  Every subsequent inserter will try to
-split, and eventually one will succeed.  If multiple inserters failed to
-split, the index might still be overfull, but eventually, the index will
+Split's attempt to acquire cleanup-lock on the old bucket number could fail
+if another process holds any lock or pin on it.  We do not want to wait if
+that happens, because we don't want to wait while holding the metapage
+exclusive-lock.  So, this is a conditional LWLockAcquire operation, and if
+it fails we just abandon the attempt to split.  This is all right since the
+index is overfull but perfectly functional.  Every subsequent inserter will
+try to split, and eventually one will succeed.  If multiple inserters failed
+to split, the index might still be overfull, but eventually, the index will
 not be overfull and split attempts will stop.  (We could make a successful
 splitter loop to see if the index is still overfull, but it seems better to
 distribute the split overhead across successive insertions.)
 
+has_garbage flag indicates that the bucket contains tuples that are moved due
+to split.  This will be set only for old bucket.  Now, why we need it besides
+split-in-progress flag is to distinguish the case when the split is over
+(aka split-in-progress flag is cleared.).  This is used both by vacuum as
+well as during re-split operation.  Vacuum, uses it to decide if it needs to
+clear the tuples (that are moved-by-split) from bucket along with dead tuples.
+Re-split of bucket uses it to ensure that it doesn't start a new split from a
+bucket without clearing the previous tuples from old bucket.  The usage by
+re-split helps to keep bloat under control and makes the design somewhat
+simpler as we don't have to any time handle the situation where a bucket can
+contain dead-tuples from multiple splits.
+
 A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt.  The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
+disk space or crash) the index is left corrupt.  The probability of that
+could be made quite low if we grab a free page or two before we update the
+meta page, but the only real solution is to treat a split as a WAL-loggable,
 must-complete action.  I'm not planning to teach hash about WAL in this
-go-round.
+go-round.  However, we do try to finish the incomplete splits during insert
+and split.
 
 The fourth operation is garbage collection (bulk deletion):
 
@@ -319,9 +392,13 @@ The fourth operation is garbage collection (bulk deletion):
 	fetch current max bucket number
 	release meta page buffer content lock and pin
 	while next bucket <= max bucket do
-		Acquire X lock on target bucket
-		Scan and remove tuples, compact free space as needed
-		Release X lock
+		Acquire cleanup lock on target bucket
+		Scan and remove tuples
+		For overflow buckets, first we need to lock the next bucket and then
+		release the lock on current bucket
+		Ensure to have X lock on bucket page
+		If buffer pincount is one, then compact free space as needed
+		Release lock
 		next bucket ++
 	end loop
 	pin metapage and take buffer content lock in exclusive mode
@@ -330,20 +407,23 @@ The fourth operation is garbage collection (bulk deletion):
 	else update metapage tuple count
 	mark meta page dirty and release buffer content lock and pin
 
-Note that this is designed to allow concurrent splits.  If a split occurs,
-tuples relocated into the new bucket will be visited twice by the scan,
-but that does no harm.  (We must however be careful about the statistics
-reported by the VACUUM operation.  What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count
-if the stored tuple count and number of buckets did *not* change at any
-time during the scan.  This provides a way of correcting the stored tuple
-count if it gets out of sync for some reason.  But if a split or insertion
-does occur concurrently, the scan count is untrustworthy; instead,
-subtract the number of tuples deleted from the stored tuple count and
-use that.)
-
-The exclusive lock request could deadlock in some strange scenarios, but
-we can just error out without any great harm being done.
+Note that this is designed to allow concurrent splits and scans.  If a
+split occurs, tuples relocated into the new bucket will be visited twice
+by the scan, but that does no harm.  As we are releasing the locks during
+scan of a bucket, it will allow concurrent scan to start on a bucket and
+ensures that scan will always be behind cleanup.  It is must to keep scans
+behind cleanup, else vacuum could remove tuples that are required to
+complete the scan as explained in Lock Definitions section above.  This holds
+true for backward scans as well (backward scans first traverse each bucket
+starting from first bucket to last overflow bucket in the chain).
+We must be careful about the statistics reported by the VACUUM operation.
+What we can do is count the number of tuples scanned, and believe this in
+preference to the stored tuple count if the stored tuple count and number
+of buckets did *not* change at any time during the scan.  This provides a
+way of correcting the stored tuple count if it gets out of sync for some
+reason.  But if a split or insertion does occur concurrently, the scan
+count is untrustworthy; instead, subtract the number of tuples deleted
+from the stored tuple count and use that.
 
 
 Free Space Management
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..a12a830 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -287,10 +287,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		/*
 		 * An insertion into the current index page could have happened while
 		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold share lock on
-		 * the bucket, no deletions or splits could have occurred; therefore
-		 * we can expect that the TID still exists in the current index page,
-		 * at an offset >= where we were.)
+		 * for the TID we previously returned.  (Because we hold pin on the
+		 * bucket, no deletions or splits could have occurred; therefore we
+		 * can expect that the TID still exists in the current index page, at
+		 * an offset >= where we were.)
 		 */
 		OffsetNumber maxoffnum;
 
@@ -425,12 +425,15 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
 	so->hashso_bucket_valid = false;
-	so->hashso_bucket_blkno = 0;
 	so->hashso_curbuf = InvalidBuffer;
+	so->hashso_bucket_buf = InvalidBuffer;
+	so->hashso_old_bucket_buf = InvalidBuffer;
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
 	ItemPointerSetInvalid(&(so->hashso_heappos));
 
+	so->hashso_skip_moved_tuples = false;
+
 	scan->opaque = so;
 
 	/* register scan in case we change pages it's using */
@@ -449,15 +452,7 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -471,6 +466,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 				scan->numberOfKeys * sizeof(ScanKeyData));
 		so->hashso_bucket_valid = false;
 	}
+
+	so->hashso_skip_moved_tuples = false;
 }
 
 /*
@@ -484,16 +481,7 @@ hashendscan(IndexScanDesc scan)
 
 	/* don't need scan registered anymore */
 	_hash_dropscan(scan);
-
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	pfree(so);
 	scan->opaque = NULL;
@@ -504,6 +492,9 @@ hashendscan(IndexScanDesc scan)
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
  *
+ * This function also delete the tuples that are moved by split to other
+ * bucket.
+ *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
@@ -548,83 +539,52 @@ loop_top:
 	{
 		BlockNumber bucket_blkno;
 		BlockNumber blkno;
-		bool		bucket_dirty = false;
+		Buffer		bucket_buf;
+		Buffer		buf;
+		HashPageOpaque bucket_opaque;
+		Page		page;
+		bool		bucket_has_garbage = false;
 
 		/* Get address of bucket's start page */
 		bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
 
-		/* Exclusive-lock the bucket so we can shrink it */
-		_hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
 		/* Shouldn't have any active scans locally, either */
 		if (_hash_has_active_scan(rel, cur_bucket))
 			elog(ERROR, "hash index has active scan during VACUUM");
 
-		/* Scan each page in bucket */
 		blkno = bucket_blkno;
-		while (BlockNumberIsValid(blkno))
-		{
-			Buffer		buf;
-			Page		page;
-			HashPageOpaque opaque;
-			OffsetNumber offno;
-			OffsetNumber maxoffno;
-			OffsetNumber deletable[MaxOffsetNumber];
-			int			ndeletable = 0;
 
-			vacuum_delay_point();
-
-			buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-											 info->strategy);
-			page = BufferGetPage(buf);
-			opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-			Assert(opaque->hasho_bucket == cur_bucket);
-
-			/* Scan each tuple in page */
-			maxoffno = PageGetMaxOffsetNumber(page);
-			for (offno = FirstOffsetNumber;
-				 offno <= maxoffno;
-				 offno = OffsetNumberNext(offno))
-			{
-				IndexTuple	itup;
-				ItemPointer htup;
+		/*
+		 * We need to acquire a cleanup lock on the primary bucket to out wait
+		 * concurrent scans.
+		 */
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+		LockBufferForCleanup(buf);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 
-				itup = (IndexTuple) PageGetItem(page,
-												PageGetItemId(page, offno));
-				htup = &(itup->t_tid);
-				if (callback(htup, callback_state))
-				{
-					/* mark the item for deletion */
-					deletable[ndeletable++] = offno;
-					tuples_removed += 1;
-				}
-				else
-					num_index_tuples += 1;
-			}
+		page = BufferGetPage(buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
-			/*
-			 * Apply deletions and write page if needed, advance to next page.
-			 */
-			blkno = opaque->hasho_nextblkno;
+		/*
+		 * If the bucket contains tuples that are moved by split, then we need
+		 * to delete such tuples on completion of split.  Before cleaning, we
+		 * need to out-wait the scans that have started when the split was in
+		 * progress for a bucket.
+		 */
+		if (H_HAS_GARBAGE(bucket_opaque) &&
+			!H_INCOMPLETE_SPLIT(bucket_opaque))
+			bucket_has_garbage = true;
 
-			if (ndeletable > 0)
-			{
-				PageIndexMultiDelete(page, deletable, ndeletable);
-				_hash_wrtbuf(rel, buf);
-				bucket_dirty = true;
-			}
-			else
-				_hash_relbuf(rel, buf);
-		}
+		bucket_buf = buf;
 
-		/* If we deleted anything, try to compact free space */
-		if (bucket_dirty)
-			_hash_squeezebucket(rel, cur_bucket, bucket_blkno,
-								info->strategy);
+		hashbucketcleanup(rel, bucket_buf, blkno, info->strategy,
+						  local_metapage.hashm_maxbucket,
+						  local_metapage.hashm_highmask,
+						  local_metapage.hashm_lowmask, &tuples_removed,
+						  &num_index_tuples, bucket_has_garbage, true,
+						  callback, callback_state);
 
-		/* Release bucket lock */
-		_hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+		_hash_relbuf(rel, bucket_buf);
 
 		/* Advance to next bucket */
 		cur_bucket++;
@@ -705,6 +665,197 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }
 
+/*
+ * Helper function to perform deletion of index entries from a bucket.
+ *
+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.
+ *
+ * During scan of overflow buckets, first we need to lock the next bucket and
+ * then release the lock on current bucket.  This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.
+ *
+ * We need to retain a pin on the primary bucket to ensure that no concurrent
+ * split can start.
+ */
+void
+hashbucketcleanup(Relation rel, Buffer bucket_buf,
+				  BlockNumber bucket_blkno,
+				  BufferAccessStrategy bstrategy,
+				  uint32 maxbucket,
+				  uint32 highmask, uint32 lowmask,
+				  double *tuples_removed,
+				  double *num_index_tuples,
+				  bool bucket_has_garbage,
+				  bool delay,
+				  IndexBulkDeleteCallback callback,
+				  void *callback_state)
+{
+	BlockNumber blkno;
+	Buffer		buf;
+	Bucket		cur_bucket;
+	Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
+	Page		page;
+	bool		bucket_dirty = false;
+
+	blkno = bucket_blkno;
+	buf = bucket_buf;
+	page = BufferGetPage(buf);
+	cur_bucket = ((HashPageOpaque) PageGetSpecialPointer(page))->hasho_bucket;
+
+	if (bucket_has_garbage)
+		new_bucket = _hash_get_newbucket(rel, cur_bucket,
+										 lowmask, maxbucket);
+
+	/* Scan each page in bucket */
+	for (;;)
+	{
+		HashPageOpaque opaque;
+		OffsetNumber offno;
+		OffsetNumber maxoffno;
+		Buffer		next_buf;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+		bool		retain_pin = false;
+		bool		curr_page_dirty = false;
+
+		if (delay)
+			vacuum_delay_point();
+
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		/* Scan each tuple in page */
+		maxoffno = PageGetMaxOffsetNumber(page);
+		for (offno = FirstOffsetNumber;
+			 offno <= maxoffno;
+			 offno = OffsetNumberNext(offno))
+		{
+			IndexTuple	itup;
+			ItemPointer htup;
+			Bucket		bucket;
+
+			itup = (IndexTuple) PageGetItem(page,
+											PageGetItemId(page, offno));
+			htup = &(itup->t_tid);
+			if (callback && callback(htup, callback_state))
+			{
+				/* mark the item for deletion */
+				deletable[ndeletable++] = offno;
+				if (tuples_removed)
+					*tuples_removed += 1;
+			}
+			else if (bucket_has_garbage)
+			{
+				/* delete the tuples that are moved by split. */
+				bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+											  maxbucket,
+											  highmask,
+											  lowmask);
+				/* mark the item for deletion */
+				if (bucket != cur_bucket)
+				{
+					/*
+					 * We expect tuples to either belong to curent bucket or
+					 * new_bucket.  This is ensured because we don't allow
+					 * further splits from bucket that contains garbage. See
+					 * comments in _hash_expandtable.
+					 */
+					Assert(bucket == new_bucket);
+					deletable[ndeletable++] = offno;
+				}
+				else if (num_index_tuples)
+					*num_index_tuples += 1;
+			}
+			else if (num_index_tuples)
+				*num_index_tuples += 1;
+		}
+
+		/* retain the pin on primary bucket till end of bucket scan */
+		if (blkno == bucket_blkno)
+			retain_pin = true;
+		else
+			retain_pin = false;
+
+		blkno = opaque->hasho_nextblkno;
+
+		/*
+		 * Apply deletions and write page if needed, advance to next page.
+		 */
+		if (ndeletable > 0)
+		{
+			PageIndexMultiDelete(page, deletable, ndeletable);
+			bucket_dirty = true;
+			curr_page_dirty = true;
+		}
+
+		/* bail out if there are no more pages to scan. */
+		if (!BlockNumberIsValid(blkno))
+			break;
+
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
+		/*
+		 * release the lock on previous page after acquiring the lock on next
+		 * page
+		 */
+		if (curr_page_dirty)
+		{
+			if (retain_pin)
+				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+			else
+				_hash_wrtbuf(rel, buf);
+			curr_page_dirty = false;
+		}
+		else if (retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
+
+		buf = next_buf;
+	}
+
+	/*
+	 * lock the bucket page to clear the garbage flag and squeeze the bucket.
+	 * if the current buffer is same as bucket buffer, then we already have
+	 * lock on bucket page.
+	 */
+	if (buf != bucket_buf)
+	{
+		_hash_relbuf(rel, buf);
+		_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+	}
+
+	/*
+	 * Clear the garbage flag from bucket after deleting the tuples that are
+	 * moved by split.  We purposefully clear the flag before squeeze bucket,
+	 * so that after restart, vacuum shouldn't again try to delete the moved
+	 * by split tuples.
+	 */
+	if (bucket_has_garbage)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = BufferGetPage(bucket_buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+	}
+
+	/*
+	 * If we deleted anything, try to compact free space.  For squeezing the
+	 * bucket, we must have a cleanup lock, else it can impact the ordering of
+	 * tuples for a scan that has started before it.
+	 */
+	if (bucket_dirty && CheckBufferForCleanup(bucket_buf))
+		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
+							bstrategy);
+}
 
 void
 hash_redo(XLogReaderState *record)
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..5cfd0aa 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -28,7 +28,8 @@
 void
 _hash_doinsert(Relation rel, IndexTuple itup)
 {
-	Buffer		buf;
+	Buffer		buf = InvalidBuffer;
+	Buffer		bucket_buf;
 	Buffer		metabuf;
 	HashMetaPage metap;
 	BlockNumber blkno;
@@ -40,6 +41,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	bool		do_expand;
 	uint32		hashkey;
 	Bucket		bucket;
+	uint32		maxbucket;
+	uint32		highmask;
+	uint32		lowmask;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -70,51 +74,131 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			errhint("Values larger than a buffer page cannot be indexed.")));
 
 	/*
-	 * Loop until we get a lock on the correct target bucket.
+	 * Copy bucket mapping info now;  The comment in _hash_expandtable where
+	 * we copy this information and calls _hash_splitbucket explains why this
+	 * is OK.
 	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
-									  metap->hashm_maxbucket,
-									  metap->hashm_highmask,
-									  metap->hashm_lowmask);
+	maxbucket = metap->hashm_maxbucket;
+	highmask = metap->hashm_highmask;
+	lowmask = metap->hashm_lowmask;
 
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
+	/*
+	 * Conditionally get the lock on primary bucket page for insertion while
+	 * holding lock on meta page. If we have to wait, then release the meta
+	 * page lock and retry it in a hard way.
+	 */
+	bucket = _hash_hashkey2bucket(hashkey,
+								  maxbucket,
+								  highmask,
+								  lowmask);
+
+	blkno = BUCKET_TO_BLKNO(metap, bucket);
 
-		/* Release metapage lock, but keep pin. */
+	/* Fetch the primary bucket page for the bucket */
+	buf = ReadBuffer(rel, blkno);
+	if (!ConditionalLockBuffer(buf))
+	{
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(buf, HASH_WRITE);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		oldblkno = blkno;
+		retry = true;
+	}
+	else
+	{
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	}
 
+	if (retry)
+	{
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * Loop until we get a lock on the correct target bucket.  We get the
+		 * lock on primary bucket page and retain the pin on it during insert
+		 * operation to prevent the concurrent splits.  Retaining pin on a
+		 * primary bucket page ensures that split can't happen as it needs to
+		 * acquire the cleanup lock on primary bucket page.  Acquiring lock on
+		 * primary bucket and rechecking if it is a target bucket is mandatory
+		 * as otherwise a concurrent split might cause this insertion to fall
+		 * in wrong bucket.
 		 */
-		if (retry)
+		for (;;)
 		{
+			/*
+			 * Compute the target bucket number, and convert to block number.
+			 */
+			bucket = _hash_hashkey2bucket(hashkey,
+										  metap->hashm_maxbucket,
+										  metap->hashm_highmask,
+										  metap->hashm_lowmask);
+
+			blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+			/* Release metapage lock, but keep pin. */
+			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+			/*
+			 * If the previous iteration of this loop locked what is still the
+			 * correct target bucket, we are done.  Otherwise, drop any old
+			 * lock and lock what now appears to be the correct bucket.
+			 */
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
-		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
-		oldblkno = blkno;
-		retry = true;
+			/* Fetch the primary bucket page for the bucket */
+			buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+
+			/*
+			 * Reacquire metapage lock and check that no bucket split has
+			 * taken place while we were awaiting the bucket lock.
+			 */
+			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			oldblkno = blkno;
+		}
 	}
 
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+	/* remember the primary bucket buffer to release the pin on it at end. */
+	bucket_buf = buf;
+
 	page = BufferGetPage(buf);
 	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(pageopaque->hasho_bucket == bucket);
 
+	/*
+	 * If there is any pending split, try to finish it before proceeding for
+	 * the insertion.  We try to finish the split for the insertion in old
+	 * bucket, as that will allow us to remove the tuples from old bucket and
+	 * reuse the space.  There is no such apparent benefit from finishing the
+	 * split during insertion in new bucket.
+	 *
+	 * In future, if we want to finish the splits during insertion in new
+	 * bucket, we must ensure the locking order such that old bucket is locked
+	 * before new bucket.
+	 */
+	if (H_OLD_INCOMPLETE_SPLIT(pageopaque) && CheckBufferForCleanup(buf))
+	{
+		BlockNumber nblkno;
+		Buffer		nbuf;
+
+		nblkno = _hash_get_newblk(rel, pageopaque);
+
+		/* Fetch the primary bucket page for the new bucket */
+		nbuf = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+		if (nbuf)
+		{
+			_hash_finish_split(rel, metabuf, buf, nbuf, maxbucket,
+							   highmask, lowmask);
+
+			/*
+			 * release the buffer here as the insertion will happen in old
+			 * bucket.
+			 */
+			_hash_relbuf(rel, nbuf);
+		}
+	}
+
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
@@ -127,14 +211,23 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		{
 			/*
 			 * ovfl page exists; go get it.  if it doesn't have room, we'll
-			 * find out next pass through the loop test above.
+			 * find out next pass through the loop test above.  Retain the pin
+			 * if it is a primary bucket.
 			 */
-			_hash_relbuf(rel, buf);
+			if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
+				_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+			else
+				_hash_relbuf(rel, buf);
 			buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 			page = BufferGetPage(buf);
 		}
 		else
 		{
+			bool		retain_pin = false;
+
+			/* page flags must be accessed before releasing lock on a page. */
+			retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;
+
 			/*
 			 * we're at the end of the bucket chain and we haven't found a
 			 * page with enough room.  allocate a new overflow page.
@@ -144,7 +237,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 
 			/* chain to a new overflow page */
-			buf = _hash_addovflpage(rel, metabuf, buf);
+			buf = _hash_addovflpage(rel, metabuf, buf, retain_pin);
 			page = BufferGetPage(buf);
 
 			/* should fit now, given test above */
@@ -158,11 +251,13 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	/* found page with enough space, so add the item here */
 	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
 
-	/* write and release the modified page */
+	/*
+	 * write and release the modified page and ensure to release the pin on
+	 * primary page.
+	 */
 	_hash_wrtbuf(rel, buf);
-
-	/* We can drop the bucket lock now */
-	_hash_droplock(rel, blkno, HASH_SHARE);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
 
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index db3e268..760563a 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *
  *	On entry, the caller must hold a pin but no lock on 'buf'.  The pin is
  *	dropped before exiting (we assume the caller is not interested in 'buf'
- *	anymore).  The returned overflow page will be pinned and write-locked;
- *	it is guaranteed to be empty.
+ *	anymore) if not asked to retain.  The pin will be retained only for the
+ *	primary bucket.  The returned overflow page will be pinned and
+ *	write-locked; it is guaranteed to be empty.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
  *
- *	The caller must hold at least share lock on the bucket, to ensure that
- *	no one else tries to compact the bucket meanwhile.  This guarantees that
- *	'buf' won't stop being part of the bucket while it's unlocked.
- *
  * NB: since this could be executed concurrently by multiple processes,
  * one should not assume that the returned overflow page will be the
  * immediate successor of the originally passed 'buf'.  Additional overflow
  * pages might have been added to the bucket chain in between.
  */
 Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 {
 	Buffer		ovflbuf;
 	Page		page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 			break;
 
 		/* we assume we do not need to write the unmodified page */
-		_hash_relbuf(rel, buf);
+		if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
 
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
@@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 
 	/* logically chain overflow page to previous page */
 	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	_hash_wrtbuf(rel, buf);
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, buf);
 
 	return ovflbuf;
 }
@@ -370,11 +373,11 @@ _hash_firstfreebit(uint32 map)
  *	in the bucket, or InvalidBlockNumber if no following page.
  *
  *	NB: caller must not hold lock on metapage, nor on either page that's
- *	adjacent in the bucket chain.  The caller had better hold exclusive lock
- *	on the bucket, too.
+ *	adjacent in the bucket chain except from primary bucket.  The caller had
+ *	better hold cleanup lock on the primary bucket.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf,
+_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
@@ -413,22 +416,41 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
-	 * deleted.  No concurrency issues since we hold exclusive lock on the
-	 * entire bucket.
+	 * deleted.  No concurrency issues since we hold the cleanup lock on
+	 * primary bucket.  We don't need to aqcuire buffer lock to fix the
+	 * primary bucket, as we already have that lock.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-														 prevblkno,
-														 HASH_WRITE,
+		if (prevblkno == bucket_blkno)
+		{
+			Buffer		prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
+													 prevblkno,
+													 RBM_NORMAL,
+													 bstrategy);
+
+			Page		prevpage = BufferGetPage(prevbuf);
+			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+			Assert(prevopaque->hasho_bucket == bucket);
+			prevopaque->hasho_nextblkno = nextblkno;
+			MarkBufferDirty(prevbuf);
+			ReleaseBuffer(prevbuf);
+		}
+		else
+		{
+			Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
+															 prevblkno,
+															 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		prevpage = BufferGetPage(prevbuf);
-		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+															 bstrategy);
+			Page		prevpage = BufferGetPage(prevbuf);
+			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
 
-		Assert(prevopaque->hasho_bucket == bucket);
-		prevopaque->hasho_nextblkno = nextblkno;
-		_hash_wrtbuf(rel, prevbuf);
+			Assert(prevopaque->hasho_bucket == bucket);
+			prevopaque->hasho_nextblkno = nextblkno;
+			_hash_wrtbuf(rel, prevbuf);
+		}
 	}
 	if (BlockNumberIsValid(nextblkno))
 	{
@@ -570,7 +592,7 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
  *	required that to be true on entry as well, but it's a lot easier for
  *	callers to leave empty overflow pages and let this guy clean it up.
  *
- *	Caller must hold exclusive lock on the target bucket.  This allows
+ *	Caller must hold cleanup lock on the target bucket.  This allows
  *	us to safely lock multiple pages in the bucket.
  *
  *	Since this function is invoked in VACUUM, we provide an access strategy
@@ -580,6 +602,7 @@ void
 _hash_squeezebucket(Relation rel,
 					Bucket bucket,
 					BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy)
 {
 	BlockNumber wblkno;
@@ -591,27 +614,22 @@ _hash_squeezebucket(Relation rel,
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
 	bool		wbuf_dirty;
+	bool		release_buf = false;
 
 	/*
 	 * start squeezing into the base bucket page.
 	 */
 	wblkno = bucket_blkno;
-	wbuf = _hash_getbuf_with_strategy(rel,
-									  wblkno,
-									  HASH_WRITE,
-									  LH_BUCKET_PAGE,
-									  bstrategy);
+	wbuf = bucket_buf;
 	wpage = BufferGetPage(wbuf);
 	wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 
 	/*
-	 * if there aren't any overflow pages, there's nothing to squeeze.
+	 * if there aren't any overflow pages, there's nothing to squeeze. caller
+	 * is responsible to release the lock on primary bucket.
 	 */
 	if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
-	{
-		_hash_relbuf(rel, wbuf);
 		return;
-	}
 
 	/*
 	 * Find the last page in the bucket chain by starting at the base bucket
@@ -669,12 +687,17 @@ _hash_squeezebucket(Relation rel,
 			{
 				Assert(!PageIsEmpty(wpage));
 
+				if (wblkno != bucket_blkno)
+					release_buf = true;
+
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));
 
-				if (wbuf_dirty)
+				if (wbuf_dirty && release_buf)
 					_hash_wrtbuf(rel, wbuf);
-				else
+				else if (wbuf_dirty)
+					MarkBufferDirty(wbuf);
+				else if (release_buf)
 					_hash_relbuf(rel, wbuf);
 
 				/* nothing more to do if we reached the read page */
@@ -700,6 +723,7 @@ _hash_squeezebucket(Relation rel,
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
 				wbuf_dirty = false;
+				release_buf = false;
 			}
 
 			/*
@@ -733,19 +757,25 @@ _hash_squeezebucket(Relation rel,
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
-			/* yes, so release wbuf lock first */
-			if (wbuf_dirty)
+			if (wblkno != bucket_blkno)
+				release_buf = true;
+
+			/* yes, so release wbuf lock first if needed */
+			if (wbuf_dirty && release_buf)
 				_hash_wrtbuf(rel, wbuf);
-			else
+			else if (wbuf_dirty)
+				MarkBufferDirty(wbuf);
+			else if (release_buf)
 				_hash_relbuf(rel, wbuf);
+
 			/* free this overflow page (releases rbuf) */
-			_hash_freeovflpage(rel, rbuf, bstrategy);
+			_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 			/* done */
 			return;
 		}
 
 		/* free this overflow page, then get the previous one */
-		_hash_freeovflpage(rel, rbuf, bstrategy);
+		_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 178463f..f51c313 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,10 +38,14 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
 					uint32 nblocks);
 static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
+static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
+					   Bucket obucket, Bucket nbucket, Buffer obuf,
+					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
+					   uint32 highmask, uint32 lowmask);
 
 
 /*
@@ -55,46 +59,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 
 
 /*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock.  (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		return ConditionalLockPage(rel, whichlock, access);
-	else
-		return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		UnlockPage(rel, whichlock, access);
-}
-
-/*
  *	_hash_getbuf() -- Get a buffer by block number for read or write.
  *
  *		'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -132,6 +96,35 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
 }
 
 /*
+ * _hash_getbuf_with_condlock_cleanup() -- as above, but get the buffer for write.
+ *
+ *		We try to take the conditional cleanup lock and if we get it then
+ *		return the buffer, else return InvalidBuffer.
+ */
+Buffer
+_hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
+{
+	Buffer		buf;
+
+	if (blkno == P_NEW)
+		elog(ERROR, "hash AM does not use P_NEW");
+
+	buf = ReadBuffer(rel, blkno);
+
+	if (!ConditionalLockBufferForCleanup(buf))
+	{
+		ReleaseBuffer(buf);
+		return InvalidBuffer;
+	}
+
+	/* ref count and lock type are correct */
+
+	_hash_checkpage(rel, buf, flags);
+
+	return buf;
+}
+
+/*
  *	_hash_getinitbuf() -- Get and initialize a buffer by block number.
  *
  *		This must be used only to fetch pages that are known to be before
@@ -266,6 +259,33 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
+ *	_hash_dropscanbuf() -- release buffers used in scan.
+ *
+ * This routine unpins the buffers used during scan on which we
+ * hold no lock.
+ */
+void
+_hash_dropscanbuf(Relation rel, HashScanOpaque so)
+{
+	/* release pin we hold on primary bucket */
+	if (BufferIsValid(so->hashso_bucket_buf) &&
+		so->hashso_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_bucket_buf);
+	so->hashso_bucket_buf = InvalidBuffer;
+
+	/* release pin we hold on old primary bucket */
+	if (BufferIsValid(so->hashso_old_bucket_buf) &&
+		so->hashso_old_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+	so->hashso_old_bucket_buf = InvalidBuffer;
+
+	/* release any pin we still hold */
+	if (BufferIsValid(so->hashso_curbuf))
+		_hash_dropbuf(rel, so->hashso_curbuf);
+	so->hashso_curbuf = InvalidBuffer;
+}
+
+/*
  *	_hash_wrtbuf() -- write a hash page to disk.
  *
  *		This routine releases the lock held on the buffer and our refcount
@@ -489,9 +509,11 @@ _hash_pageinit(Page page, Size size)
 /*
  * Attempt to expand the hash table by creating one new bucket.
  *
- * This will silently do nothing if it cannot get the needed locks.
+ * This will silently do nothing if there are active scans of our own
+ * backend or if we don't get cleanup lock on old or new bucket.
  *
- * The caller should hold no locks on the hash index.
+ * Complete the pending splits and remove the tuples from old bucket,
+ * if there are any left over from previous split.
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.
@@ -506,10 +528,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	BlockNumber start_oblkno;
 	BlockNumber start_nblkno;
 	Buffer		buf_nblkno;
+	Buffer		buf_oblkno;
+	Page		opage;
+	HashPageOpaque oopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
 
+restart_expand:
+
 	/*
 	 * Write-lock the meta page.  It used to be necessary to acquire a
 	 * heavyweight lock to begin a split, but that is no longer required.
@@ -548,11 +575,16 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		goto fail;
 
 	/*
-	 * Determine which bucket is to be split, and attempt to lock the old
-	 * bucket.  If we can't get the lock, give up.
+	 * Determine which bucket is to be split, and attempt to take cleanup lock
+	 * on the old bucket.  If we can't get the lock, give up.
 	 *
-	 * The lock protects us against other backends, but not against our own
-	 * backend.  Must check for active scans separately.
+	 * The cleanup lock protects us against other backends, but not against
+	 * our own backend.  Must check for active scans separately.
+	 *
+	 * The cleanup lock is mainly to protect the split from concurrent
+	 * inserts. See src/backend/access/hash/README, Lock Definitions for
+	 * further details.  Due to this locking restriction, if there is any
+	 * pending scan, split will give up which is not good, but harmless.
 	 */
 	new_bucket = metap->hashm_maxbucket + 1;
 
@@ -563,11 +595,90 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, old_bucket))
 		goto fail;
 
-	if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
+	buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
+	if (!buf_oblkno)
 		goto fail;
 
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
 	/*
-	 * Likewise lock the new bucket (should never fail).
+	 * We want to finish the split from a bucket as there is no apparent
+	 * benefit by not doing so and it will make the code complicated to finish
+	 * the split that involves multiple buckets considering the case where new
+	 * split also fails.  We don't need to consider the new bucket for
+	 * completing the split here as it is not possible that a re-split of new
+	 * bucket starts when there is still a pending split from old bucket.
+	 */
+	if (H_OLD_INCOMPLETE_SPLIT(oopaque))
+	{
+		BlockNumber nblkno;
+		Buffer		buf_nblkno;
+
+		/*
+		 * Copy bucket mapping info now;  The comment in code below where we
+		 * copy this information and calls _hash_splitbucket explains why this
+		 * is OK.
+		 */
+		maxbucket = metap->hashm_maxbucket;
+		highmask = metap->hashm_highmask;
+		lowmask = metap->hashm_lowmask;
+
+		/* Release the metapage lock, before completing the split. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		nblkno = _hash_get_newblk(rel, oopaque);
+
+		/* Fetch the primary bucket page for the new bucket */
+		buf_nblkno = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+		if (!buf_nblkno)
+		{
+			_hash_relbuf(rel, buf_oblkno);
+			goto fail;
+		}
+
+		_hash_finish_split(rel, metabuf, buf_oblkno, buf_nblkno, maxbucket,
+						   highmask, lowmask);
+
+		/*
+		 * release the buffers and retry for expand.
+		 */
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+
+		goto restart_expand;
+	}
+
+	/*
+	 * Clean the tuples remained from previous split.  This operation requires
+	 * cleanup lock and we already have one on old bucket, so let's do it. We
+	 * also don't want to allow further splits from the bucket till the
+	 * garbage of previous split is cleaned.  This has two advantages, first
+	 * it helps in avoiding the bloat due to garbage and second is, during
+	 * cleanup of bucket, we are always sure that the garbage tuples belong to
+	 * most recently splitted bucket.  On the contrary, if we allow cleanup of
+	 * bucket after meta page is updated to indicate the new split and before
+	 * the actual split, the cleanup operation won't be able to decide whether
+	 * the tuple has been moved to the newly created bucket and ended up
+	 * deleting such tuples.
+	 */
+	if (H_HAS_GARBAGE(oopaque))
+	{
+		/* Release the metapage lock. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		hashbucketcleanup(rel, buf_oblkno, start_oblkno, NULL,
+						  metap->hashm_maxbucket, metap->hashm_highmask,
+						  metap->hashm_lowmask, NULL,
+						  NULL, true, false, NULL, NULL);
+
+		_hash_relbuf(rel, buf_oblkno);
+
+		goto restart_expand;
+	}
+
+	/*
+	 * There shouldn't be any active scan on new bucket.
 	 *
 	 * Note: it is safe to compute the new bucket's blkno here, even though we
 	 * may still need to update the BUCKET_TO_BLKNO mapping.  This is because
@@ -579,9 +690,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, new_bucket))
 		elog(ERROR, "scan in progress on supposedly new bucket");
 
-	if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
-		elog(ERROR, "could not get lock on supposedly new bucket");
-
 	/*
 	 * If the split point is increasing (hashm_maxbucket's log base 2
 	 * increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +708,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
 			/* can't split due to BlockNumber overflow */
-			_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-			_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+			_hash_relbuf(rel, buf_oblkno);
 			goto fail;
 		}
 	}
@@ -609,9 +716,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/*
 	 * Physically allocate the new bucket's primary page.  We want to do this
 	 * before changing the metapage's mapping info, in case we can't get the
-	 * disk space.
+	 * disk space.  Ideally, we don't need to check for cleanup lock on new
+	 * bucket as no other backend could find this bucket unless meta page is
+	 * updated.  However, it is good to be consistent with old bucket locking.
 	 */
 	buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
+	if (!CheckBufferForCleanup(buf_nblkno))
+	{
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+		goto fail;
+	}
+
 
 	/*
 	 * Okay to proceed with split.  Update the metapage bucket mapping info.
@@ -665,13 +781,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  start_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno,
 					  maxbucket, highmask, lowmask);
 
-	/* Release bucket locks, allowing others to access them */
-	_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-	_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -745,6 +857,10 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
  * The buffer is returned in the same state.  (The metapage is only
  * touched if it becomes necessary to add or remove overflow pages.)
  *
+ * Split needs to hold pin on primary bucket pages of both old and new
+ * buckets till end of operation.  This is to prevent vacuum to start
+ * when split is in progress.
+ *
  * In addition, the caller must have created the new bucket's base page,
  * which is passed in buffer nbuf, pinned and write-locked.  That lock and
  * pin are released here.  (The API is set up this way because we must do
@@ -756,37 +872,87 @@ _hash_splitbucket(Relation rel,
 				  Buffer metabuf,
 				  Bucket obucket,
 				  Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Buffer		obuf;
 	Page		opage;
 	Page		npage;
 	HashPageOpaque oopaque;
 	HashPageOpaque nopaque;
 
-	/*
-	 * It should be okay to simultaneously write-lock pages from each bucket,
-	 * since no one else can be trying to acquire buffer lock on pages of
-	 * either bucket.
-	 */
-	obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
 	opage = BufferGetPage(obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 
+	/*
+	 * Mark the old bucket to indicate that split is in progress and it has
+	 * deletable tuples. At operation end, we clear split in progress flag and
+	 * vacuum will clear page_has_garbage flag after deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
 	npage = BufferGetPage(nbuf);
 
-	/* initialize the new bucket's primary page */
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nopaque->hasho_prevblkno = InvalidBlockNumber;
 	nopaque->hasho_nextblkno = InvalidBlockNumber;
 	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
 	nopaque->hasho_page_id = HASHO_PAGE_ID;
 
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, nbuf, NULL,
+						   maxbucket, highmask, lowmask);
+
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, obuf);
+	_hash_relbuf(rel, nbuf);
+}
+
+/*
+ * _hash_splitbucket_guts -- Helper function to perform the split operation
+ *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, caller needs to fill htab.  If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
+ * Caller needs to lock and unlock the old and new primary buckets.
+ */
+static void
+_hash_splitbucket_guts(Relation rel,
+					   Buffer metabuf,
+					   Bucket obucket,
+					   Bucket nbucket,
+					   Buffer obuf,
+					   Buffer nbuf,
+					   HTAB *htab,
+					   uint32 maxbucket,
+					   uint32 highmask,
+					   uint32 lowmask)
+{
+	Buffer		bucket_obuf;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
+
+	bucket_obuf = obuf;
+	opage = BufferGetPage(obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
 	/*
 	 * Partition the tuples in the old bucket between the old bucket and the
 	 * new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +964,6 @@ _hash_splitbucket(Relation rel,
 		BlockNumber oblkno;
 		OffsetNumber ooffnum;
 		OffsetNumber omaxoffnum;
-		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
 
 		/* Scan each tuple in old page */
 		omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -810,18 +974,45 @@ _hash_splitbucket(Relation rel,
 			IndexTuple	itup;
 			Size		itemsz;
 			Bucket		bucket;
+			bool		found = false;
 
 			/*
-			 * Fetch the item's hash key (conveniently stored in the item) and
-			 * determine which bucket it now belongs in.
+			 * Before inserting tuple, probe the hash table containing TIDs of
+			 * tuples belonging to new bucket, if we find a match, then skip
+			 * that tuple, else fetch the item's hash key (conveniently stored
+			 * in the item) and determine which bucket it now belongs in.
 			 */
 			itup = (IndexTuple) PageGetItem(opage,
 											PageGetItemId(opage, ooffnum));
+
+			if (htab)
+				(void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
+
+			if (found)
+				continue;
+
 			bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
 										  maxbucket, highmask, lowmask);
 
 			if (bucket == nbucket)
 			{
+				Size		itupsize = 0;
+				IndexTuple	new_itup;
+
+				/*
+				 * make a copy of index tuple as we have to scribble on it.
+				 */
+				new_itup = CopyIndexTuple(itup);
+
+				/*
+				 * mark the index tuple as moved by split, such tuples are
+				 * skipped by scan if there is split in progress for a bucket.
+				 */
+				itupsize = new_itup->t_info & INDEX_SIZE_MASK;
+				new_itup->t_info &= ~INDEX_SIZE_MASK;
+				new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+				new_itup->t_info |= itupsize;
+
 				/*
 				 * insert the tuple into the new bucket.  if it doesn't fit on
 				 * the current page in the new bucket, we must allocate a new
@@ -832,17 +1023,25 @@ _hash_splitbucket(Relation rel,
 				 * only partially complete, meaning the index is corrupt,
 				 * since searches may fail to find entries they should find.
 				 */
-				itemsz = IndexTupleDSize(*itup);
+				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
 				if (PageGetFreeSpace(npage) < itemsz)
 				{
+					bool		retain_pin = false;
+
+					/*
+					 * page flags must be accessed before releasing lock on a
+					 * page.
+					 */
+					retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
+
 					/* write out nbuf and drop lock, but keep pin */
 					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
 					/* chain to a new overflow page */
-					nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+					nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
 					npage = BufferGetPage(nbuf);
-					/* we don't need nopaque within the loop */
+					nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 				}
 
 				/*
@@ -852,12 +1051,10 @@ _hash_splitbucket(Relation rel,
 				 * Possible future improvement: accumulate all the items for
 				 * the new page and qsort them before insertion.
 				 */
-				(void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
+				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
-				/*
-				 * Mark tuple for deletion from old page.
-				 */
-				deletable[ndeletable++] = ooffnum;
+				/* be tidy */
+				pfree(new_itup);
 			}
 			else
 			{
@@ -870,15 +1067,9 @@ _hash_splitbucket(Relation rel,
 
 		oblkno = oopaque->hasho_nextblkno;
 
-		/*
-		 * Done scanning this old page.  If we moved any tuples, delete them
-		 * from the old page.
-		 */
-		if (ndeletable > 0)
-		{
-			PageIndexMultiDelete(opage, deletable, ndeletable);
-			_hash_wrtbuf(rel, obuf);
-		}
+		/* retain the pin on the old primary bucket */
+		if (obuf == bucket_obuf)
+			_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, obuf);
 
@@ -887,18 +1078,153 @@ _hash_splitbucket(Relation rel,
 			break;
 
 		/* Else, advance to next old page */
-		obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
 		opage = BufferGetPage(obuf);
 		oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	}
 
 	/*
 	 * We're at the end of the old bucket chain, so we're done partitioning
-	 * the tuples.  Before quitting, call _hash_squeezebucket to ensure the
-	 * tuples remaining in the old bucket (including the overflow pages) are
-	 * packed as tightly as possible.  The new bucket is already tight.
+	 * the tuples.  Mark the old and new buckets to indicate split is
+	 * finished.
+	 *
+	 * To avoid deadlocks due to locking order of buckets, first lock the old
+	 * bucket and then the new bucket.
 	 */
-	_hash_wrtbuf(rel, nbuf);
+	if (nopaque->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, nbuf);
+
+	/*
+	 * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+	 * there is no pending scan that has seen the flag after it is cleared.
+	 */
+	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+	opage = BufferGetPage(bucket_obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	_hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+	npage = BufferGetPage(bucket_nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+	/* indicate that split is finished */
+	oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
+	nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
+
+	/*
+	 * now write the buffers, here we don't release the locks as caller is
+	 * responsible to release locks.
+	 */
+	MarkBufferDirty(bucket_obuf);
+	MarkBufferDirty(bucket_nbuf);
+}
+
+/*
+ *	_hash_finish_split() -- Finish the previously interrupted split operation
+ *
+ * To complete the split operation, we form the hash table of TIDs in new
+ * bucket which is then used by split operation to skip tuples that are
+ * already moved before the split operation was previously interruptted.
+ *
+ * The caller must hold a pin, but no lock, on the metapage buffer.
+ * The buffer is returned in the same state.  (The metapage is only
+ * touched if it becomes necessary to add or remove overflow pages.)
+ *
+ * 'obuf' and 'nbuf' must be locked by the caller which is also responsible
+ * for unlocking them.
+ */
+void
+_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf,
+				   uint32 maxbucket, uint32 highmask, uint32 lowmask)
+{
+	HASHCTL		hash_ctl;
+	HTAB	   *tidhtab;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque opageopaque;
+	HashPageOpaque npageopaque;
+	Bucket		obucket;
+	Bucket		nbucket;
+	bool		found;
+
+	/* Initialize hash tables used to track TIDs */
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+	hash_ctl.keysize = sizeof(ItemPointerData);
+	hash_ctl.entrysize = sizeof(ItemPointerData);
+	hash_ctl.hcxt = CurrentMemoryContext;
+
+	tidhtab =
+		hash_create("bucket ctids",
+					256,		/* arbitrary initial size */
+					&hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	/*
+	 * Scan the new bucket and build hash table of TIDs
+	 */
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	for (;;)
+	{
+		BlockNumber nblkno;
+		OffsetNumber noffnum;
+		OffsetNumber nmaxoffnum;
+
+		/* Scan each tuple in new page */
+		nmaxoffnum = PageGetMaxOffsetNumber(npage);
+		for (noffnum = FirstOffsetNumber;
+			 noffnum <= nmaxoffnum;
+			 noffnum = OffsetNumberNext(noffnum))
+		{
+			IndexTuple	itup;
+
+			/* Fetch the item's TID and insert it in hash table. */
+			itup = (IndexTuple) PageGetItem(npage,
+											PageGetItemId(npage, noffnum));
+
+			(void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
+
+			Assert(!found);
+		}
+
+		nblkno = npageopaque->hasho_nextblkno;
+
+		/*
+		 * release our write lock without modifying buffer and ensure to
+		 * retain the pin on primary bucket.
+		 */
+		if (nbuf == bucket_nbuf)
+			_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, nbuf);
+
+		/* Exit loop if no more overflow pages in new bucket */
+		if (!BlockNumberIsValid(nblkno))
+			break;
+
+		/* Else, advance to next page */
+		nbuf = _hash_getbuf(rel, nblkno, HASH_READ, LH_OVERFLOW_PAGE);
+		npage = BufferGetPage(nbuf);
+		npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	}
+
+	/* Need a cleanup lock to perform split operation. */
+	LockBufferForCleanup(bucket_nbuf);
+
+	npage = BufferGetPage(bucket_nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nbucket = npageopaque->hasho_bucket;
+
+	opage = BufferGetPage(obuf);
+	opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+	obucket = opageopaque->hasho_bucket;
+
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, bucket_nbuf, tidhtab,
+						   maxbucket, highmask, lowmask);
 
-	_hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+	hash_destroy(tidhtab);
 }
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..e3a99cf 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -72,7 +72,19 @@ _hash_readnext(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_nextblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan to ensure
+	 * that vacuum can't delete the tuples that are moved by split to new
+	 * bucket. Such tuples are required by the scans that are started on
+	 * splitted buckets, before a new buckets split in progress flag
+	 * (LH_BUCKET_NEW_PAGE_SPLIT) is cleared.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -94,7 +106,16 @@ _hash_readprev(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_prevblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan. See
+	 * comments in _hash_readnext to know the reason of retaining pin.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -104,6 +125,13 @@ _hash_readprev(Relation rel,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
+
+		/*
+		 * We always maintain the pin on bucket page for whole scan operation,
+		 * so releasing the additional pin we have acquired here.
+		 */
+		if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+			_hash_dropbuf(rel, *bufp);
 	}
 }
 
@@ -192,43 +220,81 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	metap = HashPageGetMeta(page);
 
 	/*
-	 * Loop until we get a lock on the correct target bucket.
+	 * Conditionally get the lock on primary bucket page for search while
+	 * holding lock on meta page. If we have to wait, then release the meta
+	 * page lock and retry it in a hard way.
 	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
-									  metap->hashm_maxbucket,
-									  metap->hashm_highmask,
-									  metap->hashm_lowmask);
+	bucket = _hash_hashkey2bucket(hashkey,
+								  metap->hashm_maxbucket,
+								  metap->hashm_highmask,
+								  metap->hashm_lowmask);
 
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
+	blkno = BUCKET_TO_BLKNO(metap, bucket);
 
-		/* Release metapage lock, but keep pin. */
+	/* Fetch the primary bucket page for the bucket */
+	buf = ReadBuffer(rel, blkno);
+	if (!ConditionalLockBufferShared(buf))
+	{
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(buf, HASH_READ);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		oldblkno = blkno;
+		retry = true;
+	}
+	else
+	{
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	}
 
+	if (retry)
+	{
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * Loop until we get a lock on the correct target bucket.  We get the
+		 * lock on primary bucket page and retain the pin on it during read
+		 * operation to prevent the concurrent splits.  Retaining pin on a
+		 * primary bucket page ensures that split can't happen as it needs to
+		 * acquire the cleanup lock on primary bucket page.  Acquiring lock on
+		 * primary bucket and rechecking if it is a target bucket is mandatory
+		 * as otherwise a concurrent split followed by vacuum could remove
+		 * tuples from the selected bucket which otherwise would have been
+		 * visible.
 		 */
-		if (retry)
+		for (;;)
 		{
+			/*
+			 * Compute the target bucket number, and convert to block number.
+			 */
+			bucket = _hash_hashkey2bucket(hashkey,
+										  metap->hashm_maxbucket,
+										  metap->hashm_highmask,
+										  metap->hashm_lowmask);
+
+			blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+			/* Release metapage lock, but keep pin. */
+			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+			/*
+			 * If the previous iteration of this loop locked what is still the
+			 * correct target bucket, we are done.  Otherwise, drop any old
+			 * lock and lock what now appears to be the correct bucket.
+			 */
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
-		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
-		oldblkno = blkno;
-		retry = true;
+			/* Fetch the primary bucket page for the bucket */
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
+			/*
+			 * Reacquire metapage lock and check that no bucket split has
+			 * taken place while we were awaiting the bucket lock.
+			 */
+			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			oldblkno = blkno;
+		}
 	}
 
 	/* done with the metapage */
@@ -237,14 +303,60 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	/* Update scan opaque state to show we have lock on the bucket */
 	so->hashso_bucket = bucket;
 	so->hashso_bucket_valid = true;
-	so->hashso_bucket_blkno = blkno;
 
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
 	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
+	so->hashso_bucket_buf = buf;
+
+	/*
+	 * If the bucket split is in progress, then we need to skip tuples that
+	 * are moved from old bucket.  To ensure that vacuum doesn't clean any
+	 * tuples from old or new buckets till this scan is in progress, maintain
+	 * a pin on both of the buckets.  Here, we have to be cautious about lock
+	 * ordering, first acquire the lock on old bucket, release the lock on old
+	 * bucket, but not pin, then acquire the lock on new bucket and again
+	 * re-verify whether the bucket split still is in progress. Acquiring lock
+	 * on old bucket first ensures that the vacuum waits for this scan to
+	 * finish.
+	 */
+	if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+	{
+		BlockNumber old_blkno;
+		Buffer		old_buf;
+
+		old_blkno = _hash_get_oldblk(rel, opaque);
+
+		/*
+		 * release the lock on new bucket and re-acquire it after acquiring
+		 * the lock on old bucket.
+		 */
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+		/*
+		 * remember the old bucket buffer so as to use it later for scanning.
+		 */
+		so->hashso_old_bucket_buf = old_buf;
+		_hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+		_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		Assert(opaque->hasho_bucket == bucket);
+
+		if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+			so->hashso_skip_moved_tuples = true;
+		else
+		{
+			_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+			so->hashso_old_bucket_buf = InvalidBuffer;
+		}
+	}
+
 	/* If a backwards scan is requested, move to the end of the chain */
 	if (ScanDirectionIsBackward(dir))
 	{
@@ -273,6 +385,13 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
  *		false.  Else, return true and set the hashso_curpos for the
  *		scan to the right thing.
  *
+ *		Here we also scan the old bucket if the split for current bucket
+ *		was in progress at the start of scan.  The basic idea is that
+ *		skip the tuples that are moved by split while scanning current
+ *		bucket and then scan the old bucket to cover all such tuples. This
+ *		is done to ensure that we don't miss any tuples in the scans that
+ *		started during split.
+ *
  *		'bufP' points to the current buffer, which is pinned and read-locked.
  *		On success exit, we have pin and read-lock on whichever page
  *		contains the right item; on failure, we have released all buffers.
@@ -338,6 +457,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum >= FirstOffsetNumber);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberNext(offnum);	/* move forward */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -353,9 +485,41 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -379,6 +543,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum <= maxoff);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberPrev(offnum);	/* move back */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -394,9 +571,41 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -410,9 +619,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 
 		if (itup == NULL)
 		{
-			/* we ran off the end of the bucket without finding a match */
+			/*
+			 * We ran off the end of the bucket without finding a match.
+			 * Release the pin on bucket buffers.  Normally, such pins are
+			 * released at end of scan, however scrolling cursors can
+			 * reacquire the bucket lock and pin in the same scan multiple
+			 * times.
+			 */
 			*bufP = so->hashso_curbuf = InvalidBuffer;
 			ItemPointerSetInvalid(current);
+			_hash_dropscanbuf(rel, so);
 			return false;
 		}
 
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..b5164d7 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -147,6 +147,23 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_msb-- returns most significant bit position.
+ */
+static uint32
+_hash_msb(uint32 num)
+{
+	uint32		i = 0;
+
+	while (num)
+	{
+		num = num >> 1;
+		++i;
+	}
+
+	return i - 1;
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
@@ -352,3 +369,123 @@ _hash_binsearch_last(Page page, uint32 hash_value)
 
 	return lower;
 }
+
+/*
+ *	_hash_get_oldblk() -- get the block number from which current bucket
+ *			is being splitted.
+ */
+BlockNumber
+_hash_get_oldblk(Relation rel, HashPageOpaque opaque)
+{
+	Bucket		curr_bucket;
+	Bucket		old_bucket;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	/*
+	 * To get the old bucket from the current bucket, we need a mask to modulo
+	 * into lower half of table.  This mask is stored in meta page as
+	 * hashm_lowmask, but here we can't rely on the same, because we need a
+	 * value of lowmask that was prevalent at the time when bucket split was
+	 * started.  Masking the most significant bit of new bucket would give us
+	 * old bucket.
+	 */
+	curr_bucket = opaque->hasho_bucket;
+	mask = (((uint32) 1) << _hash_msb(curr_bucket)) - 1;
+	old_bucket = curr_bucket & mask;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newblk() -- get the block number of bucket for the new bucket
+ *			that will be generated after split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket based on current table
+ * half.  It is mainly required to finish the incomplete splits where we are
+ * sure that not more than one bucket could have split in progress from old
+ * bucket.
+ */
+BlockNumber
+_hash_get_newblk(Relation rel, HashPageOpaque opaque)
+{
+	Bucket		curr_bucket;
+	Bucket		new_bucket;
+	uint32		lowmask;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	curr_bucket = opaque->hasho_bucket;
+
+	/*
+	 * new bucket can be obtained by OR'ing old bucket with most significant
+	 * bit of current table half.  There could be multiple buckets that could
+	 * have splitted from curent bucket.  We need the first such bucket that
+	 * exists based on current table half.
+	 */
+	lowmask = metap->hashm_lowmask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = curr_bucket | mask;
+		if (new_bucket > metap->hashm_maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+		break;
+	}
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newbucket() -- get the new bucket that will be generated after
+ *			split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket.  New bucket can be
+ * obtained by OR'ing old bucket with most significant bit of table half
+ * for lowmask passed in this function.  There could be multiple buckets that
+ * could have splitted from curent bucket.  We need the first such bucket that
+ * exists.  Caller must ensure that no more than one split has happened from
+ * old bucket.
+ */
+Bucket
+_hash_get_newbucket(Relation rel, Bucket curr_bucket,
+					uint32 lowmask, uint32 maxbucket)
+{
+	Bucket		new_bucket;
+	uint32		mask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = curr_bucket | mask;
+		if (new_bucket > maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		break;
+	}
+
+	return new_bucket;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 76ade37..1c9be40 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3567,6 +3567,26 @@ ConditionalLockBuffer(Buffer buffer)
 }
 
 /*
+ * Acquire the content_lock for the buffer, but only if we don't have to wait.
+ *
+ * This assumes the caller wants BUFFER_LOCK_SHARED mode.
+ */
+bool
+ConditionalLockBufferShared(Buffer buffer)
+{
+	BufferDesc *buf;
+
+	Assert(BufferIsValid(buffer));
+	if (BufferIsLocal(buffer))
+		return true;			/* act as though we got it */
+
+	buf = GetBufferDescriptor(buffer - 1);
+
+	return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+									LW_SHARED);
+}
+
+/*
  * LockBufferForCleanup - lock a buffer in preparation for deleting items
  *
  * Items may be deleted from a disk page only when the caller (a) holds an
@@ -3750,6 +3770,49 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 	return false;
 }
 
+/*
+ * CheckBufferForCleanup - as above, but don't attempt to take lock
+ *
+ * We won't loop, but just check once to see if the pin count is OK.  If
+ * not, return FALSE.
+ */
+bool
+CheckBufferForCleanup(Buffer buffer)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(BufferIsValid(buffer));
+
+	if (BufferIsLocal(buffer))
+	{
+		/* There should be exactly one pin */
+		if (LocalRefCount[-buffer - 1] != 1)
+			return false;
+		/* Nobody else to wait for */
+		return true;
+	}
+
+	/* There should be exactly one local pin */
+	if (GetPrivateRefCount(buffer) != 1)
+		return false;
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+	buf_state = LockBufHdr(bufHdr);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+	{
+		/* pincount is OK. */
+		UnlockBufHdr(bufHdr, buf_state);
+		return true;
+	}
+
+	UnlockBufHdr(bufHdr, buf_state);
+	return false;
+}
+
 
 /*
  *	Functions for buffer I/O handling
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index d9df904..bbf822b 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -24,6 +24,7 @@
 #include "lib/stringinfo.h"
 #include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
+#include "utils/hsearch.h"
 #include "utils/relcache.h"
 
 /*
@@ -32,6 +33,8 @@
  */
 typedef uint32 Bucket;
 
+#define InvalidBucket	((Bucket) 0xFFFFFFFF)
+
 #define BUCKET_TO_BLKNO(metap,B) \
 		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
 
@@ -51,6 +54,9 @@ typedef uint32 Bucket;
 #define LH_BUCKET_PAGE			(1 << 1)
 #define LH_BITMAP_PAGE			(1 << 2)
 #define LH_META_PAGE			(1 << 3)
+#define LH_BUCKET_NEW_PAGE_SPLIT	(1 << 4)
+#define LH_BUCKET_OLD_PAGE_SPLIT	(1 << 5)
+#define LH_BUCKET_PAGE_HAS_GARBAGE	(1 << 6)
 
 typedef struct HashPageOpaqueData
 {
@@ -63,6 +69,12 @@ typedef struct HashPageOpaqueData
 
 typedef HashPageOpaqueData *HashPageOpaque;
 
+#define H_HAS_GARBAGE(opaque)			((opaque)->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+#define H_OLD_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+#define H_INCOMPLETE_SPLIT(opaque)		(((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) || \
+										 ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT))
+
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
  * which otherwise would have a hard time telling pages of different index
@@ -87,12 +99,6 @@ typedef struct HashScanOpaqueData
 	bool		hashso_bucket_valid;
 
 	/*
-	 * If we have a share lock on the bucket, we record it here.  When
-	 * hashso_bucket_blkno is zero, we have no such lock.
-	 */
-	BlockNumber hashso_bucket_blkno;
-
-	/*
 	 * We also want to remember which buffer we're currently examining in the
 	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
 	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -100,11 +106,23 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_curbuf;
 
+	/* remember the buffer associated with primary bucket */
+	Buffer		hashso_bucket_buf;
+
+	/*
+	 * remember the buffer associated with old primary bucket which is
+	 * required during the scan of the bucket for which split is in progress.
+	 */
+	Buffer		hashso_old_bucket_buf;
+
 	/* Current position of the scan, as an index TID */
 	ItemPointerData hashso_curpos;
 
 	/* Current position of the scan, as a heap TID */
 	ItemPointerData hashso_heappos;
+
+	/* Whether scan needs to skip tuples that are moved by split */
+	bool		hashso_skip_moved_tuples;
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -175,6 +193,8 @@ typedef HashMetaPageData *HashMetaPage;
 				  sizeof(ItemIdData) - \
 				  MAXALIGN(sizeof(HashPageOpaqueData)))
 
+#define INDEX_MOVED_BY_SPLIT_MASK	0x2000
+
 #define HASH_MIN_FILLFACTOR			10
 #define HASH_DEFAULT_FILLFACTOR		75
 
@@ -223,9 +243,6 @@ typedef HashMetaPageData *HashMetaPage;
 #define HASH_WRITE		BUFFER_LOCK_EXCLUSIVE
 #define HASH_NOLOCK		(-1)
 
-#define HASH_SHARE		ShareLock
-#define HASH_EXCLUSIVE	ExclusiveLock
-
 /*
  *	Strategy number. There's only one valid strategy for hashing: equality.
  */
@@ -298,21 +315,21 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 
 /* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
 extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BufferAccessStrategy bstrategy);
+				   BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy);
 
 /* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
 extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 			 int access, int flags);
+extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
+								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
@@ -321,6 +338,7 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 						   BufferAccessStrategy bstrategy);
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
+extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
 extern void _hash_wrtbuf(Relation rel, Buffer buf);
 extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
 				   int to_access);
@@ -328,6 +346,9 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
 				ForkNumber forkNum);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
+extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
+				   Buffer nbuf, uint32 maxbucket, uint32 highmask,
+				   uint32 lowmask);
 
 /* hashscan.c */
 extern void _hash_regscan(IndexScanDesc scan);
@@ -363,5 +384,17 @@ extern bool _hash_convert_tuple(Relation index,
 					Datum *index_values, bool *index_isnull);
 extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
 extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+extern BlockNumber _hash_get_oldblk(Relation rel, HashPageOpaque opaque);
+extern BlockNumber _hash_get_newblk(Relation rel, HashPageOpaque opaque);
+extern Bucket _hash_get_newbucket(Relation rel, Bucket curr_bucket,
+					uint32 lowmask, uint32 maxbucket);
+
+/* hash.c */
+extern void hashbucketcleanup(Relation rel, Buffer bucket_buf,
+				  BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
+				  double *tuples_removed, double *num_index_tuples,
+				  bool bucket_has_garbage, bool delay,
+				  IndexBulkDeleteCallback callback, void *callback_state);
 
 #endif   /* HASH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7b6ba96..accbb88 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -225,8 +225,10 @@ extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
 extern void UnlockBuffers(void);
 extern void LockBuffer(Buffer buffer, int mode);
 extern bool ConditionalLockBuffer(Buffer buffer);
+extern bool ConditionalLockBufferShared(Buffer buffer);
 extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
+extern bool CheckBufferForCleanup(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern void AbortBufferIO(void);
#21Jeff Janes
jeff.janes@gmail.com
In reply to: Amit Kapila (#20)
Re: Hash Indexes

On Thu, Sep 1, 2016 at 8:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have fixed all other issues you have raised. Updated patch is
attached with this mail.

I am finding the comments (particularly README) quite hard to follow.
There are many references to an "overflow bucket", or similar phrases. I
think these should be "overflow pages". A bucket is a conceptual thing
consisting of a primary page for that bucket and zero or more overflow
pages for the same bucket. There are no overflow buckets, unless you are
referring to the new bucket to which things are being moved.

Was maintaining on-disk compatibility a major concern for this patch?
Would you do things differently if that were not a concern? If we would
benefit from a break in format, I think it would be better to do that now
while hash indexes are still discouraged, rather than in a future release.

In particular, I am thinking about the need for every insert to
exclusive-content-lock the meta page to increment the index-wide tuple
count. I think that this is going to be a huge bottleneck on update
intensive workloads (which I don't believe have been performance tested as
of yet). I was wondering if we might not want to change that so that each
bucket keeps a local count, and sweeps that up to the meta page only when
it exceeds a threshold. But this would require the bucket page to have an
area to hold such a count. Another idea would to keep not a count of
tuples, but of buckets with at least one overflow page, and split when
there are too many of those. I bring it up now because it would be a shame
to ignore it until 10.0 is out the door, and then need to break things in
11.0.

Cheers,

Jeff

#22Amit Kapila
amit.kapila16@gmail.com
In reply to: Jeff Janes (#21)
Re: Hash Indexes

On Wed, Sep 7, 2016 at 11:49 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Sep 1, 2016 at 8:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have fixed all other issues you have raised. Updated patch is
attached with this mail.

I am finding the comments (particularly README) quite hard to follow. There
are many references to an "overflow bucket", or similar phrases. I think
these should be "overflow pages". A bucket is a conceptual thing consisting
of a primary page for that bucket and zero or more overflow pages for the
same bucket. There are no overflow buckets, unless you are referring to the
new bucket to which things are being moved.

Hmm. I think page or block is a concept of database systems and
buckets is a general concept used in hashing technology. I think the
difference is that there are primary buckets and overflow buckets. I
have checked how they are referred in one of the wiki pages [1]https://en.wikipedia.org/wiki/Linear_hashing,
search for overflow on that wiki page. Now, I think we shouldn't be
inconsistent in using them. I will change to make it same if I find
any inconsistency based on what you or other people think is the
better way to refer overflow space.

Was maintaining on-disk compatibility a major concern for this patch? Would
you do things differently if that were not a concern?

I would not have done much differently from what it is now, however
one thing I have considered during development was to change the hash
index tuple structure as below to mark the index tuples as
move-by-split:

typedef struct
{
IndexTuple entry; /* tuple to insert */
bool moved_by_split;
} HashEntryData;

The other alternative was to use the (unused) bit in IndexTupleData->tinfo.

I have chosen the later approach, now one could definitely argue that
it is the last available bit in IndexTuple and using it for hash
indexes might or might not be best thing to do. However, I think it
is also not advisable to break the compatibility if we can use some
existing bit. In any case, the same question can arise whenever
anyone wants to use it for some other purpose.

In particular, I am thinking about the need for every insert to
exclusive-content-lock the meta page to increment the index-wide tuple
count.

This is not what this patch has changed. The main purpose of this
patch is to change heavy-weight locking to light-weight locking and
provide a way to handle the in-complete splits, both of which are
required to sensibly write WAL for hash-indexes. Having said that, I
agree with your point that we can improve the insertion logic, so that
we don't need to Write-lock the meta-page at each insert. I have
noticed some other improvements in hash indexes as well during this
work like caching the meta page, reduce lock/unlock calls for
retrieving tuples from a page by making hash index scans work a page
at a time as we do for btree scans, kill_prior_tuple mechanism is
current quite naive and needs improvement and the biggest improvement
is needed in create index logic where we are inserting tuple-by-tuple
whereas btree operates at page level and also by-passes the shared
buffers. One of such improvements (cache the meta page) is already
being worked upon by my colleague and the patch [2]https://commitfest.postgresql.org/10/715/ for same is in CF.
The main point I want to high light is that apart from what this patch
does, there are number of other potential areas which needs
improvements in hash indexes and I think it is better to do those as
separate enhancements rather than as a single patch.

I think that this is going to be a huge bottleneck on update
intensive workloads (which I don't believe have been performance tested as
of yet).

I have done some performance testing with this patch and I find there
was a significant improvement as compare to what we have now in hash
indexes even for read-write workload. I think the better idea is to
compare it with btree, but in any case, even if this proves to be a
bottleneck, we should try to improve it in a separate patch rather
than as a part of this patch.

I was wondering if we might not want to change that so that each
bucket keeps a local count, and sweeps that up to the meta page only when it
exceeds a threshold. But this would require the bucket page to have an area
to hold such a count. Another idea would to keep not a count of tuples, but
of buckets with at least one overflow page, and split when there are too
many of those.

I think both of these ideas could lead to change the point (tuple
count) where we currently split. This might impact the search speed
and space usage. Yet another alternative could be to change
hashm_ntuples to 64bit and use 64-bit atomics to operate on it or may
be use a separate spin-lock to protect it. However, whatever we
decide to do with it, I think it is a matter of separate patch.

Thanks for looking into patch.

[1]: https://en.wikipedia.org/wiki/Linear_hashing
[2]: https://commitfest.postgresql.org/10/715/

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Amit Kapila (#20)
1 attachment(s)
Re: Hash Indexes

On 09/01/2016 11:55 PM, Amit Kapila wrote:

I have fixed all other issues you have raised. Updated patch is
attached with this mail.

The following script hangs on idx_val creation - just with v5, WAL patch
not applied.

Best regards,
Jesper

Attachments:

zero.sqlapplication/sql; name=zero.sqlDownload
#24Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Jesper Pedersen (#23)
Re: Hash Indexes

On 13/09/16 01:20, Jesper Pedersen wrote:

On 09/01/2016 11:55 PM, Amit Kapila wrote:

I have fixed all other issues you have raised. Updated patch is
attached with this mail.

The following script hangs on idx_val creation - just with v5, WAL patch
not applied.

Are you sure it is actually hanging? I see 100% cpu for a few minutes
but the index eventually completes ok for me (v5 patch applied to
today's master).

Cheers

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Amit Kapila
amit.kapila16@gmail.com
In reply to: Mark Kirkwood (#24)
Re: Hash Indexes

On Tue, Sep 13, 2016 at 3:58 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

On 13/09/16 01:20, Jesper Pedersen wrote:

On 09/01/2016 11:55 PM, Amit Kapila wrote:

I have fixed all other issues you have raised. Updated patch is
attached with this mail.

The following script hangs on idx_val creation - just with v5, WAL patch
not applied.

Are you sure it is actually hanging? I see 100% cpu for a few minutes but
the index eventually completes ok for me (v5 patch applied to today's
master).

It completed for me as well. The second index creation is taking more
time and cpu, because it is just inserting duplicate values which need
lot of overflow pages.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#22)
1 attachment(s)
Re: Hash Indexes

Attached, new version of patch which contains the fix for problem
reported on write-ahead-log of hash index thread [1]/messages/by-id/CAA4eK1JuKt=-=Y0FheiFL-i8Z5_5660=3n8JUA8s3zG53t_ArQ@mail.gmail.com.

[1]: /messages/by-id/CAA4eK1JuKt=-=Y0FheiFL-i8Z5_5660=3n8JUA8s3zG53t_ArQ@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

concurrent_hash_index_v6.patchapplication/octet-stream; name=concurrent_hash_index_v6.patchDownload
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c1122b4..d02539a 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -400,7 +400,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	Buffer		buf;
 	Page		page;
 
-	_hash_getlock(rel, blkno, HASH_SHARE);
 	buf = _hash_getbuf_with_strategy(rel, blkno, HASH_READ, 0, bstrategy);
 	page = BufferGetPage(buf);
 
@@ -431,7 +430,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	}
 
 	_hash_relbuf(rel, buf);
-	_hash_droplock(rel, blkno, HASH_SHARE);
 }
 
 /*
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 0a7da89..a0feb2f 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -125,49 +125,45 @@ the initially created buckets.
 
 Lock Definitions
 ----------------
-
-We use both lmgr locks ("heavyweight" locks) and buffer context locks
-(LWLocks) to control access to a hash index.  lmgr locks are needed for
-long-term locking since there is a (small) risk of deadlock, which we must
-be able to detect.  Buffer context locks are used for short-term access
-control to individual pages of the index.
-
-LockPage(rel, page), where page is the page number of a hash bucket page,
-represents the right to split or compact an individual bucket.  A process
-splitting a bucket must exclusive-lock both old and new halves of the
-bucket until it is done.  A process doing VACUUM must exclusive-lock the
-bucket it is currently purging tuples from.  Processes doing scans or
-insertions must share-lock the bucket they are scanning or inserting into.
-(It is okay to allow concurrent scans and insertions.)
-
-The lmgr lock IDs corresponding to overflow pages are currently unused.
-These are available for possible future refinements.  LockPage(rel, 0)
-is also currently undefined (it was previously used to represent the right
-to modify the hash-code-to-bucket mapping, but it is no longer needed for
-that purpose).
-
-Note that these lock definitions are conceptually distinct from any sort
-of lock on the pages whose numbers they share.  A process must also obtain
-read or write buffer lock on the metapage or bucket page before accessing
-said page.
-
-Processes performing hash index scans must hold share lock on the bucket
-they are scanning throughout the scan.  This seems to be essential, since
-there is no reasonable way for a scan to cope with its bucket being split
-underneath it.  This creates a possibility of deadlock external to the
-hash index code, since a process holding one of these locks could block
-waiting for an unrelated lock held by another process.  If that process
-then does something that requires exclusive lock on the bucket, we have
-deadlock.  Therefore the bucket locks must be lmgr locks so that deadlock
-can be detected and recovered from.
-
-Processes must obtain read (share) buffer context lock on any hash index
-page while reading it, and write (exclusive) lock while modifying it.
-To prevent deadlock we enforce these coding rules: no buffer lock may be
-held long term (across index AM calls), nor may any buffer lock be held
-while waiting for an lmgr lock, nor may more than one buffer lock
-be held at a time by any one process.  (The third restriction is probably
-stronger than necessary, but it makes the proof of no deadlock obvious.)
+We use buffer content locks (LWLocks) and buffer pins to control access to
+a hash index.
+
+Scan will take a lock in shared mode on primary or overflow buckets.  Inserts
+will acquire exclusive lock on the bucket in which it has to insert.  Both the
+operations releases the lock on previous bucket before moving to the next
+overflow bucket.  They will retain a pin on primary bucket till end of operation.
+Split operation must acquire cleanup lock on both old and new halves of the
+bucket and mark split-in-progress on both the buckets.  The cleanup lock at
+the start of split ensures that parallel insert won't get lost.  Consider a
+case where insertion has to add a tuple on some intermediate overflow bucket
+in the bucket chain, if we allow split when insertion is in progress, split
+might not move this newly inserted tuple.  It releases the lock on previous
+bucket before moving to the next overflow bucket either for old bucket or for
+new bucket.  After partitioning the tuples between old and new buckets, it
+again needs to acquire exclusive lock on both old and new buckets to clear
+the split-in-progress flag.  Like inserts and scans, it will also retain pins
+on both the old and new primary buckets till end of split operation, although
+we can do without that as well.
+
+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or tuples
+that are moved due to split.  The need for cleanup lock to remove dead tuples
+is to ensure that scans' returns correct results.  Scan that returns multiple
+tuples from the same bucket page always restart the scan from the previous
+offset number from which it has returned last tuple.  If we allow vacuum to
+remove the dead tuples with just an exclusive lock, it could remove the tuple
+required to resume the scan.  The need for cleanup lock to remove the tuples
+that are moved by split is to ensure that there is no pending scan that has
+started after the start of split and before the finish of split on bucket.
+If we don't do that, then vacuum can remove tuples that are required by such
+a scan.  We don't need to retain this cleanup lock during whole vacuum
+operation on bucket.  We releases the lock as we move ahead in the bucket
+chain.  In the end, for squeeze-phase, we conditionally acquire cleanup lock
+and if we don't get, then we just abandon the squeeze phase.
+
+To avoid deadlocks, we must be consistent about the lock order in which we
+lock the buckets for operations that requires locks on two different buckets.
+We use the rule "first lock the old bucket and then new bucket, basically
+lock the lowered number bucket first".
 
 
 Pseudocode Algorithms
@@ -188,63 +184,105 @@ track of available overflow pages.
 The reader algorithm is:
 
 	pin meta page and take buffer content lock in shared mode
-	loop:
-		compute bucket number for target hash key
-		release meta page buffer content lock
-		if (correct bucket page is already locked)
-			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock
-		retake meta page buffer content lock in shared mode
+	compute bucket number for target hash key
+	read and pin the primary bucket page
+	conditionally get the buffer content lock in shared mode on primary bucket page for search
+	if we didn't get the lock (need to wait for lock)
+		release the buffer content lock on meta page
+		acquire buffer content lock on primary bucket page in shared mode
+		acquire the buffer content lock in shared mode on meta page
+		to check for possibility of split, we need to recompute the bucket and
+		verify, if it is a correct bucket; set the retry flag
+	else if we get the lock, then we can skip the retry path
+	if (retry)
+		loop:
+			compute bucket number for target hash key
+			release meta page buffer content lock
+			if (correct bucket page is already locked)
+				break
+			release any existing content lock on bucket page (if a concurrent split happened)
+			pin primary bucket page and take shared buffer content lock
+			retake meta page buffer content lock in shared mode
 -- then, per read request:
 	release pin on metapage
-	read current page of bucket and take shared buffer content lock
-		step to next page if necessary (no chaining of locks)
+	if the split is in progress for current bucket and this is a new bucket
+		release the buffer content lock on current bucket page
+		pin and acquire the buffer content lock on old bucket in shared mode
+		release the buffer content lock on old bucket, but not pin
+		retake the buffer content lock on new bucket
+		mark the scan such that it skips the tuples that are marked as moved by split
+	step to next page if necessary (no chaining of locks)
+		if the scan indicates moved by split, then move to old bucket after the scan
+		of current bucket is finished
 	get tuple
 	release buffer content lock and pin on current page
 -- at scan shutdown:
-	release bucket share-lock
-
-We can't hold the metapage lock while acquiring a lock on the target bucket,
-because that might result in an undetected deadlock (lwlocks do not participate
-in deadlock detection).  Instead, we relock the metapage after acquiring the
-bucket page lock and check whether the bucket has been split.  If not, we're
-done.  If so, we release our previously-acquired lock and repeat the process
-using the new bucket number.  Holding the bucket sharelock for
+	release any pin we hold on current buffer, old bucket buffer, new bucket buffer
+
+We don't want to hold the meta page lock if we have to wait for acquiring the
+content lock on bucket page, because that might result in poor concurrency.
+Instead, we relock the metapage after acquiring the bucket page content lock
+and check whether the bucket has been split.  If not, we're done.  If so, we
+release our previously-acquired content lock, but not pin and repeat the
+process using the new bucket number.  Holding the buffer pin on bucket page for
 the remainder of the scan prevents the reader's current-tuple pointer from
-being invalidated by splits or compactions.  Notice that the reader's lock
+being invalidated by splits or compactions.  Notice that the reader's pin
 does not prevent other buckets from being split or compacted.
 
 To keep concurrency reasonably good, we require readers to cope with
 concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the page sharelock.  Since
-deletion is not possible while a reader holds the bucket sharelock, and
-we assume that heap tuple TIDs are unique, this can be implemented by
+their current scan position after re-acquiring the buffer content lock on
+page.  Since deletion is not possible while a reader holds the pin on bucket,
+and we assume that heap tuple TIDs are unique, this can be implemented by
 searching for the same heap tuple TID previously returned.  Insertion does
 not move index entries across pages, so the previously-returned index entry
 should always be on the same page, at the same or higher offset number,
 as it was before.
 
+To allow scan during bucket split, if at the start of the scan, bucket is
+marked as split-in-progress, it scan all the tuples in that bucket except for
+those that are marked as moved-by-split.  Once it finishes the scan of all the
+tuples in the current bucket, it scans the old bucket from which this bucket
+is formed by split.  This happens only for the new half bucket.
+
 The insertion algorithm is rather similar:
 
 	pin meta page and take buffer content lock in shared mode
-	loop:
-		compute bucket number for target hash key
-		release meta page buffer content lock
-		if (correct bucket page is already locked)
-			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock in shared mode
-		retake meta page buffer content lock in shared mode
--- (so far same as reader)
+	compute bucket number for target hash key
+	read and pin the primary bucket page
+	conditionally get the buffer content lock in exclusive mode on primary bucket page for search
+	if we didn't get the lock (need to wait for lock)
+		release the buffer content lock on meta page
+		acquire buffer content lock on primary bucket page in exclusive mode
+		acquire the buffer content lock in shared mode on meta page
+		to check for possibility of split, we need to recompute the bucket and
+		verify, if it is a correct bucket; set the retry flag
+	else if we get the lock, then we can skip the retry path
+	if (retry)
+		loop:
+			compute bucket number for target hash key
+			release meta page buffer content lock
+			if (correct bucket page is already locked)
+				break
+			release any existing content lock on bucket page (if a concurrent split happened)
+			pin primary bucket page and take exclusive buffer content lock
+			retake meta page buffer content lock in shared mode
+-- (so far same as reader, except for acquisation of buffer content lock in
+	exclusive mode on primary bucket page)
 	release pin on metapage
-	pin current page of bucket and take exclusive buffer content lock
-	if full, release, read/exclusive-lock next page; repeat as needed
+	if the split-in-progress flag is set for bucket in old half of split
+	and pin count on it is one, then finish the split
+		we already have a buffer content lock on old bucket, conditionally get the content lock on new bucket
+		if get the lock on new bucket
+			finish the split using algorithm mentioned below for split
+			release the buffer content lock and pin on new bucket
+	if full, release lock but not pin, read/exclusive-lock next page; repeat as needed
 	>> see below if no space in any page of bucket
 	insert tuple at appropriate place in page
 	mark current page dirty and release buffer content lock and pin
+	if current page is not a bucket page, release the pin on bucket page
 	release heavyweight share-lock
-	pin meta page and take buffer content lock in shared mode
+	pin meta page and take buffer content lock in exclusive mode
 	increment tuple count, decide if split needed
 	mark meta page dirty and release buffer content lock and pin
 	done if no split needed, else enter Split algorithm below
@@ -256,11 +294,13 @@ bucket that is being actively scanned, because readers can cope with this
 as explained above.  We only need the short-term buffer locks to ensure
 that readers do not see a partially-updated page.
 
-It is clearly impossible for readers and inserters to deadlock, and in
-fact this algorithm allows them a very high degree of concurrency.
-(The exclusive metapage lock taken to update the tuple count is stronger
-than necessary, since readers do not care about the tuple count, but the
-lock is held for such a short time that this is probably not an issue.)
+To avoid deadlock between readers and inserters, whenever there is a need
+to lock multiple buckets, we always take in the order suggested in Locking
+Definitions above.  This algorithm allows them a very high degree of
+concurrency.  (The exclusive metapage lock taken to update the tuple count
+is stronger than necessary, since readers do not care about the tuple count,
+but the lock is held for such a short time that this is probably not an
+issue.)
 
 When an inserter cannot find space in any existing page of a bucket, it
 must obtain an overflow page and add that page to the bucket's chain.
@@ -271,46 +311,79 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
 The algorithm attempts, but does not necessarily succeed, to split one
 existing bucket in two, thereby lowering the fill ratio:
 
-	pin meta page and take buffer content lock in exclusive mode
-	check split still needed
-	if split not needed anymore, drop buffer content lock and pin and exit
-	decide which bucket to split
-	Attempt to X-lock old bucket number (definitely could fail)
-	Attempt to X-lock new bucket number (shouldn't fail, but...)
-	if above fail, drop locks and pin and exit
+	expand:
+		take buffer content lock in exclusive mode on meta page
+		check split still needed
+		if split not needed anymore, drop buffer content lock and exit
+		decide which bucket to split
+		Attempt to acquire cleanup lock on old bucket number (definitely could fail)
+		if above fail, release lock and pin and exit
+		if the split-in-progress flag is set, then finish the split
+			conditionally get the content lock on new bucket which was involved in split
+			if got the lock on new bucket
+				finish the split using algorithm mentioned below for split
+				release the buffer content lock and pin on old and new bucketa
+				try to expand from start
+			else
+				release the buffer conetent lock and pin on old bucket and exit
+		if the garbage flag (indicates that tuples are moved by split) is set on bucket
+			release the buffer content lock on meta page
+			remove the tuples that doesn't belong to this bucket; see bucket cleanup below
+	Attempt to acquire cleanup lock on new bucket number (shouldn't fail, but...)
 	update meta page to reflect new number of buckets
-	mark meta page dirty and release buffer content lock and pin
+	mark meta page dirty and release buffer content lock
 	-- now, accesses to all other buckets can proceed.
 	Perform actual split of bucket, moving tuples as needed
 	>> see below about acquiring needed extra space
 	Release X-locks of old and new buckets
 
+	split guts
+	mark the old and new buckets indicating split-in-progress
+	mark the old bucket indicating has-garbage
+	copy the tuples that belongs to new bucket from old bucket
+	during copy mark such tuples as move-by-split
+	release lock but not pin for primary bucket page of old bucket,
+	read/shared-lock next page; repeat as needed
+	>> see below if no space in bucket page of new bucket
+	ensure to have exclusive-lock on both old and new buckets in that order
+	clear the split-in-progress flag from both the buckets
+	mark buffers dirty and release the locks and pins on both old and new buckets
+
 Note the metapage lock is not held while the actual tuple rearrangement is
 performed, so accesses to other buckets can proceed in parallel; in fact,
 it's possible for multiple bucket splits to proceed in parallel.
 
-Split's attempt to X-lock the old bucket number could fail if another
-process holds S-lock on it.  We do not want to wait if that happens, first
-because we don't want to wait while holding the metapage exclusive-lock,
-and second because it could very easily result in deadlock.  (The other
-process might be out of the hash AM altogether, and could do something
-that blocks on another lock this process holds; so even if the hash
-algorithm itself is deadlock-free, a user-induced deadlock could occur.)
-So, this is a conditional LockAcquire operation, and if it fails we just
-abandon the attempt to split.  This is all right since the index is
-overfull but perfectly functional.  Every subsequent inserter will try to
-split, and eventually one will succeed.  If multiple inserters failed to
-split, the index might still be overfull, but eventually, the index will
+Split's attempt to acquire cleanup-lock on the old bucket number could fail
+if another process holds any lock or pin on it.  We do not want to wait if
+that happens, because we don't want to wait while holding the metapage
+exclusive-lock.  So, this is a conditional LWLockAcquire operation, and if
+it fails we just abandon the attempt to split.  This is all right since the
+index is overfull but perfectly functional.  Every subsequent inserter will
+try to split, and eventually one will succeed.  If multiple inserters failed
+to split, the index might still be overfull, but eventually, the index will
 not be overfull and split attempts will stop.  (We could make a successful
 splitter loop to see if the index is still overfull, but it seems better to
 distribute the split overhead across successive insertions.)
 
+has_garbage flag indicates that the bucket contains tuples that are moved due
+to split.  This will be set only for old bucket.  Now, why we need it besides
+split-in-progress flag is to distinguish the case when the split is over
+(aka split-in-progress flag is cleared.).  This is used both by vacuum as
+well as during re-split operation.  Vacuum, uses it to decide if it needs to
+clear the tuples (that are moved-by-split) from bucket along with dead tuples.
+Re-split of bucket uses it to ensure that it doesn't start a new split from a
+bucket without clearing the previous tuples from old bucket.  The usage by
+re-split helps to keep bloat under control and makes the design somewhat
+simpler as we don't have to any time handle the situation where a bucket can
+contain dead-tuples from multiple splits.
+
 A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt.  The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
+disk space or crash) the index is left corrupt.  The probability of that
+could be made quite low if we grab a free page or two before we update the
+meta page, but the only real solution is to treat a split as a WAL-loggable,
 must-complete action.  I'm not planning to teach hash about WAL in this
-go-round.
+go-round.  However, we do try to finish the incomplete splits during insert
+and split.
 
 The fourth operation is garbage collection (bulk deletion):
 
@@ -319,9 +392,13 @@ The fourth operation is garbage collection (bulk deletion):
 	fetch current max bucket number
 	release meta page buffer content lock and pin
 	while next bucket <= max bucket do
-		Acquire X lock on target bucket
-		Scan and remove tuples, compact free space as needed
-		Release X lock
+		Acquire cleanup lock on target bucket
+		Scan and remove tuples
+		For overflow buckets, first we need to lock the next bucket and then
+		release the lock on current bucket
+		Ensure to have X lock on bucket page
+		If buffer pincount is one, then compact free space as needed
+		Release lock
 		next bucket ++
 	end loop
 	pin metapage and take buffer content lock in exclusive mode
@@ -330,20 +407,23 @@ The fourth operation is garbage collection (bulk deletion):
 	else update metapage tuple count
 	mark meta page dirty and release buffer content lock and pin
 
-Note that this is designed to allow concurrent splits.  If a split occurs,
-tuples relocated into the new bucket will be visited twice by the scan,
-but that does no harm.  (We must however be careful about the statistics
-reported by the VACUUM operation.  What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count
-if the stored tuple count and number of buckets did *not* change at any
-time during the scan.  This provides a way of correcting the stored tuple
-count if it gets out of sync for some reason.  But if a split or insertion
-does occur concurrently, the scan count is untrustworthy; instead,
-subtract the number of tuples deleted from the stored tuple count and
-use that.)
-
-The exclusive lock request could deadlock in some strange scenarios, but
-we can just error out without any great harm being done.
+Note that this is designed to allow concurrent splits and scans.  If a
+split occurs, tuples relocated into the new bucket will be visited twice
+by the scan, but that does no harm.  As we are releasing the locks during
+scan of a bucket, it will allow concurrent scan to start on a bucket and
+ensures that scan will always be behind cleanup.  It is must to keep scans
+behind cleanup, else vacuum could remove tuples that are required to
+complete the scan as explained in Lock Definitions section above.  This holds
+true for backward scans as well (backward scans first traverse each bucket
+starting from first bucket to last overflow bucket in the chain).
+We must be careful about the statistics reported by the VACUUM operation.
+What we can do is count the number of tuples scanned, and believe this in
+preference to the stored tuple count if the stored tuple count and number
+of buckets did *not* change at any time during the scan.  This provides a
+way of correcting the stored tuple count if it gets out of sync for some
+reason.  But if a split or insertion does occur concurrently, the scan
+count is untrustworthy; instead, subtract the number of tuples deleted
+from the stored tuple count and use that.
 
 
 Free Space Management
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..a12a830 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -287,10 +287,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		/*
 		 * An insertion into the current index page could have happened while
 		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold share lock on
-		 * the bucket, no deletions or splits could have occurred; therefore
-		 * we can expect that the TID still exists in the current index page,
-		 * at an offset >= where we were.)
+		 * for the TID we previously returned.  (Because we hold pin on the
+		 * bucket, no deletions or splits could have occurred; therefore we
+		 * can expect that the TID still exists in the current index page, at
+		 * an offset >= where we were.)
 		 */
 		OffsetNumber maxoffnum;
 
@@ -425,12 +425,15 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
 	so->hashso_bucket_valid = false;
-	so->hashso_bucket_blkno = 0;
 	so->hashso_curbuf = InvalidBuffer;
+	so->hashso_bucket_buf = InvalidBuffer;
+	so->hashso_old_bucket_buf = InvalidBuffer;
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
 	ItemPointerSetInvalid(&(so->hashso_heappos));
 
+	so->hashso_skip_moved_tuples = false;
+
 	scan->opaque = so;
 
 	/* register scan in case we change pages it's using */
@@ -449,15 +452,7 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -471,6 +466,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 				scan->numberOfKeys * sizeof(ScanKeyData));
 		so->hashso_bucket_valid = false;
 	}
+
+	so->hashso_skip_moved_tuples = false;
 }
 
 /*
@@ -484,16 +481,7 @@ hashendscan(IndexScanDesc scan)
 
 	/* don't need scan registered anymore */
 	_hash_dropscan(scan);
-
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	pfree(so);
 	scan->opaque = NULL;
@@ -504,6 +492,9 @@ hashendscan(IndexScanDesc scan)
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
  *
+ * This function also delete the tuples that are moved by split to other
+ * bucket.
+ *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
@@ -548,83 +539,52 @@ loop_top:
 	{
 		BlockNumber bucket_blkno;
 		BlockNumber blkno;
-		bool		bucket_dirty = false;
+		Buffer		bucket_buf;
+		Buffer		buf;
+		HashPageOpaque bucket_opaque;
+		Page		page;
+		bool		bucket_has_garbage = false;
 
 		/* Get address of bucket's start page */
 		bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
 
-		/* Exclusive-lock the bucket so we can shrink it */
-		_hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
 		/* Shouldn't have any active scans locally, either */
 		if (_hash_has_active_scan(rel, cur_bucket))
 			elog(ERROR, "hash index has active scan during VACUUM");
 
-		/* Scan each page in bucket */
 		blkno = bucket_blkno;
-		while (BlockNumberIsValid(blkno))
-		{
-			Buffer		buf;
-			Page		page;
-			HashPageOpaque opaque;
-			OffsetNumber offno;
-			OffsetNumber maxoffno;
-			OffsetNumber deletable[MaxOffsetNumber];
-			int			ndeletable = 0;
 
-			vacuum_delay_point();
-
-			buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-											 info->strategy);
-			page = BufferGetPage(buf);
-			opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-			Assert(opaque->hasho_bucket == cur_bucket);
-
-			/* Scan each tuple in page */
-			maxoffno = PageGetMaxOffsetNumber(page);
-			for (offno = FirstOffsetNumber;
-				 offno <= maxoffno;
-				 offno = OffsetNumberNext(offno))
-			{
-				IndexTuple	itup;
-				ItemPointer htup;
+		/*
+		 * We need to acquire a cleanup lock on the primary bucket to out wait
+		 * concurrent scans.
+		 */
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+		LockBufferForCleanup(buf);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 
-				itup = (IndexTuple) PageGetItem(page,
-												PageGetItemId(page, offno));
-				htup = &(itup->t_tid);
-				if (callback(htup, callback_state))
-				{
-					/* mark the item for deletion */
-					deletable[ndeletable++] = offno;
-					tuples_removed += 1;
-				}
-				else
-					num_index_tuples += 1;
-			}
+		page = BufferGetPage(buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
-			/*
-			 * Apply deletions and write page if needed, advance to next page.
-			 */
-			blkno = opaque->hasho_nextblkno;
+		/*
+		 * If the bucket contains tuples that are moved by split, then we need
+		 * to delete such tuples on completion of split.  Before cleaning, we
+		 * need to out-wait the scans that have started when the split was in
+		 * progress for a bucket.
+		 */
+		if (H_HAS_GARBAGE(bucket_opaque) &&
+			!H_INCOMPLETE_SPLIT(bucket_opaque))
+			bucket_has_garbage = true;
 
-			if (ndeletable > 0)
-			{
-				PageIndexMultiDelete(page, deletable, ndeletable);
-				_hash_wrtbuf(rel, buf);
-				bucket_dirty = true;
-			}
-			else
-				_hash_relbuf(rel, buf);
-		}
+		bucket_buf = buf;
 
-		/* If we deleted anything, try to compact free space */
-		if (bucket_dirty)
-			_hash_squeezebucket(rel, cur_bucket, bucket_blkno,
-								info->strategy);
+		hashbucketcleanup(rel, bucket_buf, blkno, info->strategy,
+						  local_metapage.hashm_maxbucket,
+						  local_metapage.hashm_highmask,
+						  local_metapage.hashm_lowmask, &tuples_removed,
+						  &num_index_tuples, bucket_has_garbage, true,
+						  callback, callback_state);
 
-		/* Release bucket lock */
-		_hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+		_hash_relbuf(rel, bucket_buf);
 
 		/* Advance to next bucket */
 		cur_bucket++;
@@ -705,6 +665,197 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }
 
+/*
+ * Helper function to perform deletion of index entries from a bucket.
+ *
+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.
+ *
+ * During scan of overflow buckets, first we need to lock the next bucket and
+ * then release the lock on current bucket.  This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.
+ *
+ * We need to retain a pin on the primary bucket to ensure that no concurrent
+ * split can start.
+ */
+void
+hashbucketcleanup(Relation rel, Buffer bucket_buf,
+				  BlockNumber bucket_blkno,
+				  BufferAccessStrategy bstrategy,
+				  uint32 maxbucket,
+				  uint32 highmask, uint32 lowmask,
+				  double *tuples_removed,
+				  double *num_index_tuples,
+				  bool bucket_has_garbage,
+				  bool delay,
+				  IndexBulkDeleteCallback callback,
+				  void *callback_state)
+{
+	BlockNumber blkno;
+	Buffer		buf;
+	Bucket		cur_bucket;
+	Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
+	Page		page;
+	bool		bucket_dirty = false;
+
+	blkno = bucket_blkno;
+	buf = bucket_buf;
+	page = BufferGetPage(buf);
+	cur_bucket = ((HashPageOpaque) PageGetSpecialPointer(page))->hasho_bucket;
+
+	if (bucket_has_garbage)
+		new_bucket = _hash_get_newbucket(rel, cur_bucket,
+										 lowmask, maxbucket);
+
+	/* Scan each page in bucket */
+	for (;;)
+	{
+		HashPageOpaque opaque;
+		OffsetNumber offno;
+		OffsetNumber maxoffno;
+		Buffer		next_buf;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+		bool		retain_pin = false;
+		bool		curr_page_dirty = false;
+
+		if (delay)
+			vacuum_delay_point();
+
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		/* Scan each tuple in page */
+		maxoffno = PageGetMaxOffsetNumber(page);
+		for (offno = FirstOffsetNumber;
+			 offno <= maxoffno;
+			 offno = OffsetNumberNext(offno))
+		{
+			IndexTuple	itup;
+			ItemPointer htup;
+			Bucket		bucket;
+
+			itup = (IndexTuple) PageGetItem(page,
+											PageGetItemId(page, offno));
+			htup = &(itup->t_tid);
+			if (callback && callback(htup, callback_state))
+			{
+				/* mark the item for deletion */
+				deletable[ndeletable++] = offno;
+				if (tuples_removed)
+					*tuples_removed += 1;
+			}
+			else if (bucket_has_garbage)
+			{
+				/* delete the tuples that are moved by split. */
+				bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+											  maxbucket,
+											  highmask,
+											  lowmask);
+				/* mark the item for deletion */
+				if (bucket != cur_bucket)
+				{
+					/*
+					 * We expect tuples to either belong to curent bucket or
+					 * new_bucket.  This is ensured because we don't allow
+					 * further splits from bucket that contains garbage. See
+					 * comments in _hash_expandtable.
+					 */
+					Assert(bucket == new_bucket);
+					deletable[ndeletable++] = offno;
+				}
+				else if (num_index_tuples)
+					*num_index_tuples += 1;
+			}
+			else if (num_index_tuples)
+				*num_index_tuples += 1;
+		}
+
+		/* retain the pin on primary bucket till end of bucket scan */
+		if (blkno == bucket_blkno)
+			retain_pin = true;
+		else
+			retain_pin = false;
+
+		blkno = opaque->hasho_nextblkno;
+
+		/*
+		 * Apply deletions and write page if needed, advance to next page.
+		 */
+		if (ndeletable > 0)
+		{
+			PageIndexMultiDelete(page, deletable, ndeletable);
+			bucket_dirty = true;
+			curr_page_dirty = true;
+		}
+
+		/* bail out if there are no more pages to scan. */
+		if (!BlockNumberIsValid(blkno))
+			break;
+
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
+		/*
+		 * release the lock on previous page after acquiring the lock on next
+		 * page
+		 */
+		if (curr_page_dirty)
+		{
+			if (retain_pin)
+				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+			else
+				_hash_wrtbuf(rel, buf);
+			curr_page_dirty = false;
+		}
+		else if (retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
+
+		buf = next_buf;
+	}
+
+	/*
+	 * lock the bucket page to clear the garbage flag and squeeze the bucket.
+	 * if the current buffer is same as bucket buffer, then we already have
+	 * lock on bucket page.
+	 */
+	if (buf != bucket_buf)
+	{
+		_hash_relbuf(rel, buf);
+		_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+	}
+
+	/*
+	 * Clear the garbage flag from bucket after deleting the tuples that are
+	 * moved by split.  We purposefully clear the flag before squeeze bucket,
+	 * so that after restart, vacuum shouldn't again try to delete the moved
+	 * by split tuples.
+	 */
+	if (bucket_has_garbage)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = BufferGetPage(bucket_buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+	}
+
+	/*
+	 * If we deleted anything, try to compact free space.  For squeezing the
+	 * bucket, we must have a cleanup lock, else it can impact the ordering of
+	 * tuples for a scan that has started before it.
+	 */
+	if (bucket_dirty && CheckBufferForCleanup(bucket_buf))
+		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
+							bstrategy);
+}
 
 void
 hash_redo(XLogReaderState *record)
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..5cfd0aa 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -28,7 +28,8 @@
 void
 _hash_doinsert(Relation rel, IndexTuple itup)
 {
-	Buffer		buf;
+	Buffer		buf = InvalidBuffer;
+	Buffer		bucket_buf;
 	Buffer		metabuf;
 	HashMetaPage metap;
 	BlockNumber blkno;
@@ -40,6 +41,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	bool		do_expand;
 	uint32		hashkey;
 	Bucket		bucket;
+	uint32		maxbucket;
+	uint32		highmask;
+	uint32		lowmask;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -70,51 +74,131 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			errhint("Values larger than a buffer page cannot be indexed.")));
 
 	/*
-	 * Loop until we get a lock on the correct target bucket.
+	 * Copy bucket mapping info now;  The comment in _hash_expandtable where
+	 * we copy this information and calls _hash_splitbucket explains why this
+	 * is OK.
 	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
-									  metap->hashm_maxbucket,
-									  metap->hashm_highmask,
-									  metap->hashm_lowmask);
+	maxbucket = metap->hashm_maxbucket;
+	highmask = metap->hashm_highmask;
+	lowmask = metap->hashm_lowmask;
 
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
+	/*
+	 * Conditionally get the lock on primary bucket page for insertion while
+	 * holding lock on meta page. If we have to wait, then release the meta
+	 * page lock and retry it in a hard way.
+	 */
+	bucket = _hash_hashkey2bucket(hashkey,
+								  maxbucket,
+								  highmask,
+								  lowmask);
+
+	blkno = BUCKET_TO_BLKNO(metap, bucket);
 
-		/* Release metapage lock, but keep pin. */
+	/* Fetch the primary bucket page for the bucket */
+	buf = ReadBuffer(rel, blkno);
+	if (!ConditionalLockBuffer(buf))
+	{
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(buf, HASH_WRITE);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		oldblkno = blkno;
+		retry = true;
+	}
+	else
+	{
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	}
 
+	if (retry)
+	{
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * Loop until we get a lock on the correct target bucket.  We get the
+		 * lock on primary bucket page and retain the pin on it during insert
+		 * operation to prevent the concurrent splits.  Retaining pin on a
+		 * primary bucket page ensures that split can't happen as it needs to
+		 * acquire the cleanup lock on primary bucket page.  Acquiring lock on
+		 * primary bucket and rechecking if it is a target bucket is mandatory
+		 * as otherwise a concurrent split might cause this insertion to fall
+		 * in wrong bucket.
 		 */
-		if (retry)
+		for (;;)
 		{
+			/*
+			 * Compute the target bucket number, and convert to block number.
+			 */
+			bucket = _hash_hashkey2bucket(hashkey,
+										  metap->hashm_maxbucket,
+										  metap->hashm_highmask,
+										  metap->hashm_lowmask);
+
+			blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+			/* Release metapage lock, but keep pin. */
+			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+			/*
+			 * If the previous iteration of this loop locked what is still the
+			 * correct target bucket, we are done.  Otherwise, drop any old
+			 * lock and lock what now appears to be the correct bucket.
+			 */
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
-		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
-		oldblkno = blkno;
-		retry = true;
+			/* Fetch the primary bucket page for the bucket */
+			buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+
+			/*
+			 * Reacquire metapage lock and check that no bucket split has
+			 * taken place while we were awaiting the bucket lock.
+			 */
+			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			oldblkno = blkno;
+		}
 	}
 
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+	/* remember the primary bucket buffer to release the pin on it at end. */
+	bucket_buf = buf;
+
 	page = BufferGetPage(buf);
 	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(pageopaque->hasho_bucket == bucket);
 
+	/*
+	 * If there is any pending split, try to finish it before proceeding for
+	 * the insertion.  We try to finish the split for the insertion in old
+	 * bucket, as that will allow us to remove the tuples from old bucket and
+	 * reuse the space.  There is no such apparent benefit from finishing the
+	 * split during insertion in new bucket.
+	 *
+	 * In future, if we want to finish the splits during insertion in new
+	 * bucket, we must ensure the locking order such that old bucket is locked
+	 * before new bucket.
+	 */
+	if (H_OLD_INCOMPLETE_SPLIT(pageopaque) && CheckBufferForCleanup(buf))
+	{
+		BlockNumber nblkno;
+		Buffer		nbuf;
+
+		nblkno = _hash_get_newblk(rel, pageopaque);
+
+		/* Fetch the primary bucket page for the new bucket */
+		nbuf = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+		if (nbuf)
+		{
+			_hash_finish_split(rel, metabuf, buf, nbuf, maxbucket,
+							   highmask, lowmask);
+
+			/*
+			 * release the buffer here as the insertion will happen in old
+			 * bucket.
+			 */
+			_hash_relbuf(rel, nbuf);
+		}
+	}
+
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
@@ -127,14 +211,23 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		{
 			/*
 			 * ovfl page exists; go get it.  if it doesn't have room, we'll
-			 * find out next pass through the loop test above.
+			 * find out next pass through the loop test above.  Retain the pin
+			 * if it is a primary bucket.
 			 */
-			_hash_relbuf(rel, buf);
+			if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
+				_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+			else
+				_hash_relbuf(rel, buf);
 			buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 			page = BufferGetPage(buf);
 		}
 		else
 		{
+			bool		retain_pin = false;
+
+			/* page flags must be accessed before releasing lock on a page. */
+			retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;
+
 			/*
 			 * we're at the end of the bucket chain and we haven't found a
 			 * page with enough room.  allocate a new overflow page.
@@ -144,7 +237,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 
 			/* chain to a new overflow page */
-			buf = _hash_addovflpage(rel, metabuf, buf);
+			buf = _hash_addovflpage(rel, metabuf, buf, retain_pin);
 			page = BufferGetPage(buf);
 
 			/* should fit now, given test above */
@@ -158,11 +251,13 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	/* found page with enough space, so add the item here */
 	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
 
-	/* write and release the modified page */
+	/*
+	 * write and release the modified page and ensure to release the pin on
+	 * primary page.
+	 */
 	_hash_wrtbuf(rel, buf);
-
-	/* We can drop the bucket lock now */
-	_hash_droplock(rel, blkno, HASH_SHARE);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
 
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index db3e268..760563a 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *
  *	On entry, the caller must hold a pin but no lock on 'buf'.  The pin is
  *	dropped before exiting (we assume the caller is not interested in 'buf'
- *	anymore).  The returned overflow page will be pinned and write-locked;
- *	it is guaranteed to be empty.
+ *	anymore) if not asked to retain.  The pin will be retained only for the
+ *	primary bucket.  The returned overflow page will be pinned and
+ *	write-locked; it is guaranteed to be empty.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
  *
- *	The caller must hold at least share lock on the bucket, to ensure that
- *	no one else tries to compact the bucket meanwhile.  This guarantees that
- *	'buf' won't stop being part of the bucket while it's unlocked.
- *
  * NB: since this could be executed concurrently by multiple processes,
  * one should not assume that the returned overflow page will be the
  * immediate successor of the originally passed 'buf'.  Additional overflow
  * pages might have been added to the bucket chain in between.
  */
 Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 {
 	Buffer		ovflbuf;
 	Page		page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 			break;
 
 		/* we assume we do not need to write the unmodified page */
-		_hash_relbuf(rel, buf);
+		if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
 
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
@@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 
 	/* logically chain overflow page to previous page */
 	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	_hash_wrtbuf(rel, buf);
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, buf);
 
 	return ovflbuf;
 }
@@ -370,11 +373,11 @@ _hash_firstfreebit(uint32 map)
  *	in the bucket, or InvalidBlockNumber if no following page.
  *
  *	NB: caller must not hold lock on metapage, nor on either page that's
- *	adjacent in the bucket chain.  The caller had better hold exclusive lock
- *	on the bucket, too.
+ *	adjacent in the bucket chain except from primary bucket.  The caller had
+ *	better hold cleanup lock on the primary bucket.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf,
+_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
@@ -413,22 +416,41 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
-	 * deleted.  No concurrency issues since we hold exclusive lock on the
-	 * entire bucket.
+	 * deleted.  No concurrency issues since we hold the cleanup lock on
+	 * primary bucket.  We don't need to aqcuire buffer lock to fix the
+	 * primary bucket, as we already have that lock.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-														 prevblkno,
-														 HASH_WRITE,
+		if (prevblkno == bucket_blkno)
+		{
+			Buffer		prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
+													 prevblkno,
+													 RBM_NORMAL,
+													 bstrategy);
+
+			Page		prevpage = BufferGetPage(prevbuf);
+			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+			Assert(prevopaque->hasho_bucket == bucket);
+			prevopaque->hasho_nextblkno = nextblkno;
+			MarkBufferDirty(prevbuf);
+			ReleaseBuffer(prevbuf);
+		}
+		else
+		{
+			Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
+															 prevblkno,
+															 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		prevpage = BufferGetPage(prevbuf);
-		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+															 bstrategy);
+			Page		prevpage = BufferGetPage(prevbuf);
+			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
 
-		Assert(prevopaque->hasho_bucket == bucket);
-		prevopaque->hasho_nextblkno = nextblkno;
-		_hash_wrtbuf(rel, prevbuf);
+			Assert(prevopaque->hasho_bucket == bucket);
+			prevopaque->hasho_nextblkno = nextblkno;
+			_hash_wrtbuf(rel, prevbuf);
+		}
 	}
 	if (BlockNumberIsValid(nextblkno))
 	{
@@ -570,7 +592,7 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
  *	required that to be true on entry as well, but it's a lot easier for
  *	callers to leave empty overflow pages and let this guy clean it up.
  *
- *	Caller must hold exclusive lock on the target bucket.  This allows
+ *	Caller must hold cleanup lock on the target bucket.  This allows
  *	us to safely lock multiple pages in the bucket.
  *
  *	Since this function is invoked in VACUUM, we provide an access strategy
@@ -580,6 +602,7 @@ void
 _hash_squeezebucket(Relation rel,
 					Bucket bucket,
 					BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy)
 {
 	BlockNumber wblkno;
@@ -591,27 +614,22 @@ _hash_squeezebucket(Relation rel,
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
 	bool		wbuf_dirty;
+	bool		release_buf = false;
 
 	/*
 	 * start squeezing into the base bucket page.
 	 */
 	wblkno = bucket_blkno;
-	wbuf = _hash_getbuf_with_strategy(rel,
-									  wblkno,
-									  HASH_WRITE,
-									  LH_BUCKET_PAGE,
-									  bstrategy);
+	wbuf = bucket_buf;
 	wpage = BufferGetPage(wbuf);
 	wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 
 	/*
-	 * if there aren't any overflow pages, there's nothing to squeeze.
+	 * if there aren't any overflow pages, there's nothing to squeeze. caller
+	 * is responsible to release the lock on primary bucket.
 	 */
 	if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
-	{
-		_hash_relbuf(rel, wbuf);
 		return;
-	}
 
 	/*
 	 * Find the last page in the bucket chain by starting at the base bucket
@@ -669,12 +687,17 @@ _hash_squeezebucket(Relation rel,
 			{
 				Assert(!PageIsEmpty(wpage));
 
+				if (wblkno != bucket_blkno)
+					release_buf = true;
+
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));
 
-				if (wbuf_dirty)
+				if (wbuf_dirty && release_buf)
 					_hash_wrtbuf(rel, wbuf);
-				else
+				else if (wbuf_dirty)
+					MarkBufferDirty(wbuf);
+				else if (release_buf)
 					_hash_relbuf(rel, wbuf);
 
 				/* nothing more to do if we reached the read page */
@@ -700,6 +723,7 @@ _hash_squeezebucket(Relation rel,
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
 				wbuf_dirty = false;
+				release_buf = false;
 			}
 
 			/*
@@ -733,19 +757,25 @@ _hash_squeezebucket(Relation rel,
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
-			/* yes, so release wbuf lock first */
-			if (wbuf_dirty)
+			if (wblkno != bucket_blkno)
+				release_buf = true;
+
+			/* yes, so release wbuf lock first if needed */
+			if (wbuf_dirty && release_buf)
 				_hash_wrtbuf(rel, wbuf);
-			else
+			else if (wbuf_dirty)
+				MarkBufferDirty(wbuf);
+			else if (release_buf)
 				_hash_relbuf(rel, wbuf);
+
 			/* free this overflow page (releases rbuf) */
-			_hash_freeovflpage(rel, rbuf, bstrategy);
+			_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 			/* done */
 			return;
 		}
 
 		/* free this overflow page, then get the previous one */
-		_hash_freeovflpage(rel, rbuf, bstrategy);
+		_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 178463f..f51c313 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,10 +38,14 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
 					uint32 nblocks);
 static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
+static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
+					   Bucket obucket, Bucket nbucket, Buffer obuf,
+					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
+					   uint32 highmask, uint32 lowmask);
 
 
 /*
@@ -55,46 +59,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 
 
 /*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock.  (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		return ConditionalLockPage(rel, whichlock, access);
-	else
-		return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		UnlockPage(rel, whichlock, access);
-}
-
-/*
  *	_hash_getbuf() -- Get a buffer by block number for read or write.
  *
  *		'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -132,6 +96,35 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
 }
 
 /*
+ * _hash_getbuf_with_condlock_cleanup() -- as above, but get the buffer for write.
+ *
+ *		We try to take the conditional cleanup lock and if we get it then
+ *		return the buffer, else return InvalidBuffer.
+ */
+Buffer
+_hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
+{
+	Buffer		buf;
+
+	if (blkno == P_NEW)
+		elog(ERROR, "hash AM does not use P_NEW");
+
+	buf = ReadBuffer(rel, blkno);
+
+	if (!ConditionalLockBufferForCleanup(buf))
+	{
+		ReleaseBuffer(buf);
+		return InvalidBuffer;
+	}
+
+	/* ref count and lock type are correct */
+
+	_hash_checkpage(rel, buf, flags);
+
+	return buf;
+}
+
+/*
  *	_hash_getinitbuf() -- Get and initialize a buffer by block number.
  *
  *		This must be used only to fetch pages that are known to be before
@@ -266,6 +259,33 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
+ *	_hash_dropscanbuf() -- release buffers used in scan.
+ *
+ * This routine unpins the buffers used during scan on which we
+ * hold no lock.
+ */
+void
+_hash_dropscanbuf(Relation rel, HashScanOpaque so)
+{
+	/* release pin we hold on primary bucket */
+	if (BufferIsValid(so->hashso_bucket_buf) &&
+		so->hashso_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_bucket_buf);
+	so->hashso_bucket_buf = InvalidBuffer;
+
+	/* release pin we hold on old primary bucket */
+	if (BufferIsValid(so->hashso_old_bucket_buf) &&
+		so->hashso_old_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+	so->hashso_old_bucket_buf = InvalidBuffer;
+
+	/* release any pin we still hold */
+	if (BufferIsValid(so->hashso_curbuf))
+		_hash_dropbuf(rel, so->hashso_curbuf);
+	so->hashso_curbuf = InvalidBuffer;
+}
+
+/*
  *	_hash_wrtbuf() -- write a hash page to disk.
  *
  *		This routine releases the lock held on the buffer and our refcount
@@ -489,9 +509,11 @@ _hash_pageinit(Page page, Size size)
 /*
  * Attempt to expand the hash table by creating one new bucket.
  *
- * This will silently do nothing if it cannot get the needed locks.
+ * This will silently do nothing if there are active scans of our own
+ * backend or if we don't get cleanup lock on old or new bucket.
  *
- * The caller should hold no locks on the hash index.
+ * Complete the pending splits and remove the tuples from old bucket,
+ * if there are any left over from previous split.
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.
@@ -506,10 +528,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	BlockNumber start_oblkno;
 	BlockNumber start_nblkno;
 	Buffer		buf_nblkno;
+	Buffer		buf_oblkno;
+	Page		opage;
+	HashPageOpaque oopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
 
+restart_expand:
+
 	/*
 	 * Write-lock the meta page.  It used to be necessary to acquire a
 	 * heavyweight lock to begin a split, but that is no longer required.
@@ -548,11 +575,16 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		goto fail;
 
 	/*
-	 * Determine which bucket is to be split, and attempt to lock the old
-	 * bucket.  If we can't get the lock, give up.
+	 * Determine which bucket is to be split, and attempt to take cleanup lock
+	 * on the old bucket.  If we can't get the lock, give up.
 	 *
-	 * The lock protects us against other backends, but not against our own
-	 * backend.  Must check for active scans separately.
+	 * The cleanup lock protects us against other backends, but not against
+	 * our own backend.  Must check for active scans separately.
+	 *
+	 * The cleanup lock is mainly to protect the split from concurrent
+	 * inserts. See src/backend/access/hash/README, Lock Definitions for
+	 * further details.  Due to this locking restriction, if there is any
+	 * pending scan, split will give up which is not good, but harmless.
 	 */
 	new_bucket = metap->hashm_maxbucket + 1;
 
@@ -563,11 +595,90 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, old_bucket))
 		goto fail;
 
-	if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
+	buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
+	if (!buf_oblkno)
 		goto fail;
 
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
 	/*
-	 * Likewise lock the new bucket (should never fail).
+	 * We want to finish the split from a bucket as there is no apparent
+	 * benefit by not doing so and it will make the code complicated to finish
+	 * the split that involves multiple buckets considering the case where new
+	 * split also fails.  We don't need to consider the new bucket for
+	 * completing the split here as it is not possible that a re-split of new
+	 * bucket starts when there is still a pending split from old bucket.
+	 */
+	if (H_OLD_INCOMPLETE_SPLIT(oopaque))
+	{
+		BlockNumber nblkno;
+		Buffer		buf_nblkno;
+
+		/*
+		 * Copy bucket mapping info now;  The comment in code below where we
+		 * copy this information and calls _hash_splitbucket explains why this
+		 * is OK.
+		 */
+		maxbucket = metap->hashm_maxbucket;
+		highmask = metap->hashm_highmask;
+		lowmask = metap->hashm_lowmask;
+
+		/* Release the metapage lock, before completing the split. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		nblkno = _hash_get_newblk(rel, oopaque);
+
+		/* Fetch the primary bucket page for the new bucket */
+		buf_nblkno = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+		if (!buf_nblkno)
+		{
+			_hash_relbuf(rel, buf_oblkno);
+			goto fail;
+		}
+
+		_hash_finish_split(rel, metabuf, buf_oblkno, buf_nblkno, maxbucket,
+						   highmask, lowmask);
+
+		/*
+		 * release the buffers and retry for expand.
+		 */
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+
+		goto restart_expand;
+	}
+
+	/*
+	 * Clean the tuples remained from previous split.  This operation requires
+	 * cleanup lock and we already have one on old bucket, so let's do it. We
+	 * also don't want to allow further splits from the bucket till the
+	 * garbage of previous split is cleaned.  This has two advantages, first
+	 * it helps in avoiding the bloat due to garbage and second is, during
+	 * cleanup of bucket, we are always sure that the garbage tuples belong to
+	 * most recently splitted bucket.  On the contrary, if we allow cleanup of
+	 * bucket after meta page is updated to indicate the new split and before
+	 * the actual split, the cleanup operation won't be able to decide whether
+	 * the tuple has been moved to the newly created bucket and ended up
+	 * deleting such tuples.
+	 */
+	if (H_HAS_GARBAGE(oopaque))
+	{
+		/* Release the metapage lock. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		hashbucketcleanup(rel, buf_oblkno, start_oblkno, NULL,
+						  metap->hashm_maxbucket, metap->hashm_highmask,
+						  metap->hashm_lowmask, NULL,
+						  NULL, true, false, NULL, NULL);
+
+		_hash_relbuf(rel, buf_oblkno);
+
+		goto restart_expand;
+	}
+
+	/*
+	 * There shouldn't be any active scan on new bucket.
 	 *
 	 * Note: it is safe to compute the new bucket's blkno here, even though we
 	 * may still need to update the BUCKET_TO_BLKNO mapping.  This is because
@@ -579,9 +690,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, new_bucket))
 		elog(ERROR, "scan in progress on supposedly new bucket");
 
-	if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
-		elog(ERROR, "could not get lock on supposedly new bucket");
-
 	/*
 	 * If the split point is increasing (hashm_maxbucket's log base 2
 	 * increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +708,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
 			/* can't split due to BlockNumber overflow */
-			_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-			_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+			_hash_relbuf(rel, buf_oblkno);
 			goto fail;
 		}
 	}
@@ -609,9 +716,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/*
 	 * Physically allocate the new bucket's primary page.  We want to do this
 	 * before changing the metapage's mapping info, in case we can't get the
-	 * disk space.
+	 * disk space.  Ideally, we don't need to check for cleanup lock on new
+	 * bucket as no other backend could find this bucket unless meta page is
+	 * updated.  However, it is good to be consistent with old bucket locking.
 	 */
 	buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
+	if (!CheckBufferForCleanup(buf_nblkno))
+	{
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+		goto fail;
+	}
+
 
 	/*
 	 * Okay to proceed with split.  Update the metapage bucket mapping info.
@@ -665,13 +781,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  start_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno,
 					  maxbucket, highmask, lowmask);
 
-	/* Release bucket locks, allowing others to access them */
-	_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-	_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -745,6 +857,10 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
  * The buffer is returned in the same state.  (The metapage is only
  * touched if it becomes necessary to add or remove overflow pages.)
  *
+ * Split needs to hold pin on primary bucket pages of both old and new
+ * buckets till end of operation.  This is to prevent vacuum to start
+ * when split is in progress.
+ *
  * In addition, the caller must have created the new bucket's base page,
  * which is passed in buffer nbuf, pinned and write-locked.  That lock and
  * pin are released here.  (The API is set up this way because we must do
@@ -756,37 +872,87 @@ _hash_splitbucket(Relation rel,
 				  Buffer metabuf,
 				  Bucket obucket,
 				  Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Buffer		obuf;
 	Page		opage;
 	Page		npage;
 	HashPageOpaque oopaque;
 	HashPageOpaque nopaque;
 
-	/*
-	 * It should be okay to simultaneously write-lock pages from each bucket,
-	 * since no one else can be trying to acquire buffer lock on pages of
-	 * either bucket.
-	 */
-	obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
 	opage = BufferGetPage(obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 
+	/*
+	 * Mark the old bucket to indicate that split is in progress and it has
+	 * deletable tuples. At operation end, we clear split in progress flag and
+	 * vacuum will clear page_has_garbage flag after deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
 	npage = BufferGetPage(nbuf);
 
-	/* initialize the new bucket's primary page */
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nopaque->hasho_prevblkno = InvalidBlockNumber;
 	nopaque->hasho_nextblkno = InvalidBlockNumber;
 	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
 	nopaque->hasho_page_id = HASHO_PAGE_ID;
 
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, nbuf, NULL,
+						   maxbucket, highmask, lowmask);
+
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, obuf);
+	_hash_relbuf(rel, nbuf);
+}
+
+/*
+ * _hash_splitbucket_guts -- Helper function to perform the split operation
+ *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, caller needs to fill htab.  If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
+ * Caller needs to lock and unlock the old and new primary buckets.
+ */
+static void
+_hash_splitbucket_guts(Relation rel,
+					   Buffer metabuf,
+					   Bucket obucket,
+					   Bucket nbucket,
+					   Buffer obuf,
+					   Buffer nbuf,
+					   HTAB *htab,
+					   uint32 maxbucket,
+					   uint32 highmask,
+					   uint32 lowmask)
+{
+	Buffer		bucket_obuf;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
+
+	bucket_obuf = obuf;
+	opage = BufferGetPage(obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
 	/*
 	 * Partition the tuples in the old bucket between the old bucket and the
 	 * new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +964,6 @@ _hash_splitbucket(Relation rel,
 		BlockNumber oblkno;
 		OffsetNumber ooffnum;
 		OffsetNumber omaxoffnum;
-		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
 
 		/* Scan each tuple in old page */
 		omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -810,18 +974,45 @@ _hash_splitbucket(Relation rel,
 			IndexTuple	itup;
 			Size		itemsz;
 			Bucket		bucket;
+			bool		found = false;
 
 			/*
-			 * Fetch the item's hash key (conveniently stored in the item) and
-			 * determine which bucket it now belongs in.
+			 * Before inserting tuple, probe the hash table containing TIDs of
+			 * tuples belonging to new bucket, if we find a match, then skip
+			 * that tuple, else fetch the item's hash key (conveniently stored
+			 * in the item) and determine which bucket it now belongs in.
 			 */
 			itup = (IndexTuple) PageGetItem(opage,
 											PageGetItemId(opage, ooffnum));
+
+			if (htab)
+				(void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
+
+			if (found)
+				continue;
+
 			bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
 										  maxbucket, highmask, lowmask);
 
 			if (bucket == nbucket)
 			{
+				Size		itupsize = 0;
+				IndexTuple	new_itup;
+
+				/*
+				 * make a copy of index tuple as we have to scribble on it.
+				 */
+				new_itup = CopyIndexTuple(itup);
+
+				/*
+				 * mark the index tuple as moved by split, such tuples are
+				 * skipped by scan if there is split in progress for a bucket.
+				 */
+				itupsize = new_itup->t_info & INDEX_SIZE_MASK;
+				new_itup->t_info &= ~INDEX_SIZE_MASK;
+				new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+				new_itup->t_info |= itupsize;
+
 				/*
 				 * insert the tuple into the new bucket.  if it doesn't fit on
 				 * the current page in the new bucket, we must allocate a new
@@ -832,17 +1023,25 @@ _hash_splitbucket(Relation rel,
 				 * only partially complete, meaning the index is corrupt,
 				 * since searches may fail to find entries they should find.
 				 */
-				itemsz = IndexTupleDSize(*itup);
+				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
 				if (PageGetFreeSpace(npage) < itemsz)
 				{
+					bool		retain_pin = false;
+
+					/*
+					 * page flags must be accessed before releasing lock on a
+					 * page.
+					 */
+					retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
+
 					/* write out nbuf and drop lock, but keep pin */
 					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
 					/* chain to a new overflow page */
-					nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+					nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
 					npage = BufferGetPage(nbuf);
-					/* we don't need nopaque within the loop */
+					nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 				}
 
 				/*
@@ -852,12 +1051,10 @@ _hash_splitbucket(Relation rel,
 				 * Possible future improvement: accumulate all the items for
 				 * the new page and qsort them before insertion.
 				 */
-				(void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
+				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
-				/*
-				 * Mark tuple for deletion from old page.
-				 */
-				deletable[ndeletable++] = ooffnum;
+				/* be tidy */
+				pfree(new_itup);
 			}
 			else
 			{
@@ -870,15 +1067,9 @@ _hash_splitbucket(Relation rel,
 
 		oblkno = oopaque->hasho_nextblkno;
 
-		/*
-		 * Done scanning this old page.  If we moved any tuples, delete them
-		 * from the old page.
-		 */
-		if (ndeletable > 0)
-		{
-			PageIndexMultiDelete(opage, deletable, ndeletable);
-			_hash_wrtbuf(rel, obuf);
-		}
+		/* retain the pin on the old primary bucket */
+		if (obuf == bucket_obuf)
+			_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, obuf);
 
@@ -887,18 +1078,153 @@ _hash_splitbucket(Relation rel,
 			break;
 
 		/* Else, advance to next old page */
-		obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
 		opage = BufferGetPage(obuf);
 		oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	}
 
 	/*
 	 * We're at the end of the old bucket chain, so we're done partitioning
-	 * the tuples.  Before quitting, call _hash_squeezebucket to ensure the
-	 * tuples remaining in the old bucket (including the overflow pages) are
-	 * packed as tightly as possible.  The new bucket is already tight.
+	 * the tuples.  Mark the old and new buckets to indicate split is
+	 * finished.
+	 *
+	 * To avoid deadlocks due to locking order of buckets, first lock the old
+	 * bucket and then the new bucket.
 	 */
-	_hash_wrtbuf(rel, nbuf);
+	if (nopaque->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, nbuf);
+
+	/*
+	 * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+	 * there is no pending scan that has seen the flag after it is cleared.
+	 */
+	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+	opage = BufferGetPage(bucket_obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	_hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+	npage = BufferGetPage(bucket_nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+	/* indicate that split is finished */
+	oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
+	nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
+
+	/*
+	 * now write the buffers, here we don't release the locks as caller is
+	 * responsible to release locks.
+	 */
+	MarkBufferDirty(bucket_obuf);
+	MarkBufferDirty(bucket_nbuf);
+}
+
+/*
+ *	_hash_finish_split() -- Finish the previously interrupted split operation
+ *
+ * To complete the split operation, we form the hash table of TIDs in new
+ * bucket which is then used by split operation to skip tuples that are
+ * already moved before the split operation was previously interruptted.
+ *
+ * The caller must hold a pin, but no lock, on the metapage buffer.
+ * The buffer is returned in the same state.  (The metapage is only
+ * touched if it becomes necessary to add or remove overflow pages.)
+ *
+ * 'obuf' and 'nbuf' must be locked by the caller which is also responsible
+ * for unlocking them.
+ */
+void
+_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf,
+				   uint32 maxbucket, uint32 highmask, uint32 lowmask)
+{
+	HASHCTL		hash_ctl;
+	HTAB	   *tidhtab;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque opageopaque;
+	HashPageOpaque npageopaque;
+	Bucket		obucket;
+	Bucket		nbucket;
+	bool		found;
+
+	/* Initialize hash tables used to track TIDs */
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+	hash_ctl.keysize = sizeof(ItemPointerData);
+	hash_ctl.entrysize = sizeof(ItemPointerData);
+	hash_ctl.hcxt = CurrentMemoryContext;
+
+	tidhtab =
+		hash_create("bucket ctids",
+					256,		/* arbitrary initial size */
+					&hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	/*
+	 * Scan the new bucket and build hash table of TIDs
+	 */
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	for (;;)
+	{
+		BlockNumber nblkno;
+		OffsetNumber noffnum;
+		OffsetNumber nmaxoffnum;
+
+		/* Scan each tuple in new page */
+		nmaxoffnum = PageGetMaxOffsetNumber(npage);
+		for (noffnum = FirstOffsetNumber;
+			 noffnum <= nmaxoffnum;
+			 noffnum = OffsetNumberNext(noffnum))
+		{
+			IndexTuple	itup;
+
+			/* Fetch the item's TID and insert it in hash table. */
+			itup = (IndexTuple) PageGetItem(npage,
+											PageGetItemId(npage, noffnum));
+
+			(void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
+
+			Assert(!found);
+		}
+
+		nblkno = npageopaque->hasho_nextblkno;
+
+		/*
+		 * release our write lock without modifying buffer and ensure to
+		 * retain the pin on primary bucket.
+		 */
+		if (nbuf == bucket_nbuf)
+			_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, nbuf);
+
+		/* Exit loop if no more overflow pages in new bucket */
+		if (!BlockNumberIsValid(nblkno))
+			break;
+
+		/* Else, advance to next page */
+		nbuf = _hash_getbuf(rel, nblkno, HASH_READ, LH_OVERFLOW_PAGE);
+		npage = BufferGetPage(nbuf);
+		npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	}
+
+	/* Need a cleanup lock to perform split operation. */
+	LockBufferForCleanup(bucket_nbuf);
+
+	npage = BufferGetPage(bucket_nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nbucket = npageopaque->hasho_bucket;
+
+	opage = BufferGetPage(obuf);
+	opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+	obucket = opageopaque->hasho_bucket;
+
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, bucket_nbuf, tidhtab,
+						   maxbucket, highmask, lowmask);
 
-	_hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+	hash_destroy(tidhtab);
 }
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..6ec3bea 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -72,7 +72,19 @@ _hash_readnext(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_nextblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan to ensure
+	 * that vacuum can't delete the tuples that are moved by split to new
+	 * bucket. Such tuples are required by the scans that are started on
+	 * splitted buckets, before a new buckets split in progress flag
+	 * (LH_BUCKET_NEW_PAGE_SPLIT) is cleared.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -94,7 +106,16 @@ _hash_readprev(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_prevblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan. See
+	 * comments in _hash_readnext to know the reason of retaining pin.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -104,6 +125,13 @@ _hash_readprev(Relation rel,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
+
+		/*
+		 * We always maintain the pin on bucket page for whole scan operation,
+		 * so releasing the additional pin we have acquired here.
+		 */
+		if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+			_hash_dropbuf(rel, *bufp);
 	}
 }
 
@@ -192,43 +220,81 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	metap = HashPageGetMeta(page);
 
 	/*
-	 * Loop until we get a lock on the correct target bucket.
+	 * Conditionally get the lock on primary bucket page for search while
+	 * holding lock on meta page. If we have to wait, then release the meta
+	 * page lock and retry it in a hard way.
 	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
-									  metap->hashm_maxbucket,
-									  metap->hashm_highmask,
-									  metap->hashm_lowmask);
+	bucket = _hash_hashkey2bucket(hashkey,
+								  metap->hashm_maxbucket,
+								  metap->hashm_highmask,
+								  metap->hashm_lowmask);
 
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
+	blkno = BUCKET_TO_BLKNO(metap, bucket);
 
-		/* Release metapage lock, but keep pin. */
+	/* Fetch the primary bucket page for the bucket */
+	buf = ReadBuffer(rel, blkno);
+	if (!ConditionalLockBufferShared(buf))
+	{
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(buf, HASH_READ);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		oldblkno = blkno;
+		retry = true;
+	}
+	else
+	{
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	}
 
+	if (retry)
+	{
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * Loop until we get a lock on the correct target bucket.  We get the
+		 * lock on primary bucket page and retain the pin on it during read
+		 * operation to prevent the concurrent splits.  Retaining pin on a
+		 * primary bucket page ensures that split can't happen as it needs to
+		 * acquire the cleanup lock on primary bucket page.  Acquiring lock on
+		 * primary bucket and rechecking if it is a target bucket is mandatory
+		 * as otherwise a concurrent split followed by vacuum could remove
+		 * tuples from the selected bucket which otherwise would have been
+		 * visible.
 		 */
-		if (retry)
+		for (;;)
 		{
+			/*
+			 * Compute the target bucket number, and convert to block number.
+			 */
+			bucket = _hash_hashkey2bucket(hashkey,
+										  metap->hashm_maxbucket,
+										  metap->hashm_highmask,
+										  metap->hashm_lowmask);
+
+			blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+			/* Release metapage lock, but keep pin. */
+			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+			/*
+			 * If the previous iteration of this loop locked what is still the
+			 * correct target bucket, we are done.  Otherwise, drop any old
+			 * lock and lock what now appears to be the correct bucket.
+			 */
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
-		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
-		oldblkno = blkno;
-		retry = true;
+			/* Fetch the primary bucket page for the bucket */
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
+			/*
+			 * Reacquire metapage lock and check that no bucket split has
+			 * taken place while we were awaiting the bucket lock.
+			 */
+			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			oldblkno = blkno;
+		}
 	}
 
 	/* done with the metapage */
@@ -237,14 +303,60 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	/* Update scan opaque state to show we have lock on the bucket */
 	so->hashso_bucket = bucket;
 	so->hashso_bucket_valid = true;
-	so->hashso_bucket_blkno = blkno;
 
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
 	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
+	so->hashso_bucket_buf = buf;
+
+	/*
+	 * If the bucket split is in progress, then we need to skip tuples that
+	 * are moved from old bucket.  To ensure that vacuum doesn't clean any
+	 * tuples from old or new buckets till this scan is in progress, maintain
+	 * a pin on both of the buckets.  Here, we have to be cautious about lock
+	 * ordering, first acquire the lock on old bucket, release the lock on old
+	 * bucket, but not pin, then acquire the lock on new bucket and again
+	 * re-verify whether the bucket split still is in progress. Acquiring lock
+	 * on old bucket first ensures that the vacuum waits for this scan to
+	 * finish.
+	 */
+	if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+	{
+		BlockNumber old_blkno;
+		Buffer		old_buf;
+
+		old_blkno = _hash_get_oldblk(rel, opaque);
+
+		/*
+		 * release the lock on new bucket and re-acquire it after acquiring
+		 * the lock on old bucket.
+		 */
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+		/*
+		 * remember the old bucket buffer so as to use it later for scanning.
+		 */
+		so->hashso_old_bucket_buf = old_buf;
+		_hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+		_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		Assert(opaque->hasho_bucket == bucket);
+
+		if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+			so->hashso_skip_moved_tuples = true;
+		else
+		{
+			_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+			so->hashso_old_bucket_buf = InvalidBuffer;
+		}
+	}
+
 	/* If a backwards scan is requested, move to the end of the chain */
 	if (ScanDirectionIsBackward(dir))
 	{
@@ -273,6 +385,13 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
  *		false.  Else, return true and set the hashso_curpos for the
  *		scan to the right thing.
  *
+ *		Here we also scan the old bucket if the split for current bucket
+ *		was in progress at the start of scan.  The basic idea is that
+ *		skip the tuples that are moved by split while scanning current
+ *		bucket and then scan the old bucket to cover all such tuples. This
+ *		is done to ensure that we don't miss any tuples in the scans that
+ *		started during split.
+ *
  *		'bufP' points to the current buffer, which is pinned and read-locked.
  *		On success exit, we have pin and read-lock on whichever page
  *		contains the right item; on failure, we have released all buffers.
@@ -338,6 +457,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum >= FirstOffsetNumber);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberNext(offnum);	/* move forward */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -353,9 +485,42 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -379,6 +544,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum <= maxoff);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberPrev(offnum);	/* move back */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -394,9 +572,42 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -410,9 +621,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 
 		if (itup == NULL)
 		{
-			/* we ran off the end of the bucket without finding a match */
+			/*
+			 * We ran off the end of the bucket without finding a match.
+			 * Release the pin on bucket buffers.  Normally, such pins are
+			 * released at end of scan, however scrolling cursors can
+			 * reacquire the bucket lock and pin in the same scan multiple
+			 * times.
+			 */
 			*bufP = so->hashso_curbuf = InvalidBuffer;
 			ItemPointerSetInvalid(current);
+			_hash_dropscanbuf(rel, so);
 			return false;
 		}
 
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..b5164d7 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -147,6 +147,23 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_msb-- returns most significant bit position.
+ */
+static uint32
+_hash_msb(uint32 num)
+{
+	uint32		i = 0;
+
+	while (num)
+	{
+		num = num >> 1;
+		++i;
+	}
+
+	return i - 1;
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
@@ -352,3 +369,123 @@ _hash_binsearch_last(Page page, uint32 hash_value)
 
 	return lower;
 }
+
+/*
+ *	_hash_get_oldblk() -- get the block number from which current bucket
+ *			is being splitted.
+ */
+BlockNumber
+_hash_get_oldblk(Relation rel, HashPageOpaque opaque)
+{
+	Bucket		curr_bucket;
+	Bucket		old_bucket;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	/*
+	 * To get the old bucket from the current bucket, we need a mask to modulo
+	 * into lower half of table.  This mask is stored in meta page as
+	 * hashm_lowmask, but here we can't rely on the same, because we need a
+	 * value of lowmask that was prevalent at the time when bucket split was
+	 * started.  Masking the most significant bit of new bucket would give us
+	 * old bucket.
+	 */
+	curr_bucket = opaque->hasho_bucket;
+	mask = (((uint32) 1) << _hash_msb(curr_bucket)) - 1;
+	old_bucket = curr_bucket & mask;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newblk() -- get the block number of bucket for the new bucket
+ *			that will be generated after split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket based on current table
+ * half.  It is mainly required to finish the incomplete splits where we are
+ * sure that not more than one bucket could have split in progress from old
+ * bucket.
+ */
+BlockNumber
+_hash_get_newblk(Relation rel, HashPageOpaque opaque)
+{
+	Bucket		curr_bucket;
+	Bucket		new_bucket;
+	uint32		lowmask;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	curr_bucket = opaque->hasho_bucket;
+
+	/*
+	 * new bucket can be obtained by OR'ing old bucket with most significant
+	 * bit of current table half.  There could be multiple buckets that could
+	 * have splitted from curent bucket.  We need the first such bucket that
+	 * exists based on current table half.
+	 */
+	lowmask = metap->hashm_lowmask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = curr_bucket | mask;
+		if (new_bucket > metap->hashm_maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+		break;
+	}
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newbucket() -- get the new bucket that will be generated after
+ *			split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket.  New bucket can be
+ * obtained by OR'ing old bucket with most significant bit of table half
+ * for lowmask passed in this function.  There could be multiple buckets that
+ * could have splitted from curent bucket.  We need the first such bucket that
+ * exists.  Caller must ensure that no more than one split has happened from
+ * old bucket.
+ */
+Bucket
+_hash_get_newbucket(Relation rel, Bucket curr_bucket,
+					uint32 lowmask, uint32 maxbucket)
+{
+	Bucket		new_bucket;
+	uint32		mask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = curr_bucket | mask;
+		if (new_bucket > maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		break;
+	}
+
+	return new_bucket;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 90804a3..3e5b1d2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3567,6 +3567,26 @@ ConditionalLockBuffer(Buffer buffer)
 }
 
 /*
+ * Acquire the content_lock for the buffer, but only if we don't have to wait.
+ *
+ * This assumes the caller wants BUFFER_LOCK_SHARED mode.
+ */
+bool
+ConditionalLockBufferShared(Buffer buffer)
+{
+	BufferDesc *buf;
+
+	Assert(BufferIsValid(buffer));
+	if (BufferIsLocal(buffer))
+		return true;			/* act as though we got it */
+
+	buf = GetBufferDescriptor(buffer - 1);
+
+	return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+									LW_SHARED);
+}
+
+/*
  * LockBufferForCleanup - lock a buffer in preparation for deleting items
  *
  * Items may be deleted from a disk page only when the caller (a) holds an
@@ -3750,6 +3770,49 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 	return false;
 }
 
+/*
+ * CheckBufferForCleanup - as above, but don't attempt to take lock
+ *
+ * We won't loop, but just check once to see if the pin count is OK.  If
+ * not, return FALSE.
+ */
+bool
+CheckBufferForCleanup(Buffer buffer)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(BufferIsValid(buffer));
+
+	if (BufferIsLocal(buffer))
+	{
+		/* There should be exactly one pin */
+		if (LocalRefCount[-buffer - 1] != 1)
+			return false;
+		/* Nobody else to wait for */
+		return true;
+	}
+
+	/* There should be exactly one local pin */
+	if (GetPrivateRefCount(buffer) != 1)
+		return false;
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+	buf_state = LockBufHdr(bufHdr);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+	{
+		/* pincount is OK. */
+		UnlockBufHdr(bufHdr, buf_state);
+		return true;
+	}
+
+	UnlockBufHdr(bufHdr, buf_state);
+	return false;
+}
+
 
 /*
  *	Functions for buffer I/O handling
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index d9df904..bbf822b 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -24,6 +24,7 @@
 #include "lib/stringinfo.h"
 #include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
+#include "utils/hsearch.h"
 #include "utils/relcache.h"
 
 /*
@@ -32,6 +33,8 @@
  */
 typedef uint32 Bucket;
 
+#define InvalidBucket	((Bucket) 0xFFFFFFFF)
+
 #define BUCKET_TO_BLKNO(metap,B) \
 		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
 
@@ -51,6 +54,9 @@ typedef uint32 Bucket;
 #define LH_BUCKET_PAGE			(1 << 1)
 #define LH_BITMAP_PAGE			(1 << 2)
 #define LH_META_PAGE			(1 << 3)
+#define LH_BUCKET_NEW_PAGE_SPLIT	(1 << 4)
+#define LH_BUCKET_OLD_PAGE_SPLIT	(1 << 5)
+#define LH_BUCKET_PAGE_HAS_GARBAGE	(1 << 6)
 
 typedef struct HashPageOpaqueData
 {
@@ -63,6 +69,12 @@ typedef struct HashPageOpaqueData
 
 typedef HashPageOpaqueData *HashPageOpaque;
 
+#define H_HAS_GARBAGE(opaque)			((opaque)->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+#define H_OLD_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+#define H_INCOMPLETE_SPLIT(opaque)		(((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) || \
+										 ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT))
+
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
  * which otherwise would have a hard time telling pages of different index
@@ -87,12 +99,6 @@ typedef struct HashScanOpaqueData
 	bool		hashso_bucket_valid;
 
 	/*
-	 * If we have a share lock on the bucket, we record it here.  When
-	 * hashso_bucket_blkno is zero, we have no such lock.
-	 */
-	BlockNumber hashso_bucket_blkno;
-
-	/*
 	 * We also want to remember which buffer we're currently examining in the
 	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
 	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -100,11 +106,23 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_curbuf;
 
+	/* remember the buffer associated with primary bucket */
+	Buffer		hashso_bucket_buf;
+
+	/*
+	 * remember the buffer associated with old primary bucket which is
+	 * required during the scan of the bucket for which split is in progress.
+	 */
+	Buffer		hashso_old_bucket_buf;
+
 	/* Current position of the scan, as an index TID */
 	ItemPointerData hashso_curpos;
 
 	/* Current position of the scan, as a heap TID */
 	ItemPointerData hashso_heappos;
+
+	/* Whether scan needs to skip tuples that are moved by split */
+	bool		hashso_skip_moved_tuples;
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -175,6 +193,8 @@ typedef HashMetaPageData *HashMetaPage;
 				  sizeof(ItemIdData) - \
 				  MAXALIGN(sizeof(HashPageOpaqueData)))
 
+#define INDEX_MOVED_BY_SPLIT_MASK	0x2000
+
 #define HASH_MIN_FILLFACTOR			10
 #define HASH_DEFAULT_FILLFACTOR		75
 
@@ -223,9 +243,6 @@ typedef HashMetaPageData *HashMetaPage;
 #define HASH_WRITE		BUFFER_LOCK_EXCLUSIVE
 #define HASH_NOLOCK		(-1)
 
-#define HASH_SHARE		ShareLock
-#define HASH_EXCLUSIVE	ExclusiveLock
-
 /*
  *	Strategy number. There's only one valid strategy for hashing: equality.
  */
@@ -298,21 +315,21 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 
 /* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
 extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BufferAccessStrategy bstrategy);
+				   BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy);
 
 /* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
 extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 			 int access, int flags);
+extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
+								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
@@ -321,6 +338,7 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 						   BufferAccessStrategy bstrategy);
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
+extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
 extern void _hash_wrtbuf(Relation rel, Buffer buf);
 extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
 				   int to_access);
@@ -328,6 +346,9 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
 				ForkNumber forkNum);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
+extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
+				   Buffer nbuf, uint32 maxbucket, uint32 highmask,
+				   uint32 lowmask);
 
 /* hashscan.c */
 extern void _hash_regscan(IndexScanDesc scan);
@@ -363,5 +384,17 @@ extern bool _hash_convert_tuple(Relation index,
 					Datum *index_values, bool *index_isnull);
 extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
 extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+extern BlockNumber _hash_get_oldblk(Relation rel, HashPageOpaque opaque);
+extern BlockNumber _hash_get_newblk(Relation rel, HashPageOpaque opaque);
+extern Bucket _hash_get_newbucket(Relation rel, Bucket curr_bucket,
+					uint32 lowmask, uint32 maxbucket);
+
+/* hash.c */
+extern void hashbucketcleanup(Relation rel, Buffer bucket_buf,
+				  BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
+				  double *tuples_removed, double *num_index_tuples,
+				  bool bucket_has_garbage, bool delay,
+				  IndexBulkDeleteCallback callback, void *callback_state);
 
 #endif   /* HASH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7b6ba96..accbb88 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -225,8 +225,10 @@ extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
 extern void UnlockBuffers(void);
 extern void LockBuffer(Buffer buffer, int mode);
 extern bool ConditionalLockBuffer(Buffer buffer);
+extern bool ConditionalLockBufferShared(Buffer buffer);
 extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
+extern bool CheckBufferForCleanup(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern void AbortBufferIO(void);
#27Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Amit Kapila (#25)
Re: Hash Indexes

On 09/12/2016 10:42 PM, Amit Kapila wrote:

The following script hangs on idx_val creation - just with v5, WAL patch
not applied.

Are you sure it is actually hanging? I see 100% cpu for a few minutes but
the index eventually completes ok for me (v5 patch applied to today's
master).

It completed for me as well. The second index creation is taking more
time and cpu, because it is just inserting duplicate values which need
lot of overflow pages.

Yeah, sorry for the false alarm. It just took 3m45s to complete on my
machine.

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#22)
Re: Hash Indexes

On Thu, Sep 8, 2016 at 12:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Hmm. I think page or block is a concept of database systems and
buckets is a general concept used in hashing technology. I think the
difference is that there are primary buckets and overflow buckets. I
have checked how they are referred in one of the wiki pages [1],
search for overflow on that wiki page. Now, I think we shouldn't be
inconsistent in using them. I will change to make it same if I find
any inconsistency based on what you or other people think is the
better way to refer overflow space.

In the existing source code, the terminology 'overflow page' is
clearly preferred to 'overflow bucket'.

[rhaas pgsql]$ git grep 'overflow page' | wc -l
75
[rhaas pgsql]$ git grep 'overflow bucket' | wc -l
1

In our off-list conversations, I too have found it very confusing when
you've made reference to an overflow bucket. A hash table has a fixed
number of buckets, and depending on the type of hash table the storage
for each bucket may be linked together into some kind of a chain;
here, a chain of pages. The 'bucket' logically refers to all of the
entries that have hash codes such that (hc % nbuckets) == bucketno,
regardless of which pages contain them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Jeff Janes
jeff.janes@gmail.com
In reply to: Amit Kapila (#22)
Re: Hash Indexes

On Wed, Sep 7, 2016 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 7, 2016 at 11:49 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Sep 1, 2016 at 8:55 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I have fixed all other issues you have raised. Updated patch is
attached with this mail.

I am finding the comments (particularly README) quite hard to follow.

There

are many references to an "overflow bucket", or similar phrases. I think
these should be "overflow pages". A bucket is a conceptual thing

consisting

of a primary page for that bucket and zero or more overflow pages for the
same bucket. There are no overflow buckets, unless you are referring to

the

new bucket to which things are being moved.

Hmm. I think page or block is a concept of database systems and
buckets is a general concept used in hashing technology. I think the
difference is that there are primary buckets and overflow buckets. I
have checked how they are referred in one of the wiki pages [1],
search for overflow on that wiki page.

That page seems to use "slot" to refer to the primary bucket/page and all
the overflow buckets/pages which cover the same post-masked values. I
don't think that would be an improvement for us, because "slot" is already
pretty well-used for other things. Their use of "bucket" does seem to be
mostly the same as "page" (or maybe "buffer" or "block"?) but I don't think
we gain anything from creating yet another synonym for page/buffer/block.
I think the easiest thing would be to keep using the meanings which the
existed committed code uses, so that we at least have internal consistency.

Now, I think we shouldn't be
inconsistent in using them. I will change to make it same if I find
any inconsistency based on what you or other people think is the
better way to refer overflow space.

I think just "overflow page" or "buffer containing the overflow page".

Here are some more notes I've taken, mostly about the README and comments.

It took me a while to understand that once a tuple is marked as moved by
split, it stays that way forever. It doesn't mean "recently moved by
split", but "ever moved by split". Which works, but is rather subtle.
Perhaps this deserves a parenthetical comment in the README the first time
the flag is mentioned.

========

#define INDEX_SIZE_MASK 0x1FFF
/* bit 0x2000 is not used at present */

This is no longer true, maybe:
/* bit 0x2000 is reserved for index-AM specific usage */

========

Note that this is designed to allow concurrent splits and scans. If a
split occurs, tuples relocated into the new bucket will be visited twice
by the scan, but that does no harm. As we are releasing the locks during
scan of a bucket, it will allow concurrent scan to start on a bucket and
ensures that scan will always be behind cleanup.

Above, the abrupt transition from splits (first sentence) to cleanup is
confusing. If the cleanup referred to is vacuuming, it should be a new
paragraph or at least have a transition sentence. Or is it referring to
clean-up locks used for control purposes, rather than for actual vacuum
clean-up? I think it is the first one, the vacuum. (I find the committed
version of this comment confusing as well--how in the committed code would
a tuple be visited twice, and why does that not do harm in the committed
coding? So maybe the issue here is me, not the comment.)

=======

+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or
tuples
+that are moved due to split.  The need for cleanup lock to remove dead
tuples
+is to ensure that scans' returns correct results.  Scan that returns
multiple
+tuples from the same bucket page always restart the scan from the previous
+offset number from which it has returned last tuple.

Perhaps it would be better to teach scans to restart anywhere on the page,
than to force more cleanup locks to be taken?

=======
This comment no longer seems accurate (as long as it is just an ERROR and
not a PANIC):

* XXX we have a problem here if we fail to get space for a
* new overflow page: we'll error out leaving the bucket
split
* only partially complete, meaning the index is corrupt,
* since searches may fail to find entries they should find.

The split will still be marked as being in progress, so any scanner will
have to scan the old page and see the tuple there.

========
in _hash_splitbucket comments, this needs updating:

* The caller must hold exclusive locks on both buckets to ensure that
* no one else is trying to access them (see README).

The true prereq here is a buffer clean up lock (pin plus exclusive buffer
content lock), correct?

And then:

* Split needs to hold pin on primary bucket pages of both old and new
* buckets till end of operation.

'retain' is probably better than 'hold', to emphasize that we are dropping
the buffer content lock part of the clean-up lock, but that the pin part of
it is kept continuously (this also matches the variable name used in the
code). Also, the paragraph after that one seems to be obsolete and
contradictory with the newly added comments.

===========

/*
* Acquiring cleanup lock to clear the split-in-progress flag ensures
that
* there is no pending scan that has seen the flag after it is cleared.
*/

But, we are not acquiring a clean up lock. We already have a pin, and we
do acquire a write buffer-content lock, but don't observe that our pin is
the only one. I don't see why it is necessary to have a clean up lock
(what harm is done if a under-way scan thinks it is scanning a bucket that
is being split when it actually just finished the split?), but if it is
necessary then I think this code is wrong. If not necessary, the comment
is wrong.

Also, why must we hold a write lock on both old and new primary bucket
pages simultaneously? Is this in anticipation of the WAL patch? The
contract for the function does say that it returns both pages write locked,
but I don't see a reason for that part of the contract at the moment.

=========

To avoid deadlock between readers and inserters, whenever there is a need
to lock multiple buckets, we always take in the order suggested in
Locking
Definitions above. This algorithm allows them a very high degree of
concurrency.

The section referred to is actually spelled "Lock Definitions", no "ing".

The Lock Definitions sections doesn't mention the meta page at all. I
think there needs be something added to it about how the meta page gets
locked and why that is deadlock free. (But we could be optimistic and
assume the patch to implement caching of the metapage will go in and will
take care of that).

=========

And an operational question on this: A lot of stuff is done conditionally
here. Under high concurrency, do splits ever actually occur? It seems
like they could easily be permanently starved.

Cheers,

Jeff

#30Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Amit Kapila (#26)
Re: Hash Indexes

On 09/13/2016 07:26 AM, Amit Kapila wrote:

Attached, new version of patch which contains the fix for problem
reported on write-ahead-log of hash index thread [1].

I have been testing patch in various scenarios, and it has a positive
performance impact in some cases.

This is especially seen in cases where the values of the indexed column
are unique - SELECTs can see a 40-60% benefit over a similar query using
b-tree. UPDATE also sees an improvement.

In cases where the indexed column value isn't unique, it takes a long
time to build the index due to the overflow page creation.

Also in cases where the index column is updated with a high number of
clients, ala

-- ddl.sql --
CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val;
CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id);
CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val);
ANALYZE;

-- test.sql --
\set id random(1,10)
\set val random(0,10)
BEGIN;
UPDATE test SET val = :val WHERE id = :id;
COMMIT;

w/ 100 clients - it takes longer than the b-tree counterpart (2921 tps
for hash, and 10062 tps for b-tree).

Jeff mentioned upthread the idea of moving the lock to a bucket meta
page instead of having it on the main meta page. Likely a question for
the assigned committer.

Thanks for working on this !

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#28)
Re: Hash Indexes

On Tue, Sep 13, 2016 at 5:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 8, 2016 at 12:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Hmm. I think page or block is a concept of database systems and
buckets is a general concept used in hashing technology. I think the
difference is that there are primary buckets and overflow buckets. I
have checked how they are referred in one of the wiki pages [1],
search for overflow on that wiki page. Now, I think we shouldn't be
inconsistent in using them. I will change to make it same if I find
any inconsistency based on what you or other people think is the
better way to refer overflow space.

In the existing source code, the terminology 'overflow page' is
clearly preferred to 'overflow bucket'.

Okay, point taken. Will update it in next version of patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Amit Kapila
amit.kapila16@gmail.com
In reply to: Jesper Pedersen (#30)
Re: Hash Indexes

On Wed, Sep 14, 2016 at 12:29 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

On 09/13/2016 07:26 AM, Amit Kapila wrote:

Attached, new version of patch which contains the fix for problem
reported on write-ahead-log of hash index thread [1].

I have been testing patch in various scenarios, and it has a positive
performance impact in some cases.

This is especially seen in cases where the values of the indexed column are
unique - SELECTs can see a 40-60% benefit over a similar query using b-tree.

Here, I think it is better if we have the data comparing the situation
of hash index with respect to HEAD as well. What I mean to say is
that you are claiming that after the hash index improvements SELECT
workload is 40-60% better, but where do we stand as of HEAD?

UPDATE also sees an improvement.

Can you explain this more? Is it more compare to HEAD or more as
compare to Btree? Isn't this contradictory to what the test in below
mail shows?

In cases where the indexed column value isn't unique, it takes a long time
to build the index due to the overflow page creation.

Also in cases where the index column is updated with a high number of
clients, ala

-- ddl.sql --
CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val;
CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id);
CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val);
ANALYZE;

-- test.sql --
\set id random(1,10)
\set val random(0,10)
BEGIN;
UPDATE test SET val = :val WHERE id = :id;
COMMIT;

w/ 100 clients - it takes longer than the b-tree counterpart (2921 tps for
hash, and 10062 tps for b-tree).

Thanks for doing the tests. Have you applied both concurrent index
and cache the meta page patch for these tests? So from above tests,
we can say that after these set of patches read-only workloads will be
significantly improved even better than btree in quite-a-few useful
cases. However when the indexed column is updated, there is still a
large gap as compare to btree (what about the case when the indexed
column is not updated in read-write transaction as in our pgbench
read-write transactions, by any chance did you ran any such test?). I
think we need to focus on improving cases where index columns are
updated, but it is better to do that work as a separate patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Amit Kapila (#32)
3 attachment(s)
Re: Hash Indexes

Hi,

On 09/14/2016 07:24 AM, Amit Kapila wrote:

On Wed, Sep 14, 2016 at 12:29 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

On 09/13/2016 07:26 AM, Amit Kapila wrote:

Attached, new version of patch which contains the fix for problem
reported on write-ahead-log of hash index thread [1].

I have been testing patch in various scenarios, and it has a positive
performance impact in some cases.

This is especially seen in cases where the values of the indexed column are
unique - SELECTs can see a 40-60% benefit over a similar query using b-tree.

Here, I think it is better if we have the data comparing the situation
of hash index with respect to HEAD as well. What I mean to say is
that you are claiming that after the hash index improvements SELECT
workload is 40-60% better, but where do we stand as of HEAD?

The tests I have done are with a copy of a production database using the
same queries sent with a b-tree index for the primary key, and the same
with a hash index. Those are seeing a speed-up of the mentioned 40-60%
in execution time - some involve JOINs.

Largest of those tables is 390Mb with a CHAR() based primary key.

UPDATE also sees an improvement.

Can you explain this more? Is it more compare to HEAD or more as
compare to Btree? Isn't this contradictory to what the test in below
mail shows?

Same thing here - where the fields involving the hash index aren't updated.

In cases where the indexed column value isn't unique, it takes a long time
to build the index due to the overflow page creation.

Also in cases where the index column is updated with a high number of
clients, ala

-- ddl.sql --
CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val;
CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id);
CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val);
ANALYZE;

-- test.sql --
\set id random(1,10)
\set val random(0,10)
BEGIN;
UPDATE test SET val = :val WHERE id = :id;
COMMIT;

w/ 100 clients - it takes longer than the b-tree counterpart (2921 tps for
hash, and 10062 tps for b-tree).

Thanks for doing the tests. Have you applied both concurrent index
and cache the meta page patch for these tests? So from above tests,
we can say that after these set of patches read-only workloads will be
significantly improved even better than btree in quite-a-few useful
cases.

Agreed.

However when the indexed column is updated, there is still a
large gap as compare to btree (what about the case when the indexed
column is not updated in read-write transaction as in our pgbench
read-write transactions, by any chance did you ran any such test?).

I have done a run to look at the concurrency / TPS aspect of the
implementation - to try something different than Mark's work on testing
the pgbench setup.

With definitions as above, with SELECT as

-- select.sql --
\set id random(1,10)
BEGIN;
SELECT * FROM test WHERE id = :id;
COMMIT;

and UPDATE/Indexed with an index on 'val', and finally UPDATE/Nonindexed
w/o one.

[1]: /messages/by-id/CAA4eK1+ERbP+7mdKkAhJZWQ_dTdkocbpt7LSWFwCQvUHBXzkmA@mail.gmail.com
on master. btree is master too.

Machine is a 28C/56T with 256Gb RAM with 2 x RAID10 SSD for data + wal.
Clients ran with -M prepared.

[1]: /messages/by-id/CAA4eK1+ERbP+7mdKkAhJZWQ_dTdkocbpt7LSWFwCQvUHBXzkmA@mail.gmail.com
/messages/by-id/CAA4eK1+ERbP+7mdKkAhJZWQ_dTdkocbpt7LSWFwCQvUHBXzkmA@mail.gmail.com
[2]: /messages/by-id/CAD__OujvYghFX_XVkgRcJH4VcEbfJNSxySd9x=1Wp5VyLvkf8Q@mail.gmail.com
/messages/by-id/CAD__OujvYghFX_XVkgRcJH4VcEbfJNSxySd9x=1Wp5VyLvkf8Q@mail.gmail.com
[3]: /messages/by-id/CAA4eK1JUYr_aB7BxFnSg5+JQhiwgkLKgAcFK9bfD4MLfFK6Oqw@mail.gmail.com
/messages/by-id/CAA4eK1JUYr_aB7BxFnSg5+JQhiwgkLKgAcFK9bfD4MLfFK6Oqw@mail.gmail.com

Don't know if you find this useful due to the small number of rows, but
let me know if there are other tests I can run, f.ex. bump the number of
rows.

I
think we need to focus on improving cases where index columns are
updated, but it is better to do that work as a separate patch.

Ok.

Best regards,
Jesper

Attachments:

select.pngimage/png; name=select.pngDownload
�PNG


IHDR^Tz0�*	pHYs����}�D
IDATx���	\���m�"D<	������xDIG�fV��)b���)�G�����x"��i)��"�%������&��
>�W����w�����|���CS*�2�?�t7�Z�r�r�r�r�r�r�r�r�r�r�r�r�r�r�r�r�r�r�M�n�����g�����8��o������e2�/��T*�V��������O�_�������w�0a�������{��!� ""���^�z���#I.>}����ss��)//�1c�NsB���
		��kWpp���b����?���o=zt����o�@@��M@�gd����b��=�L�bii��O?����\�vm�����_711�*�����9r�DcK7�N �@ t���N�L�N�:�
��y����Y.���;���@44I&''��+W�3FKK��������-k�h��M0o�<�D���S\\<l�0����o��V�un�����mnn���y����A44�����������+W�$&&�?���:22����Z��6�!�I����	��d$�l�}��W"������9sf+��H��>0`@K?@;�hh����O������+:w��
���;�=z����M��� ���r�������?���>����}yK			IKK��}{+�� � 00p�������iaaaBB����/^�
�>r�H///2w\�z��	��555566v��
l��	D#@899=zt����������������6m�8q�l��M��M����
[�l!����_r�\33��c�^�r����O� ���j/�����6�Bc6m%��Y��k�(q�� � � � � � � � � � � � � � � � � � G��1<<|��)R������O�g���e�2�p8^^^III����������/@����Hq��]�+$����,--k������BCC����_�vK�����cC�=���B������j���111l6���������f��R�h			'KY�D�������mmm���caaA�fffd`nn���N���"�[��			FFF�{��]trr�1c������K}}}�9B�|>_WW�����Rk6����F��U����[��w�^j�d��/�zzz$��������+�(�O@ggg�[�*����Qc&�9z�h2�LNN4h���H$:::��$#������333����R���N%a��/(>����F�$e��3gN�:u��iAAA���T���5::����,]\\�R�vKu�Q��������5j���T������#$$���!""B)Eh�� eSF���&99��
l6;..N�Eh�� Z�@�@�@�@�@�@�@�@��)C�d0���U�$�t�5���n<���&T�q����fr�x��QG�w����K�y;�jJ5._O�>��C]������
}��w�F�+�����1�>Yt7�(o}�S6��-/c��7&*�kYQzz}��������s���w������tRe;o��PE"�X��P�F���iN�c��TY�Nd������6V]�������������e���u�"�H,Tx��q5���	��|���['������
��h�*v��w���NES������)��&�`*�r�������%"���:WX&���JO(T_+����VPJ�T�\+����\PK���
Y��Ie�E������LW�����kBQ*e�I�@�#x�g��?=�����H��"�H���V���t^{��P�h�2e���T���+))���>22�����������zs��^�.�x�O��(�LR^�I�}�Q5�����YAP&aT��������������w7�E&��//���*DyB���Ax^���Z�g��������.�g0��*=T:I"����v%((���!***44488800��������_��}y��;�2U���eYg�$�����y��g�im�%R���td��ijI*���N'��"Q=�X�)���{f��%i����G1u�Q*a9������&�
���c��
:�����JlllLL��vsssww�b�%�m�X"z\������s.��wE$y�ei��]Sma�H\)aT_�{ZumZ�S�di7�_K�%K��:��&�z,����e�|)��E��%�D�����II��]s?R����J��*}Fu���q}�&��LX�S5����j2���/kP���5���!%���S`1u��-���3����;�(�0uLV����4��y����\:J���I.�5�b���*�!!!���d)�ddd�������yzzz����7��&��ML}z](�q����3!���k����K%U�%H��/y��m����z�FnT|��D�D�_�J���cI|#��i���l�]����cR}$��5z�1.�[A�u�+D�7�Gu�1!!����w����|>_WW������\�m�H�����_����[�*x���B^��j�T ,a�I[�K���E)C5w��Wr�
�V����=��rK���F��U����[����G��,y<���~�ebcc�����:���r��	��V�[^ZFf��!����X�p�15�R��l\'-=��'yr���L�xB[���,M
&S���fj����L=m
j ���EwN"�����R��R����D"~��g.�ScM�;
�Xt��Y��:,q��n�x����$���T~�@��jWB�X,���H,��f�"�P(j��H�E����x�5f2��G�&�HKK���,kk���L+++���(�8;;���\�_������9oe_��q��i.���
��1��;�I�J��P�������4tYde
{�\�_��z�+�~Y,��`<��|^�=��8�nU4>c�u.����
F
����.%9�];0r�6��1�H���} ���=�����5���&�!���%��E�Q��p��F-FY+72���6H.�����������d����rE�PP���o����r/���U
�O��"a3C!�@��k�����w�X{��oZv�T�e�[��Fr���n����,L3b����h�I.3�8�C~��Eo��e�z����������v����i��,���R�S����~��O�N�"i����������GHH���CDDD�T��55������������B~���THm#������k����~�o�3l���Cu�����o����>|�������ZD�\�B*$i���d�(������4��t���F���`��qqquVh�"�*K���K�>.����v��B�U�0�U^�+J����md���Ao�=|��Q���z���U}��Qx�#R'����G��Qa�$�cr�L����E
�@��f�T(��P_�����y)��Vz���gw3
��=~�����]V�����T �0Y���X�am�8j�H���wmj��{$���`���������;�r�m�@���oox�:	�SX�Oj�?�y�?���YjvNW�]����<Fy�qI�V�/����������z����o���P-��?k��A4(��p��|���f������N'3B������O�ymf����N������������5����-G��6��Ao� ��|*��@�d�hb�?v��Y�$w�����n�.C�������R3CK#_����P��>�L��F}��w�s��u�m���b�_���dR�J�7\cT��������Q<f7=i� �M�"�T���!��v��]��
�?�m��~�60n�'��!����P�h�e\xKX��'�H�$�R�����,f]���6�~�n�m�wihjA4(�z6�W��4�4�M;�����!c�FM��e`+v/�hh%,-�D(�j?��������L�p����!�H�l=]������l^�e���(-��B��M�hP�J�xJ����P���B��(C�����_%�t�M�hh�2�`��C��/`����u�l�fV�����IG-������C44KnQ���<���O%�����0�Q�ZsDfM:��"4�����O�>����k�SRI�N�;�����F���Y9�r�k��M�hxEW��8�m(�����:	K���]c=S����*�I��@��n4>x�������}����}���CI���1>>�Za���[�l!�������dooihh��"@��\��������K���VG��!���Coc�|��3g�tss;}�txx��3n��������4KK��k988DEE���6���������P�mRN��m�M�V�9����6J���������������GI4�=�dlllLL��&Q���Ne[3�/������F
��!�p�J������_3}���@YT7I"��X,��{��1c�"�FOO���/����������322�������<==�Z��E���R����ND�!�'.�9���`��A��z/9G1���F���f��=��NNN3f����[�t�����#GH�������Yr�5��Y�C O
�����7��XP���1������5X����D��I����0�?���\��w/U_�d��G===od�������R��������"�a|�$��c.���EIq���L��q��A��N���;P2���5k��y������S���O���
�b�$	��*��������������RJQ����N��b�"�ayE�/�����G���������1_���-���:T7�=*�I:�������*��9�$��i��������ktt4Y�,]\\�R����d��`��I>���:�
0������n4���c���+V�x�����*����b@@��Q�v��M���=<<BBB"""�R ������������eT���p��K�{���1hQ��}��MJJ�S���INN�Sd��qqq�-���1s�����+�T���W��������S�h�Q��;?X���xeY��mG����s��Z�6��P����>����"�K�.3N�^6g�}@+A4<W�S��'o/���*�o'������/��5h=�F�U;������Tp'�f'
-������;��@�B4T)�	<C�U<][~�CU:���>���-�}
�U;���]���S�Qs�@�n��lL��������Fh��r��o�)~�{*�*���=��AG#Z�� �]K�7��Ho������4F����?��u�mh�h�����k[�����&�����g,��b��m��h�vj������Wt9_"�v^d��Y�9q6�}���-=������k�RI�n�Z�k~�r��	U�
���T��o���/�UNU�:����Hz��h�v������^y7�������i���Z��=O'�g�Fh/�=�v�GO��W�]IU�Z��762��1P5��<����q�F��}�o�>t�PR�p8^^^III���������-T�6����W�x�_|�/Q�w��C�������1PA��3g�tss;}�txx��3n��E�AAAQQQ���������-T������]�����v��������b'
PH�����CKK�������*��������l�����T��D�)C�K|��+�r8��4fM�a���t��Ku��$"Y�����w�3�*fdd�������yzzz��
�+W��z����i��44�V~�k��[����H������gBBu�����V��,�\n�A�q�O�4���g�J�3L��l^u��A����<U�F�P������2�����$#K����O��E�����]),��(=�{�%�z�~�g��;�|�cNfIN&����P�h\�f���������S���O---������333���Z�(��\�)$�Au\ytb�����O+����a�{����M����@���[u�n4=zT,�t���?Utuu���&E�tqqi�"����;v_�1'��_(�*o
~wc���m���F��;�O��b���_���������GHH���CDDD�A���_�\p�Vtv|���f'
�q��X�����sPA��'�o��IIIu�l6;..���^��E�N���r9�|��_���������#���R�hh������S��O�M,���6Ge�4�/���xo�[��h������s_��-������a����&���-D#�������k���%Ew+��~G�_����Loc������b��I	��%��e��a�k��wo<���9����C4��)�,^����I9	^��*Z[����hG}�8��j&�������s3�v�(��I��m���viij���
�FP'7�$lL��(�(�<G���E�/~\���d����FPuI�G7%�������[����r�98R!�9*�����}�5�-@��hUt���2^���R�xg�lK�1�;�6�)M��].aT�"SC'd�~�w\�m�D#����~���r�z���H���8,�UN��V��5��!������	m�T��pY���#���T"��\Z���]{����=����e�FP]C����nR�4�O7��SLM���-�*-uo���f%G$�u}�8��-@��h�#��k_�����3j�Z�hwT7=z���s���>}����
>�����f���e�2�p8^^^III����������/]����|���j�T�X�J+��V1���*��EM�&S*��)�;�������T7g��1n��������o��Y�����KKK�����fPP���CTTThhhppp```��@�"n��3>�
o���p0���#KGP}���t���T7���'O����������GI4�Y3666&&��f������S���"�����!g���>�QV�n�d��Y���n��p���{����?�������=M�u��]9t���t�����1bDff&�M.^���~������s�<������N�:511q��a�622"7WX���F���5 Sr�*Sc�$)/_�lkk�g�R���033#ss���tj�f����q"���<>7�RIy_V���w����6�����F<=]������x'��nk7{W��<}��^���$�_������7����+Wd���eyy�D"),,$IF8�D���+Ipn���EE�n4RRRR���Gfu�E'''�����n������G�!E>����KdI>�Pk6��)��o�_]!�
s�s��BY����b�n�5oh��y������������l��Y�6o�\�Z����\���G.�;w���;;v$���H^T��JG����]\\~�����k�L�w�^j�d��/���&�F�<O___)E���~c
��
$�_e��x�+9���ba��mS�����P��<�t��/x���
,�5�Bz.�EW5|C���u*���Uuc����7���+5(--���;5���i�H#�����gg���P2S����
�b���\����������������RJQ��P�B�(�/�+(�������"7W�s�#J����^�~���:;�@M)�_��6�B[�����I�UTTD������� 5����q�j���300�}��"�T7,X0e��>��vq���S�N�6mZPP���#Utuu������%K2�TJZZ^iF��i9%i�i��+�RI�o,��2�.&��4������������#G���={�|����������T�{��5k�,\�0""b������{Q�F����J�+W��.����)y��O�0j����wSW���{xx���888�[)�-�N�����**Kk1\G��5$��A���4�H$��������������?~�������^?00����LIjn����"�T7%I����
��cml6;..N�Eh9�"~K�^$�'��f�1���b��������
�]i�~�/���YVV�����[�����e�6�> �����c�����"�T7���2��X��zI�$���_�����v��562��=
�Z	_X�����3O�����e5������~����Go{2�Fh
����g��?��+�$��5��{�}3���,�*�-.����3>�9�����%Rq��L���s7Or�Iwwu!�eQG����S��R����vy���	tw��r�������M�r������S����������o���o={�T�C�:��'��/����l�������X�^o����('���+///����O������O@@�����C����^D����R��I�7�fc���vW��w��=�('����d�N�8agg7l�02^�j��5vSk��#����,,g�	J����������t��=��)��F


�<v���q��
����
�� �������!����>�������r���������;w>|���U�
�d�������uq'����_r+����{��������p����@1)C���"��O��rd'}l��+'7l�0y�d>��u���^{�T�_�~���9��=8� rG�"�DT��}v��z?����Z�������@���g��s��
�#-�����)))�+�����?�r���dG�#���J9�k���S�M�b��nCo{���'	^�KW#�~�]��_H�n�=��<���5�������������O�hd������znn���S�
v��a2w������700(//733{��I������v����#�	kc�Xd��������n�:///R������o>|��W�M�6�?�<�h����,+++2I=z��n�����R�����w�i�H�{�S����[Y�t�[z�h������I7�������i�/�Fmm����[�n��:y����?�s��8p��C�V�\�x�bWd�DV���O�?^QQq��	ww����#F(�EFu��{NKK;}�4�^*����������/��2==�Q����!C����7o�?�����7�r����{>>>w�������s'�h����';��'��/�,Q��v�B�G�w`���4&�)
.\H��~����'U?w�	2/$AEB�TH9r�D#�DRK�B�/�s�HD���fO�8����*����'O&�C�I������W�n�J��q��I����mS;r4�_��|���>}����Q�,��!�9))���>22������������V�����H.�*j6Fu���{���N�Notn�k��:E����d�E
��X"��%G��n��N�s������|�d*��{wj���C�$&�.]J��g�������
F�d>�a������Q}�+�='$$,X� %%����V;v�X`` ��=zl��Q�h��7�r�x�#��:�\\�l�R�s�����#��}����5�:McPPy�QQQ��������h�"�����-��|nNe��%a���|~�b���-��;`x��7��aj�
�E������?:����w�L%e2�<x���+������lll�x��!��M;!���Y������������w��A�Q$2���333���r��D7�E2 K�)@)�I��,��dz���Gccccbb���������k�"4�:�T*)I���U���<Z[[w��_'�i��@�+�b�u������{��Gl������?w���;U��|��ud��G���7<eT�_�~}���6m"����C�#G�$�L��[ZZ6i����~�%�����#F���fffd`nnN���BEx�����[��?�nd�y��?����=h����m�*gzC���8�;ZXXl���*N�4���w���dLbl��y$ �z�$q��ghhH��d�Hf�dJ�|��9s�|��g�z���}{3�WZ4������&���B^;2��.��|]]]2 K.��rEP��������]���yI�e5_�[�zc���=��Fkw�4������y�����;VgM������{��!7��mmm�������c��G2?~|�}_t��PN463�^����...����P@U���H��%�����o��Ll��/��2qA|��2I��'�N���	�zo���.��}�:��(�r�����Q(''���944���IV���������������j����N��b�b���-���\Iqe�('�XX^�1��{S����6�����37��w�2-4�j���q��S�L�������FGG����%�P�\j;� jG�B�T���\�H����THu�1""��X+W��.����)������GHH���Y���%�@y~8��a���RI���Z:Kl�d�P����W�,�p�[6��
E`�:\��Q

:��)�v0N�m���Q����L����SS3���ib~Ri����Q{v��y���,���@�Q�/T�F����7��2O,������
�������|��Qz�hQ�F��������$|a�8;�HPZ�1����~Z�KGG���Z�J���/��o{�Z�`H�����_��^}����,���@](����=����w#��j�%�����?�K�e��y�%Rq��LCCs����~Aw��do���kd��v1�v����]?/��+U�h�*����S��>����S����R�:t0^�����mZ����k�#��M����loo�������&<<�W�^��MIIqww�����kW���b�����������n�:�������~������n��i����x���E�eeeYYYm��u���/���A4#�4��S
��RF����5�v��sc���V��m^M����OR_�IG�	�ik��%$����u
�<.����{~N��>����!!!uv'�I�&��7o���
wB����[iii�O��?>�ms��!w������/���:	�W_}u���!C���$���?����������_:�S)�H����������Z�Dr������+KH<z�����y�Tx���v4�~�����L�n��a``���K&�uVNJJ��c����?o���"���,���8q��G������98����egd�����C��t��U����Dc�?jGb��!r�9�����_�G���]W����Z(,,����+Y�g���������N��=�������/			,HII��x���;8l��=zl������E75��vJ��F��6��F2�
���1*@[���owj`��;�e�n�r�����\��������H.6p�:�255%�H�pdI�uV644,..&����WxdR�f�WWW&�I��T�w��d&�o�>oo����W�� ��Jw���o�V�t�<��{�T*����bi�������t7J�����H8*�v"�=��O��zW���3��7�����i��E��]�������C�%16g��_��E�m@�~����+�C�|}��������#������,--�9AT��^q����0��bn������U����1���~������|[��,)dL���w����v��m������{�kW�Z����n���;w���J��&$�2�7�L=I��<���b���$k?���^�zm����
���-c���G�<����S'��q�����!e<�V������l�}c`��o��)��������=z�HHH�S��}?h��{��Qc.������a_6���-**��dz���?�����SRRs�W�hlJH.Vp;Nx��(Is/�T<�S�Z�~s�����X��#��������g�x4FFF�Iz������������l��������+))���������_l4�"�>�9�Y�w�:���'���T���G��
�1��~�y���~��RU�hLLL�y�f���XZZ���e�bPP���CTTThhhppp```��m�@Q�=���l
2e�*�b���b�fc���O��o��f�Zxe*��
�n4^�z����U�je����f��������lkf���/�`2�uj\�[�{�D"|>_� ���T��F���_�"�FOO���/��������������3�������S�j~�-)�OeO4��	�^-�Jj>�u�fW��7���&��F����f��agg�t�R__�#G��"�����%��m���b�(�@�@S����AO�@5���{�R�%K��~q���#�F�<O___)E�����m(,��t����e"&�!����q��&O���Y4&''4��bI$�H22++���:33���J)Eg�����8�_T572/�=��N��/���D�|TUj�3g��:u��i����d�#ruu������%K�������N�0Y��E�N�=�u�����(��1����(���P����M�>=  `��Q�w��V������		qpp��'��E��^x����L��$��Q�y�n��������f����2���&��R�hT�����Mrrr�"�����SnQM�p���KY������J�������1d��V�/����7M=�:��Fh����}(e��>�=�\���Y�
]W;���f�va����Fq��G����t^��j����u��3&�]��IP������;�&d�=��%��-����?���;���D���	��L�I�T���"�Q=a4��'l�YS�tw�~��M �w��r���\,�r�[�_�����'���%D�I�zU����$�H�k��������+D���H�Kb>�����S�s�S���f;�����5��Pc�F�$eHW�}������I�H�U�hb�5tu�Y����hTK�O.L)<���������l_�w�~t�����g��e�s"x���b��*��L��;m��
�[h�j������w�\|r�HZ��l��!��z�Iwkm�Q�����)�����s�RaU.t4
]��u�[h;�j#����VW��D(!����$�����5�6���Sbw_�'(d�)�rQ�;t��7����@[�������J$Y���xyy%%%����k


[��R�>J�~�WP*�:[$��E=]��A1�Fwkm��Fcbb���7���1((���!***44488800�����N����>�U>9[,���E]����5��[h�T7�^�JBk��U�����111l6����������(�����?�r�O��ybR���#�E�!��n
��R�h����_���033#ss�����+��'��?�DXQ��t��:�4�W/�j�.���e��
��|]]]2 K.��rE����?�&�s��>��5+�=�����85�F===�dd������[�([�
�E%���d�H%d�(���EK���J
�-���f�hii���emm���iee�rEgg�:N��J���
w����H.�S��������������Wu�Y4���FGG�������K��R)�}�w"���u�y.��~7r���n42���j/������j��(�B$���I9��'g��e5��"���=i�
��Q�h��G#��f����B��I��o�{s�7�OJ�s���q����M��5��Eu��]�J%�EM{Zz)�LQe��Q=W^���y�gt��� �'eH8�M6'����JNM..�v�$��t��!��������<9[\Y$dT��������/�v
�H�5q+���{r��_(�*s��p�����3D#����x5+,�<G���|�r��|z�h���������5����|U�=c�����v�Fz�\?x���9��d����e3���0��8u;n���r
�y5�8�c�S~��+� [��M���y.z2w����v2��Vu���
���_zZ�]IU>�x���W����hl=�s�]y���r~���\��h�w_���+���J2
->������Y|�2����Y/;�:�Dck�/��������Uqu���w���*���
�fG8��.�E�	S��c�X�6
![Vye��{?,���,�&������#T��E���c||<5�={��-[����xyy%%%���GFF6��|o���&�)I�����'�������������&1���fiiY������b�����;?�O���AM.���[���E�f�I4�)��������l777www*��Yl&�T��n��[��\�b?lB��pMM���9�(��FOO���/����������322�������<==�Z�����2�_����q��nUn7n��?��t�����,���f��agg�t�R__�#G��"�����%��rkfi�,��9e�6e������������	�>?}��+>gh]j�{���K�,������G��,y<����R�2���u*$�e��^���T���\���-�������s���������,���
�b�$��N���$#������333����R|5+c~���k��r���o�x��^�,g��9u��i��9::REWW���h___�tqqQJ����|�����5����n���z_��j�aaa��O5j�����������GHH���CDD�R�M�#~��s��)�.��������;��3��Y4���$''�)�����8��$��������)�.������$���O���E�
����]1_>��C]43��{s������.��r����f>�ZL]���O������
����.��k�����J�/�����9��mBs[�<��Wt#��O{&\+�.v2��K�I�n�v��h|9��@������'D{|xy��]�gTO�:u��~��Iw:�%A4���pY���RF�m��$�w�
����r��z�����!e��G����?11�/},��T"�E�{�^��J�hl,����������Jj���n��=Loc�\��&xv�6S�)�EBK��hk�M#K_��3DchjJ�tw-��XT.ZV��b�c~��2(���F����(�s���K������� ����$>�j,�5444M��?�+���@���������dooihhX���bn���5��w���O[�Mh%�FFPP���CTTThhhppp```������������V�Z��*�bbb�l������{�h�v�����033#ss���t���!|>_WW���������)��Da�D#COO��#Y�x<}}�:����X��o����>t����s�[��F���eVV���uff�������WW���h___�tqq��������������Aw;@3D#��f������
D#�Dc����otNt���94�@�@�@�@��	"##?��S�DBw#�U������v�=����v�Z�>}����N��������
�g���e�=*��I�x��Lf��R�����y#��*����\}��m���o��I��P
~��)U��3��w���}����5+99�Q�.--��������I�x�e��m���?���k�������
;W��|[�hl��W��c�V�����R��Z�����{���:::���~~~T���066����R��Z���������K�.QU�5o��[�^����{��B46����.6�'�����g��1b5&&�_���/�����������_Ha�j��S��_?e��=zPU�5o��[�^����{��B4�/
��R������7�|��.:99��1���n������G���;�6�.�yEE��}��_�.���k^���Z]^�z���*Dc����)U����]\\~�����S��{�R�%K����/
�T��<66��w�122�U��5�M�K�.������*Dc��'����qvv

%��e����A��X,�D���Cc{
P��Z�����G?�����x�kS�R��������
������),X0e��>��vq���S�N�6mZPP���#]�5La�j��7n�X�lY��Z���)|����W��|[�hl,��_�@���P��Z���4F�]�r%u155�|��>}z@@��Q�v��Mk�/��I�x��'O��6����k�������
;W��|[�hl,���:6�'�TxPjg/U��I�x�����:U~���V��_a����o��r�r�r�r�r�r�r�r�r�r�r�r�r�r�r��.�<yr���w��e��NNN�W������t�����������			����h�������~�iXX���KAA���k���8m�+��:55�Y-�*A4B��|���?��������_����Ycnn���S�
v��a###�H��[7r���'�l�������csrr��n������h����,++��[��=�5�!(���[�n�����
���;d��C��\�r���$�455KKKo�������'O��'���g�����w�=q����������?�(��@�C4B����S����:w��v����������P(000���===���k��$J��yUkn�@D#�}��w������l�
�����������;w&Kmmm�DR�&��6lX�=6n���������Fh�F�q��2�k�
����{d�������{��aaa���������l������-\�������>�������;w^�v����/��{���f�r��������;w�Ek������u��m���~~~nnnd�ZN	j�m������q��E��������S��r�L�l\g����@2w�����m[�y��~��q8������3���>���������\� �]�P�~]�.�������c���l,��f���)))J���h��h��h��h��h��h��h��h��h��h��h��h��h��h��h��h��h��h��h��h��h��V~:[�
IEND�B`�
update-indexed.pngimage/png; name=update-indexed.pngDownload
update-nonindexed.pngimage/png; name=update-nonindexed.pngDownload
�PNG


IHDR^Tz0�*	pHYs����}�P�IDATx���	X���IXE@��k�.�bo�^��keQ�F��X�x�V�Y�u�*�V+
o�Pl���
�n�#� (*�I�3�$��H���'N��$���9g�R)�j����������������������������������������,K*���������������,X```����������gpp���S����������k�,,,���<���}��� 4L>�*++

���/^�n�:�
����OSSS�?����n�����O��v��mmm��w����(����Fh���^#�~���vvv�hd�l���n��z�-�!C�8;;�6 9�o����'��Gl6����A!Z����MMM�l��_�?�������}��x��	�0`���G�|���o)����������K�.]�v��-�������+6lX�d	Y			���O�m��a������2d��e'NT�,�@���/�{�������_������eee��]4hP��Z �
��a��"�,C�������`��]�q��i���;[��@	��	�lv�a�D>��'//�W�^�zQQQtttTT����ep��5k�X[[������M��k���������gkk��W��������o��}��E�����.	��?���o/�hh�^;w�\�|�|����S:������H<x�����DFF�<yRi�i�����k�/622j�{�!����>sww�J�o��6�;��{���H��?��&
�����9����~��=z� ����n�����(m���D6��������hh��={�={v����/**"6q������]�6�Ed�*


mmm�x��}������)n��q���*�H:�6lP�
���s���4������
=��z�j:��k5�:�?MA4(@4(@4(@4(@4(@4(@4(@4(@4(@4(@4(@4(@4(@4(@4(@4(@4(@4(@4(@4(@4(@4(@4(@4�)�%�J[o�Vm@�hP�����_}�����	D��C4�@�q��y��������m
��� �X�b��Y�������

����>}:i��������VVV:::���|��Giii}��!�Q�F�e�qT�qyy���s�;fee�q�F�7���s������S7m�$������v2����k��
III$��6666V�� ���!�(�����^J��]�������k$�>����������������;88l����?�t����W��W_}��s���c������Y��tao��ABn��u�V����/�9B��o��#F�����>����g�������o���@��A[�l!=9�=C�������D���������!!G��B�0<<����7���?o���l|���'NX�Z�r���Geoz��)��#����'
#i��[7�5�2e��={&O���w�	��6V�� �� 33����~��III<OV����������D������bnnN�O�������{]�ve���ddee�������!�2p�������#Q7~�x�7�� �������o�M:^�~���XUU%�
����4��UnL^\�����CI��'�LZZ����������zq���
�X�h�����r8�"�������zzzz�^�4�v���������'''�����w���u4e��9s�����O�:u���������zq��hx��{�n��T���IR~��Gb�x��-���n��	_���\VV���_�����{-_����b��������'H���9��#�`�����;r����X���<D#��u��m��
^^^��e��}������f�?��
�Wxx��9sH����j����O��DFFF|�����\UU5l��M�6UVV�������dc�D���u�HN_�p����_\��h�*����_���D"a�9��Z���r�Y����|��d���
f���p}-�g]�|Y��o-f���j^�!               �n4zzz�>}�Y�7o��-[�Jqqq```bb���{ll���E���t7I�edd�bXX�������###���W�^��"�<��FKKK�b||��C��|�������m-,���hHJJ6l����H1;;��������gff2[�� Ow�q��	AAAnnn+W�>|�0)
�B.�KV�mEE�e�2�O�_@�������n4FGG3++V��q411!�Fn+++y<�F�2�(IX�'�>��������A���l�D��p�"����\gg���'''���n4��;w����f�
���d�>>>qqq�������[#Ey��QQQ�g�^�t����w���CCC���#""<<<bbb4R����8t����d�"��OHH�l@��F#�V       �z4���g��R������y��if}��y[�l�j/_������kaa��"�<��F�����|�d[FF��B���0����GFF���3(n|@�NG���G_~���~�MV!�hii��Y||��C��|������x�/���h�����g��UH4$%%
6l�������mggGV���333�-_����x���N�:9::�'L�����r���������P(�r�d��VTT0[6� Ow�q��5,P*FGG3++V��q411!�Gn+++y<^S�2*/���nt4��?�b�Y,��1cH?299y��Al6["�p8�Q��������999NNNM-�xyy)UH.�/�|����$e�s���9s��Y����<==����O\\\pp0����njtB�]�,Q�Cf������n4�5{���K��=z��]L144���?""���#&&��E�	����|4w>5 �
[�D���H:499Yi>������"�s�u�%)C.@[��h����5�� ���l��fm�:.D#��G���]��q��e��^'+d1����k��R�����t�TLY�O=�F��D���IS�D�2�������XQV�-����(�ik���=h�Q�����4
����1}�My�\�U,��(���8��������R�*:�����>��g�I�t�kk��%��:F��8V
�W��*.R9R5d���.���3�a��2����b)����=����K�C�.����$/~�����D��mCz���Z���u{n-������[O4V���E}�h��R����)iMe���)���	�}`}�������Df)����$�O���!�O�������76�A��^���c��=�r*/�*9��S�E�so-�A��0������)��ET�#��I�
����$�iY|V�=�� �(����������%U�T���m���^Q���)��`oF����26h�{��)�n���y$5kc��6;e�����S�gN���5����(�+��K/�.�0D#hC�A*9%������g����*�es�$U����lszFx��1���)s:)�4u�'�������-U��VuWRJU��K��g_��q�8N��@z!���'�@�z4���g���	q������ccc-,,Z^�6E�����"�e\c������5���T�qG�mW3�^�������oos��M�$�k3�,��/��i�(zi�����B�t������0E��FIUy�^�G�e
(c��%YL��{h}:�$���{�JXX�������###���W�^��"�Q>��Uy���Q�{V���w.R���O%?�ydy��_���c��@��,O������q([��������4�SUI.�NL?���P�4�*���A	o>�.3�bz�2YJ��U���2@�r��=K�����h<z���/���o������������~~~L���m��)������j���a�))�CB17����?�?[�,�XXUr#�,�]cK���������>���mdh����>�P�S���l��w�:�2����0�5�G�����#I�����Z�����:#9�1i2��	�F��1""b��=�VV������#+������)B�#��G�k�BeFP�
,V)�]PA�^`������7����:2n�������[3��������X����'��I�d!�l���|W����~�9]:�X��r�=��'��P;
��������,C���^d�Q�������w�����i
2
��%���7��!��g���������(B�����r[QQ����������-d�.{��.k���P�?\�9��U�|�����qba]����O{BVl-�+k,��2�<$�����TA�iz>��D����%����+d?�-}�YBn���IJ"�g�����|����)�\yN�l###IOT������V��7S�����BQ���E���Z��?03�����1(S���!UF�s�=���\��������iu�R�]������g��k��Y�`�R�����������x)�xyy)UH.�/B��M�~)D����p��\#q�0+*���KL���������ql���^����@���
/|�����[y�.f�>e����"���CY}���K7)z�$��{W��>�I���}�}T��������QS@	�kU^��VVe(�`�f�I�����PW"�J���:��,�g�R�_�����XxQ�v��:@=|���?�b�Y,��1cH?R ���:;;���8991������47�w����^�O��*O�I�����cf;���o���@�����<����P������������WsjD��mK*�= ��_�]�N|���wz����]zqL��sx��m(3����+Q�YtF2I)��"���������~���O|��pLt��F�����/|||�����������F��y�2�^(UZw��',���}9����c��?r�����j�?�0���Q}�m,^�8�,d]XQ�y���[��n��Q��������'�%�mb��G������������F�'��eD��J�N����n%��Ui����S���)U�E�z<O��G�od�=�����hT)44���?""���#&&F#E�0�������[{�u�5�W�|��-*y��yo��f+c���%^>s�^������L��y#��L�~����[2�����F�H%u�G���$
M�<'��<G{����:��~$=,���b��<�������IE���6
��o�C*������af!�u+�3��s]F��� �O����			J���T���	3n�Xj/�>���q]r�N������l���[���\m��u�C9�,c����q��|���7����Y�[(�eM���Z*Y(�%������k;[s����w���g�z���l���^���6��d,(�Hza4f�H}b��(*w}2�Q�)�����>f	�^�A4�~�VQ��RO�3����N����ID5�~L�?r���}�y��/��j
�D��(


������t(�o����y�bV�P$�R�Ci������F)�t�e��G2��T������|m�}���m��p������[�=�[�y=R����y	�M����8�/��^8�V����hG�
�P�M�.�w��I��Y�VC�3R�����u���g��0��U���lj�;��I^�C&#�����(�6�4%�,u���bc�w��5}�q-2���^�����R����=���|��n�*��s���yd�V01,�-D#��������;Ko��u�&�RB���-~{@\Y�l�:���Mo��[i���c`�vp�F��S��*?yTz����+�)��I�RX�p�����QAa�D;?e��sn����A&I���)����\z�����$Y�JW���A�����B����nX�]���������������d�\j� �*E�OR��-��{�g��E�; ���E�c��?r�@�P�|�{�S^o]�T�P6��#���2��������/Q]��'�0�`��
�K��d*�w�1��$V%�1stS}�: 9}��l��4���|@���/+�*�*7�����x��
�%Y/O�I��C����m|{����l���tg:��O�O��(��������c��<k���V��������j��'�6��?���T�d�{P�z^���|��D6�K�99�L���2��%�����n�.��/�'�B�!�Y���s�������^zYH��KL[�>"��;����>����E�bj�����K��G���/�%c���A����u����kI���pI	�����y)GZMx���0��+�Sq�U����K5�d9�s���/K���4=P�s;��D.B� ���in�"�2��D��R�������-{���f6j�a�:�k�`���%�������������9����Y�����Kg��Tg����F��e�b�,3�D�fF�����%�t�W4�=]�s�����5��@��h��?��S�?WI�N�>�,`N�(����5���n������-3�~��L]���H�h5rp���.K��HKN��ze����<-g���[M���<rHow������6oz[�
#Q8�)�D�1yWn��]�L���T�6�s�����Fcjj��9s�\�����}���_��xzz�>]7��y��l�BV������ccc-,,�T����A������^[$�I�^����_���,�����)j:R�����_�!�8�w����"�k�����E�5W��s�x��_�W<}��eu�Z�L/�ed+�*Z�2�F/��
e��0���wZ���Z%7�q�,4��F���s}}}���={�]�v�������@ �eXX�������###����7��BOs��FB�!Z�T���)Q��n��8%���g�������%m�S{�Y��z-H>�*_a��>�?Y���\t1����n��t���C�"Jt��.�4J"T�"���U8���bw��5(��h���722�>}zHH�<#$---�����?t���'Q����^�������O�'?�I��^[(��\R�=lQTQ7l�eH���3�b�b�z�eCd)KR.g��<s�riQ��g$������#\�X�io����,�a��R=��r��ai�g��� ;KV��6�s5vx��d���sk}-��t7I"�[�X�k���_�)�hHJJ6l�����o���mgG_R���>33����EP�*��}_"�8+Z�g��R�|&V|�s�"�	��������obhZ�
�����|2�����\.)|����3�RmF�����T�>��L�����"s&��]�
�%KI�S��R~^���k#�YK@o�n42

{��q�����	������V�\|��aR
�\.}d��VT����������v���v3�6�Kt���a��I?�X�q7c���u���,����o�d|��mo�a�CGd�d\)H�TP��n�YF��L.�
 �e�C'�c�����"��Y�l��|���N��~m�jzn#%61|bjXhj��gL���?2b�������j`�HbZQc]QcEn�����[5�(&�8����������g�'v���R�Y��"�(**j��9�����W�X!;�hbbB2��VVV�x��e����*��~�C���F����5s-S�u�T*<������.<���E�[*'��d�Z?z���5&�������6����#��u3&�H;3�UR�F0�E�e����6��I�+,�s��Pmj����X����%��k��%�B�3s��E��N�ONN4h���H$�!�H2277���9''�����EPFOs���������%u'A�K����u^a���l�5O��VU��1�<����!=Fv��3������!���@F�Y?�b�X�@��ej��%}Q0����n49rD,�t��gO�~u{����K�r��Yaaa��uW�������#[�[oo��AAy�4w���S��b���;�i����D��m�d���=����~F������������d��w�6#mGv����8Q=����a2�YDLv�7��T}�,�J��*�H��.�`���h2���;v��=��/����Yg�QQQ��t����G����)������GDDxxx���4��H�G��b�Tm����0cI$��������-j�,#����})�d��c�E�ed���:��>���d��B�b���(�N-�bJ*(�fM��-�\l�.��NvS�b����?v���'O6�]Z����������T:thrr�R���'$(���"��O����K���>^M����l�!�j6s����8���[�,6K]FV��2��� Wdd�h������m�edsU�
�D0����x#]���m��zI���eeI��J[�L>�nyi�i{��B7������-�y�jQA]E>#o�fd�Q�zw����\;S�zyq0�������=�X�]�������������~>��e�E���O?u����0`���#srrHor����|�I�=,XVQQq����3g�;wn������y���!;��������v�u���K�%������.�����a���~F�K�Z�@EF���4jH�Q�����������yB��S���_E����$��2����l�1	U~�9OT���l���|����|����w7n��t���������������2�DRXH_��d���I��Z�����[*j���T��~u��k����Z����k~���������?��
}�UKM�&hlFn�1ePmF��d�%�bIq�;�w�3i���F��c�����R����b�q�e�(~|�����w7o���(����j���	=��S�n��aff�p��#F0��,j�������s�bQ��n�=�����xt��a�#�� "���,#�����> �������]\Z"�e��,#�>sm�	��+`!������a�������|E����^�:����G���������t���ZZZ���>�����Y)--���;�.����E���HXx��������#�I���Sgn1w��
g/����v����T��S����C�g+gd�7�Rq�������z�������p�p��~Sc������{�������_aaa�i��?&d�[�n����?���E��FZ��s���W�J�>��*�<����}�38�z��n�!��a��
�����N������=���g��\d����r�����gon�<>��`f��QQQ!!!;w�5jS���Gzz�������X�����]�v��%111���;u�TCE-��{R%|p*�����d^a����`�bGA>���E�w>�T��R��R��]�
�����)K���Mk�DB555������������c����0aBqq��G���_�z��9sH7����m�����Qd����*��c�rY/&���������f���������FhSL?���qo:-T���/#�c�csW������cgKS���3����jBa�{;��.�������,����������6�O��K��G�*�����n4����W�\qqq��}��/���� ������������-/�{"��XJ�����k�6���Q�}����o�������	�a`���P%U�OS��BQ�Xl��=��
HR�]M;����]I���
���� ��;�|���_���t��5R�����dddxx8s-��U�C���}Z����������t���7e��0[��������;�J��eE�XR��K����5�-�$#-:�[{k��B;������odd4}�������!>�OR�������6����K�#^��L�U�����A��ms��e=��p���<0l�C�����k�R���O�Mzo���_=�)�u!���,r���@~��'��I�d!�F|3�>=���]����Aw^mo����a�:�'��$��X,��k���������������}ff�F�����Y��{���aQ������/�����NG����uS-����T�1�t>�?������\d�5vs�"�����EK�������|65��=�	\����
B�
���������������Gt�'����G�g��a�
�B.�KV�mEE�F��I�PdX����=�������L�����z�"P�I��O���U�Y��|�d��P����+wo]��]���"�fb���I6����Y_z�+��������~W�e��"�(**j��9�A311!�Fn+++y��YZZX�������E��l�EQ�R�����;;����u��3�l���I��p;u5;g �����c�k,�������7�'��D�tr>��0�����Y%��U�?�I�DRq�Y�J_K�cc�w�rq$}J������DT��.2�yb**fI%
�@��h\�vmpp0���9s��E���@ ���uvv���qrr�HQ���K�B�:�/�,�����^�R��L�%C[��-.��u�O^��NAn��CU����;=��Q'�Yo�����<�#����U�>��{���w���XL�q��=���c�>>>qqq�Hn���5Rl7��"�\�g`�
	���$[���px�����,f������R��I�	��O�,W�
�����H���;��b��6�H�F�LPAw�q���g����/���C��bhh���DD���GLL�F������&?�v�kq�E���6�LB�^Y^uGvx�BVu��������k�d!���I���W�E�>�|�����]K�����:.������������T���			�-��%�	k����2�Kj���j�j&<��W��B����o�'�M%1y�R����g��FOR���������'M���)��������9�<�����y���k����R*�"��#)d}l���n4B��VS3��l�����?R��=�ahh�4��,�)K�e<dv���+C6�9C��$3�@��cJ_����Us���/w�oq}U�A�mh��F�I�����n��@4�=�}0��\l�f�������l�nd?��R:<y����I��^��CDT����~����L������c����j;��y�������=z�_�~��E�j��
S�N%������9���s��?x��P(���eeevvv���333{�����MVV������lvtttHH�����u�I����}�Qzzz��=7m�4n=5���-[�������u��1c�4���@4������*u�h�5��<�f�h9���%Ee�IW���[23n��<�!V5��2���;]{�(�Q�6Y��O�<m�s�������y4���^�v�d��'BCC�h\�`������U�V-_�����#�S�L9v�Xyy�������F��2��l#������/���e�m���������{�����[��}��C�!�p��K�.5���@4����\)��$bq����WK�60�1��
��,:��7�,d����t"�������r~�����:������O�&�� �%��,YB��o��f@@S?u���7H��	ER!Ax��a����2r�������W����&M���g��e��i����3�Y'�Jb�k�����<�1�z�AY��������F>�zH�T�a��G�����^pL���, ���������/#Yl��]y��3��3�1d�~��M�������)�F����DE��R�S���kk��=�)��u�IW�{���:�C���<y���+I��_�����N�In��
����>����P��g��Y�xqJJJe���FG�]�z5yM[[��7zxx4���@4����@jY��3Y/9~8�D:S���k�v����G��n���Eh������R*����9����g��g�SOs��fN=
x&������O��G����k=����
w�(���d�w��-���UH�n����V�rsssrr:t(Y2dH�.]����S�v�Z�a�������;������;}������4���H����{�bz���k�R��a�/y��Y���'���4����TUX�e������{
���<-/�|������)��]y{���	���W������v(�cFk(dK�?���%Kbbb���w��)�v�jHH��u���������P�]F�\]]]\\���7m�Db5==���Q��+���
�&uUB4����y�t.V����-�����w�Y�u�(�����W��t�������p��_TP�~�^������S����9�<��$#��u�nVV\���1����G�h0 W�^=g��wtpp��mS�<yrpp��I��:�����l�+��}��7,,,H7�tIO������?�?��o���g��������F�s�qaR�-Y�)���X/,�usK���sy����w���5����2�(;����b8Y��'��I^�������'s��G�H�d��%���mO_Z�����1�P����B������C�:��=zTiK++���*f���V���W6lXQQ����c�~��gde��q))�C�T>�1t7����'�.���;**�9��������������eY)..LLLtww���%%�T�/�UG��gW�D,��zsI���:��7�����v���IA�$)���
�P2��V�	�o�3���\�.-*���t(��M��\Z�6&�syxv6�MY���T'�����n4�.��c��������&''S�����!�������dddxx8��7��G��T��'UR����o�>�q�����[G���9��k�g�f\�)���T�[�v-Wv�oJ��Z��<�l�����{ZZ�5~�W���R�-k�q-��h�~���Y���?odd���?��??//��W^���o{�h�9Rr�O�>m�4��I4ZZ*�S��!>��������^��z�HJ��G�����{�^f�����g�(������j�CI22�I�� G��Z�l����g���������������l�Pv���A�)��f�����$�X^^>{�l?�{�^�tipp�O?�����
�<y����#�u�$)����
�{�nR�������I���g�EhRQ_����RA�bq���{�fw{������.��In%M�\����Ip�o�p.���IO[�k�)��.�n������WUU�HB����2j�m��#s|w@�����d�Y$+��wss>|8Y_�f���s_9%%e���$k��&L

"o�r�J���&E�P������������_�QyQ��h�	�6���S�W���[����e��7p������n@G�w�F:9���/.�x�S� ��~��'��\�G"�]6�s�fY�-\��95f5Ox�E��E�R�\��[��#�0J_Znx��|�Ac�?�V��c�����G����L�|���-y����{{{��7�f*������+dGMLLH�����J��������R��b�b�+V}{�J*e�<-O_Y]A��w������p	F-�,�������s�nIQ�����k����s�3��������_���������>7�vi&�z��Ikk��~���"�H�Q6UO3����X���$=EY�tO
�f�%	3�A2277��Psrr����Z�q5�@��
�=�(�����%�hhD���\�&<NC���_�'�f�synr�%Bu��m�L4n��a��i�C�u��^�z�������?/��5����g��1q�����s���9s��Yaaa���L���'...88���^fS�:.�N�}�_:{g\��l�8?<`���v��F�\��$S��%��������+U����,�0���h&�
�4�r����GK^3&&F*��Z������F�yQQQ�g�^�t����w���<���A:��YM-�����W��\�?t����L�o���G���1����#Y��O��]�M�Az��7/f��(�� cbcUYP���z��b�?�Te���6��Gw�5��o����Gy|>?!!��E��W����d�����1ug�������/����������<�������4���&+��~��WpAJS�x���9s���q�������S��P^Uu �-��*�=�X��TB������Yl'�{l�7'��?6L��]G,Mk��#C\#Q�E��T������7}���/����={z��)�hJJ����������{���f����CBB

��[�\v����}�Qzz:y�M�6�7�8�l����\''��[��3���7���5N�<���S��m#��@h6���K�}*q�)-O]US;��|�\��\����[�a�����fv6���Z~�����g=R�A^�C�u�����y��6�����%KF�q�������K�*�".\�~���l�v�ZFF�/���h�"&����O^���s��}���3$������B2����K�zz�h&/\����`jjJ�����j�5;�_So�T��D)�;��,����Qs��w�v�@�8&��9�{����M����_�����s�$�KT�h����H:vW�\177&G��w��A}��w�DMM
IY>�?i�$�x��
fe��i��`lll~����]��R����L4VVV�\$+��������7�R/��J������fS�������������&F�U�\$u
^:��Z[��7���z������	>�m�#�@��L0��nr�����������mv�����W>���v���������<�{N������|z����'�q�)����aZm�"& ��i�,q��v�G?5�W1�������f�u��T<��K���Gn����O�<!����F��9�)\�v����$��'Sttt$=�����{�N�>='G��e5�����������LX]�&Q����������}�����n�@���>R�����c�|���f�72������c�n��i��e_��l����/�Lbl���;w�Ty�
�\]]]\\����[�|MOOwrr5jTHH����@ haQ%�D#����w�PJ�I�������T�)���-\�SR�U���%����ZR����W��:uj�n�F���?(=�f�����u��}��w\.��B��/N��o���z�|%O<xpyy���N�����������������L4jp�e��G�����D��w�������+��wpJ*�����O���������3g���P4h��[�����uP���a��1��{��g���q��)�3����Gw�5fee��3����{����"F�����������������sD��� �����C$"Qj�w����b�5�d�;8%���+��h

"��c�������w�e&�	�����dddxx8s-������p��D����}��M����7�y�����y*w�����V9
G^������O�6������(���?t���������c���E���R�B�&/�����������/i�a�D�RP%�E���T����'O�9�Y������#+�����-/jWB��G��������B���C���i�U���Pe���,\��t���B�����r+;�����S�o��������
�,����u/����������c�����NG�������������3o�������i�(#?���4W�r]�����I����u���X��q�U7]99�R�~�&��q����$�"##'L� +
���\gg���'''��������v����F�����"*��u��k\�x��3&N�(_������&��C��b����������5#rOY:=���
Y;�NI���111�3�j�*�nZZ��������GDDxxx��{��b�;q�Rn������]f*3O|�Ky�z�
��F��������������.W��0)9��_�
	���W�u�Ew��]*x�|�����������}�9.\7�S�@+A4�����qw�TU�	�)��OI������s�����C�G��H%�Cw�	���T]TB*\g���t�6�v�@�����s6��3c�e��.��Z�az/W[m��!�Bj�/Ieo��w�0�
S���{���*P	���
�.~�����y�N2���'���v[
A4�������N�Sem��T��9~���V����$���W�7���%�R������ ��
��t�otll���S���{zz�>}�Y�7o��-[�Jqqq```bb���;y���E����\����o�Y�QT���5�qV������U�ZHw����sW�^U���d[FF�@ �/���yxx���?222<<��@q���$+�����i�wU��S���.�8��k��{G��������$��5k��$---�����?t���������c����P������{�?�p��-s�����v�A��!!!��$����
�{�n�R���vvvd���>33����E����?�ez�LZ���}�c����Z+�h��F�J&L

rss[�repp����IQ(r�\�Bn+**�-_�0������7zgm��)�.�p5NI�z�������+dGMLLH�����J������Kl7���R���z<9u�:��Ts+����Y�������hLNN4h���H$�)�����uvv���qrrjjQ���K�Br�~Q��{���
�)*-#wM��k�����So����AZ��E���sg��9k����0O��k�������[oo��5���������!���}r�m�^�ir@��n4�.a��0�8���f���t����G����� 44���?""���#&&��E��T���ip}��'�n2��+&��z?
��
��F����C�&''+�|~B�����5@"<y����6������������7������?�7�����jf�V�����?X5E�m�fC4�����y���&�����>�6���@���fB4�������!7�D3���,L?�������#�������U]X�[e�r����q���]��.hDcsI�����N�'�o1�w?�<x��v-�hliZ��;6� ���>�=�5}�v��hl��1?$
�����;�������"�Dc���&}���5?2��v�,��i`��v�@3�M$�;x�A��[5���;�SR�}��7�v�@ct:ccc�N�*�Hd���������Dwww����E��M &�$������300�����8%�}��h<w����W��������dddxx����[^l4���o��l[z=������������Bw�����&I�f��b||��C��|�������m-,6RYA��=�~����������j@g�n4����/fgg����{{���L�C�41��E����F�����y_�8��F��B!��%+����B#E��%E��"����H��QO����'u;�p�5�@�:=�Fo�������i�(����T!�����]���GBqy%�p-L��������H��T~:���h������999NNN)*��(���w�����Yu���6`��6����Y4�������[ooo��KJ���W��
�?���2h��@�vDw���b��0�8BCC���#""<<<bbb�
ZXT0 ��������;v��@�M��J#|>?!!A�Ey���C��FNU�`ht7�""�m7�������,��7h������b��R���
�n�:Dc�X{�=����[m�� >{C���c^F.t�Fe���qm.bo*@�hT!��\& �q�
@�g����y��if}��y[�l!+���������������M*6��"f����,I�edd�bXX�������###����7� O�����R�H�v�������~~~L�5� O��1   ))i��a�w�vpp ���l;;;�boo����l��"�<=��	&����\�288�����(
�\.Y!����/���h���fVV�X!;�hbbB2��VVV�x��eT�t�3q:=�����A���l�D��p�"����\gg���''��e����*$�Z>��=���s���9s��Yaaa���L���'...88��z{{7� O��1**j���K�.=z��]��bhh���DD���GLLLS���,�����T���			�.���hhm�F�F�F�F�F�F�F�F+������ccc-,,��"�9+���<<<����z�jm�tN������C���|___???D#����1;;��������gffj�9��:V4
�B.�KV�mEE���*�#���t4+MLLH:����J�������R��b���@��E�O�y�����n��X�(rss���srr������E+}||���������������cEchh���DD���GLL�����cE#��OHH�v+@�u�hx!D�:�{���k�^7��������F�F�F�F�F�bcc�N�*�H���&��f}�>eVV��9s.\���w����#F���������
����e����A*��/�y��b���J����������������Q�s��]�z���/T�Y_�O��o;vl������nrr2U�w-##C h�u/��������e!�m���~��Y��o~��u�Ae���W�]B4�����_�5k�h�!M����r}����O�6������0E�w���R�
k������LQQQDD���g�����o���n�/�l�^�
�K�Fd?�zDe�����s��aVN�<9r�Hf��] &����
�{�n�5P����������g��akk����o~��u�Ae���W�]B4�g��O�kRRR.\H>�3w'L�����r�������k�u
R�N����������������|�eT~������_�v	�����>�N�~�����7�|3x�`����X�B���l�}�������j�N�d}������������
�K���L_�O������I>&������
b������h�y��l��|�G�����-_��o���o��/���@��hl������/�1c�����s���9s��Y����<==���R�N}��3�\������W���/���G�z�+�.!U���bV�b��6���)I�H�W�Z��MKK#�����f���t����G���K�
TGe;��;��w����.������*������Q��B%*��/��T9����C��]:Ne;��;�(--U���7��?�������z�+�.!               �9q�����o�����'L������k��T�Ee����V���^;s�Lk������9u����(oo�G�}������JOo�U����4�D�
�F�(""">�����z������_���6�^����g��y������<x�S�N555��u#�Z�h�f��
$h�����O�u������e�����:99m��u��1m��� ���v�Zddd#7^�`��!C~���U�V-_��D���aii���W���O�8Az�$O�<)K�����'�:p����/]���_
�"D#t>d�,6��S�H/�������#��H$Z�t�����o�����-ki������Qt��=//O 4fc�A������pdukkkrkll,�H��r�����W>���v���j8�5D#t#G����I��1�.��[�H�������c������{�N�>=''��--C4BG�d������{����>}��w�]�p���*7?~���k�Sbbb���w����^��������u5jTHH���/����S�A4BG1x�`�k\�l�;��caa1a�5g��^�z��9������m�65/�y�fWW�����?�|���o��v��=�o��
_�D#t ��U�.�([������G�L~]v���q�RRR4�fh{�F�F�F�F�F�F�F�F�F�F�F�F�F�F�F�F�F�F�F�F�F�%�	���7�IEND�B`�
#34Jeff Janes
jeff.janes@gmail.com
In reply to: Jeff Janes (#29)
Re: Hash Indexes

On Tue, Sep 13, 2016 at 9:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

=======

+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or
tuples
+that are moved due to split.  The need for cleanup lock to remove dead
tuples
+is to ensure that scans' returns correct results.  Scan that returns
multiple
+tuples from the same bucket page always restart the scan from the previous
+offset number from which it has returned last tuple.

Perhaps it would be better to teach scans to restart anywhere on the page,
than to force more cleanup locks to be taken?

Commenting on one of my own questions:

This won't work when the vacuum removes the tuple which an existing scan is
currently examining and thus will be used to re-find it's position when it
realizes it is not visible and so takes up the scan again.

The index tuples in a page are stored sorted just by hash value, not by the
combination of (hash value, tid). If they were sorted by both, we could
re-find our position even if the tuple had been removed, because we would
know to start at the slot adjacent to where the missing tuple would be were
it not removed. But unless we are willing to break pg_upgrade, there is no
feasible way to change that now.

Cheers,

Jeff

#35Jeff Janes
jeff.janes@gmail.com
In reply to: Amit Kapila (#1)
Re: Hash Indexes

On Tue, May 10, 2016 at 5:09 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Although, I don't think it is a very good idea to take any performance
data with WIP patch, still I couldn't resist myself from doing so and below
are the performance numbers. To get the performance data, I have dropped
the primary key constraint on pgbench_accounts and created a hash index on
aid column as below.

alter table pgbench_accounts drop constraint pgbench_accounts_pkey;
create index pgbench_accounts_pkey on pgbench_accounts using hash(aid);

To be rigorously fair, you should probably replace the btree primary key
with a non-unique btree index and use that in the btree comparison case. I
don't know how much difference that would make, probably none at all for a
read-only case.

Below data is for read-only pgbench test and is a median of 3 5-min runs.
The performance tests are executed on a power-8 m/c.

With pgbench -S where everything fits in shared_buffers and the number of
cores I have at my disposal, I am mostly benchmarking interprocess
communication between pgbench and the backend. I am impressed that you can
detect any difference at all.

For this type of thing, I like to create a server side function for use in
benchmarking:

create or replace function pgbench_query(scale integer,size integer)
RETURNS integer AS $$
DECLARE sum integer default 0;
amount integer;
account_id integer;
BEGIN FOR i IN 1..size LOOP
account_id=1+floor(random()*scale);
SELECT abalance into strict amount FROM pgbench_accounts
WHERE aid = account_id;
sum := sum + amount;
END LOOP;
return sum;
END $$ LANGUAGE plpgsql;

And then run using a command like this:

pgbench -f <(echo 'select pgbench_query(40,1000)') -c$j -j$j -T 300

Where the first argument ('40', here) must be manually set to the same
value as the scale-factor.

With 8 cores and 8 clients, the values I get are, for btree, hash-head,
hash-concurrent, hash-concurrent-cache, respectively:

598.2
577.4
668.7
664.6

(each transaction involves 1000 select statements)

So I do see that the concurrency patch is quite an improvement. The cache
patch does not produce a further improvement, which was somewhat surprising
to me (I thought that that patch would really shine in a read-write
workload, but I expected at least improvement in read only)

I've run this was 128MB shared_buffers and scale factor 40. Not everything
fits in shared_buffers, but quite easily fits in RAM, and there is no
meaningful IO caused by the benchmark.

Cheers,

Jeff

#36Amit Kapila
amit.kapila16@gmail.com
In reply to: Jeff Janes (#29)
1 attachment(s)
Re: Hash Indexes

On Tue, Sep 13, 2016 at 10:01 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, Sep 7, 2016 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Now, I think we shouldn't be
inconsistent in using them. I will change to make it same if I find
any inconsistency based on what you or other people think is the
better way to refer overflow space.

I think just "overflow page" or "buffer containing the overflow page".

Okay changed to overflow page.

Here are some more notes I've taken, mostly about the README and comments.

It took me a while to understand that once a tuple is marked as moved by
split, it stays that way forever. It doesn't mean "recently moved by
split", but "ever moved by split". Which works, but is rather subtle.
Perhaps this deserves a parenthetical comment in the README the first time
the flag is mentioned.

I have added an additional paragraph explaining move-by-split flag
along with the explanation of split operation.

========

#define INDEX_SIZE_MASK 0x1FFF
/* bit 0x2000 is not used at present */

This is no longer true, maybe:
/* bit 0x2000 is reserved for index-AM specific usage */

Changed as per suggestion.

========

Note that this is designed to allow concurrent splits and scans. If a
split occurs, tuples relocated into the new bucket will be visited twice
by the scan, but that does no harm. As we are releasing the locks during
scan of a bucket, it will allow concurrent scan to start on a bucket and
ensures that scan will always be behind cleanup.

Above, the abrupt transition from splits (first sentence) to cleanup is
confusing. If the cleanup referred to is vacuuming, it should be a new
paragraph or at least have a transition sentence. Or is it referring to
clean-up locks used for control purposes, rather than for actual vacuum
clean-up? I think it is the first one, the vacuum.

Yes, it is first one.

(I find the committed
version of this comment confusing as well--how in the committed code would a
tuple be visited twice, and why does that not do harm in the committed
coding? So maybe the issue here is me, not the comment.)

You have to read this scan as scan during vacuum. Whatever is written
in committed code is right, let me try to explain with example.
Suppose, there are two buckets at the start of vacuum, after it
completes the vacuuming of first bucket and before or during vacuum
for second bucket, a split for
first bucket occurs. Now we have three buckets. If you notice in
code (hashbulkdelete), after completing the vacuum for first and
second bucket, if there is a split it will perform the vacuum for
third bucket as well. This is the reason why readme mention's that
tuples relocated into the new bucket will be visited twice.

This whole explanation is in garbage collection section, so to me it
looks clear. However, I have changed some wording, see if it makes
sense to you now.

=======

+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or
tuples
+that are moved due to split.  The need for cleanup lock to remove dead
tuples
+is to ensure that scans' returns correct results.  Scan that returns
multiple
+tuples from the same bucket page always restart the scan from the previous
+offset number from which it has returned last tuple.

Perhaps it would be better to teach scans to restart anywhere on the page,
than to force more cleanup locks to be taken?

Yeah, we can do that by making hash index scans work-a-page-at-a-time
as we do for btree scans. However, as mentioned earlier, this is in
my Todo list and I think it is better to do it as a separate patch
based on this work. Do you think thats reasonable or do you have some
strong reason why we should consider it as part of this patch only?

=======
This comment no longer seems accurate (as long as it is just an ERROR and
not a PANIC):

* XXX we have a problem here if we fail to get space for a
* new overflow page: we'll error out leaving the bucket
split
* only partially complete, meaning the index is corrupt,
* since searches may fail to find entries they should find.

The split will still be marked as being in progress, so any scanner will
have to scan the old page and see the tuple there.

I have removed that part of comment. I think for PANIC case anyway
hash index will be corrupt, so we might not need to mention anything
about it.

========
in _hash_splitbucket comments, this needs updating:

* The caller must hold exclusive locks on both buckets to ensure that
* no one else is trying to access them (see README).

The true prereq here is a buffer clean up lock (pin plus exclusive buffer
content lock), correct?

Right and I have changed it accordingly.

And then:

* Split needs to hold pin on primary bucket pages of both old and new
* buckets till end of operation.

'retain' is probably better than 'hold', to emphasize that we are dropping
the buffer content lock part of the clean-up lock, but that the pin part of
it is kept continuously (this also matches the variable name used in the
code).

Okay, changed to retain.

Also, the paragraph after that one seems to be obsolete and
contradictory with the newly added comments.

Are you talking about:
* In addition, the caller must have created the new bucket's base page,
..

If yes, then I think that is valid. That paragraph mainly highlights
two points. First is the new bucket's base page should be pinned,
write-locked before calling this API and both will be released in this
API. Second is we must do _hash_getnewbuf() before releasing the
metapage write lock. Both the points still seems to be valid.

===========

/*
* Acquiring cleanup lock to clear the split-in-progress flag ensures
that
* there is no pending scan that has seen the flag after it is cleared.
*/

But, we are not acquiring a clean up lock. We already have a pin, and we do
acquire a write buffer-content lock, but don't observe that our pin is the
only one. I don't see why it is necessary to have a clean up lock (what
harm is done if a under-way scan thinks it is scanning a bucket that is
being split when it actually just finished the split?), but if it is
necessary then I think this code is wrong. If not necessary, the comment is
wrong.

The comment is wrong and I have removed it. This is ramanant of some
previous idea which I wanted to try but found problems in it and
didin't pursued it.

Also, why must we hold a write lock on both old and new primary bucket pages
simultaneously? Is this in anticipation of the WAL patch?

Yes, clearing the flag on both the buckets needs to be an atomic
operation. Otherwise also, it is not good to write two different WAL
records (one for clearing the flag on old bucket and other on new
bucket).

The contract for
the function does say that it returns both pages write locked, but I don't
see a reason for that part of the contract at the moment.

Just refer it's usage in _hash_finish_split() cleanup flow. The
reason is that we need to retain the lock in one of the buckets
depending on the case.

=========

To avoid deadlock between readers and inserters, whenever there is a need
to lock multiple buckets, we always take in the order suggested in
Locking
Definitions above. This algorithm allows them a very high degree of
concurrency.

The section referred to is actually spelled "Lock Definitions", no "ing".

The Lock Definitions sections doesn't mention the meta page at all.

Okay, changed.

I think
there needs be something added to it about how the meta page gets locked and
why that is deadlock free. (But we could be optimistic and assume the patch
to implement caching of the metapage will go in and will take care of that).

I don't think caching the meta page will eliminate the need to lock
the meta page. However, this patch has not changed anything relavant
in meta page locking that can impact deadlock detection. I have
thought about it but not sure what more to write other than what is
already mentioned at different places about meta page in README. Let
me know, if you have something specific in mind.

=========

And an operational question on this: A lot of stuff is done conditionally
here. Under high concurrency, do splits ever actually occur? It seems like
they could easily be permanently starved.

May be, but the situation won't be worse than what we have in head.
Under high concurrency also, it can arise only if there is always a
reader for a bucket, before we try to split. Point to note here is
once the split is started, concurrent readers are allowed which was
not allowed previously. I
think the same argument can be applied to other places where readers
and writers contend for same lock, example procarraylock. In such
cases theoretically readers can starve writers for ever, but
practically such situations are rare.

Apart from fixing above review comments, I have fixed the issue
reported by Ashutosh Sharma [1]/messages/by-id/CAA4eK1+fMUpJoAp5MXKRSv9193JXn25qtG+ZrYUwb4dUuqmHrA@mail.gmail.com.

Many thanks Jeff for the detailed review.

[1]: /messages/by-id/CAA4eK1+fMUpJoAp5MXKRSv9193JXn25qtG+ZrYUwb4dUuqmHrA@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

concurrent_hash_index_v7.patchapplication/octet-stream; name=concurrent_hash_index_v7.patchDownload
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c1122b4..d02539a 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -400,7 +400,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	Buffer		buf;
 	Page		page;
 
-	_hash_getlock(rel, blkno, HASH_SHARE);
 	buf = _hash_getbuf_with_strategy(rel, blkno, HASH_READ, 0, bstrategy);
 	page = BufferGetPage(buf);
 
@@ -431,7 +430,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	}
 
 	_hash_relbuf(rel, buf);
-	_hash_droplock(rel, blkno, HASH_SHARE);
 }
 
 /*
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 0a7da89..8d815f0 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -125,49 +125,47 @@ the initially created buckets.
 
 Lock Definitions
 ----------------
-
-We use both lmgr locks ("heavyweight" locks) and buffer context locks
-(LWLocks) to control access to a hash index.  lmgr locks are needed for
-long-term locking since there is a (small) risk of deadlock, which we must
-be able to detect.  Buffer context locks are used for short-term access
-control to individual pages of the index.
-
-LockPage(rel, page), where page is the page number of a hash bucket page,
-represents the right to split or compact an individual bucket.  A process
-splitting a bucket must exclusive-lock both old and new halves of the
-bucket until it is done.  A process doing VACUUM must exclusive-lock the
-bucket it is currently purging tuples from.  Processes doing scans or
-insertions must share-lock the bucket they are scanning or inserting into.
-(It is okay to allow concurrent scans and insertions.)
-
-The lmgr lock IDs corresponding to overflow pages are currently unused.
-These are available for possible future refinements.  LockPage(rel, 0)
-is also currently undefined (it was previously used to represent the right
-to modify the hash-code-to-bucket mapping, but it is no longer needed for
-that purpose).
-
-Note that these lock definitions are conceptually distinct from any sort
-of lock on the pages whose numbers they share.  A process must also obtain
-read or write buffer lock on the metapage or bucket page before accessing
-said page.
-
-Processes performing hash index scans must hold share lock on the bucket
-they are scanning throughout the scan.  This seems to be essential, since
-there is no reasonable way for a scan to cope with its bucket being split
-underneath it.  This creates a possibility of deadlock external to the
-hash index code, since a process holding one of these locks could block
-waiting for an unrelated lock held by another process.  If that process
-then does something that requires exclusive lock on the bucket, we have
-deadlock.  Therefore the bucket locks must be lmgr locks so that deadlock
-can be detected and recovered from.
-
-Processes must obtain read (share) buffer context lock on any hash index
-page while reading it, and write (exclusive) lock while modifying it.
-To prevent deadlock we enforce these coding rules: no buffer lock may be
-held long term (across index AM calls), nor may any buffer lock be held
-while waiting for an lmgr lock, nor may more than one buffer lock
-be held at a time by any one process.  (The third restriction is probably
-stronger than necessary, but it makes the proof of no deadlock obvious.)
+We use buffer content locks (LWLocks) and buffer pins to control access to
+a hash index.  We will refer buffer content locks as locks in following
+paragraphs.
+
+Scan will take a lock in shared mode on the primary bucket or on one of the
+overflow page.  Inserts will acquire exclusive lock on the primary bucket or
+on the overflow page in which it has to insert.  Both the operations releases
+the lock on previous bucket or overflow page before moving to the next overflow
+page.  They will retain a pin on primary bucket till end of operation. Split
+operation must acquire cleanup lock on both old and new halves of the bucket
+and mark split-in-progress on both the buckets.  The cleanup lock at the start
+of split ensures that parallel insert won't get lost.  Consider a case where
+insertion has to add a tuple on some intermediate overflow page in the bucket
+chain, if we allow split when insertion is in progress, split might not move
+this newly inserted tuple.  Like inserts and scans, it releases the lock
+on previous bucket or overflow page before moving to the next overflow page
+both for old bucket or for new bucket.  After partitioning the tuples between
+old and new buckets, it again needs to acquire exclusive lock on both old and
+new buckets to clear the split-in-progress flag.  Like inserts and scans, it
+will also retain pins on both the old and new primary buckets till end of split
+operation, although we can do without that as well.
+
+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or tuples
+that are moved due to split.  The need for cleanup lock to remove dead tuples
+is to ensure that scans' returns correct results.  Scan that returns multiple
+tuples from the same bucket page always restart the scan from the previous
+offset number from which it has returned last tuple.  If we allow vacuum to
+remove the dead tuples with just an exclusive lock, it could remove the tuple
+required to resume the scan.  The need for cleanup lock to remove the tuples
+that are moved by split is to ensure that there is no pending scan that has
+started after the start of split and before the finish of split on bucket.
+If we don't do that, then vacuum can remove tuples that are required by such
+a scan.  We don't need to retain this cleanup lock during whole vacuum
+operation on bucket.  We releases the lock as we move ahead in the bucket
+chain.  In the end, for squeeze-phase, we conditionally acquire cleanup lock
+and if we don't get, then we just abandon the squeeze phase.
+
+To avoid deadlocks, we must be consistent about the lock order in which we
+lock the buckets for operations that requires locks on two different buckets.
+We use the rule "first lock the old bucket and then new bucket, basically
+lock the lowered number bucket first".
 
 
 Pseudocode Algorithms
@@ -188,63 +186,104 @@ track of available overflow pages.
 The reader algorithm is:
 
 	pin meta page and take buffer content lock in shared mode
-	loop:
-		compute bucket number for target hash key
-		release meta page buffer content lock
-		if (correct bucket page is already locked)
-			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock
-		retake meta page buffer content lock in shared mode
+	compute bucket number for target hash key
+	read and pin the primary bucket page
+	conditionally get the buffer content lock in shared mode on primary bucket page for search
+	if we didn't get the lock (need to wait for lock)
+		release the buffer content lock on meta page
+		acquire buffer content lock on primary bucket page in shared mode
+		acquire the buffer content lock in shared mode on meta page
+		to check for possibility of split, we need to recompute the bucket and
+		verify, if it is a correct bucket; set the retry flag
+	else if we get the lock, then we can skip the retry path
+	if (retry)
+		loop:
+			compute bucket number for target hash key
+			release meta page buffer content lock
+			if (correct bucket page is already locked)
+				break
+			release any existing content lock on bucket page (if a concurrent split happened)
+			pin primary bucket page and take shared buffer content lock
+			retake meta page buffer content lock in shared mode
 -- then, per read request:
 	release pin on metapage
-	read current page of bucket and take shared buffer content lock
-		step to next page if necessary (no chaining of locks)
+	if the split is in progress for current bucket and this is a new bucket
+		release the buffer content lock on current bucket page
+		pin and acquire the buffer content lock on old bucket in shared mode
+		release the buffer content lock on old bucket, but not pin
+		retake the buffer content lock on new bucket
+		mark the scan such that it skips the tuples that are marked as moved by split
+	step to next page if necessary (no chaining of locks)
+		if the scan indicates moved by split, then move to old bucket after the scan
+		of current bucket is finished
 	get tuple
 	release buffer content lock and pin on current page
 -- at scan shutdown:
-	release bucket share-lock
-
-We can't hold the metapage lock while acquiring a lock on the target bucket,
-because that might result in an undetected deadlock (lwlocks do not participate
-in deadlock detection).  Instead, we relock the metapage after acquiring the
-bucket page lock and check whether the bucket has been split.  If not, we're
-done.  If so, we release our previously-acquired lock and repeat the process
-using the new bucket number.  Holding the bucket sharelock for
+	release any pin we hold on current buffer, old bucket buffer, new bucket buffer
+
+We don't want to hold the meta page lock if we have to wait for acquiring the
+content lock on bucket page, because that might result in poor concurrency.
+Instead, we relock the metapage after acquiring the bucket page content lock
+and check whether the bucket has been split.  If not, we're done.  If so, we
+release our previously-acquired content lock, but not pin and repeat the
+process using the new bucket number.  Holding the buffer pin on bucket page for
 the remainder of the scan prevents the reader's current-tuple pointer from
-being invalidated by splits or compactions.  Notice that the reader's lock
+being invalidated by splits or compactions.  Notice that the reader's pin
 does not prevent other buckets from being split or compacted.
 
 To keep concurrency reasonably good, we require readers to cope with
 concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the page sharelock.  Since
-deletion is not possible while a reader holds the bucket sharelock, and
-we assume that heap tuple TIDs are unique, this can be implemented by
+their current scan position after re-acquiring the buffer content lock on
+page.  Since deletion is not possible while a reader holds the pin on bucket,
+and we assume that heap tuple TIDs are unique, this can be implemented by
 searching for the same heap tuple TID previously returned.  Insertion does
 not move index entries across pages, so the previously-returned index entry
 should always be on the same page, at the same or higher offset number,
 as it was before.
 
+To allow scan during bucket split, if at the start of the scan, bucket is
+marked as split-in-progress, it scan all the tuples in that bucket except for
+those that are marked as moved-by-split.  Once it finishes the scan of all the
+tuples in the current bucket, it scans the old bucket from which this bucket
+is formed by split.  This happens only for the new half bucket.
+
 The insertion algorithm is rather similar:
 
 	pin meta page and take buffer content lock in shared mode
-	loop:
-		compute bucket number for target hash key
-		release meta page buffer content lock
-		if (correct bucket page is already locked)
-			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock in shared mode
-		retake meta page buffer content lock in shared mode
--- (so far same as reader)
+	compute bucket number for target hash key
+	read and pin the primary bucket page
+	conditionally get the buffer content lock in exclusive mode on primary bucket page for search
+	if we didn't get the lock (need to wait for lock)
+		release the buffer content lock on meta page
+		acquire buffer content lock on primary bucket page in exclusive mode
+		acquire the buffer content lock in shared mode on meta page
+		to check for possibility of split, we need to recompute the bucket and
+		verify, if it is a correct bucket; set the retry flag
+	else if we get the lock, then we can skip the retry path
+	if (retry)
+		loop:
+			compute bucket number for target hash key
+			release meta page buffer content lock
+			if (correct bucket page is already locked)
+				break
+			release any existing content lock on bucket page (if a concurrent split happened)
+			pin primary bucket page and take exclusive buffer content lock
+			retake meta page buffer content lock in shared mode
+-- (so far same as reader, except for acquisation of buffer content lock in
+	exclusive mode on primary bucket page)
 	release pin on metapage
-	pin current page of bucket and take exclusive buffer content lock
-	if full, release, read/exclusive-lock next page; repeat as needed
+	if the split-in-progress flag is set for bucket in old half of split
+	and pin count on it is one, then finish the split
+		we already have a buffer content lock on old bucket, conditionally get the content lock on new bucket
+		if get the lock on new bucket
+			finish the split using algorithm mentioned below for split
+			release the buffer content lock and pin on new bucket
+	if full, release lock but not pin, read/exclusive-lock next page; repeat as needed
 	>> see below if no space in any page of bucket
 	insert tuple at appropriate place in page
 	mark current page dirty and release buffer content lock and pin
-	release heavyweight share-lock
-	pin meta page and take buffer content lock in shared mode
+	if current page is not a bucket page, release the pin on bucket page
+	pin meta page and take buffer content lock in exclusive mode
 	increment tuple count, decide if split needed
 	mark meta page dirty and release buffer content lock and pin
 	done if no split needed, else enter Split algorithm below
@@ -256,11 +295,13 @@ bucket that is being actively scanned, because readers can cope with this
 as explained above.  We only need the short-term buffer locks to ensure
 that readers do not see a partially-updated page.
 
-It is clearly impossible for readers and inserters to deadlock, and in
-fact this algorithm allows them a very high degree of concurrency.
-(The exclusive metapage lock taken to update the tuple count is stronger
-than necessary, since readers do not care about the tuple count, but the
-lock is held for such a short time that this is probably not an issue.)
+To avoid deadlock between readers and inserters, whenever there is a need
+to lock multiple buckets, we always take in the order suggested in Lock
+Definitions above.  This algorithm allows them a very high degree of
+concurrency.  (The exclusive metapage lock taken to update the tuple count
+is stronger than necessary, since readers do not care about the tuple count,
+but the lock is held for such a short time that this is probably not an
+issue.)
 
 When an inserter cannot find space in any existing page of a bucket, it
 must obtain an overflow page and add that page to the bucket's chain.
@@ -271,46 +312,84 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
 The algorithm attempts, but does not necessarily succeed, to split one
 existing bucket in two, thereby lowering the fill ratio:
 
-	pin meta page and take buffer content lock in exclusive mode
-	check split still needed
-	if split not needed anymore, drop buffer content lock and pin and exit
-	decide which bucket to split
-	Attempt to X-lock old bucket number (definitely could fail)
-	Attempt to X-lock new bucket number (shouldn't fail, but...)
-	if above fail, drop locks and pin and exit
+	expand:
+		take buffer content lock in exclusive mode on meta page
+		check split still needed
+		if split not needed anymore, drop buffer content lock and exit
+		decide which bucket to split
+		Attempt to acquire cleanup lock on old bucket number (definitely could fail)
+		if above fail, release lock and pin and exit
+		if the split-in-progress flag is set, then finish the split
+			conditionally get the content lock on new bucket which was involved in split
+			if got the lock on new bucket
+				finish the split using algorithm mentioned below for split
+				release the buffer content lock and pin on old and new bucketa
+				try to expand from start
+			else
+				release the buffer conetent lock and pin on old bucket and exit
+		if the garbage flag (indicates that tuples are moved by split) is set on bucket
+			release the buffer content lock on meta page
+			remove the tuples that doesn't belong to this bucket; see bucket cleanup below
+	Attempt to acquire cleanup lock on new bucket number (shouldn't fail, but...)
 	update meta page to reflect new number of buckets
-	mark meta page dirty and release buffer content lock and pin
+	mark meta page dirty and release buffer content lock
 	-- now, accesses to all other buckets can proceed.
 	Perform actual split of bucket, moving tuples as needed
 	>> see below about acquiring needed extra space
-	Release X-locks of old and new buckets
+
+	split guts
+	mark the old and new buckets indicating split-in-progress
+	mark the old bucket indicating has-garbage
+	copy the tuples that belongs to new bucket from old bucket
+	during copy mark such tuples as move-by-split
+	release lock but not pin for primary bucket page of old bucket,
+	read/shared-lock next page; repeat as needed
+	>> see below if no space in bucket page of new bucket
+	ensure to have exclusive-lock on both old and new buckets in that order
+	clear the split-in-progress flag from both the buckets
+	mark buffers dirty and release the locks and pins on both old and new buckets
 
 Note the metapage lock is not held while the actual tuple rearrangement is
 performed, so accesses to other buckets can proceed in parallel; in fact,
 it's possible for multiple bucket splits to proceed in parallel.
 
-Split's attempt to X-lock the old bucket number could fail if another
-process holds S-lock on it.  We do not want to wait if that happens, first
-because we don't want to wait while holding the metapage exclusive-lock,
-and second because it could very easily result in deadlock.  (The other
-process might be out of the hash AM altogether, and could do something
-that blocks on another lock this process holds; so even if the hash
-algorithm itself is deadlock-free, a user-induced deadlock could occur.)
-So, this is a conditional LockAcquire operation, and if it fails we just
-abandon the attempt to split.  This is all right since the index is
-overfull but perfectly functional.  Every subsequent inserter will try to
-split, and eventually one will succeed.  If multiple inserters failed to
-split, the index might still be overfull, but eventually, the index will
+Split's attempt to acquire cleanup-lock on the old bucket number could fail
+if another process holds any lock or pin on it.  We do not want to wait if
+that happens, because we don't want to wait while holding the metapage
+exclusive-lock.  So, this is a conditional LWLockAcquire operation, and if
+it fails we just abandon the attempt to split.  This is all right since the
+index is overfull but perfectly functional.  Every subsequent inserter will
+try to split, and eventually one will succeed.  If multiple inserters failed
+to split, the index might still be overfull, but eventually, the index will
 not be overfull and split attempts will stop.  (We could make a successful
 splitter loop to see if the index is still overfull, but it seems better to
 distribute the split overhead across successive insertions.)
 
+During copy of tuple from old bucket to new bucket, we mark tuple as
+move-by-split so that concurrent scans can skip such tuples till the split
+operation is finished.  Once the tuple is marked as moved-by-split, it will
+remain so forever but that does no harm.  We have intentionally not
+cleared it, as that can generate an additional I/O which is not necessary.
+
+has_garbage flag indicates that the bucket contains tuples that are moved due
+to split.  This will be set only for old bucket.  Now, why we need it besides
+split-in-progress flag is to distinguish the case when the split is over
+(aka split-in-progress flag is cleared.).  This is used both by vacuum as
+well as during re-split operation.  Vacuum, uses it to decide if it needs to
+clear the tuples (that are moved-by-split) from bucket along with dead tuples.
+Re-split of bucket uses it to ensure that it doesn't start a new split from a
+bucket without clearing the previous tuples from old bucket.  The usage by
+re-split helps to keep bloat under control and makes the design somewhat
+simpler as we don't have to any time handle the situation where a bucket can
+contain dead-tuples from multiple splits.
+
 A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt.  The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
+disk space or crash) the index is left corrupt.  The probability of that
+could be made quite low if we grab a free page or two before we update the
+meta page, but the only real solution is to treat a split as a WAL-loggable,
 must-complete action.  I'm not planning to teach hash about WAL in this
-go-round.
+go-round.  However, we do try to finish the incomplete splits during insert
+and split.
 
 The fourth operation is garbage collection (bulk deletion):
 
@@ -319,9 +398,13 @@ The fourth operation is garbage collection (bulk deletion):
 	fetch current max bucket number
 	release meta page buffer content lock and pin
 	while next bucket <= max bucket do
-		Acquire X lock on target bucket
-		Scan and remove tuples, compact free space as needed
-		Release X lock
+		Acquire cleanup lock on target bucket
+		Scan and remove tuples
+		For overflow page, first we need to lock the next page and then
+		release the lock on current bucket or overflow page
+		Ensure to have buffer content lock in exclusive mode on bucket page
+		If buffer pincount is one, then compact free space as needed
+		Release lock
 		next bucket ++
 	end loop
 	pin metapage and take buffer content lock in exclusive mode
@@ -330,20 +413,23 @@ The fourth operation is garbage collection (bulk deletion):
 	else update metapage tuple count
 	mark meta page dirty and release buffer content lock and pin
 
-Note that this is designed to allow concurrent splits.  If a split occurs,
-tuples relocated into the new bucket will be visited twice by the scan,
-but that does no harm.  (We must however be careful about the statistics
-reported by the VACUUM operation.  What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count
-if the stored tuple count and number of buckets did *not* change at any
-time during the scan.  This provides a way of correcting the stored tuple
-count if it gets out of sync for some reason.  But if a split or insertion
-does occur concurrently, the scan count is untrustworthy; instead,
-subtract the number of tuples deleted from the stored tuple count and
-use that.)
-
-The exclusive lock request could deadlock in some strange scenarios, but
-we can just error out without any great harm being done.
+Note that this is designed to allow concurrent splits and scans.  If a
+split occurs, tuples relocated into the new bucket will be visited twice
+by the scan, but that does no harm.  As we are releasing the locks on bucket
+page or overflow pages during cleanup scan of a bucket, it will allow
+concurrent scan to start on a bucket and ensures that scan will always be
+behind cleanup.  It is must to keep scans behind cleanup, else vacuum could
+remove tuples that are required to complete the scan as explained in Lock
+Definitions section above.  This holds true for backward scans as well
+(backward scans first traverse each bucket starting from first bucket to last
+overflow page in the chain).  We must be careful about the statistics reported
+by the VACUUM operation.  What we can do is count the number of tuples scanned,
+and believe this in preference to the stored tuple count if the stored tuple
+count and number of buckets did *not* change at any time during the scan.  This
+provides a way of correcting the stored tuple count if it gets out of sync for
+some reason.  But if a split or insertion does occur concurrently, the scan
+count is untrustworthy; instead, subtract the number of tuples deleted
+from the stored tuple count and use that.
 
 
 Free Space Management
@@ -417,13 +503,11 @@ free page; there can be no other process holding lock on it.
 
 Bucket splitting uses a similar algorithm if it has to extend the new
 bucket, but it need not worry about concurrent extension since it has
-exclusive lock on the new bucket.
+buffer content lock in exclusive mode on the new bucket.
 
-Freeing an overflow page is done by garbage collection and by bucket
-splitting (the old bucket may contain no-longer-needed overflow pages).
-In both cases, the process holds exclusive lock on the containing bucket,
-so need not worry about other accessors of pages in the bucket.  The
-algorithm is:
+Freeing an overflow page requires the process to hold buffer content lock in
+exclusive mode on the containing bucket, so need not worry about other
+accessors of pages in the bucket.  The algorithm is:
 
 	delink overflow page from bucket chain
 	(this requires read/update/write/release of fore and aft siblings)
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..302f6ed 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -287,10 +287,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		/*
 		 * An insertion into the current index page could have happened while
 		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold share lock on
-		 * the bucket, no deletions or splits could have occurred; therefore
-		 * we can expect that the TID still exists in the current index page,
-		 * at an offset >= where we were.)
+		 * for the TID we previously returned.  (Because we hold pin on the
+		 * bucket, no deletions or splits could have occurred; therefore we
+		 * can expect that the TID still exists in the current index page, at
+		 * an offset >= where we were.)
 		 */
 		OffsetNumber maxoffnum;
 
@@ -425,12 +425,15 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
 	so->hashso_bucket_valid = false;
-	so->hashso_bucket_blkno = 0;
 	so->hashso_curbuf = InvalidBuffer;
+	so->hashso_bucket_buf = InvalidBuffer;
+	so->hashso_old_bucket_buf = InvalidBuffer;
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
 	ItemPointerSetInvalid(&(so->hashso_heappos));
 
+	so->hashso_skip_moved_tuples = false;
+
 	scan->opaque = so;
 
 	/* register scan in case we change pages it's using */
@@ -449,15 +452,7 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -471,6 +466,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 				scan->numberOfKeys * sizeof(ScanKeyData));
 		so->hashso_bucket_valid = false;
 	}
+
+	so->hashso_skip_moved_tuples = false;
 }
 
 /*
@@ -484,16 +481,7 @@ hashendscan(IndexScanDesc scan)
 
 	/* don't need scan registered anymore */
 	_hash_dropscan(scan);
-
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	pfree(so);
 	scan->opaque = NULL;
@@ -504,6 +492,9 @@ hashendscan(IndexScanDesc scan)
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
  *
+ * This function also delete the tuples that are moved by split to other
+ * bucket.
+ *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
@@ -548,83 +539,52 @@ loop_top:
 	{
 		BlockNumber bucket_blkno;
 		BlockNumber blkno;
-		bool		bucket_dirty = false;
+		Buffer		bucket_buf;
+		Buffer		buf;
+		HashPageOpaque bucket_opaque;
+		Page		page;
+		bool		bucket_has_garbage = false;
 
 		/* Get address of bucket's start page */
 		bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
 
-		/* Exclusive-lock the bucket so we can shrink it */
-		_hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
 		/* Shouldn't have any active scans locally, either */
 		if (_hash_has_active_scan(rel, cur_bucket))
 			elog(ERROR, "hash index has active scan during VACUUM");
 
-		/* Scan each page in bucket */
 		blkno = bucket_blkno;
-		while (BlockNumberIsValid(blkno))
-		{
-			Buffer		buf;
-			Page		page;
-			HashPageOpaque opaque;
-			OffsetNumber offno;
-			OffsetNumber maxoffno;
-			OffsetNumber deletable[MaxOffsetNumber];
-			int			ndeletable = 0;
 
-			vacuum_delay_point();
-
-			buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-											 info->strategy);
-			page = BufferGetPage(buf);
-			opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-			Assert(opaque->hasho_bucket == cur_bucket);
-
-			/* Scan each tuple in page */
-			maxoffno = PageGetMaxOffsetNumber(page);
-			for (offno = FirstOffsetNumber;
-				 offno <= maxoffno;
-				 offno = OffsetNumberNext(offno))
-			{
-				IndexTuple	itup;
-				ItemPointer htup;
+		/*
+		 * We need to acquire a cleanup lock on the primary bucket to out wait
+		 * concurrent scans.
+		 */
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+		LockBufferForCleanup(buf);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 
-				itup = (IndexTuple) PageGetItem(page,
-												PageGetItemId(page, offno));
-				htup = &(itup->t_tid);
-				if (callback(htup, callback_state))
-				{
-					/* mark the item for deletion */
-					deletable[ndeletable++] = offno;
-					tuples_removed += 1;
-				}
-				else
-					num_index_tuples += 1;
-			}
+		page = BufferGetPage(buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
-			/*
-			 * Apply deletions and write page if needed, advance to next page.
-			 */
-			blkno = opaque->hasho_nextblkno;
+		/*
+		 * If the bucket contains tuples that are moved by split, then we need
+		 * to delete such tuples on completion of split.  Before cleaning, we
+		 * need to out-wait the scans that have started when the split was in
+		 * progress for a bucket.
+		 */
+		if (H_HAS_GARBAGE(bucket_opaque) &&
+			!H_INCOMPLETE_SPLIT(bucket_opaque))
+			bucket_has_garbage = true;
 
-			if (ndeletable > 0)
-			{
-				PageIndexMultiDelete(page, deletable, ndeletable);
-				_hash_wrtbuf(rel, buf);
-				bucket_dirty = true;
-			}
-			else
-				_hash_relbuf(rel, buf);
-		}
+		bucket_buf = buf;
 
-		/* If we deleted anything, try to compact free space */
-		if (bucket_dirty)
-			_hash_squeezebucket(rel, cur_bucket, bucket_blkno,
-								info->strategy);
+		hashbucketcleanup(rel, bucket_buf, blkno, info->strategy,
+						  local_metapage.hashm_maxbucket,
+						  local_metapage.hashm_highmask,
+						  local_metapage.hashm_lowmask, &tuples_removed,
+						  &num_index_tuples, bucket_has_garbage, true,
+						  callback, callback_state);
 
-		/* Release bucket lock */
-		_hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+		_hash_relbuf(rel, bucket_buf);
 
 		/* Advance to next bucket */
 		cur_bucket++;
@@ -705,6 +665,197 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }
 
+/*
+ * Helper function to perform deletion of index entries from a bucket.
+ *
+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.
+ *
+ * During scan of overflow pages, first we need to lock the next bucket and
+ * then release the lock on current bucket.  This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.
+ *
+ * We need to retain a pin on the primary bucket to ensure that no concurrent
+ * split can start.
+ */
+void
+hashbucketcleanup(Relation rel, Buffer bucket_buf,
+				  BlockNumber bucket_blkno,
+				  BufferAccessStrategy bstrategy,
+				  uint32 maxbucket,
+				  uint32 highmask, uint32 lowmask,
+				  double *tuples_removed,
+				  double *num_index_tuples,
+				  bool bucket_has_garbage,
+				  bool delay,
+				  IndexBulkDeleteCallback callback,
+				  void *callback_state)
+{
+	BlockNumber blkno;
+	Buffer		buf;
+	Bucket		cur_bucket;
+	Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
+	Page		page;
+	bool		bucket_dirty = false;
+
+	blkno = bucket_blkno;
+	buf = bucket_buf;
+	page = BufferGetPage(buf);
+	cur_bucket = ((HashPageOpaque) PageGetSpecialPointer(page))->hasho_bucket;
+
+	if (bucket_has_garbage)
+		new_bucket = _hash_get_newbucket(rel, cur_bucket,
+										 lowmask, maxbucket);
+
+	/* Scan each page in bucket */
+	for (;;)
+	{
+		HashPageOpaque opaque;
+		OffsetNumber offno;
+		OffsetNumber maxoffno;
+		Buffer		next_buf;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+		bool		retain_pin = false;
+		bool		curr_page_dirty = false;
+
+		if (delay)
+			vacuum_delay_point();
+
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		/* Scan each tuple in page */
+		maxoffno = PageGetMaxOffsetNumber(page);
+		for (offno = FirstOffsetNumber;
+			 offno <= maxoffno;
+			 offno = OffsetNumberNext(offno))
+		{
+			IndexTuple	itup;
+			ItemPointer htup;
+			Bucket		bucket;
+
+			itup = (IndexTuple) PageGetItem(page,
+											PageGetItemId(page, offno));
+			htup = &(itup->t_tid);
+			if (callback && callback(htup, callback_state))
+			{
+				/* mark the item for deletion */
+				deletable[ndeletable++] = offno;
+				if (tuples_removed)
+					*tuples_removed += 1;
+			}
+			else if (bucket_has_garbage)
+			{
+				/* delete the tuples that are moved by split. */
+				bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+											  maxbucket,
+											  highmask,
+											  lowmask);
+				/* mark the item for deletion */
+				if (bucket != cur_bucket)
+				{
+					/*
+					 * We expect tuples to either belong to curent bucket or
+					 * new_bucket.  This is ensured because we don't allow
+					 * further splits from bucket that contains garbage. See
+					 * comments in _hash_expandtable.
+					 */
+					Assert(bucket == new_bucket);
+					deletable[ndeletable++] = offno;
+				}
+				else if (num_index_tuples)
+					*num_index_tuples += 1;
+			}
+			else if (num_index_tuples)
+				*num_index_tuples += 1;
+		}
+
+		/* retain the pin on primary bucket till end of bucket scan */
+		if (blkno == bucket_blkno)
+			retain_pin = true;
+		else
+			retain_pin = false;
+
+		blkno = opaque->hasho_nextblkno;
+
+		/*
+		 * Apply deletions and write page if needed, advance to next page.
+		 */
+		if (ndeletable > 0)
+		{
+			PageIndexMultiDelete(page, deletable, ndeletable);
+			bucket_dirty = true;
+			curr_page_dirty = true;
+		}
+
+		/* bail out if there are no more pages to scan. */
+		if (!BlockNumberIsValid(blkno))
+			break;
+
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
+		/*
+		 * release the lock on previous page after acquiring the lock on next
+		 * page
+		 */
+		if (curr_page_dirty)
+		{
+			if (retain_pin)
+				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+			else
+				_hash_wrtbuf(rel, buf);
+			curr_page_dirty = false;
+		}
+		else if (retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
+
+		buf = next_buf;
+	}
+
+	/*
+	 * lock the bucket page to clear the garbage flag and squeeze the bucket.
+	 * if the current buffer is same as bucket buffer, then we already have
+	 * lock on bucket page.
+	 */
+	if (buf != bucket_buf)
+	{
+		_hash_relbuf(rel, buf);
+		_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+	}
+
+	/*
+	 * Clear the garbage flag from bucket after deleting the tuples that are
+	 * moved by split.  We purposefully clear the flag before squeeze bucket,
+	 * so that after restart, vacuum shouldn't again try to delete the moved
+	 * by split tuples.
+	 */
+	if (bucket_has_garbage)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = BufferGetPage(bucket_buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+	}
+
+	/*
+	 * If we deleted anything, try to compact free space.  For squeezing the
+	 * bucket, we must have a cleanup lock, else it can impact the ordering of
+	 * tuples for a scan that has started before it.
+	 */
+	if (bucket_dirty && CheckBufferForCleanup(bucket_buf))
+		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
+							bstrategy);
+}
 
 void
 hash_redo(XLogReaderState *record)
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..5cfd0aa 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -28,7 +28,8 @@
 void
 _hash_doinsert(Relation rel, IndexTuple itup)
 {
-	Buffer		buf;
+	Buffer		buf = InvalidBuffer;
+	Buffer		bucket_buf;
 	Buffer		metabuf;
 	HashMetaPage metap;
 	BlockNumber blkno;
@@ -40,6 +41,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	bool		do_expand;
 	uint32		hashkey;
 	Bucket		bucket;
+	uint32		maxbucket;
+	uint32		highmask;
+	uint32		lowmask;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -70,51 +74,131 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			errhint("Values larger than a buffer page cannot be indexed.")));
 
 	/*
-	 * Loop until we get a lock on the correct target bucket.
+	 * Copy bucket mapping info now;  The comment in _hash_expandtable where
+	 * we copy this information and calls _hash_splitbucket explains why this
+	 * is OK.
 	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
-									  metap->hashm_maxbucket,
-									  metap->hashm_highmask,
-									  metap->hashm_lowmask);
+	maxbucket = metap->hashm_maxbucket;
+	highmask = metap->hashm_highmask;
+	lowmask = metap->hashm_lowmask;
 
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
+	/*
+	 * Conditionally get the lock on primary bucket page for insertion while
+	 * holding lock on meta page. If we have to wait, then release the meta
+	 * page lock and retry it in a hard way.
+	 */
+	bucket = _hash_hashkey2bucket(hashkey,
+								  maxbucket,
+								  highmask,
+								  lowmask);
+
+	blkno = BUCKET_TO_BLKNO(metap, bucket);
 
-		/* Release metapage lock, but keep pin. */
+	/* Fetch the primary bucket page for the bucket */
+	buf = ReadBuffer(rel, blkno);
+	if (!ConditionalLockBuffer(buf))
+	{
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(buf, HASH_WRITE);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		oldblkno = blkno;
+		retry = true;
+	}
+	else
+	{
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	}
 
+	if (retry)
+	{
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * Loop until we get a lock on the correct target bucket.  We get the
+		 * lock on primary bucket page and retain the pin on it during insert
+		 * operation to prevent the concurrent splits.  Retaining pin on a
+		 * primary bucket page ensures that split can't happen as it needs to
+		 * acquire the cleanup lock on primary bucket page.  Acquiring lock on
+		 * primary bucket and rechecking if it is a target bucket is mandatory
+		 * as otherwise a concurrent split might cause this insertion to fall
+		 * in wrong bucket.
 		 */
-		if (retry)
+		for (;;)
 		{
+			/*
+			 * Compute the target bucket number, and convert to block number.
+			 */
+			bucket = _hash_hashkey2bucket(hashkey,
+										  metap->hashm_maxbucket,
+										  metap->hashm_highmask,
+										  metap->hashm_lowmask);
+
+			blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+			/* Release metapage lock, but keep pin. */
+			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+			/*
+			 * If the previous iteration of this loop locked what is still the
+			 * correct target bucket, we are done.  Otherwise, drop any old
+			 * lock and lock what now appears to be the correct bucket.
+			 */
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
-		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
-		oldblkno = blkno;
-		retry = true;
+			/* Fetch the primary bucket page for the bucket */
+			buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+
+			/*
+			 * Reacquire metapage lock and check that no bucket split has
+			 * taken place while we were awaiting the bucket lock.
+			 */
+			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			oldblkno = blkno;
+		}
 	}
 
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+	/* remember the primary bucket buffer to release the pin on it at end. */
+	bucket_buf = buf;
+
 	page = BufferGetPage(buf);
 	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(pageopaque->hasho_bucket == bucket);
 
+	/*
+	 * If there is any pending split, try to finish it before proceeding for
+	 * the insertion.  We try to finish the split for the insertion in old
+	 * bucket, as that will allow us to remove the tuples from old bucket and
+	 * reuse the space.  There is no such apparent benefit from finishing the
+	 * split during insertion in new bucket.
+	 *
+	 * In future, if we want to finish the splits during insertion in new
+	 * bucket, we must ensure the locking order such that old bucket is locked
+	 * before new bucket.
+	 */
+	if (H_OLD_INCOMPLETE_SPLIT(pageopaque) && CheckBufferForCleanup(buf))
+	{
+		BlockNumber nblkno;
+		Buffer		nbuf;
+
+		nblkno = _hash_get_newblk(rel, pageopaque);
+
+		/* Fetch the primary bucket page for the new bucket */
+		nbuf = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+		if (nbuf)
+		{
+			_hash_finish_split(rel, metabuf, buf, nbuf, maxbucket,
+							   highmask, lowmask);
+
+			/*
+			 * release the buffer here as the insertion will happen in old
+			 * bucket.
+			 */
+			_hash_relbuf(rel, nbuf);
+		}
+	}
+
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
@@ -127,14 +211,23 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		{
 			/*
 			 * ovfl page exists; go get it.  if it doesn't have room, we'll
-			 * find out next pass through the loop test above.
+			 * find out next pass through the loop test above.  Retain the pin
+			 * if it is a primary bucket.
 			 */
-			_hash_relbuf(rel, buf);
+			if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
+				_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+			else
+				_hash_relbuf(rel, buf);
 			buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 			page = BufferGetPage(buf);
 		}
 		else
 		{
+			bool		retain_pin = false;
+
+			/* page flags must be accessed before releasing lock on a page. */
+			retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;
+
 			/*
 			 * we're at the end of the bucket chain and we haven't found a
 			 * page with enough room.  allocate a new overflow page.
@@ -144,7 +237,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 
 			/* chain to a new overflow page */
-			buf = _hash_addovflpage(rel, metabuf, buf);
+			buf = _hash_addovflpage(rel, metabuf, buf, retain_pin);
 			page = BufferGetPage(buf);
 
 			/* should fit now, given test above */
@@ -158,11 +251,13 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	/* found page with enough space, so add the item here */
 	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
 
-	/* write and release the modified page */
+	/*
+	 * write and release the modified page and ensure to release the pin on
+	 * primary page.
+	 */
 	_hash_wrtbuf(rel, buf);
-
-	/* We can drop the bucket lock now */
-	_hash_droplock(rel, blkno, HASH_SHARE);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
 
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index db3e268..760563a 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *
  *	On entry, the caller must hold a pin but no lock on 'buf'.  The pin is
  *	dropped before exiting (we assume the caller is not interested in 'buf'
- *	anymore).  The returned overflow page will be pinned and write-locked;
- *	it is guaranteed to be empty.
+ *	anymore) if not asked to retain.  The pin will be retained only for the
+ *	primary bucket.  The returned overflow page will be pinned and
+ *	write-locked; it is guaranteed to be empty.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
  *
- *	The caller must hold at least share lock on the bucket, to ensure that
- *	no one else tries to compact the bucket meanwhile.  This guarantees that
- *	'buf' won't stop being part of the bucket while it's unlocked.
- *
  * NB: since this could be executed concurrently by multiple processes,
  * one should not assume that the returned overflow page will be the
  * immediate successor of the originally passed 'buf'.  Additional overflow
  * pages might have been added to the bucket chain in between.
  */
 Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 {
 	Buffer		ovflbuf;
 	Page		page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 			break;
 
 		/* we assume we do not need to write the unmodified page */
-		_hash_relbuf(rel, buf);
+		if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
 
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
@@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 
 	/* logically chain overflow page to previous page */
 	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	_hash_wrtbuf(rel, buf);
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, buf);
 
 	return ovflbuf;
 }
@@ -370,11 +373,11 @@ _hash_firstfreebit(uint32 map)
  *	in the bucket, or InvalidBlockNumber if no following page.
  *
  *	NB: caller must not hold lock on metapage, nor on either page that's
- *	adjacent in the bucket chain.  The caller had better hold exclusive lock
- *	on the bucket, too.
+ *	adjacent in the bucket chain except from primary bucket.  The caller had
+ *	better hold cleanup lock on the primary bucket.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf,
+_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
@@ -413,22 +416,41 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
-	 * deleted.  No concurrency issues since we hold exclusive lock on the
-	 * entire bucket.
+	 * deleted.  No concurrency issues since we hold the cleanup lock on
+	 * primary bucket.  We don't need to aqcuire buffer lock to fix the
+	 * primary bucket, as we already have that lock.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-														 prevblkno,
-														 HASH_WRITE,
+		if (prevblkno == bucket_blkno)
+		{
+			Buffer		prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
+													 prevblkno,
+													 RBM_NORMAL,
+													 bstrategy);
+
+			Page		prevpage = BufferGetPage(prevbuf);
+			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+			Assert(prevopaque->hasho_bucket == bucket);
+			prevopaque->hasho_nextblkno = nextblkno;
+			MarkBufferDirty(prevbuf);
+			ReleaseBuffer(prevbuf);
+		}
+		else
+		{
+			Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
+															 prevblkno,
+															 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		prevpage = BufferGetPage(prevbuf);
-		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+															 bstrategy);
+			Page		prevpage = BufferGetPage(prevbuf);
+			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
 
-		Assert(prevopaque->hasho_bucket == bucket);
-		prevopaque->hasho_nextblkno = nextblkno;
-		_hash_wrtbuf(rel, prevbuf);
+			Assert(prevopaque->hasho_bucket == bucket);
+			prevopaque->hasho_nextblkno = nextblkno;
+			_hash_wrtbuf(rel, prevbuf);
+		}
 	}
 	if (BlockNumberIsValid(nextblkno))
 	{
@@ -570,7 +592,7 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
  *	required that to be true on entry as well, but it's a lot easier for
  *	callers to leave empty overflow pages and let this guy clean it up.
  *
- *	Caller must hold exclusive lock on the target bucket.  This allows
+ *	Caller must hold cleanup lock on the target bucket.  This allows
  *	us to safely lock multiple pages in the bucket.
  *
  *	Since this function is invoked in VACUUM, we provide an access strategy
@@ -580,6 +602,7 @@ void
 _hash_squeezebucket(Relation rel,
 					Bucket bucket,
 					BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy)
 {
 	BlockNumber wblkno;
@@ -591,27 +614,22 @@ _hash_squeezebucket(Relation rel,
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
 	bool		wbuf_dirty;
+	bool		release_buf = false;
 
 	/*
 	 * start squeezing into the base bucket page.
 	 */
 	wblkno = bucket_blkno;
-	wbuf = _hash_getbuf_with_strategy(rel,
-									  wblkno,
-									  HASH_WRITE,
-									  LH_BUCKET_PAGE,
-									  bstrategy);
+	wbuf = bucket_buf;
 	wpage = BufferGetPage(wbuf);
 	wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 
 	/*
-	 * if there aren't any overflow pages, there's nothing to squeeze.
+	 * if there aren't any overflow pages, there's nothing to squeeze. caller
+	 * is responsible to release the lock on primary bucket.
 	 */
 	if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
-	{
-		_hash_relbuf(rel, wbuf);
 		return;
-	}
 
 	/*
 	 * Find the last page in the bucket chain by starting at the base bucket
@@ -669,12 +687,17 @@ _hash_squeezebucket(Relation rel,
 			{
 				Assert(!PageIsEmpty(wpage));
 
+				if (wblkno != bucket_blkno)
+					release_buf = true;
+
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));
 
-				if (wbuf_dirty)
+				if (wbuf_dirty && release_buf)
 					_hash_wrtbuf(rel, wbuf);
-				else
+				else if (wbuf_dirty)
+					MarkBufferDirty(wbuf);
+				else if (release_buf)
 					_hash_relbuf(rel, wbuf);
 
 				/* nothing more to do if we reached the read page */
@@ -700,6 +723,7 @@ _hash_squeezebucket(Relation rel,
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
 				wbuf_dirty = false;
+				release_buf = false;
 			}
 
 			/*
@@ -733,19 +757,25 @@ _hash_squeezebucket(Relation rel,
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
-			/* yes, so release wbuf lock first */
-			if (wbuf_dirty)
+			if (wblkno != bucket_blkno)
+				release_buf = true;
+
+			/* yes, so release wbuf lock first if needed */
+			if (wbuf_dirty && release_buf)
 				_hash_wrtbuf(rel, wbuf);
-			else
+			else if (wbuf_dirty)
+				MarkBufferDirty(wbuf);
+			else if (release_buf)
 				_hash_relbuf(rel, wbuf);
+
 			/* free this overflow page (releases rbuf) */
-			_hash_freeovflpage(rel, rbuf, bstrategy);
+			_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 			/* done */
 			return;
 		}
 
 		/* free this overflow page, then get the previous one */
-		_hash_freeovflpage(rel, rbuf, bstrategy);
+		_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 178463f..2a45862 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,10 +38,14 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
 					uint32 nblocks);
 static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
+static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
+					   Bucket obucket, Bucket nbucket, Buffer obuf,
+					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
+					   uint32 highmask, uint32 lowmask);
 
 
 /*
@@ -55,46 +59,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 
 
 /*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock.  (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		return ConditionalLockPage(rel, whichlock, access);
-	else
-		return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		UnlockPage(rel, whichlock, access);
-}
-
-/*
  *	_hash_getbuf() -- Get a buffer by block number for read or write.
  *
  *		'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -132,6 +96,35 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
 }
 
 /*
+ * _hash_getbuf_with_condlock_cleanup() -- as above, but get the buffer for write.
+ *
+ *		We try to take the conditional cleanup lock and if we get it then
+ *		return the buffer, else return InvalidBuffer.
+ */
+Buffer
+_hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
+{
+	Buffer		buf;
+
+	if (blkno == P_NEW)
+		elog(ERROR, "hash AM does not use P_NEW");
+
+	buf = ReadBuffer(rel, blkno);
+
+	if (!ConditionalLockBufferForCleanup(buf))
+	{
+		ReleaseBuffer(buf);
+		return InvalidBuffer;
+	}
+
+	/* ref count and lock type are correct */
+
+	_hash_checkpage(rel, buf, flags);
+
+	return buf;
+}
+
+/*
  *	_hash_getinitbuf() -- Get and initialize a buffer by block number.
  *
  *		This must be used only to fetch pages that are known to be before
@@ -266,6 +259,33 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
+ *	_hash_dropscanbuf() -- release buffers used in scan.
+ *
+ * This routine unpins the buffers used during scan on which we
+ * hold no lock.
+ */
+void
+_hash_dropscanbuf(Relation rel, HashScanOpaque so)
+{
+	/* release pin we hold on primary bucket */
+	if (BufferIsValid(so->hashso_bucket_buf) &&
+		so->hashso_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_bucket_buf);
+	so->hashso_bucket_buf = InvalidBuffer;
+
+	/* release pin we hold on old primary bucket */
+	if (BufferIsValid(so->hashso_old_bucket_buf) &&
+		so->hashso_old_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+	so->hashso_old_bucket_buf = InvalidBuffer;
+
+	/* release any pin we still hold */
+	if (BufferIsValid(so->hashso_curbuf))
+		_hash_dropbuf(rel, so->hashso_curbuf);
+	so->hashso_curbuf = InvalidBuffer;
+}
+
+/*
  *	_hash_wrtbuf() -- write a hash page to disk.
  *
  *		This routine releases the lock held on the buffer and our refcount
@@ -489,9 +509,11 @@ _hash_pageinit(Page page, Size size)
 /*
  * Attempt to expand the hash table by creating one new bucket.
  *
- * This will silently do nothing if it cannot get the needed locks.
+ * This will silently do nothing if there are active scans of our own
+ * backend or if we don't get cleanup lock on old or new bucket.
  *
- * The caller should hold no locks on the hash index.
+ * Complete the pending splits and remove the tuples from old bucket,
+ * if there are any left over from previous split.
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.
@@ -506,10 +528,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	BlockNumber start_oblkno;
 	BlockNumber start_nblkno;
 	Buffer		buf_nblkno;
+	Buffer		buf_oblkno;
+	Page		opage;
+	HashPageOpaque oopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
 
+restart_expand:
+
 	/*
 	 * Write-lock the meta page.  It used to be necessary to acquire a
 	 * heavyweight lock to begin a split, but that is no longer required.
@@ -548,11 +575,16 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		goto fail;
 
 	/*
-	 * Determine which bucket is to be split, and attempt to lock the old
-	 * bucket.  If we can't get the lock, give up.
+	 * Determine which bucket is to be split, and attempt to take cleanup lock
+	 * on the old bucket.  If we can't get the lock, give up.
 	 *
-	 * The lock protects us against other backends, but not against our own
-	 * backend.  Must check for active scans separately.
+	 * The cleanup lock protects us against other backends, but not against
+	 * our own backend.  Must check for active scans separately.
+	 *
+	 * The cleanup lock is mainly to protect the split from concurrent
+	 * inserts. See src/backend/access/hash/README, Lock Definitions for
+	 * further details.  Due to this locking restriction, if there is any
+	 * pending scan, split will give up which is not good, but harmless.
 	 */
 	new_bucket = metap->hashm_maxbucket + 1;
 
@@ -563,11 +595,90 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, old_bucket))
 		goto fail;
 
-	if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
+	buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
+	if (!buf_oblkno)
 		goto fail;
 
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * We want to finish the split from a bucket as there is no apparent
+	 * benefit by not doing so and it will make the code complicated to finish
+	 * the split that involves multiple buckets considering the case where new
+	 * split also fails.  We don't need to consider the new bucket for
+	 * completing the split here as it is not possible that a re-split of new
+	 * bucket starts when there is still a pending split from old bucket.
+	 */
+	if (H_OLD_INCOMPLETE_SPLIT(oopaque))
+	{
+		BlockNumber nblkno;
+		Buffer		buf_nblkno;
+
+		/*
+		 * Copy bucket mapping info now;  The comment in code below where we
+		 * copy this information and calls _hash_splitbucket explains why this
+		 * is OK.
+		 */
+		maxbucket = metap->hashm_maxbucket;
+		highmask = metap->hashm_highmask;
+		lowmask = metap->hashm_lowmask;
+
+		/* Release the metapage lock, before completing the split. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		nblkno = _hash_get_newblk(rel, oopaque);
+
+		/* Fetch the primary bucket page for the new bucket */
+		buf_nblkno = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+		if (!buf_nblkno)
+		{
+			_hash_relbuf(rel, buf_oblkno);
+			return;
+		}
+
+		_hash_finish_split(rel, metabuf, buf_oblkno, buf_nblkno, maxbucket,
+						   highmask, lowmask);
+
+		/*
+		 * release the buffers and retry for expand.
+		 */
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+
+		goto restart_expand;
+	}
+
+	/*
+	 * Clean the tuples remained from previous split.  This operation requires
+	 * cleanup lock and we already have one on old bucket, so let's do it. We
+	 * also don't want to allow further splits from the bucket till the
+	 * garbage of previous split is cleaned.  This has two advantages, first
+	 * it helps in avoiding the bloat due to garbage and second is, during
+	 * cleanup of bucket, we are always sure that the garbage tuples belong to
+	 * most recently splitted bucket.  On the contrary, if we allow cleanup of
+	 * bucket after meta page is updated to indicate the new split and before
+	 * the actual split, the cleanup operation won't be able to decide whether
+	 * the tuple has been moved to the newly created bucket and ended up
+	 * deleting such tuples.
+	 */
+	if (H_HAS_GARBAGE(oopaque))
+	{
+		/* Release the metapage lock. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		hashbucketcleanup(rel, buf_oblkno, start_oblkno, NULL,
+						  metap->hashm_maxbucket, metap->hashm_highmask,
+						  metap->hashm_lowmask, NULL,
+						  NULL, true, false, NULL, NULL);
+
+		_hash_relbuf(rel, buf_oblkno);
+
+		goto restart_expand;
+	}
+
 	/*
-	 * Likewise lock the new bucket (should never fail).
+	 * There shouldn't be any active scan on new bucket.
 	 *
 	 * Note: it is safe to compute the new bucket's blkno here, even though we
 	 * may still need to update the BUCKET_TO_BLKNO mapping.  This is because
@@ -579,9 +690,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, new_bucket))
 		elog(ERROR, "scan in progress on supposedly new bucket");
 
-	if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
-		elog(ERROR, "could not get lock on supposedly new bucket");
-
 	/*
 	 * If the split point is increasing (hashm_maxbucket's log base 2
 	 * increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +708,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
 			/* can't split due to BlockNumber overflow */
-			_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-			_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+			_hash_relbuf(rel, buf_oblkno);
 			goto fail;
 		}
 	}
@@ -609,9 +716,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/*
 	 * Physically allocate the new bucket's primary page.  We want to do this
 	 * before changing the metapage's mapping info, in case we can't get the
-	 * disk space.
+	 * disk space.  Ideally, we don't need to check for cleanup lock on new
+	 * bucket as no other backend could find this bucket unless meta page is
+	 * updated.  However, it is good to be consistent with old bucket locking.
 	 */
 	buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
+	if (!CheckBufferForCleanup(buf_nblkno))
+	{
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+		goto fail;
+	}
+
 
 	/*
 	 * Okay to proceed with split.  Update the metapage bucket mapping info.
@@ -665,13 +781,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  start_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno,
 					  maxbucket, highmask, lowmask);
 
-	/* Release bucket locks, allowing others to access them */
-	_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-	_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -738,13 +850,17 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
  * belong in the new bucket, and compress out any free space in the old
  * bucket.
  *
- * The caller must hold exclusive locks on both buckets to ensure that
+ * The caller must hold cleanup locks on both buckets to ensure that
  * no one else is trying to access them (see README).
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.  (The metapage is only
  * touched if it becomes necessary to add or remove overflow pages.)
  *
+ * Split needs to retain pin on primary bucket pages of both old and new
+ * buckets till end of operation.  This is to prevent vacuum to start
+ * when split is in progress.
+ *
  * In addition, the caller must have created the new bucket's base page,
  * which is passed in buffer nbuf, pinned and write-locked.  That lock and
  * pin are released here.  (The API is set up this way because we must do
@@ -756,37 +872,87 @@ _hash_splitbucket(Relation rel,
 				  Buffer metabuf,
 				  Bucket obucket,
 				  Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Buffer		obuf;
 	Page		opage;
 	Page		npage;
 	HashPageOpaque oopaque;
 	HashPageOpaque nopaque;
 
-	/*
-	 * It should be okay to simultaneously write-lock pages from each bucket,
-	 * since no one else can be trying to acquire buffer lock on pages of
-	 * either bucket.
-	 */
-	obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
 	opage = BufferGetPage(obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 
+	/*
+	 * Mark the old bucket to indicate that split is in progress and it has
+	 * deletable tuples. At operation end, we clear split in progress flag and
+	 * vacuum will clear page_has_garbage flag after deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
 	npage = BufferGetPage(nbuf);
 
-	/* initialize the new bucket's primary page */
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nopaque->hasho_prevblkno = InvalidBlockNumber;
 	nopaque->hasho_nextblkno = InvalidBlockNumber;
 	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
 	nopaque->hasho_page_id = HASHO_PAGE_ID;
 
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, nbuf, NULL,
+						   maxbucket, highmask, lowmask);
+
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, obuf);
+	_hash_relbuf(rel, nbuf);
+}
+
+/*
+ * _hash_splitbucket_guts -- Helper function to perform the split operation
+ *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, caller needs to fill htab.  If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
+ * Caller needs to lock and unlock the old and new primary buckets.
+ */
+static void
+_hash_splitbucket_guts(Relation rel,
+					   Buffer metabuf,
+					   Bucket obucket,
+					   Bucket nbucket,
+					   Buffer obuf,
+					   Buffer nbuf,
+					   HTAB *htab,
+					   uint32 maxbucket,
+					   uint32 highmask,
+					   uint32 lowmask)
+{
+	Buffer		bucket_obuf;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
+
+	bucket_obuf = obuf;
+	opage = BufferGetPage(obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
 	/*
 	 * Partition the tuples in the old bucket between the old bucket and the
 	 * new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +964,6 @@ _hash_splitbucket(Relation rel,
 		BlockNumber oblkno;
 		OffsetNumber ooffnum;
 		OffsetNumber omaxoffnum;
-		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
 
 		/* Scan each tuple in old page */
 		omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -810,39 +974,69 @@ _hash_splitbucket(Relation rel,
 			IndexTuple	itup;
 			Size		itemsz;
 			Bucket		bucket;
+			bool		found = false;
 
 			/*
-			 * Fetch the item's hash key (conveniently stored in the item) and
-			 * determine which bucket it now belongs in.
+			 * Before inserting tuple, probe the hash table containing TIDs of
+			 * tuples belonging to new bucket, if we find a match, then skip
+			 * that tuple, else fetch the item's hash key (conveniently stored
+			 * in the item) and determine which bucket it now belongs in.
 			 */
 			itup = (IndexTuple) PageGetItem(opage,
 											PageGetItemId(opage, ooffnum));
+
+			if (htab)
+				(void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
+
+			if (found)
+				continue;
+
 			bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
 										  maxbucket, highmask, lowmask);
 
 			if (bucket == nbucket)
 			{
+				Size		itupsize = 0;
+				IndexTuple	new_itup;
+
+				/*
+				 * make a copy of index tuple as we have to scribble on it.
+				 */
+				new_itup = CopyIndexTuple(itup);
+
+				/*
+				 * mark the index tuple as moved by split, such tuples are
+				 * skipped by scan if there is split in progress for a bucket.
+				 */
+				itupsize = new_itup->t_info & INDEX_SIZE_MASK;
+				new_itup->t_info &= ~INDEX_SIZE_MASK;
+				new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+				new_itup->t_info |= itupsize;
+
 				/*
 				 * insert the tuple into the new bucket.  if it doesn't fit on
 				 * the current page in the new bucket, we must allocate a new
 				 * overflow page and place the tuple on that page instead.
-				 *
-				 * XXX we have a problem here if we fail to get space for a
-				 * new overflow page: we'll error out leaving the bucket split
-				 * only partially complete, meaning the index is corrupt,
-				 * since searches may fail to find entries they should find.
 				 */
-				itemsz = IndexTupleDSize(*itup);
+				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
 				if (PageGetFreeSpace(npage) < itemsz)
 				{
+					bool		retain_pin = false;
+
+					/*
+					 * page flags must be accessed before releasing lock on a
+					 * page.
+					 */
+					retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
+
 					/* write out nbuf and drop lock, but keep pin */
 					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
 					/* chain to a new overflow page */
-					nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+					nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
 					npage = BufferGetPage(nbuf);
-					/* we don't need nopaque within the loop */
+					nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 				}
 
 				/*
@@ -852,12 +1046,10 @@ _hash_splitbucket(Relation rel,
 				 * Possible future improvement: accumulate all the items for
 				 * the new page and qsort them before insertion.
 				 */
-				(void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
+				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
-				/*
-				 * Mark tuple for deletion from old page.
-				 */
-				deletable[ndeletable++] = ooffnum;
+				/* be tidy */
+				pfree(new_itup);
 			}
 			else
 			{
@@ -870,15 +1062,9 @@ _hash_splitbucket(Relation rel,
 
 		oblkno = oopaque->hasho_nextblkno;
 
-		/*
-		 * Done scanning this old page.  If we moved any tuples, delete them
-		 * from the old page.
-		 */
-		if (ndeletable > 0)
-		{
-			PageIndexMultiDelete(opage, deletable, ndeletable);
-			_hash_wrtbuf(rel, obuf);
-		}
+		/* retain the pin on the old primary bucket */
+		if (obuf == bucket_obuf)
+			_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, obuf);
 
@@ -887,18 +1073,153 @@ _hash_splitbucket(Relation rel,
 			break;
 
 		/* Else, advance to next old page */
-		obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
 		opage = BufferGetPage(obuf);
 		oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	}
 
 	/*
 	 * We're at the end of the old bucket chain, so we're done partitioning
-	 * the tuples.  Before quitting, call _hash_squeezebucket to ensure the
-	 * tuples remaining in the old bucket (including the overflow pages) are
-	 * packed as tightly as possible.  The new bucket is already tight.
+	 * the tuples.  Mark the old and new buckets to indicate split is
+	 * finished.
+	 *
+	 * To avoid deadlocks due to locking order of buckets, first lock the old
+	 * bucket and then the new bucket.
 	 */
-	_hash_wrtbuf(rel, nbuf);
+	if (nopaque->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, nbuf);
+
+	/*
+	 * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+	 * there is no pending scan that has seen the flag after it is cleared.
+	 */
+	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+	opage = BufferGetPage(bucket_obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	_hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+	npage = BufferGetPage(bucket_nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+	/* indicate that split is finished */
+	oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
+	nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
+
+	/*
+	 * now write the buffers, here we don't release the locks as caller is
+	 * responsible to release locks.
+	 */
+	MarkBufferDirty(bucket_obuf);
+	MarkBufferDirty(bucket_nbuf);
+}
+
+/*
+ *	_hash_finish_split() -- Finish the previously interrupted split operation
+ *
+ * To complete the split operation, we form the hash table of TIDs in new
+ * bucket which is then used by split operation to skip tuples that are
+ * already moved before the split operation was previously interruptted.
+ *
+ * The caller must hold a pin, but no lock, on the metapage buffer.
+ * The buffer is returned in the same state.  (The metapage is only
+ * touched if it becomes necessary to add or remove overflow pages.)
+ *
+ * 'obuf' and 'nbuf' must be locked by the caller which is also responsible
+ * for unlocking them.
+ */
+void
+_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf,
+				   uint32 maxbucket, uint32 highmask, uint32 lowmask)
+{
+	HASHCTL		hash_ctl;
+	HTAB	   *tidhtab;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque opageopaque;
+	HashPageOpaque npageopaque;
+	Bucket		obucket;
+	Bucket		nbucket;
+	bool		found;
+
+	/* Initialize hash tables used to track TIDs */
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+	hash_ctl.keysize = sizeof(ItemPointerData);
+	hash_ctl.entrysize = sizeof(ItemPointerData);
+	hash_ctl.hcxt = CurrentMemoryContext;
+
+	tidhtab =
+		hash_create("bucket ctids",
+					256,		/* arbitrary initial size */
+					&hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	/*
+	 * Scan the new bucket and build hash table of TIDs
+	 */
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	for (;;)
+	{
+		BlockNumber nblkno;
+		OffsetNumber noffnum;
+		OffsetNumber nmaxoffnum;
+
+		/* Scan each tuple in new page */
+		nmaxoffnum = PageGetMaxOffsetNumber(npage);
+		for (noffnum = FirstOffsetNumber;
+			 noffnum <= nmaxoffnum;
+			 noffnum = OffsetNumberNext(noffnum))
+		{
+			IndexTuple	itup;
+
+			/* Fetch the item's TID and insert it in hash table. */
+			itup = (IndexTuple) PageGetItem(npage,
+											PageGetItemId(npage, noffnum));
+
+			(void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
+
+			Assert(!found);
+		}
+
+		nblkno = npageopaque->hasho_nextblkno;
+
+		/*
+		 * release our write lock without modifying buffer and ensure to
+		 * retain the pin on primary bucket.
+		 */
+		if (nbuf == bucket_nbuf)
+			_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, nbuf);
+
+		/* Exit loop if no more overflow pages in new bucket */
+		if (!BlockNumberIsValid(nblkno))
+			break;
+
+		/* Else, advance to next page */
+		nbuf = _hash_getbuf(rel, nblkno, HASH_READ, LH_OVERFLOW_PAGE);
+		npage = BufferGetPage(nbuf);
+		npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	}
+
+	/* Need a cleanup lock to perform split operation. */
+	LockBufferForCleanup(bucket_nbuf);
+
+	npage = BufferGetPage(bucket_nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nbucket = npageopaque->hasho_bucket;
+
+	opage = BufferGetPage(obuf);
+	opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+	obucket = opageopaque->hasho_bucket;
+
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, bucket_nbuf, tidhtab,
+						   maxbucket, highmask, lowmask);
 
-	_hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+	hash_destroy(tidhtab);
 }
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..6ec3bea 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -72,7 +72,19 @@ _hash_readnext(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_nextblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan to ensure
+	 * that vacuum can't delete the tuples that are moved by split to new
+	 * bucket. Such tuples are required by the scans that are started on
+	 * splitted buckets, before a new buckets split in progress flag
+	 * (LH_BUCKET_NEW_PAGE_SPLIT) is cleared.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -94,7 +106,16 @@ _hash_readprev(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_prevblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan. See
+	 * comments in _hash_readnext to know the reason of retaining pin.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -104,6 +125,13 @@ _hash_readprev(Relation rel,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
+
+		/*
+		 * We always maintain the pin on bucket page for whole scan operation,
+		 * so releasing the additional pin we have acquired here.
+		 */
+		if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+			_hash_dropbuf(rel, *bufp);
 	}
 }
 
@@ -192,43 +220,81 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	metap = HashPageGetMeta(page);
 
 	/*
-	 * Loop until we get a lock on the correct target bucket.
+	 * Conditionally get the lock on primary bucket page for search while
+	 * holding lock on meta page. If we have to wait, then release the meta
+	 * page lock and retry it in a hard way.
 	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
-									  metap->hashm_maxbucket,
-									  metap->hashm_highmask,
-									  metap->hashm_lowmask);
+	bucket = _hash_hashkey2bucket(hashkey,
+								  metap->hashm_maxbucket,
+								  metap->hashm_highmask,
+								  metap->hashm_lowmask);
 
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
+	blkno = BUCKET_TO_BLKNO(metap, bucket);
 
-		/* Release metapage lock, but keep pin. */
+	/* Fetch the primary bucket page for the bucket */
+	buf = ReadBuffer(rel, blkno);
+	if (!ConditionalLockBufferShared(buf))
+	{
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(buf, HASH_READ);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		oldblkno = blkno;
+		retry = true;
+	}
+	else
+	{
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	}
 
+	if (retry)
+	{
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * Loop until we get a lock on the correct target bucket.  We get the
+		 * lock on primary bucket page and retain the pin on it during read
+		 * operation to prevent the concurrent splits.  Retaining pin on a
+		 * primary bucket page ensures that split can't happen as it needs to
+		 * acquire the cleanup lock on primary bucket page.  Acquiring lock on
+		 * primary bucket and rechecking if it is a target bucket is mandatory
+		 * as otherwise a concurrent split followed by vacuum could remove
+		 * tuples from the selected bucket which otherwise would have been
+		 * visible.
 		 */
-		if (retry)
+		for (;;)
 		{
+			/*
+			 * Compute the target bucket number, and convert to block number.
+			 */
+			bucket = _hash_hashkey2bucket(hashkey,
+										  metap->hashm_maxbucket,
+										  metap->hashm_highmask,
+										  metap->hashm_lowmask);
+
+			blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+			/* Release metapage lock, but keep pin. */
+			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+			/*
+			 * If the previous iteration of this loop locked what is still the
+			 * correct target bucket, we are done.  Otherwise, drop any old
+			 * lock and lock what now appears to be the correct bucket.
+			 */
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
-		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
-		oldblkno = blkno;
-		retry = true;
+			/* Fetch the primary bucket page for the bucket */
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
+			/*
+			 * Reacquire metapage lock and check that no bucket split has
+			 * taken place while we were awaiting the bucket lock.
+			 */
+			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			oldblkno = blkno;
+		}
 	}
 
 	/* done with the metapage */
@@ -237,14 +303,60 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	/* Update scan opaque state to show we have lock on the bucket */
 	so->hashso_bucket = bucket;
 	so->hashso_bucket_valid = true;
-	so->hashso_bucket_blkno = blkno;
 
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
 	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
+	so->hashso_bucket_buf = buf;
+
+	/*
+	 * If the bucket split is in progress, then we need to skip tuples that
+	 * are moved from old bucket.  To ensure that vacuum doesn't clean any
+	 * tuples from old or new buckets till this scan is in progress, maintain
+	 * a pin on both of the buckets.  Here, we have to be cautious about lock
+	 * ordering, first acquire the lock on old bucket, release the lock on old
+	 * bucket, but not pin, then acquire the lock on new bucket and again
+	 * re-verify whether the bucket split still is in progress. Acquiring lock
+	 * on old bucket first ensures that the vacuum waits for this scan to
+	 * finish.
+	 */
+	if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+	{
+		BlockNumber old_blkno;
+		Buffer		old_buf;
+
+		old_blkno = _hash_get_oldblk(rel, opaque);
+
+		/*
+		 * release the lock on new bucket and re-acquire it after acquiring
+		 * the lock on old bucket.
+		 */
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+		/*
+		 * remember the old bucket buffer so as to use it later for scanning.
+		 */
+		so->hashso_old_bucket_buf = old_buf;
+		_hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+		_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		Assert(opaque->hasho_bucket == bucket);
+
+		if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+			so->hashso_skip_moved_tuples = true;
+		else
+		{
+			_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+			so->hashso_old_bucket_buf = InvalidBuffer;
+		}
+	}
+
 	/* If a backwards scan is requested, move to the end of the chain */
 	if (ScanDirectionIsBackward(dir))
 	{
@@ -273,6 +385,13 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
  *		false.  Else, return true and set the hashso_curpos for the
  *		scan to the right thing.
  *
+ *		Here we also scan the old bucket if the split for current bucket
+ *		was in progress at the start of scan.  The basic idea is that
+ *		skip the tuples that are moved by split while scanning current
+ *		bucket and then scan the old bucket to cover all such tuples. This
+ *		is done to ensure that we don't miss any tuples in the scans that
+ *		started during split.
+ *
  *		'bufP' points to the current buffer, which is pinned and read-locked.
  *		On success exit, we have pin and read-lock on whichever page
  *		contains the right item; on failure, we have released all buffers.
@@ -338,6 +457,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum >= FirstOffsetNumber);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberNext(offnum);	/* move forward */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -353,9 +485,42 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -379,6 +544,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum <= maxoff);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberPrev(offnum);	/* move back */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -394,9 +572,42 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -410,9 +621,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 
 		if (itup == NULL)
 		{
-			/* we ran off the end of the bucket without finding a match */
+			/*
+			 * We ran off the end of the bucket without finding a match.
+			 * Release the pin on bucket buffers.  Normally, such pins are
+			 * released at end of scan, however scrolling cursors can
+			 * reacquire the bucket lock and pin in the same scan multiple
+			 * times.
+			 */
 			*bufP = so->hashso_curbuf = InvalidBuffer;
 			ItemPointerSetInvalid(current);
+			_hash_dropscanbuf(rel, so);
 			return false;
 		}
 
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..b5164d7 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -147,6 +147,23 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_msb-- returns most significant bit position.
+ */
+static uint32
+_hash_msb(uint32 num)
+{
+	uint32		i = 0;
+
+	while (num)
+	{
+		num = num >> 1;
+		++i;
+	}
+
+	return i - 1;
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
@@ -352,3 +369,123 @@ _hash_binsearch_last(Page page, uint32 hash_value)
 
 	return lower;
 }
+
+/*
+ *	_hash_get_oldblk() -- get the block number from which current bucket
+ *			is being splitted.
+ */
+BlockNumber
+_hash_get_oldblk(Relation rel, HashPageOpaque opaque)
+{
+	Bucket		curr_bucket;
+	Bucket		old_bucket;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	/*
+	 * To get the old bucket from the current bucket, we need a mask to modulo
+	 * into lower half of table.  This mask is stored in meta page as
+	 * hashm_lowmask, but here we can't rely on the same, because we need a
+	 * value of lowmask that was prevalent at the time when bucket split was
+	 * started.  Masking the most significant bit of new bucket would give us
+	 * old bucket.
+	 */
+	curr_bucket = opaque->hasho_bucket;
+	mask = (((uint32) 1) << _hash_msb(curr_bucket)) - 1;
+	old_bucket = curr_bucket & mask;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newblk() -- get the block number of bucket for the new bucket
+ *			that will be generated after split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket based on current table
+ * half.  It is mainly required to finish the incomplete splits where we are
+ * sure that not more than one bucket could have split in progress from old
+ * bucket.
+ */
+BlockNumber
+_hash_get_newblk(Relation rel, HashPageOpaque opaque)
+{
+	Bucket		curr_bucket;
+	Bucket		new_bucket;
+	uint32		lowmask;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	curr_bucket = opaque->hasho_bucket;
+
+	/*
+	 * new bucket can be obtained by OR'ing old bucket with most significant
+	 * bit of current table half.  There could be multiple buckets that could
+	 * have splitted from curent bucket.  We need the first such bucket that
+	 * exists based on current table half.
+	 */
+	lowmask = metap->hashm_lowmask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = curr_bucket | mask;
+		if (new_bucket > metap->hashm_maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+		break;
+	}
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newbucket() -- get the new bucket that will be generated after
+ *			split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket.  New bucket can be
+ * obtained by OR'ing old bucket with most significant bit of table half
+ * for lowmask passed in this function.  There could be multiple buckets that
+ * could have splitted from curent bucket.  We need the first such bucket that
+ * exists.  Caller must ensure that no more than one split has happened from
+ * old bucket.
+ */
+Bucket
+_hash_get_newbucket(Relation rel, Bucket curr_bucket,
+					uint32 lowmask, uint32 maxbucket)
+{
+	Bucket		new_bucket;
+	uint32		mask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = curr_bucket | mask;
+		if (new_bucket > maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		break;
+	}
+
+	return new_bucket;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 90804a3..3e5b1d2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3567,6 +3567,26 @@ ConditionalLockBuffer(Buffer buffer)
 }
 
 /*
+ * Acquire the content_lock for the buffer, but only if we don't have to wait.
+ *
+ * This assumes the caller wants BUFFER_LOCK_SHARED mode.
+ */
+bool
+ConditionalLockBufferShared(Buffer buffer)
+{
+	BufferDesc *buf;
+
+	Assert(BufferIsValid(buffer));
+	if (BufferIsLocal(buffer))
+		return true;			/* act as though we got it */
+
+	buf = GetBufferDescriptor(buffer - 1);
+
+	return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+									LW_SHARED);
+}
+
+/*
  * LockBufferForCleanup - lock a buffer in preparation for deleting items
  *
  * Items may be deleted from a disk page only when the caller (a) holds an
@@ -3750,6 +3770,49 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 	return false;
 }
 
+/*
+ * CheckBufferForCleanup - as above, but don't attempt to take lock
+ *
+ * We won't loop, but just check once to see if the pin count is OK.  If
+ * not, return FALSE.
+ */
+bool
+CheckBufferForCleanup(Buffer buffer)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(BufferIsValid(buffer));
+
+	if (BufferIsLocal(buffer))
+	{
+		/* There should be exactly one pin */
+		if (LocalRefCount[-buffer - 1] != 1)
+			return false;
+		/* Nobody else to wait for */
+		return true;
+	}
+
+	/* There should be exactly one local pin */
+	if (GetPrivateRefCount(buffer) != 1)
+		return false;
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+	buf_state = LockBufHdr(bufHdr);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+	{
+		/* pincount is OK. */
+		UnlockBufHdr(bufHdr, buf_state);
+		return true;
+	}
+
+	UnlockBufHdr(bufHdr, buf_state);
+	return false;
+}
+
 
 /*
  *	Functions for buffer I/O handling
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index d9df904..bbf822b 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -24,6 +24,7 @@
 #include "lib/stringinfo.h"
 #include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
+#include "utils/hsearch.h"
 #include "utils/relcache.h"
 
 /*
@@ -32,6 +33,8 @@
  */
 typedef uint32 Bucket;
 
+#define InvalidBucket	((Bucket) 0xFFFFFFFF)
+
 #define BUCKET_TO_BLKNO(metap,B) \
 		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
 
@@ -51,6 +54,9 @@ typedef uint32 Bucket;
 #define LH_BUCKET_PAGE			(1 << 1)
 #define LH_BITMAP_PAGE			(1 << 2)
 #define LH_META_PAGE			(1 << 3)
+#define LH_BUCKET_NEW_PAGE_SPLIT	(1 << 4)
+#define LH_BUCKET_OLD_PAGE_SPLIT	(1 << 5)
+#define LH_BUCKET_PAGE_HAS_GARBAGE	(1 << 6)
 
 typedef struct HashPageOpaqueData
 {
@@ -63,6 +69,12 @@ typedef struct HashPageOpaqueData
 
 typedef HashPageOpaqueData *HashPageOpaque;
 
+#define H_HAS_GARBAGE(opaque)			((opaque)->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+#define H_OLD_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+#define H_INCOMPLETE_SPLIT(opaque)		(((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) || \
+										 ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT))
+
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
  * which otherwise would have a hard time telling pages of different index
@@ -87,12 +99,6 @@ typedef struct HashScanOpaqueData
 	bool		hashso_bucket_valid;
 
 	/*
-	 * If we have a share lock on the bucket, we record it here.  When
-	 * hashso_bucket_blkno is zero, we have no such lock.
-	 */
-	BlockNumber hashso_bucket_blkno;
-
-	/*
 	 * We also want to remember which buffer we're currently examining in the
 	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
 	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -100,11 +106,23 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_curbuf;
 
+	/* remember the buffer associated with primary bucket */
+	Buffer		hashso_bucket_buf;
+
+	/*
+	 * remember the buffer associated with old primary bucket which is
+	 * required during the scan of the bucket for which split is in progress.
+	 */
+	Buffer		hashso_old_bucket_buf;
+
 	/* Current position of the scan, as an index TID */
 	ItemPointerData hashso_curpos;
 
 	/* Current position of the scan, as a heap TID */
 	ItemPointerData hashso_heappos;
+
+	/* Whether scan needs to skip tuples that are moved by split */
+	bool		hashso_skip_moved_tuples;
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -175,6 +193,8 @@ typedef HashMetaPageData *HashMetaPage;
 				  sizeof(ItemIdData) - \
 				  MAXALIGN(sizeof(HashPageOpaqueData)))
 
+#define INDEX_MOVED_BY_SPLIT_MASK	0x2000
+
 #define HASH_MIN_FILLFACTOR			10
 #define HASH_DEFAULT_FILLFACTOR		75
 
@@ -223,9 +243,6 @@ typedef HashMetaPageData *HashMetaPage;
 #define HASH_WRITE		BUFFER_LOCK_EXCLUSIVE
 #define HASH_NOLOCK		(-1)
 
-#define HASH_SHARE		ShareLock
-#define HASH_EXCLUSIVE	ExclusiveLock
-
 /*
  *	Strategy number. There's only one valid strategy for hashing: equality.
  */
@@ -298,21 +315,21 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 
 /* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
 extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BufferAccessStrategy bstrategy);
+				   BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy);
 
 /* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
 extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 			 int access, int flags);
+extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
+								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
@@ -321,6 +338,7 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 						   BufferAccessStrategy bstrategy);
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
+extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
 extern void _hash_wrtbuf(Relation rel, Buffer buf);
 extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
 				   int to_access);
@@ -328,6 +346,9 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
 				ForkNumber forkNum);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
+extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
+				   Buffer nbuf, uint32 maxbucket, uint32 highmask,
+				   uint32 lowmask);
 
 /* hashscan.c */
 extern void _hash_regscan(IndexScanDesc scan);
@@ -363,5 +384,17 @@ extern bool _hash_convert_tuple(Relation index,
 					Datum *index_values, bool *index_isnull);
 extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
 extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+extern BlockNumber _hash_get_oldblk(Relation rel, HashPageOpaque opaque);
+extern BlockNumber _hash_get_newblk(Relation rel, HashPageOpaque opaque);
+extern Bucket _hash_get_newbucket(Relation rel, Bucket curr_bucket,
+					uint32 lowmask, uint32 maxbucket);
+
+/* hash.c */
+extern void hashbucketcleanup(Relation rel, Buffer bucket_buf,
+				  BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
+				  double *tuples_removed, double *num_index_tuples,
+				  bool bucket_has_garbage, bool delay,
+				  IndexBulkDeleteCallback callback, void *callback_state);
 
 #endif   /* HASH_H */
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 8350fa0..788ba9f 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -63,7 +63,7 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
  * t_info manipulation macros
  */
 #define INDEX_SIZE_MASK 0x1FFF
-/* bit 0x2000 is not used at present */
+/* bit 0x2000 is reserved for index-AM specific usage */
 #define INDEX_VAR_MASK	0x4000
 #define INDEX_NULL_MASK 0x8000
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7b6ba96..accbb88 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -225,8 +225,10 @@ extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
 extern void UnlockBuffers(void);
 extern void LockBuffer(Buffer buffer, int mode);
 extern bool ConditionalLockBuffer(Buffer buffer);
+extern bool ConditionalLockBufferShared(Buffer buffer);
 extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
+extern bool CheckBufferForCleanup(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern void AbortBufferIO(void);
#37Amit Kapila
amit.kapila16@gmail.com
In reply to: Jeff Janes (#34)
Re: Hash Indexes

On Thu, Sep 15, 2016 at 4:04 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Sep 13, 2016 at 9:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

=======

+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or
tuples
+that are moved due to split.  The need for cleanup lock to remove dead
tuples
+is to ensure that scans' returns correct results.  Scan that returns
multiple
+tuples from the same bucket page always restart the scan from the
previous
+offset number from which it has returned last tuple.

Perhaps it would be better to teach scans to restart anywhere on the page,
than to force more cleanup locks to be taken?

Commenting on one of my own questions:

This won't work when the vacuum removes the tuple which an existing scan is
currently examining and thus will be used to re-find it's position when it
realizes it is not visible and so takes up the scan again.

The index tuples in a page are stored sorted just by hash value, not by the
combination of (hash value, tid). If they were sorted by both, we could
re-find our position even if the tuple had been removed, because we would
know to start at the slot adjacent to where the missing tuple would be were
it not removed. But unless we are willing to break pg_upgrade, there is no
feasible way to change that now.

I think it is possible without breaking pg_upgrade, if we match all
items of a page at once (and save them as local copy), rather than
matching item-by-item as we do now. We are already doing similar for
btree, refer explanation of BTScanPosItem and BTScanPosData in
nbtree.h.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Amit Kapila
amit.kapila16@gmail.com
In reply to: Jeff Janes (#35)
Re: Hash Indexes

On Thu, Sep 15, 2016 at 4:44 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, May 10, 2016 at 5:09 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Although, I don't think it is a very good idea to take any performance
data with WIP patch, still I couldn't resist myself from doing so and below
are the performance numbers. To get the performance data, I have dropped
the primary key constraint on pgbench_accounts and created a hash index on
aid column as below.

alter table pgbench_accounts drop constraint pgbench_accounts_pkey;
create index pgbench_accounts_pkey on pgbench_accounts using hash(aid);

To be rigorously fair, you should probably replace the btree primary key
with a non-unique btree index and use that in the btree comparison case. I
don't know how much difference that would make, probably none at all for a
read-only case.

Below data is for read-only pgbench test and is a median of 3 5-min runs.
The performance tests are executed on a power-8 m/c.

With pgbench -S where everything fits in shared_buffers and the number of
cores I have at my disposal, I am mostly benchmarking interprocess
communication between pgbench and the backend. I am impressed that you can
detect any difference at all.

For this type of thing, I like to create a server side function for use in
benchmarking:

create or replace function pgbench_query(scale integer,size integer)
RETURNS integer AS $$
DECLARE sum integer default 0;
amount integer;
account_id integer;
BEGIN FOR i IN 1..size LOOP
account_id=1+floor(random()*scale);
SELECT abalance into strict amount FROM pgbench_accounts
WHERE aid = account_id;
sum := sum + amount;
END LOOP;
return sum;
END $$ LANGUAGE plpgsql;

And then run using a command like this:

pgbench -f <(echo 'select pgbench_query(40,1000)') -c$j -j$j -T 300

Where the first argument ('40', here) must be manually set to the same value
as the scale-factor.

With 8 cores and 8 clients, the values I get are, for btree, hash-head,
hash-concurrent, hash-concurrent-cache, respectively:

598.2
577.4
668.7
664.6

(each transaction involves 1000 select statements)

So I do see that the concurrency patch is quite an improvement. The cache
patch does not produce a further improvement, which was somewhat surprising
to me (I thought that that patch would really shine in a read-write
workload, but I expected at least improvement in read only)

To see the benefit from cache meta page patch, you might want to test
with clients more than the number of cores, atleast that is what data
by Mithun [1]/messages/by-id/CAD__OugX0aOa7qopz3d-nbBAoVmvSmdFJOX4mv5tFRpijqH47A@mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com indicates or probably in somewhat larger m/c.

[1]: /messages/by-id/CAD__OugX0aOa7qopz3d-nbBAoVmvSmdFJOX4mv5tFRpijqH47A@mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Amit Kapila
amit.kapila16@gmail.com
In reply to: Jesper Pedersen (#33)
Re: Hash Indexes

On Thu, Sep 15, 2016 at 12:43 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

Hi,

On 09/14/2016 07:24 AM, Amit Kapila wrote:

UPDATE also sees an improvement.

Can you explain this more? Is it more compare to HEAD or more as
compare to Btree? Isn't this contradictory to what the test in below
mail shows?

Same thing here - where the fields involving the hash index aren't updated.

Do you mean that for such cases also you see 40-60% gain?

I have done a run to look at the concurrency / TPS aspect of the
implementation - to try something different than Mark's work on testing the
pgbench setup.

With definitions as above, with SELECT as

-- select.sql --
\set id random(1,10)
BEGIN;
SELECT * FROM test WHERE id = :id;
COMMIT;

and UPDATE/Indexed with an index on 'val', and finally UPDATE/Nonindexed w/o
one.

[1] [2] [3] is new_hash - old_hash is the existing hash implementation on
master. btree is master too.

Machine is a 28C/56T with 256Gb RAM with 2 x RAID10 SSD for data + wal.
Clients ran with -M prepared.

[1]
/messages/by-id/CAA4eK1+ERbP+7mdKkAhJZWQ_dTdkocbpt7LSWFwCQvUHBXzkmA@mail.gmail.com
[2]
/messages/by-id/CAD__OujvYghFX_XVkgRcJH4VcEbfJNSxySd9x=1Wp5VyLvkf8Q@mail.gmail.com
[3]
/messages/by-id/CAA4eK1JUYr_aB7BxFnSg5+JQhiwgkLKgAcFK9bfD4MLfFK6Oqw@mail.gmail.com

Don't know if you find this useful due to the small number of rows, but let
me know if there are other tests I can run, f.ex. bump the number of rows.

It might be useful to test with higher number of rows because with so
less data contention is not visible, but I think in general with your,
jeff's and mine own tests it is clear that there is significant win
for read-only cases and for read-write cases where index column is not
updated. Also, we don't find any regression as compare to HEAD which
is sufficient to prove the worth of patch. I think we should not
forget that one of the other main reason for this patch is to allow
WAL logging for hash indexes. I think for now, we have done
sufficient tests for this patch to ensure it's benefit, now if any
committer wants to see something more we can surely do it. I think
the important thing at this stage is to find out what more (if
anything) is left to make this patch as "ready for committer".

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#36)
Re: Hash Indexes

One other point, I would like to discuss is that currently, we have a
concept for tracking active hash scans (hashscan.c) which I think is
mainly to protect splits when the backend which is trying to split has
some scan open. You can read "Other Notes" section of
access/hash/README for further details. I think after this patch we
don't need that mechanism for splits because we always retain a pin on
bucket buffer till all the tuples are fetched or scan is finished
which will defend against a split by our own backend which tries to
ensure cleanup lock on bucket. However, we might need it for vacuum
(hashbulkdelete), if we want to get rid of cleanup lock in vacuum,
once we have a page-at-a-time scan mode implemented for hash indexes.
If you agree with above analysis, then we can remove the checks for
_hash_has_active_scan from both vacuum and split path and also remove
corresponding code from hashbegin/end scan, but retain that hashscan.c
for future improvements.

I am posting this as a separate mail to avoid it getting lost as one
of the points in long list of review points discussed.

Thoughts?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#40)
Re: Hash Indexes

On Thu, Sep 15, 2016 at 2:13 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

One other point, I would like to discuss is that currently, we have a
concept for tracking active hash scans (hashscan.c) which I think is
mainly to protect splits when the backend which is trying to split has
some scan open. You can read "Other Notes" section of
access/hash/README for further details. I think after this patch we
don't need that mechanism for splits because we always retain a pin on
bucket buffer till all the tuples are fetched or scan is finished
which will defend against a split by our own backend which tries to
ensure cleanup lock on bucket.

Hmm, yeah. It seems like we can remove it.

However, we might need it for vacuum
(hashbulkdelete), if we want to get rid of cleanup lock in vacuum,
once we have a page-at-a-time scan mode implemented for hash indexes.
If you agree with above analysis, then we can remove the checks for
_hash_has_active_scan from both vacuum and split path and also remove
corresponding code from hashbegin/end scan, but retain that hashscan.c
for future improvements.

Do you have a plan for that? I'd be inclined to just blow away
hashscan.c if we don't need it any more, unless you're pretty sure
it's going to get reused. It's not like we can't pull it back out of
git if we decide we want it back after all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#37)
Re: Hash Indexes

On Thu, Sep 15, 2016 at 1:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think it is possible without breaking pg_upgrade, if we match all
items of a page at once (and save them as local copy), rather than
matching item-by-item as we do now. We are already doing similar for
btree, refer explanation of BTScanPosItem and BTScanPosData in
nbtree.h.

If ever we want to sort hash buckets by TID, it would be best to do
that in v10 since we're presumably going to be recommending a REINDEX
anyway. But is that a good thing to do? That's a little harder to
say.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#1)
Re: Hash Indexes

Hi,

On 2016-05-10 17:39:22 +0530, Amit Kapila wrote:

For making hash indexes usable in production systems, we need to improve
its concurrency and make them crash-safe by WAL logging them.

One earlier question about this is whether that is actually a worthwhile
goal. Are the speed and space benefits big enough in the general case?
Could those benefits not be achieved in a more maintainable manner by
adding a layer that uses a btree over hash(columns), and adds
appropriate rechecks after heap scans?

Note that I'm not saying that hash indexes are not worthwhile, I'm just
doubtful that question has been explored sufficiently.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Amit Kapila (#39)
3 attachment(s)
Re: Hash Indexes

On 09/15/2016 02:03 AM, Amit Kapila wrote:

Same thing here - where the fields involving the hash index aren't updated.

Do you mean that for such cases also you see 40-60% gain?

No, UPDATEs are around 10-20% for our cases.

I have done a run to look at the concurrency / TPS aspect of the
implementation - to try something different than Mark's work on testing the
pgbench setup.

With definitions as above, with SELECT as

-- select.sql --
\set id random(1,10)
BEGIN;
SELECT * FROM test WHERE id = :id;
COMMIT;

and UPDATE/Indexed with an index on 'val', and finally UPDATE/Nonindexed w/o
one.

[1] [2] [3] is new_hash - old_hash is the existing hash implementation on
master. btree is master too.

Machine is a 28C/56T with 256Gb RAM with 2 x RAID10 SSD for data + wal.
Clients ran with -M prepared.

[1]
/messages/by-id/CAA4eK1+ERbP+7mdKkAhJZWQ_dTdkocbpt7LSWFwCQvUHBXzkmA@mail.gmail.com
[2]
/messages/by-id/CAD__OujvYghFX_XVkgRcJH4VcEbfJNSxySd9x=1Wp5VyLvkf8Q@mail.gmail.com
[3]
/messages/by-id/CAA4eK1JUYr_aB7BxFnSg5+JQhiwgkLKgAcFK9bfD4MLfFK6Oqw@mail.gmail.com

Don't know if you find this useful due to the small number of rows, but let
me know if there are other tests I can run, f.ex. bump the number of rows.

It might be useful to test with higher number of rows because with so
less data contention is not visible,

Attached is a run with 1000 rows.

but I think in general with your,
jeff's and mine own tests it is clear that there is significant win
for read-only cases and for read-write cases where index column is not
updated. Also, we don't find any regression as compare to HEAD which
is sufficient to prove the worth of patch.

Very much agreed.

I think we should not
forget that one of the other main reason for this patch is to allow
WAL logging for hash indexes.

Absolutely. There are scenarios that will have a benefit of switching to
a hash index.

I think for now, we have done
sufficient tests for this patch to ensure it's benefit, now if any
committer wants to see something more we can surely do it.

Ok.

I think
the important thing at this stage is to find out what more (if
anything) is left to make this patch as "ready for committer".

I think for CHI is would be Robert's and others feedback. For WAL, there
is [1]/messages/by-id/5f8b4681-1229-92f4-4315-57d780d9c128@redhat.com.

[1]: /messages/by-id/5f8b4681-1229-92f4-4315-57d780d9c128@redhat.com
/messages/by-id/5f8b4681-1229-92f4-4315-57d780d9c128@redhat.com

Best regards,
Jesper

Attachments:

select-1000.pngimage/png; name=select-1000.pngDownload
�PNG


IHDR^Tz0�*	pHYs����}�:�IDATx���	X����� �

X�Z7
"b���Q�JmY��@9�Z<b����Z�nUTl-�@�(jw�
ZW\KE�� ��$,I��F��$��<�u�9�����4���Lft$	��Cw��h��h��h��h��h��h��h��h��h��h��h��h��h��h��h��h��h��h��hh����5k����?{����h��_����u+��n������4���R�;�v��K�.�������7n����{�n�#��!�!66����_�j��a�H.����<yr�������6-	���722��_~�������D"��{�N�>����8p`��!�����WC44���I[�~���=zL�6��������F��\�|y��M��]355�*�������O�Q�
hD#@3TVVv��E�H&mG�m�����?���JsQj��9���4��H2����X�b��Q��������;�t��V����s�������EEE$ ]]]�{����(<����^}�'O����7�khD#@���7��o�IMM�x�brr���c������,,,�
Tw	N�o���D�F�f#e[k����������
����������W��p�F�f���"����-����,_��S�N���G�q��D#��!����#44T����7['����	&L�:U��-%222##c������&@44CXX��)S���?�����;�9sf��y�-j��>l�02w\�j��q���5==����[�n��=����@44�����~��������\###;;�
6L�8Q�Mcgk�b�p�W�$����I4FGG������\.w���/^���j��2���~��n}������)��������-q�8�7�h��h��h��h��h��h��h��h��h��h��h��h��h��h��h��h��h��h���h��s��i�$	����|��)j<k���7�Aqq���OJJ���S\\���q�����$����c��������</>>>***""",,��E�X����2�����
�F���|~bb"������������E�X�������;w���B�������vvv;v���� ���,.�K��������-,��bn4�9s�c���z��_tqq	���_�dIPP����IQ(�Y���S[������r��9s��w��E
/^,��hhhH��,����R�R|>_�����t�5��GkQc6�=r�H2�LMM8p����X,����n%���cmm���mee���T�W*	K�|@���8s07�_� �(���N�>=<<����*���'$$�����R�����
EGG������><&&�*���xyyEFF�x���X�@c�A4J�����mjj��'))I�E�Xj��	� � � � � � � � � � � � � ����s��i��I�KU\\���������gll��"h,FG#I�_~��~%<<������GEEEDD��������x���!C��>}ZZ��������������1U4��%a��tw�l"q����H_�>�KD��f��\�nj0:###w��I��JVV��%ss���L�4���g��}��Cw#M�����D@�o?���,�w�Zm��������^�����Q������vs�����@O{��%��i6�F��3g:v���W��E�Ph``@dY^^��"�&#��5�&o�j���>��l���kKu�n��!�q_I7�[HSk��n�6j@uN�'~�G�c;o���+W��3g�\����$Y
###���|~���ZNaEi��^��XZ��E��x�������VOS�7~�jI�XR��-�$	K����6�b�t��$�t��E�::U�b�GE;�EUK*��&���Jq��n��E�J�b�����i�����s/�(����2zZX���-��h<Z�����#G�y���eNN���uvv���u�*�R�����\lXPw���:��>�x��HX����TV���,�#�I{�t�UUT3�S������,VRQ���t�C�+���`��	���/l�\������������MuEM W�]��_��������S2Q������VP�e4��y���������Fcn4*������bccUWh����������_�W�r/U�����?1U�D��Z�)[��tDb�����_�>`��%%���z-jL��"����|�#H��3��c��j��)#��p����6PE��KD���y���+d�W����b---q��!���[VW5)��E�d�f5��~;M�90��"Y>4E�?���ut��o�J�o��K.Y���cW�?�Z�D#4���,������i����])�(ntSq���f�t���������[\w��m�J=2���6"+�Z��i���F5�,j�f��l6��N_[K���.�����~��
�t���G~�_�\�TW�R���
nx+�����K5�����=����#��mDQy�_�Y�������"qcdjU���4�4?���|��%&5_�g����&�9���=�nu��no�����N_W_W�P_���*�J��
����d3�qG���1�P�$iG2�$�?���������yOa�:T���?��9�u{H�����7��"��b���@4�+�A4���|������n)�+*2��7{����3��J��x#_��$�6���8*�2J19`��\n/)�}�����hP'��%�R�����+dY)6��@��Y~w���E��6��g�V{�>�������=;��m��~0�����8e�J��{�94����J��r/�f��G���R�D�U\\35,*�������C�[u����o���S���H{�����U�W9�=e�\���4s ����A�� ����%�~e��J�����g����|R*�~�/ZOG���I����fdi��������������PT�����_y5�f��V������CR�f�a��hX��[�[;#���������V3.���s/�C4(��K9�uGUW�y�p��]P���99��n)W?y��W^�����Ky%��R]�[\����3���<neE���fo��P������������[��"�D�Fe�����c/������,w����UW�}H����j�)T�4�8a���.$���Q��)�������������l�N���FB4�D��k)t�Z����{�eb�oiQ�m���#�n/�h��0���f�a�A4��B�Au;�����~�N@-�hP	��{�c�g<!*+�It��'S�.�:�Zu������]l��Ys�j��BPD#�������$�%�u�9�����������nB;k3�,`D#�����wz�~��ZWW��B����cp��f��mh�]:bg!!�F�������.��ukO�i�itvN���>��5h�F��{�����_�ncc�u��!Cj�����|�T��g���q�F2(..���IIIqrr���366ny���D?��f`�\�<��v�	a]�e��s�100������c;w��y�&�6�222,--�o���������"""���Z^h������|>��������^^^���S�N
����F�-�|~bb"��!Q���Ie[�M�Up����%������W�)�F#IDV��tD111�F���$���/\�`gg�c�R����r�����333�-[Xh�n�=�Z\Q�"a��c�����m
�Z
���h����������3����K@@�����%K������O�B������YdY^^Nm��"�kIX��77�^YI]
��Z������b����1=�����������=KVw��E�/^,��hhhH��,����R�R|����*,�F��T$����H�V�F7����P�{���Kkk�R�����W�y��������y��bjj�������b1��*��������������RJQ����pM5(,y�/��6�Z��[Q��s����:�����������x���HD�q�����������$)�O����L�����d�����"@c���E(�xF��?��E�#=;���(s�q��m~~~��/���S���hR

>|xLLU		��������x���J)(��y �����8Yn
���Ewk�4��F�������mjj�\���$%%)� G������=��R������3�-p�D{z�cn40���l��_��(�Z.h��[�����������5
���:�������Y��O��;�����@E��r/����/�u�df�S�����|���s��Y�F�F%g$n>?�ZTs��D�u�������>������@��
�%����nFQ���zRF�y�-:�EOG���@���U��}5�8�ZZ�9~�w��SG�=7*��Fy%��'���V���]�8a��I���Ooc�j�/��^^s�_/�����6����}0�����5!����=eQ�����A7��Q��z�caJwk�����TU�^�0��a����v5���y�F�t����]�9A����;��<�qW�{;}]z{Z A�=}��������u�����}9a�����p4*��B4�����kO�x!,�V32��q��f���?���1��4���vo�����U{�����g���%�9�������x��=�������l��u��!�X\\���������gll��"�a$w^������je�~J�8�������]�n��L��h���8v����;n��I����</>>>***""",,LEEh�J+����y��yj������	=M���������!�^^^���S�N
��|>?11���������bLEh���xy�����yb�ra����{bl�Ooo���F��d)�bbbF�E����\.���gff��m��Gg���YV��Z��x�������o]-z{Fan4Rtttz��q��jU(�\�,���UW�6F���k���������|���0=�����������=KV


I���@ 02�;M�*�R|>�aW
��Lb��Ji�=���je�Ar������\{���O	
17W�^������;o�<�hii���cmm���mee�������\���6,3�T�;��=A
����[��N���r"ro�9����j���h<p��H$"��s��~��QEww���R$K777��
x�<3��Y��>���2��S���{�pq�phs�q��m~~~��/���S���//���H���"����?�Uye	�z����k��������.���0s����&%%E���p���Z�j-���;/}'���X,��r��l&����n����<0�&�M�U���{z�ZQa��<.?�����8c84���a��S3�>�H���?��P��4�+����
�������U���KR�O��pa�����eO*�1�+ �-�������U��*u���i�w��V]��
���^����{i�D"f�ts��Yl��|h��yg�1�
�j�JT�-y���{�U��]�9����c�z%,������M A]���>�����u�����u)/��������8c8�1D#�����:����nrrz�q�Y,�Y�9�-�h��G����s+�k/�"a��f{��P=]�_B\}>�Gww����N$,����c�FP��D��/�w��)�p���F���@	��t%EFz-�v��"�|����T] 0:�����Uw�C+<��o��'��F`��?��m����'2����EE���s)/o�����lRgc�1����\�����Q��q��bN����t>���o���^��LxO&b�
��m�mlXe��n�=�����d�3���07<x��������{GGG:����O�:Em0k���7�j��V���������gll��"0��a��������qZ������t���/h���c��9t�����g�����������KK��[����x���������������vZZ�?�/W$����U%#@u��S�N�2e������wpp0U$�hbb"�%��OLL�p8���T�������x�9�����c/�(��k�_H��1���Fjp��	GGGjL��$�����v��aaaA�YYY\.����333�-[X��d;;���E�C��d���M�an4R������Kfu����K@@�����%K��������M(��]�eyy9�e�@�{yWW���a.R���e:���������h�u��������
DUv��E
/^,��hhhH��,����R�R|>�ac
��r����!bU5�5qd�?�s��������QQQd�(-���8PKKK,���]r�ddNN���uvv����R�R��
ySnX��K����K�%,�k����ru�i��Z��c�F�����M�6a������@__�������;;;SEww������ �$�L��5U���%/�==�Z-�?w����N
7�3`��[q���h�Fcll�D"Y�b����N�t���~~~���������n
		��������x�^J)B�)�x�����<I�V��;;A(4lo��������e-�:�v��a��b���lmm�/8���p����[���W�q|���������/��j^�����
�r���Vm��JT��\���6\SS�a���f���\q���'N�hQ�an4B���wm����g���{S�
������q����&�����M�c[����j��cD#������~�w�HH����k�����n��K>q|���@����7  ������~������cvv6�M.Z��?��O�=���^^^���__���d�}��u����]a�F�F�A����{i9u9��j�)c?�����g���[�������.����M���[�r{o����U���]������_���/^�(���,KKK�bqAAY%9`��+V� ��i����4B4B���c..9���Z��;�QQQ��G������K(�����z�}����c����$����+�3c���Qnc������&C����'O��}�v������K]7��"��N�GPU���W�OS�������Qe%ge��^���-��sg�411y����
�v�J
^�xaffF��_OWX��r����[��O'3h]]��G���=���G�����?����A�����������?yb��<���1����z��@�����R�+6�0�����{���}z�_������_AAA�+@P��\Z�lv���:��Ua�F����3g����\,++���KLL���whhhPP���{��#@���O]u����������6��tR���@&�J�1{��������������������a��QE2/��������^�����z���������'O�l�H#�Dcjj*�,���������x������Jy|Pk��>�����u0�����L��O���#}:�h�������?Buuu�v����onn��r�Nku��!����������������i"I�-[���H#��1ikk������CU��Y$)��AM58uL^�%v.0D�RP!���2����_4h�����O��sg�r��H#�D�����':u��o������JUU�5�x<�<>�#�Dsq��?wP�����i�,��Mt�}����F����V)'��[7e��P�i����~�T�]�v��E�5�@������u=�n�Aq�����t�����^�:���k('�������W�z��Y�<8���������
nS�O�������i��y������k� P���F�^XV�}���wS��VL���������h�{������������o�nii����s�����z�������]��������Ji�k�4i���'�l�B��9Z���$n/_���w���h��A���>>>)))NNNqqq���**��;�����Y��8���)��4�r��XRRR�v�H..]�T)�0f��C����{���e���y<^|||TTTDDDXX����,b�h���C�
�������g��Pwc#����,��F�@@r�����L)�9u��)S����{{{SE>������p<<<<==�SE�NXU�����r��Y���[��N��uL��Z��j�������S�'N8:�}=<++�������yff����DE����g�s0j.���]�'Mt���@])-������O��������Kfu��P(400 �,//W]�"����������Q33�=2�������Doc�\��y�u��y��>�����-�����u���m����
�*���$��R ��(���6���iW���(Z���Y�����7(�|B��[3o��4����_�6,��o6[��^A9���`W�������QQQ...����eNN���uvv������R��
���EMs���gR����s��D��/9{;N��E��P���V����\_K�>����o�)U�����+**z���v��y��� X�n����Y5�{����������o�>2!o����:t(--�r�>l��}III��]<x ��c}ZZZ�v�
���Y�f���)�:u��������{���a���c���
{�,\�������M�6�9���7s�5��?��i&L�_twwOHH

"K2�T]S{f�e����Z
�]������I#l�m�|{hRZ��f��g��q4�m2qP5���{�����7IV9r$$$���9s�0`���+V�X�h���C��
>���C����>|���3))���Qa.�j��<rFF��c�H�R�6{����Hgg�_���/����9s&y������;w���W�{S07ccc� �9�V������I���"O
��#P7��
UT��>>���S���oe�99�[��ow��1������Z�`�~�����u�@>y����������t����'�H&�����ec^]]M���L�8��QSE���`��)�6I������O�W��)TrN}o�A�X,nX$�!����Pqy����<-�;Yn�Ss����2io@oc@�N�j�#3H�[7�J���]KY_�����&MZ�d	�/?~��?�ppp c2�[�n�+�����!:::�G>s���������t����������������}jx��PZ4��H`�����������:�}�&t��v.0��q;E��\4������(W�>��������Sy�i���w�TRZ!S�A��X��������������s�>v"����W����)���{���m�������wO�:5;�E��d����s~_u<P���M��+�!x��ic�����N�Wo��EV�\$���?����?���bcc����'k.Q����f�2���OBBB^=eT�o��666���6l �z��}���
#����aii��	�B�Fx�7����b6���&k��s�����6��/h�����������>�-��sZ��aaa���d�haa�e��8i������'�1���s���l�#��3f���1�&��#�����-[�l�����y��=�n������F���%�:��\V4��������Gw���?E@-��B�nF��q�G�tL�s���������w��]:n�#���R���G���d0v�X��7v������._�?��e)�jI�qw���������mD���m�((���i��������_.���
�*��q���HD#�������j��}^��������6���o�i{�� ���C�\���N��Q_�������[���h�����#����]w0j������XOG���Z�j�%�ov��2��U������?�Bw_4@4������)Z��S��r�
����
�h�t7d,>�y�u�pJ-�?����6�M�	���~=w*�FP�/�U��n��9�p�
h4D���HX�c���b�a���u����k�h�����1..n����O���|�T��g���q�F2(..���IIIqrr"w166ny�m+T�m^�����Q�j9[���Q3��������7n���(����KK�����p���b��^oEB������W���*��
g��<�g:`"�F��K�Hh�\��~�D������|>?11���xxxxzzR���b�cc}s�mr�o3����3��Z�)�#��s_Z[`�Fcppp�"�Foo�.�����������eYYY\.����333�-[Xlc�u�I.fg[}���Z-���o���)��0s�Q!���{{�%K������B����eyy9�e�mL��dia�!���O�q�� ��Y4����,^�X�������Y
###���|~�6�L
��-U�����^��E������5������jii��b}}}�H22''���:;;���J)E)WWW�
���E��p�,J�+���0j�%+�Pm~
�?U�C��100���w��������uU���'$$�����R�mF������_�z`���n��%�cn4J�lI
�oqDGG������><&&�� $$���+22������*��TV�:uz:�wX����t���������������rE��������#������Z��r7��b2qd��g����7�cn4�����5:����T*��mL�hlkH.z�\����\���9?\�G����@M ������h���l��r��>]l��@� ���j����Z�/sq�[r���m�E2_d���S��'���Gw_������i���\��E.��Q��\��f}e���J�hTo$?_Krqr@Y�j��s�
��r�%1X��\h)D���rQh�^���N����\h)D�ZK$�
r�o7{��h������u?
��(6����4�F5#����)����rQ��\�FuR��Q��t�rQBrqBr@��qqq�'O���Jqq���OJJ�������XEE"�8#*�@g�^�\|���d������7n�]�1<<������GEEEDD������4�i�2*��N��n�6�~�\P	�F��K�Hh�\��~���'&&r8OOO*�TQd��37o����~���*��hnX����r�d`nn�����"sP����O.�-���:��F��B����eyy���Ar��-?�����s w�}�ej����$��R ��(�������
1W���vP�(�:w��s�0�F+�t��f�hii���cmm���mee�������\��b��*���4��Q�;���Z�����Zj����			AAAd�����"����w���\\8��J��l6����GHH���Wdd$������6PE�^�~�yP�BO�.��1��Z��r�h�p8����V(�h~LLf��\����t7�A���)$fG�P&�z���)��hd��;��C.���v����e.�G.�������.XN���v�h�\�	��~���Y��e.���@#D#�����(����yt7���t�.>6�h�4���@;D#mV�w�~.~��P��t7�F��%�_+��L.��������$�,��E���Q�EA4���=����\`Dc�Z��W���������hl=$�?Y���ntI.�����@����=����Pi.~��	��L�������s���Y����Qt7��Y4:;;�:u���5k���dP\\���������gll������=��e.��mC.0��E#����KK�����p�������'��S������6����N�����D����9������'�m-,*��C$�����E5�~����}��;;�;vXXX�bVV��%ss���Lj�[n�q��l�\�A.�5�F���{{�%K������B����eyy9�e�RdN��
���Ne���]��"�v_�H��_s/`5��]�vQ���K�8�x#K�@`dd�������\��b�b}���/sQ�;c�V�~c��w��������Y4���8PKKK,���SE��999������VVVJ)���N:r?�e.:!���Ec``�������������v����'$$�����R�o�����9/sq�f�"��Q�h������

>|xLLU		��������x���J)�����������.j�������rE������bs�|���z��/G�"��R�hd�]g���/:F���\PW����}����_��Q���
5��x���7�hl�_��{�e.�
��\Pw��7Gr��s�a��@����B4�^q�PRo��BTT*�{����s��8}�F�A�E����z&����--��pz�����=�F$�������� (����E$3��7�7�Br������m����n��e���*�E�����BZ�%C46�2�M&6��B���m�b%��(����)c	IHVn.��v@U�M"�2�a�������b�������Id����mn��	��������tu�����
��-]�� _������#@��hd������899����m�p�H!G�����P�]@kA4����y<^|||TTTDDDXX��q�9�W�|���k+6�
�Xu����������a4�FA4�������o(���gff�����
�d@����r��9e��(,@��hd�t$K�@`dd$wk��������>t����s�[�:�F���eNN���uvv�������,ww������ �tss�����������bcc�nh�hdq8���$���@4�@46����B�t@���C�!d d d d �!..n���b���F��a���8%<x���������{����:t():;;�:u��`��Y7n��G�6��9����*�HX�����f����s�}��U���JNN�q����6���S2A@@��1c:�{��3f����j��222,--���U6����u�e����OSc&?�My3��W������*DcS]�t��[�r%��4�������S�N�2e������wpp0U$o&&&�6�Z
�T���RXXy��yj���y�_�L{�v�����
��T����P��Z\������8q������	��q��;;�;vXXX��`�6��9e�����M���;������/o�=�
;W��|[�h�,��8%������;���O�������/Y�$((h����v���&��9/++��{��k���x��S�T����R��|[�h�,��8%s��u���m����
�*�v���/f���M��s���G���cGiE-���>������k��B4j��8����]]]����_��bjj�������b���>�����&��9'8��'�����s^���Z-��}��U�F����?��i�&L�P����;}����pggg�z{5�M��sN\�~}����+j������V��_}_�m����_������������1���+����t��~tt���_hh����cbbhm�Q
�T���x����
���������+�\}_�m������r6��TxR[[[��^L��I�x��/^�U���7�����_a����o��2�2�2�2�2�2�2�2�2�2�2�2�2�2�2��A�9�h��;w�p8�U�Vu���U{QY�*z�A�}��g��QE�@D#h�s��M�<9::���-??��~		��x�\�:==]i-3 ASDFF.[���O?%c.��v����Hg�O�<���MNNvpp��o_������u�F�5o�<���u�H��=�����^�n�JKK[�paNN�����M�F����!(	�4���7�������9s�w��+V,Z��D�����/n��q���#G��'��'NH���?<|�0���={���{��UU�6�B�F�yyy����8y�$��o������C�bUUUhhh�>��cooo��t���D)�>���:�.D#h
33�G�YZZ6ec2A���35�����;u�D�zzzb�X�.spp�������y<�����hM�����o��i_S6&S��w��	b��W�^��m�����{���S�����S��4���L���~���%%%��o�|���={n<~����W���������'O�l�aMMM�>}��[�a��{xx��i�9%�D#h�A��Y������ccc�W����O��[�ly������}��-..^�l����?����={n��U��D#h�q���_g�:w�|����6�?�t3v����4���� � � � � � � � � � � � � � � � � � � � � ���1M�c��IEND�B`�
update-indexed-1000.pngimage/png; name=update-indexed-1000.pngDownload
�PNG


IHDR^Tz0�*	pHYs��:�:�IDATx���	\e��r�����iZ�Z�&����y`�Y��&^�i��w�Jf�
(�$Vb����ye����r,7�3�������2������|��������o��9�JJJ1���h@4 ��FD#��@� �h@4 ��FD#��@� �h@4 �A&����TRg
����u���G�5u�TSS�2����i��o�����...�W������O�8q��������g���'�m��@��sH�P��y3::z��iK�,Q_�M333�������W^Y�|���O�U}��7�5Z�n]PPP�5kiD �N �����p��YNNN�h��l����K
>����k������X�EEE���6l��)&&&��}#�h��;w����T�@��?������Y�FU��s'����;w��)11��7��O��h���<yr���-�|IOOO���e��O��AAAs��E4�D#����cgg��k��3g<��G9;;_�vM5��?���y���^c�W^y%++����>�l��Z?@T�@�jE�Qff��?�!���SU�l��b���k�����D#@5����?L���X=�j����m���������qctt��q�TXYY���7j�H�'�'B4TC�&Mn����iS���+W�\�o��6p�@��f��?�p��y��Y��f����������k����g�V/�[���4������p��m�]XX�k��2��9�O�>��M377���������'����������[l�x��e6�c�v����-77755u��=_}�����[�h�������5k��}��]\\��}���#t��@�����u��{���3�G����c6x��4i���+Q�Tiff�����������z�/._����_���q��e�DcE����9��h���m�n����{�BO\����R���+�FD#��@� �h@4 ��FD#��@� �h@4 ��FD#��@� �h@4 ��M&��������@[H#�:� �h�
�B��m����)S��;w����Z�f��/������~��wv����a������������w���l�+V���{��Ic����Y�~��e�8`jjZfa�JV5�h�
++�|��g�7on�������9r���p��
\�p��X``��!����5:u��%K�,X�`�����og����O�����?��5k���{����]��d�Pc�F���d���,������?o�<��m������Q��?�811����������kO�<�
1Y�5k��

�|��
66l��u���-\��������cG�agg���o_�r�I�&|�N��K����T�lh�7:w��t�R~,�X���j��K�����yyy&&&|����:o������-��s���7����O\���@��?��9::��{�MY���c�z��m/_��h����'%%��~���_�v��o�]���j��w�
Z�x��Y����>��S�L������/��={�����5k~����;w��3fKG������{�^�zu��I����j��wl���a��K����������|��I��M���K^^^�n�V�X�P(���'L��fd�/Y����������|�����
��-��p��G5+���n�������
33����u��Q�Y�R|�����j� �h@4 ��FD#��@� �h@4 ��FD#��@@*����_\\\��a����G�.	��������~777�{{{���$����;~�x�+�S�E����k�JXX����������"""BCC�/��D4:t��Vxxx�zbb�/������*III			r���������6-��$�AAA����6l`SU%55����5Z�ly��E��I"5��g���C��m�����VVV���999:)��1��5bw*$�h�:uj����5�76U(���:)�������t�%N���S)�-��z��������������iii...��Z�I7U;��\T�����������S'Eu��F~���Gq�������FFF���������N�XI��W.�'''�Y@�"�:ID#�t ��FD#��@� �h@4 ��FD#��@� �h�J4�������*�.]
8|�p�v����{����~~~���wssc������N��o��������������;vl��i��	��c���0ww���7GEEEDD���j_�U7�<��$�b�*$�h<t����p���Q�F�iii9b���� ������� �����}||�l��P{n]���I&�����t�(ID�*����]�v����o���:99�F��-/^���"@-a����()A:J�$��g��	fC=~677����5�4''G'E�VTH�Edn)v?���<zx�2�S��4o��:�Q�$�'O����\�jU�.]����5�76U(���:)�$%%����"@� +)��sU�3�����B��T�����0���Z���f[e�7��f
���6}������� ���!�J��x��5����A������������iii...:)���+Sa�X�PG��b�:��5_�|%�$+��Te|�S�)!����l�����.3��2���Q~���
�Z�J���(5���i���=z����E//������@6eJ���*ZT55�tT���6ON�<x�������}zP:������%��������
bC��|�],�~�\�ID�L&So�Gq����������;��y!!!���������l�.-��� ��\���t�"m[M���]����=������#Se�E����������������w��om��.[������/���dm����Y���;R�S;�������\����,����S�:����c��E2�<��������:?�z�M���as�{]9�\�*IDc���%��W�������-�5,<n_����%.Y�M�_������L�����uQ�L��z��s�z������.���tD.J�$�j��{��w����+
Bvc�A=���^�ZJ���k�:#%�`
��U.��#���;�co���p�����?���.�������6>*��P����
I��x���T��{��\��U�Y1�����CX�UBn����{��<?O�egr�`��Q�B����.V<F�e�z�R�Y5Hy���H�@b�[�*�X
^+m�L�UafA��tdA�V�N���U������U:�+����.)�~�
�y��\MR�&��.e�F]�j>�,����j������������S
����ci�Y+ng�������X��d�+�q:V��@ ���"Q�G�i�H{��dUXXQ�fe��\��Xf���.
+ZT���K�E�D#�vT��c���Z����qe6h����hP���1ZT�h���C4h�L.���Q��;4���K��l��F����J�tD.�~ j*n��\����t��(�a�6�1� =�F���v�;����5����[:����Wc��D#@u�����?~�p�V�iq"�������

�i�w�9]�[PW?&��b�Y�"��C4T*�6��D����k��kW��H}�$S3���!����P0|�F�
\8E���_6�03������c��E;�v���"�D#�PI1��;}����Ip�Z[9���|&s�'o�\�3��(����~�}/��E;��>F�����@4�+M�z�����z�^4b*�H2Y��:H*�����~����??��������{����T���1�~���]I����x��j�Q���h$����;~�x���:Daaa����7o������

�S�Qqwlb�R��A�~��!��}C�z��D4:t��Vxx�z1)))!!A.�{{{����1��"�����&���n]�]�%��4`w.0n����������T'''�h������Wcq�"%����.^�"3��h�{�|�:#�h�(77�������999�+B�w��.+{0�M=zm8
��Z>%^�@������,��T�P���������T��`L�
�lwd���t�z�}���_K{�o��
?���H7������]]]���\\\�WT���(Sa�X���-����%=�+�w�EC��y��iS��E��3��I7����������SE�S�������8��}\4���^��;���x=�!�h�=:��o�Gq������FFF���������uAI1������=��C#8��L����t�$����<�\���\E0l9Y��[���s�����6
K��"��$��&n\��u�c=wM(����MC'��nPc�F0@��faE���_�� ^��.@4��(��}?r���>,�7hJ��{we(�!�d����s���&��v%�����f�"�� D#HI�
}GpM�+���w|Cy��E��
�	����w\.Z�P_{��A�F��*wv7;8�;B�iK�:
u��a��� �k�����s(��c[�H����z��� AX.~:��lA����{��I�7�2�Z�'D#�m�f
{��{��O���bo� :����T�W�.o@K��m7�^9��h��O3��-:4�E����'3��j�P�LV�|������k���	�D��--~4##*���\�e;��%�)�Wj���Vy���'/$U�F	���m8w����{���l�!jU���t*���j������?~|����l���S�^�z���������?���-ZL�:5,,,''����c����o_�=�m�������("D#�� �"��d��p��x�Z}�)j7r@7"'��=a���t���Y���m�|�����1�9�G��|�����g�����t��A���l���U\\|�.wiq���;wfQ�`��+W���("D#�:6:��O�(g��I�c�/Q��,��:q�]:]�G=Q��9~�&M���&L�����,�22??��em�%nJJ��S������gO~�E!�v]�H3�����>�4}5�=�
r�5n��M���������M�6�>l��9������("D#���~�OFQV����|��������2ee?|�2�MJ=��e�V�l��;���������������,5>B��:[����vvv��i*�H��x���w�}����~��g��Y��kWV�����������[\\�����E�%?m�%S�K1f��4�O�>�im�p��W����?r���.������DGG�]���_��-Z�8�|�V�,,,�8p��E��O����RQQD���q�������4~��#G��bXX����������"""BCC�/���;�n�P�(����m�./��-(����0��Y�\,,,������S��-���������;
�$�o�V_����a"K���WWR�t����3�������G��<y2_LJJJHH������>>>|�iY�*��E����x��p;��r�O ��pfff������W��]�tQ��~B���'&&�Y������lx�~��Q�Fm����W_������S	k��'/^�I�(�>�E��*g;�@���CcQ��`85���+V���K'Nl����������\++noF6����IQ��)�wCc��&�f�����W2u�����T��q{�D���I�&-\�p���qqq&L��g+Z[[�xcS�Bakk�/�eQ����L��b�"T��C4g�����Nl>)�9�*P
��%N���o�>��a�;�����|���9==���5--���E'E���(b"�)���}I^���@UI7Y�m����blll�����W||<KJ6����It��U5�J���M=��k�*��C���n��w�yg���,���K��������"S'E���BZ�!w�b^#GZ��\:��'�j�n4v������e�r�<99Y�E���,�?�?�<[�N�0�;��'���n4�!�q�f�Iig��/y���O>y?�$!Akg��G#��-�����i5�3*,D#hgo"���������!���'����O4��������������pD#hA}gTk[���8H�>@UE�������>���J�KAI�j�����N?(����iA�v�K T��_RT�����~�n"g<������gL-,L�L�v^^^�-�.]����!��e����Y����c����o_�=�m�������|��M;;���,''�+W���W/33�i���.]R]�Q�������������,Y���]�g���S�L9�|���W�X1`wx���[g���������r����{W����M4�<yr���477����&O�|����_~����b�L'O������O����P�j�$j���o��u������!��V���.��}�$�maa����'N����sgHH�S�N�������`����g�����'[��7���cGvv��?���������W/��H����|������E/�m,b"##���3q�D������[a��]YF�k�����M4�>��d����q��%$$�k�n�����l���)@*�^�Y>t��r��k4�[���H���d����������o�1b�����r��)6.dA�B�UX���,� ���6�h����l�r�|��!���|���o�92  �o�peq��I�R�<�*t���c�E�`���{w6|f���pWW\{�n��7�z�n_U��>��.�N��Q�FT:�,..�+l(��ys�mi���9l���?����~���?���%k����e�*Ys��
���W�5���g��ig��Q(��CCC�:�/_���^���Bgoj���|�����W�G��*|�
�p
wPwa6��KoM�13��Th��1%����9�B9z�wj��
w��������ML�|t��>}�
%U6��������������c��]�6n\���A��E����X�����m�/��2::z��M�F�JKK��:��M4��p��]�S ��?�U����U�
/i=��/w*8������>C��T���XL��L����n�a�8�D���8p �������������P�w�AAAK�,a��C����T>d��C�������_�b�������}�����������5@�H7�����#G����\��M�6�r����n��Q'�1�W�)n�r�ASng�����%���:E^hhh@@;�j�j���|q��a���C�amc���, ��f��������g�D6pd#����y��M�<����j����5k���n��[�ng��Q����������AL�9�p<��Q9��i����Z��'�
]���a��6�%&&�Y�a��yyy|���Q����Y���w�o�����O>a�����^��*v��Ag�R�>���n�z���	@�t��O�f�S�Nu��}��u���:Y-���i��C7���GQ�22��	��p�d��EmN�%��8l�������W�6 �#�������kcgT�3SP#�D�������mllX.��;W'��$K�����k�[�����M��P{t�
���"k�ivv�N�Y\\��7�4m�t������*+fdd���������-..���^�"<�>�}��x�4uv�O�M���DGG���������/�f�:p�+������o��9***"""44T�"(�Q�$��U9���B���E�>�@g���#�z��� ����\��~��>��bRRRBB�\.���fE>��,��=��O�(g��I�c����}�n�Q���:u�������stt������+���:9qWxh��%�u��@W/���t��r��7M_����(�&���F���,Z�h���?��3+���ZYqo�l����/�eQ��)�wCc�.ip�����X(�gF���������]b�@4������n��1l�����+���,��T�P���������Q��r�|�N�i}F��\��������vbw
������
�t��u�����P��G���G�;;;��������������h���Q7D?��s�y����bw@|z�
G]��h6l��M�X�-[��O�>|���+>>>00�M===uR4Fy�(�vmV�6o����
���,u�[cpp���_����u��2�/������FFF��������ht2���Qt|�r���i,9T�ziu�t�P�����}{��\.ONN�m��\���5���J��hF4YZ��'i�n4���>Ds�(��rv�D�F�'_���H��F�����	�����f�A$y��OR�Q�q�~E���b�mS�>��z�O�h���
i�����r��#-�B.�E���!����dW���4'������Nm��-�����v@��u��k����iQ�=��1��e�%��;�<��nQ�"3���u��
Pe��:�����5�?�!��zo����(;����!��XWh�E��A�����S������B4�	�4����������E��C4�	�\���r���E�jA4�i4s8��-[;�P�Fw�o.o_S��H�����\.�_�\�D�!;����bN&�61�����\���4�!r�f���Z<�
���5���^�����PcR��
6�=Zu����??�������������k_4Hq�h�'�oy�4�;~tA�������V�������o��9***"""44T���).�����N9��
�GN.��	�N�t4&&&�������������� �����}||�l��hHrs��qtp�r�C7Z����'��F���a�6UURSS���X�e��/^�I�`��E�}��_�Y����u����I7��������m[�bnn���k�iNN�N�*lLY����&�f�m��7���N}N�0���q{P'I7����N�Z�hmm���M
����N�*e*,�E���������e2zkz�13[��)�1�|���H7*��e2Y����8���9==���5--��E����E���D��<�6��i_P?_��P�I7Ul�\T�����������S'EI��ZLE�\����~K/��Ou�t�Q���__���Hww���X�%�}�6���6��xryV�>�F���������e��(E��9�R����;p/6q�O������d��������gz��8��7P+�s�:�z���T��{��/��J�>D��\:]z������iR�LD���A4J�_�������k�d��<��"v���Q~������\�mnI3VS��b�	�H!%��U5�J���]}�4�:���'��hUq}1�~�R9���}G�\E���C4�'?��'�o	�Y�g(l+5r�K�hM�}���NP�v{�>�@6v��	8�F1��L3�Q�9��FR�r23�O��h�ug��,����1�;N$�X�����.'�k������1N�>���%o��S�W����9�������h�e�0��)-�B�]��T��l��<�v|��m����� I�F=Sd��1t��l���8rh$j��2���K�.>|�]�v���={�d���??�������������k_���7h��;��}��f�%Kk}<��S�6�d����+s�(�os�l��n4�?���;v���i��	�;��aaa����7o������

���/��pW��������Vv��cwoj���zz
�Y`��\@������=�F��Q�F�iii9b���� ������� �����}||�l�����G#��]�-��[�i�L}=�����$dm�g�T��c_�5�^S|CP�H7���]�z����SSS���X�e��/^�IQ������T��k�[RH�6\_��	���?��iQ~~A�B��efk������YUY��a��ZZ��4��`�v�&f�k��A�>i�LL,�����Z�[��h�1-\����u�t��w�����`6��gsss���X�MsrrtR�1�+L����1��e�<Q���%�Sk������3+Ksk+�a��anmY�bU�%�R~I�S\@���V������.IG���'===W�Z��K�bmm���M
����N�*III�����YI��{7��3�����hH���J�����av��g:���0��y�&b��4���7�o���V�lPkjcebeabanbmijm��a)373��~�`�X�����='��Ss�u�������H7�]����5h� U���9==���5--���E'E�te*,�5+�����Q.R��Va[�����\u��961�������H�2������������j�(35����yG��!W%��c��m��m������{^�P�^~Vvqw:���%���tLO
f�[
������E=6<�����g��a���E�]�
�.6r���e��[�����s&.�}�����l���mT�37�A��8m����G<X������l@���df���tb�r�9w���l�:[)�,�&�>�}��J�l����Y������������#se��~�����a!k^x#�V5z^�����������G��r
��d!]T��R\X������Y�eiX��|��e	�r��)ks�Z�������w������t�������5k,v/���n4������,X���=w��������FFF�������,j�F�Fi�*g���?'3����g�\�}�
��M���}��s���\���?�����,�M���N�kv�Q���w|�wA���U��C�i�����#�����>[�`�p,D�d�g+5�
�d+X��=�*�����R^f6�e��W�����Jj��Z��g��7���������m]Z��[7���3�W�5��o�k�s_7}?3�Fc�����ryrr�n�Z��7w���k�Y���p�h
����~���gS.������2����:�����o�7�7���J���f�!�"3�h���=k��r���}����[��.5yyKs�4,���*G�8��X�Q�lae#K�CX�O���oU���8����������
��)�U��
�O6�g����5�r���T��{��E��Y7��.Z��n4��{h�(����&��y��f}%����:������kG4�Yc�����z����_���1y,
��|(�����(��F�k���.,Y�ffs9�lp�������u"f{����s%%��	�(Vs����cMJ�f����fy���jkm��ki��(U�d ��~���~��q� �.>=�^!k��Z<�
K�s��4�kzq���YA��b���v�P��Q�����[��ij����@��������+,Z�o8���z4�4z��Q{#V��G9��xS�*�?�����GV�~����[�k(_/m?����sO�a+�����K�#m��l��e�)�@E�Z�����]um���J��J�"�W���[Fk?Q~��7�Oc�S����A��s;cqx����y�e�e�t��>����%b��/��H(��*�M��pR����iu��F�����9,/��� [����
von�Cvo~�"/3��	����.�U��uu����&f�v�LLM����;����7����*�n"����LM�X-sb
> ��)�_�R���T\D+>��u���m(�;r*{H%n�}��$�����GK�}����y��#�W?�������eBQK����Q�/�Y���dy����T���Y[��\��?��v�(��m�vq��'&��H	B4VGn}:�������n&�'X�"7}����������8�qr����k.�qg�Ou�k��t��������i��TU�+��Oyn��?��]�g�����8�afIqI��L*)a���l�>��G���iIQK���"�Y�,�
r�y��a9������

X������(����B4V��[4�M���r��u�h]�W�b0�%��7i7���?&31i�\G�+���?���Pk�]X�����h�SEX��efi�]��Q>}���i���Y3{�8l ?6����X5�.��at��rv�h
ZJ���^�_��[[��������k5�E��u��|(����#j����]��������E��4!+v��ia��1��I�R��]��
s��������m?=H�^~e�M����%�S���b��BQT#E����WDcN����d�C	E������M��������{���,�����s�4��?@l�R7|e
�%���g�|��"?R�9T%�(0W��>|�8w������Ty�H����������>����v��9X\Xv�s3+�V/wgq�q� �S�Z�@����A4��><�=�R�m��x�v���l�x�����iT��A}��e`��_�B�� 5��qy���.�7?[��l'��B
�����y���[�=/31�����!+�d�,���hfi��w�^�\=^sh���2�(������o�����;���'��d��\�4S���������B4V�DF���6@�k�>�h_��`��qg�j�e����9q'�>B�xW4fdd���������-..��^�����3��8p����C(�����0ww���7GEEEDD����_&(��{�'��+��qE#�����\�������1Y.b�`��+SSS���X�e��/j�?�"W4���ZYY����������O��*��1�h���f���
�����iN=<<�TX.�/��������{.v�2�����������iii...bw�������+>>>00�M===��H�qEcHH���odd���{ll���)2�h������b�$�������1���=z^��s�D#��@� �h@4j���_\\,vG��|��rqJ�]�t)  �������������'+���w��������_|!j5��I���2�L}����$��������/�s�}�D����;~�8��a4v�*���������c��M�6M�0���cT�w��ggg�{W��4�m�z��^���_��R��UyKm�k������
�Q�C���c���bw��4v�*���Q�F�iii9b���� ���&���/n��Hc'
b������w�^~V����/o�m�=7���QA4j�z�
�~��)�   �o����W�^|��M�w�t����o�i���x���N�6�-]�t����������y�_�R��{n��y��h��*�8���9s&88�}��g
4~�����������?������Lc'
e�gggo��������Alsu7��l2���QA4�Y�_�R:N�<����j��.]����7��9s�H����4�m�����+�888�*���i�����
�5oT�u�A\����kQQQ�S��x���g�}���������R��UBc'
b�3��o:t�z� ��:��� ������
���2��SN�6m����V/���;c��;vlXXX��}��[�4v� �9��_��;W�b�\��Mm��p_�F�����/�!��84v� .N�:�z�`�~���s��~tt��q�f����K/�_�^�VHc'
b�3W�\Q�����6���[j�_c�
�5oT�H?���a��8���*<��s��^R���������Hy�W��-��������7*�FD#��@� �h@4 ��FD#��@� �h@4 �`Dv��9{�����[.�4���>k��	�^i�����Qu}����g�>zbA4������������y�����������z
�b}��9�u���"22r��y��gm''��K��_F5j�~���1c�����G�m��9886k��=��?d�,[��m�~��]��u���3g���93==���e����{����t�����QQQU\x���]�v����,X0{�lufff><~�����w���F�,w���J�W_}��d���ukpp��#G����!�X��u��e�*RRR�(�^�z,�z���

f��agg��o�1��C�6m���=�_)]vj��E����^����\����q��|���RUo���ZXX�yHbbbhhh�=�/_������@mC4�������-[���*�!��������o����_~�i��Q�F������ 2D#�������������333��[w����[�j\x����-b������III�h�
6�q�F�f�^|���� ooo60-?��hc��K6j�9s��o�moo?h��J��	


`c�V�Z�^����~���:t����7o�����z����[�Y�F�D#���*_W��j4n�811�������np����j�@� �h@4 ��FD#��@� �h@4 ��FD#��@� �h@4 ���*v?)�IEND�B`�
update-nonindexed-1000.pngimage/png; name=update-nonindexed-1000.pngDownload
�PNG


IHDR^Tz0�*	pHYs����}�:aIDATx���	XT���w�a�D@P�L�r
o��]�E��&��[a�M4����5I�,� ��^��_�%e.��PdQIYf_����q�0 �������9���3g^N#�yg�9�@���K`�p�����h`A4� X�,�FD#�����h`A4� X�,�FD#�����hh���h����a������>}��y�lll�Vpvv�����O?�p�b���<s����[�����������.D#����D"�������?��+W�����?����o�}���W�^e����7w��}��M���F[6��@�@4�#:j���E�������A�M(��2eJhh���C����+��KOO��}�����z�->����������?�trrjb���������
������X<x��A���������O@���n��}����._���5������
+�V�Z�`m��������F��h03��a\]]���;�L�4��GVTT�o������|���h������;s���?������-�h0��E}�@p��#2��7O���Y�q���xj0�hh>���4	�Zm�j�s���>}�0�����[�fdd��5K������e��w���'��B4������7|||�W�\���m������o��	L{��
o�����K
W�7iX.Z���O��hh�^7n\�x�aq��MFG��T]]����w��R�LOO?t���:/����O>9�|[[��<���������h4��^z��KKK�H�F�O?����I�����#G�|���|�A�^�h1''�G�<����AAAt����j���f�M�q�F�������G�}���#GVWW��4i����������W�����w�����o���?S\�z��i�L>�W�Z�D46v�8��"�F��������[���!t�����il;�?sA4� X�,�FD#�����h`A4� X�,�FD#�����h`A4� X�,�FD#�����h�P<O��������.�`�������z�)��Y�Q�}�Fh4���?��c���-��h�F���w�}���sss�+��w��Q[[��s�m�����I"����733����***���?�?��c
�L�M�\__����������s����O���s���t�i���Y�F�R=�����:�'���_~�j������h7Z�������!D#�	4�h�8p��M/�{�/_.���9C�}��>�������������v��7�|��_5��&V������[�K�.�����?d��et[XXHCn��III|��7�|C���5j��?��h���G�:::���������Q�|�	���:t�a}���4���Z_{����rt8(�JSSSiP��o��t��������w���w���{o����'=|�0�����si�h�����
_x��-[�L�<y��M�����������Q..._|���3�?������������i�����4\]]i�5�}�+_�r����i3���|������&2����\�����	&4�rcC�F��>���^���?�\_��d����|}�����7�2��~����4����
Z���h��Y.\����m���}����t4���eoo�TKJJ���i����}��1�������y�.i���S�z��}KKK�M�#G�>|�����6m���_y��&Vnl�`�p<o��������T���iR���[*���O>�7��t'N����-ZTWW��������s-^����m��
��?���3g�t�+�]���qG�=h� �+7�q0�h��=z�Z�*,,����;�����t����cbb�,Yb��JMM�����>OO��+W~��w
�������_�?��!Cd2��a���Y#�H"##���CW����X���tAAA�������	 Lhx�����oj��i�������L^�F_lle�P�s�N}],3
�@�R��Q'O�4��������������h`A4� X�,�FD#�����h`A4� X�,�FD#W�1++k��i��b)������y��������K�"�(&&&???$$�>�����EC���c���>}�h����gg�����.Z�����������������������"�!ND��'hh-[����}��u��yxxD�0����={�����HZd���EC�����������s���7���733s���XRR���G��������m,�D4�TSSs����/._�|����E�T���@t)��5�X��c����	�t�Q��FWW��3g�eBBB@@Sttt��F������,E���T�x�����8�q7{��][[K�����)��������`���8y��m���[�j��O>�#""rrr����2<<�,EC��F�g�`��HHH��������a�hF2+$&&FGG������fff��`��htF#�����o�1*
����<�q"����h`A4� X�,�FD#�����h`A4� X��YYY��MS��F�-[���1C?5�H$������		�qssk{��*������6ND��c�N�>�pj*Z���/+)))���������������m/X��:r�)i�KFX�3`��x��	Z��-3����w����������g��P�d[����#��N(:��z���F8����&�iii[�l�K}�������6�������R�0�n��?�6��,�h'���#G���������(�Jh�.�b�Y�ztL��&�����J���g��q���g��-�qR��KE���h\�l��y������4��R"�8;;���fT���� U�Z�S�Z�n�k�d���
b|�X�F�u�6��{��'�8200���<88���,((����E���jR��=�R��;7UfL?���t7������Vi.��999qqqtn�"X�;_���8O���v�gp��|�jK\���v��]+�����x����o:'����a��Yz������iii������f)��*�#���m�8iOl��������m9Q�a��  ������a������.q�B�$�����v��hl"��
�yyyF+��V���qy����e9��J�%B����F���?3����n �fG�<��Q��m��'�:������V������.�����	��E�|��9��F��T��%��t
��e	�V�	��r���75������=�
Ox�q�v��=!m��\�m5"���Lx��m����n�F���*�����unaK4}���I-{�7D#�AGZ�Oi�ze��	����J>V�����V<��@�,P���]l$��ZW����.]��\H�����&R&�nj����gCD{h.\
�{��=�`�)m��+)�m����vbe���k��������+���WF3��w�E]���j����R�#k�g�����]�������a#��mt7�]W�����F�$D#@�c��]��s���{w��L�(�_`L�i�w�Mr����� 4��l6D���w��v "r��.y��/��\x7�4��5���+��m"Y�����y���s���	�]V����|��];l��t�k�ww��h��T<�
����9F�k��t��+���[�q��iC��r$�x"&�t�G�6���?v��Q��]J������|��mt���t��B������cE��^������������*��i�UrCu�����r���Y+rQ��9hs��.�j�4�n;��F
]:����V�������O��T��=�_{��L����ve���f;����
�b��e�F���0i�)��Er���QM1x��Jr�0��x�lH#a�4�
�����;9��+���M;��-�(��=������u\�1�dm� Kw�h�J
5��j#�pY��x�is�������J��qcy"�f6c���R
�?��~h����Rw������!MG��Dr��=��@�� ���!`��������CA:��������K~]z��\x7��$���L�
�(�.�x�|��3��N����^�\4�X�@�� 8�d
�j���R�^��z|'b��������4���X��!���B��*��������|I�)�L�7���4�y4���T)��o�����F)H�������D:���!v�D�m�6��zkv����_A�����7v�R���n�c>9G�@�@4���������z������KU4��H��%��$�[2�HnK���\S�
S�;���_��_r'��1i���'�1�����El}���f��AnC���]�F��������R����=���A��R1�������U�S��^`��	�t������@m����;�����W�1++k��ij����._�P��_����Q��W��D111���!!!�!nnnm/�����g�"�R3�.O�l��!���W��_�]�Tt����S��u#�=������Dl�c[��@���c���>}�hj���g�?~��}��m�3g��S�h1%%%444;;;===55599��E��)��S�N�q.�����K�7<���v*O';;O�g�C��3������8|c���h<q�
�e���O����/���O�:5>>�)������G(FFFFEE1���"@i�E��_������p uR���k���������������&6�;���:��F}����e�=z4�.))����
���b��KY]^��/U��S����v<�Rc���+t)�!!���,`�������'$$��sS*�:8hO��K�Xl��S6���"t5B��<��H�x��5R�#�~��������7]y���IUUU��/+��h<{�lxx���~:d�����H��.%����Y�zaa�������E�BT��7�VY��,J��0���SP9�S��__[~�^��x�K*}��]m�������+
���M��h����Y���'N��/��������`��ly���z_P�~v�#� ����<�����x���d�6�
H�����jp7���?c��I�&#""rrr�����(�R�G]'��_Xy�@��*��
�q����=����]o���������������Y���������UTTD�y������iii���t��6�$g����v?��"��Nx�i_�����^��kc�n'�X/ND���lr����	����<��KS�kD{Kn�?U�J0<���(�����<�]�,�A�x��F���U�>���	�t�f��=.6u�z��=]��-�;� D#t1�9|�*�d�CgU����N�[#{�=�����s� ���]R�v�q��I����I��x����r�v�����3���i���P��}gj~Q����4����x���#�G�-��s�����r�U�9!���j����}\�����p���S`C4B��&�n��]��gUl�z����|�C���>�����m�8D#t"�r+�����d~Q��^�����b���#^[������O� 5����5���/��i|i7w2���Whk�>�A4�5S\'���7w��R�|�Jlx��@5��f�7���R���h+�Q����0�Zt��r�Ie��
����~��������@�!��(+�����m%��'��E���K��4��Q>�����h��&u��0QV�ZyB�)����b�y��7����W4��������
>��m�e�'����_�����!x�pi70�DcVV��i�g��D111���!!!�^77�v*���l�G�������?D�+�8�\vY=��^�<�A�I�	�yq"�;v��i���RRRBCC������SSS�����6����QL����J=��������3���=�E�"��x�@'���.'N����l�2�bnn��={�BadddTTc�QN�
�����x�~B���^d�����]q"���KJJ����Z����_,o�eq�0�P4�  C��a��Wv0/EQ�4Q�$|��o���5��~��q�jK�,��h�T*upp�
����W��c���0Y3R��
��!�D.�h�*� �"OzUu��{��]��]�.�q�EEEm��q7i���D"qvvn��^XX�Q��b�"�WRA�w=�%8�'��4���8���i8�b\ly.���k���={���c��A�F�]VVFG��/^�dI�^�������B�$��]�9s��c�F��{�nwww�p�E�n4������~E���\��
�2�_V��X�7�N�""���n��4������r�����-�'O�,--]�z���i����O�OY����N�V������f����i�&%%��\�n]cE�n4FDD�������exxx��2T�I���u����5���
@g$�H$gZ��vq�G)+
o��{����B��9s��]k�.�H�\N�rt��ku�����B����Q�F1��,Z'��w��C����HLL���NKK

���dVh�"X������$��J�]��~�M��O�M�����$i���������������&W���a555={�d����M-�����&�������"t4���l�)��oI��Qv|�|?�N�yi�q\2B�VR9��e;
`�z,�~<s_%3�C0���m��O��l���P]]M�����4M>B?���;w�����^�E�D4Br{/�������o1[����u��}@@+9>�����n�����<���������7n|����b�^�.^�`g��n��	��/_�`Aff����>�X����a4��j��Os��gU�1���$&�41�"���h.���
�R���4h� ??�����o���'�D���*�����ccc�0�������(Z�:�FF�.��~%���z�������seX^��56�@ �������H���!C��6~k����w�^���,Z�,A�SV��Wk����9���Lm�y.�`�a�h�9��	����)�]%s������ej#|�3WBnB4B{�����Y"�#� #w�={�O;�g!��Tg�k��O���+�D{��
�<Hv�t���h�v�Q�����O(gT,��7�D�r�@��`n�zR>WS��U|���%��n�=I����=hD#�����������o(R=����HL�������h��F��.Q�v��S����dr���l�Z�fr�[r���T���/�5w���L�C�8K�
��NC�6���*������g�
�%��,�1h1��(�M]��#�'}���s7O�>����������6l�0t�PZ�D111���!!!YYYnnnm/B�h���B"���z�n�j������0�����-!Q��J�C����/_��h�5kV\\\tt����g�������bJJJhhhvvvzzzjjjrrr���z�[��5R�7et��SA$�:�)�r7[J&����k���o��6��Z�j��i�~����3g;vl����w��J����������uuu~~~W�\qqq�������|��~BGC|>�������`��tlC��}��[o�u�����{�Y���g����;w���;���AAA���{��'{xs�'��=���/���O���������{������/��9�e������������1c� S�����g�P(������b���Eh%y	)�E��?(��055��p�l�@�?�IYm+��|pc������������9s������'&&2�8o�������+))i���4�F�EWx�����W__�������yyy�G�6��D�mt��.]:p��^&�h"���=���t���k���?�����C���LHH`FS&���F�'��4�oKG{4~����p�B:����u����/��r���;v�;v,S,))��5h�����m/Bk�%�o�T������a
:��P(,X@���>����S��������
EZ�A���_�h��Hf�T����+�J�e:�y�������"�,�x��ccc�6
W7���1:M<�9���N���E���=|�p:|��e���z�t�<f����{��G�e�tH���=o�.�b�Y�ztL��&�]\�k�`�
��N����;�:+n��*��Dn���E�?�D7�T��L�%{��3����^*r������A<x����IA�t<�j��&����=�O ��|�������?^"��W��worr2�������������7���k���a:7~�x�B�J�T�Vo���_����M�dee��3��Zttt��F�t�8;����E���0�
����.M�"7����Z�O�|{�)����Zt�{��`�n�����.�&�������m�!��N����;G���
�
2$))������y��:t��W���������GDD���o�o���}�YFF��m��O�^VV��m1O4�����C�] ���B��t����V8v�}���M���8�X^^N����{�,Eh.����EjUi�3e_�h���S�d���P��Q�/�"���j�/�1a�`,�����}������3����+V����?��������<��r�|��54V/^�H��?��ct������|��&�'��F����[��OZ9y��O?��u��Vo�}8�,��tG0E�6!''�&%]��������N�^%��UH�|�����c=,�30��4������X:vX�~=S�<y2���s��6�����-�2M��������a"8��h}}���K�����K/���{��
m��y�q��a���7��5��~h�67m������_��"���"}����F��42�R����#��i:�UE��SU��`�)���uixZ��M�s{��5Z���S&�1m___}�E[��S]]��������6�y��j�������/O��FE�P���g�"�GM��O���P�:�x��>:u�0�@'����'���s����saa�����h/00�,�������!���UNgj�a
���3q�B[.�Ff;�q����^�~=m3'r�u�(��w��+q�%_[�z�)�:���������N��)h�y���� //���������o�m�%�D����'��#[���0��F��#v�a
:5�D�D"��HtY__o�m���KIi,������Y�����v'�S�p�0��/���Du��zh��s1�s*f����l�h�%�a�k� CSn� ��FyA���*�0]�y��g�4��j�!����i��B�a
� �Dc'bS���xRs����A�f��.�5vy�JR�*��U����
U�2e�0]��k���]�BJ�9��R�(��GH���]R��c�rW�R�&Q��jz�f�z�)?���8js&������@+��!����3L��>3L���G�$�> �[h��z�N�z���h����d����`i��F�Z�y�f��7�;�E"QLLL~~~HHHVV���[��]��6){���������T��5 �����a
@���5fdd8;;���<xp��E��������������������������^!���i4�TT��b��pwG���o_�n���G�S�����g�P(����E&��X�*�����e����W,�U�"S�SF��������7n���off��h�������6�������5�X�n�%W��3L�w�������3/^\�|���s8@�R���A;��K�X�����S6����U�������xe�7]Wb��������b�E��9��FWW��3g�eBBB@@Sttt��F������,E���0�
���E��;JT5�E��To#����0��Z�����a���������swr�����{����1>��w>�,//.++

2K��(y��;����)����a
�	�����'o�����U��|�I����G����f)v'�2y��_w+�*5��6<������X�FcBBBLLL��=�
F3�)&&&FGG������fff���I�MO�Q�=Ic�~���A�_�;��E������|��QQ(������yy?(�P�17\l�'i�`�)���n4B������$�d[�����Y1�@� ;�T;1!�?�|��G�{�v�F�\h.Dc����W5���U�J����`�K�E�`���O{e�����%��c�)�A4Z9�o�������tW:	D�5��/�2���bb�!_1�)���8@+!���/��$B�r��G����L�����i6D��d���
iT��������;�_T��{�]���;3i$i���
HhD��QKH�\R{�6���w*>�W�3�<�E&��^
��d�vI��������n����/�yJ��HR���xdl/���#�	��h��RR�2��h��������1e;H��7�P& �9�VB|���BG�r��n��"��L���D�'�-�9�N�h
j����������o�S}g��W�A7N�`V\���e��3fh4w���D111���!!!YYYnnnm/r���4���s�b�S~Tw�
��07NG#M�/���������������������"wiT��RR��6SN�S|���A7�v����{��1�����Wrss���#
###����lkc���bRGj�3:��^w�M��E���q:����l�B��JII���m������E�*R:�H���YS�z�)w�'Q8���q7�9�����o_��T*up�N�K�b��,E=:�l�
��v�jw}������jM�����z�8�S������p7�-[6o�<����#�7��H$���f)����Uh.6,�����l1Q���G����nLYw����������
-��h�����xO<�G�������N���!��������T����t�����d�����ep7�'l�\��#""rrr����2<<�,E��������*�/P�����~�����b��&%&&FGG������fff��hy%��/R�M�q�J���zS��������e;��XA4����P(���3Z��E����&���Z�;[��MM?���/bJ)��g���)o�N�(,V�e�l�����W�� D��H� ��������2���}���Rw���C��/-P�`j�2����n,
�h	�������W����B���Dc�����5������oj�j?7��n����4JR�.��U�z|��t�M�����'eo���3:���2�p	��C(+II�Zz��b�/�{�L	"�]-�30�hl��t�D.�J��D�S�r�t�no���	��vVw����Z��%�V��3�V?��n�q�
'!����b�%��]��2r���>���A7\�hl'�I�O(gT,����G����n��@���@� WU�}���N�^`jN2�X�F����ccc

������1j�(Z�D111���!!!YYYnnnm/���6){MRwa��?���L���D��+��h�={��������m��9s��:u�SRRBCC������SSS����^4'�R:���f��k������t��n�w�q���/��������S����bnn��={�BadddTT�mm,���<�����ekX�s`E�����L���C�G��d�������6�������R4��H��'�QK4�Ot�L�C�A7�����8�|BB�17�R���vr
���f)��1e��,	p=>�sW�����)LE����Rv���E�-p����������~���!C����#�7��H$���f)����Uh.6,�iO�W�m�m.S�dJ�N$����]H�q�����,���XQQAc)==}����b```yyypppYYYPP�Y�������7�/��#��1� w��n�w�q���3f��4i�a1"""'''..�.���,�VR�H�������92�S�I�������h����h4III����":�KLL���NKK

�+0w����rR:�D}(��|��j��aQ(������b������I�yF�<Sp�%/�'�������h������;����#�����J7nv���
��%n~y�bg�l�mM/����$�|�v�	��L��4�_��k�qbJ8��SB4�)*�F���$�����3i\O����{�'�A7�fR�`O���
�����[O���R���/;��d������|q�
@'�hd{�(���B��{�<��:���������A7����������i�U�KEl��������
���n�xY=f�l�a���\��������eI�m��
�}��h�N���T`���!Y�� "�������h4�n�������a��� Msw�������N�t_�Cu�h�D111���!!!YYYnnnM�=��q��q��3����^�Eu�hLII	

���NOOOMMMNNnjm�Sn���Aw>bUkH�/�Q�������{��
����QQQ��F6>O���u�h,))����
���bKw��kE�T*upp�
���F��1e���,@'��������#]J$gg��3����*4�z�����g�=�t�)]+������������,�����999qqqtn��u�hLLL���NKK

����tw���V4
����<K�8�kE#�}!�b���������C;A4� X�,�FD�iYYY��MS����Hs5�p�&�����/������/##c��Q����O��w�
o���'�|b�>�f��V��y<��M�FC8��������o�s�}�w)�F�;v��i�O�U0���MNi!�g�?~��}��m�6g��S�N��K�.Z�wM1�I������������{���}���7�����[�k�KA4�p��	�ol��e��Hs��p[&��0��O������N����	�v��Lv�*�9���:--�����M.��������M��z_�]
�����Z���ULN�4:4z�h�M�L����6l����,��F���U�s���+g���������>o���k��d���5�� ;��'������'$$������'��={������^\\��_m���f���������m�v��I}�*��!���Z�?���|��h������;��=����2��l���i��������d'�e����>��������U�sC&w���+}�w)��N�*&����KOO�����S�N=���|>_�V���[�{M0�I����7�|����V�b�2���b�[�k�KA4vZV19����g��1i�$�����:s���_~9%%�����T��f��V����~�����7�X�>7drW[�����|��h4A����Y&;l�S����&%%17���������Y�f-\�p��1_~��E;�(����}N]�rE�����/o���=���|��h4��Yh�d��brJ�Ux��G�����d'�b�S555F.��������M��z_�]
�����h`A4� X�,�FD#�����h`A4� X�,�FD#t!���_�x����.
'N���Gy{{�L���z�F�=���G�i���� ������M����^UU���'&&����Y������E�D#tiiiK�.�2e
m����\���:�Q��k�f��y����#G������]�T����>�������Z����q�***����={���w�y���<((h��uO<�DG��`&�F�*��9���������7t��]�v%%%-^��F�@ ���9}�������OG�4:�O��c�~����Q;w�LHH���_����v�h������7��q��a:
tqq�!7j�(��P(.\�������N�:��!>>>4J�S���������U�����������Y������������{w����S��F��worr���#}}}W�^j��@GC4BW1z��;v�a_sV�C�s���b����o��>�,##c��m��O/++kmO����U,X��������L�R[[�i������;w�\y��	��/������}������������{����c����GFF��i�1%XD#tC����w�y��W^qss�8qbG�$''�����c@@���������k ��.]:w���^z�w��6lh��:�����4��Og�7��������j�m�A7�<�������g�x�FD#�����h`A4� X�,�FD#�����h`A4� X�,�FD#�����h`�$��%�S_IEND�B`�
#45Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#41)
Re: Hash Indexes

On Thu, Sep 15, 2016 at 7:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 15, 2016 at 2:13 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

One other point, I would like to discuss is that currently, we have a
concept for tracking active hash scans (hashscan.c) which I think is
mainly to protect splits when the backend which is trying to split has
some scan open. You can read "Other Notes" section of
access/hash/README for further details. I think after this patch we
don't need that mechanism for splits because we always retain a pin on
bucket buffer till all the tuples are fetched or scan is finished
which will defend against a split by our own backend which tries to
ensure cleanup lock on bucket.

Hmm, yeah. It seems like we can remove it.

However, we might need it for vacuum
(hashbulkdelete), if we want to get rid of cleanup lock in vacuum,
once we have a page-at-a-time scan mode implemented for hash indexes.
If you agree with above analysis, then we can remove the checks for
_hash_has_active_scan from both vacuum and split path and also remove
corresponding code from hashbegin/end scan, but retain that hashscan.c
for future improvements.

Do you have a plan for that? I'd be inclined to just blow away
hashscan.c if we don't need it any more, unless you're pretty sure
it's going to get reused. It's not like we can't pull it back out of
git if we decide we want it back after all.

I do want to work on it, but it is always possible that due to some
other work this might get delayed. Also, I think there is always a
chance that while doing that work, we face some problem due to which
we might not be able to use that optimization. So I will go with your
suggestion of removing hashscan.c and it's usage for now and then if
required we will pull it back. If nobody else thinks otherwise, I
will update this in next patch version.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#43)
Re: Hash Indexes

On Thu, Sep 15, 2016 at 7:53 PM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2016-05-10 17:39:22 +0530, Amit Kapila wrote:

For making hash indexes usable in production systems, we need to improve
its concurrency and make them crash-safe by WAL logging them.

One earlier question about this is whether that is actually a worthwhile
goal. Are the speed and space benefits big enough in the general case?

I think there will surely by speed benefits w.r.t reads for larger
indexes. All our measurements till now have shown that there is a
benefit varying from 30~60% (for reads) with hash index as compare to
btree, and I think it could be even more if we further increase the
size of index. On space front, I have not done any detailed study, so
it is not right to conclude anything, but it appears to me that if the
index is on char/varchar column where size of key is 10 or 20 bytes,
hash indexes should be beneficial as they store just hash-key.

Could those benefits not be achieved in a more maintainable manner by
adding a layer that uses a btree over hash(columns), and adds
appropriate rechecks after heap scans?

I don't think it can be faster for reads than using real hash index,
but surely one can have that as a workaround.

Note that I'm not saying that hash indexes are not worthwhile, I'm just
doubtful that question has been explored sufficiently.

I think theoretically hash indexes are faster than btree considering
logarithmic complexity (O(1) vs. O(logn)), also the results after
recent optimizations indicate that hash indexes are faster than btree
for equal to searches. I am not saying after the recent set of
patches proposed for hash indexes they will be better in all kind of
cases. It could be beneficial for cases where indexed columns are not
updated heavily.

I think one can definitely argue that we can some optimizations in
btree and make them equivalent or better than hash indexes, but I am
not sure if it is possible for all-kind of use-cases.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Amit Kapila (#46)
Re: Hash Indexes

On 16/09/16 18:35, Amit Kapila wrote:

On Thu, Sep 15, 2016 at 7:53 PM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2016-05-10 17:39:22 +0530, Amit Kapila wrote:

For making hash indexes usable in production systems, we need to improve
its concurrency and make them crash-safe by WAL logging them.

One earlier question about this is whether that is actually a worthwhile
goal. Are the speed and space benefits big enough in the general case?

I think there will surely by speed benefits w.r.t reads for larger
indexes. All our measurements till now have shown that there is a
benefit varying from 30~60% (for reads) with hash index as compare to
btree, and I think it could be even more if we further increase the
size of index. On space front, I have not done any detailed study, so
it is not right to conclude anything, but it appears to me that if the
index is on char/varchar column where size of key is 10 or 20 bytes,
hash indexes should be beneficial as they store just hash-key.

Could those benefits not be achieved in a more maintainable manner by
adding a layer that uses a btree over hash(columns), and adds
appropriate rechecks after heap scans?

I don't think it can be faster for reads than using real hash index,
but surely one can have that as a workaround.

Note that I'm not saying that hash indexes are not worthwhile, I'm just
doubtful that question has been explored sufficiently.

I think theoretically hash indexes are faster than btree considering
logarithmic complexity (O(1) vs. O(logn)), also the results after
recent optimizations indicate that hash indexes are faster than btree
for equal to searches. I am not saying after the recent set of
patches proposed for hash indexes they will be better in all kind of
cases. It could be beneficial for cases where indexed columns are not
updated heavily.

I think one can definitely argue that we can some optimizations in
btree and make them equivalent or better than hash indexes, but I am
not sure if it is possible for all-kind of use-cases.

I think having the choice for a more equality optimized index design is
desirable. Now that they are wal logged they are first class citizens so
to speak. I suspect that there are a lot of further speed optimizations
that can be considered to tease out the best performance - now that the
basics of reliability have been sorted. I think this patch/set of
patches is/are important!

regards

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Amit Kapila
amit.kapila16@gmail.com
In reply to: Jesper Pedersen (#44)
Re: Hash Indexes

On Thu, Sep 15, 2016 at 10:38 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

On 09/15/2016 02:03 AM, Amit Kapila wrote:

Same thing here - where the fields involving the hash index aren't
updated.

Do you mean that for such cases also you see 40-60% gain?

No, UPDATEs are around 10-20% for our cases.

Okay, good to know.

It might be useful to test with higher number of rows because with so
less data contention is not visible,

Attached is a run with 1000 rows.

I think 1000 is also less, you probably want to run it for 100,000 or
more rows. I suspect that the reason why you are seeing the large
difference between btree and hash index is that the range of values is
narrow and there may be many overflow pages.

I think for CHI is would be Robert's and others feedback. For WAL, there is
[1].

I have fixed your feedback for WAL and posted the patch. I think the
remaining thing to handle for Concurrent Hash Index patch is to remove
the usage of hashscan.c from code if no one objects to it, do let me
know if I am missing something here.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49Jeff Janes
jeff.janes@gmail.com
In reply to: Andres Freund (#43)
Re: Hash Indexes

On Thu, Sep 15, 2016 at 7:23 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2016-05-10 17:39:22 +0530, Amit Kapila wrote:

For making hash indexes usable in production systems, we need to improve
its concurrency and make them crash-safe by WAL logging them.

One earlier question about this is whether that is actually a worthwhile
goal. Are the speed and space benefits big enough in the general case?
Could those benefits not be achieved in a more maintainable manner by
adding a layer that uses a btree over hash(columns), and adds
appropriate rechecks after heap scans?

Note that I'm not saying that hash indexes are not worthwhile, I'm just
doubtful that question has been explored sufficiently.

I think that exploring it well requires good code. If the code is good,
why not commit it? I would certainly be unhappy to try to compare WAL
logged concurrent hash indexes to btree-over-hash indexes, if I had to wait
a few years for the latter to appear, and then dig up the patches for the
former and clean up the bitrot, and juggle multiple patch sets, in order to
have something to compare.

Cheers,

Jeff

#50Andres Freund
andres@anarazel.de
In reply to: Jeff Janes (#49)
Re: Hash Indexes

On 2016-09-16 09:12:22 -0700, Jeff Janes wrote:

On Thu, Sep 15, 2016 at 7:23 AM, Andres Freund <andres@anarazel.de> wrote:

One earlier question about this is whether that is actually a worthwhile
goal. Are the speed and space benefits big enough in the general case?
Could those benefits not be achieved in a more maintainable manner by
adding a layer that uses a btree over hash(columns), and adds
appropriate rechecks after heap scans?

Note that I'm not saying that hash indexes are not worthwhile, I'm just
doubtful that question has been explored sufficiently.

I think that exploring it well requires good code. If the code is good,
why not commit it?

Because getting there requires a lot of effort, debugging it afterwards
would take effort, and maintaining it would also takes a fair amount?
Adding code isn't free.

I'm rather unenthused about having a hash index implementation that's
mildly better in some corner cases, but otherwise doesn't have much
benefit. That'll mean we'll have to step up our user education a lot,
and we'll have to maintain something for little benefit.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Amit Kapila (#48)
3 attachment(s)
Re: Hash Indexes

On 09/16/2016 03:18 AM, Amit Kapila wrote:

Attached is a run with 1000 rows.

I think 1000 is also less, you probably want to run it for 100,000 or
more rows. I suspect that the reason why you are seeing the large
difference between btree and hash index is that the range of values is
narrow and there may be many overflow pages.

Attached is 100,000.

I think for CHI is would be Robert's and others feedback. For WAL, there is
[1].

I have fixed your feedback for WAL and posted the patch.

Thanks !

I think the
remaining thing to handle for Concurrent Hash Index patch is to remove
the usage of hashscan.c from code if no one objects to it, do let me
know if I am missing something here.

Like Robert said, hashscan.c can always come back, and it would take a
call-stack out of the 'am' methods.

Best regards,
Jesper

Attachments:

select-100000.pngimage/png; name=select-100000.pngDownload
�PNG


IHDR]T�)	pHYs����}�;VIDATx���XS���m�E^P�D5	D��%^�/�t��
��?LL��f d^RQ�S�eF��T�4�E���2n��_8�6~c*��9c���s������aM�|��p�R)�q�n�A��r�E9�"�r@� �\�C.�!���r�E9�"�r@� �\�C.�!���r�E9�"@�����X�"//���FFF������G=�f���D�TJm@
�z�s)���_�r��S����-,,��3{��^�z�h'�,�E�HJJ���o�����C��P�w�����O�^VVJm���y�sccc7o��c��X|��������{��4h������,�E� 332]������{��� ����Z���s���u���={���������x{���rZ	������;wV(�������_}��5����P��1cF|u-�\hKK�.1b���n�c��-Z����(��A.����3%IHH����]]]I:zzz������Qz�K3?�{�so��mee������-3k��/��";;��������G����KNN����6P�q7$5I*����9��-F������Skkk������'��G�d�x��������h3�"@�p����U�\��%K����������{��E.�r�|||"##eg�Srrr�&?����c���fK��������qc���!Z **j��	������:u�����#G���5k��ym���@f��~���1c�T5//O(�Z�j���m��6@.�������{��������w�9;;�������m��+����i��sR�`���$���?�����
KK��#G�<y������!Z��z�z���>g���J�����l�@r@� �\�C.�!���r�E9�"�r@� �\�C.�!���r�E9�"�r@� ��\��ukPP�T*�V���:D��M��v�Z2())	���rssKNN666n}��s������WH�������4.FGG��������������@;1:���
t��aY�������fB�0--���������R���"h'F�bll���[�RV!������	gg�-[�X[[�baa���%XYYP[������x����={6.zxx������,\�0,,l��=�XYY���OdYQQAm��"h'����e�f���P��m5X�`��SF�md)����TR�
�j��yzz���an.�^��������dvv������H$<�z�dqq���]QQ����J�2M��$)���������\���ABQ6�<yrpp�������������wjjjXXYzyy�������T||��I�"##�
���@#""���bccARR�J���4 e�E���);;[a>������"h'
�E�6�\�C.�!���r�E9�"�r@� �\�C.�!���r�E9�����[���d��*))	���rssKNN666VS��s������W���AJJJ\\\LLLTT�����������
:|���"
����|������/�a�(�vbt.���n���,e���BKKK2���*((P_4���{��-�g[+}�e�?
	�U�����K>r) j5f�%�~�F_h��t�\:����#GLLLz����XYY���OdYQQ��"�6Kju8�����_������'����LLI�����
y���r�~�l��GY����$F_G���o�e����1C�h``@b�,E"������2B��ioJ���fI���OV�b��Kt�-�=�5��O���^$��#����������������������!�mcn.�^�����������Mqq���]QQ���-��:�2���
���i@�������Nj�����_e�{.������2g�U#�����JY��+*k��.��C�uu�
�^}�2�M�bV�1��;�6U5b��T�KYM���J���Zi�%Q�CEe
5K����jU,�V��J�5RVM�G��98>%c��h����)#07e�f�P�����SSS�������K}E����_w>������a����T�+I&QE�L����b%U�?��-�++D5��Z���*k��<U5�*q�X,%�B6#�A�P+��H�*�?	�/!�T-a����g���${�N
���p��&���5<]6>P���p[b?�#3V_���*Ikh{��E�"""���bccARR���ZE��v�Xbfv�W�\Y��]d�����v����0��u���	4 e�E���gdd(l��"@�'�)�z����g.�8E�7����ZF"���u�u�^��.W�o��w"K��^�'����BQ���4�I��,�������a����R.[��s96���!;����>~�aY��'�l�����1ed
�EP�{�E����L������T��X�?���b���0�m��!��TQ,�!c�SHr���.�]7�rx��W��a��<6K���eIu;����A�P��3����2��t��tt
����z|�$���:��
y�)��f�C�X���
���K�f�'�
����.o}�P�`�H;�"@{V^�4����y��w�~\���-z��$�t�J!�~@�3�$��5���r9H��;�H����i��Y�
I,v}��9w�:�)�S�F���{�L�:���C.�7�J
��=�]t�Viv5��-�69DTi��Q��Kq���<���7�5wY5u���~I��;��P�`�`B��:�?�XZ"�H��N��#6��p��3�v���<}��w�\a��GgE�+��N
��QX[�[Rb^��kGn�>]�m�lg���,-�����y�*0�������(�2�X_�a;-��;�X�lj����n��"��yR!�������,xxV$��3|�0)�Qr'�>yl�/����ap������Y���70����|9��]�C��n�%5�'F3r@<���;���+G/����*�����n����,^�����������%�qP�7����4��
|���}���o@s �����B����3o��W���������2CVGep��6��������������uj�Wd�q�m��2dV����g=����&������K��^?{����������o��U�:�^��d{�X�]���p��n�c��t�T�w�E�)�����w�V��J���'���H���������n<=�kX���=>��e,�l�RJRkf������������������r@-ttj��F���)t�'����9���� ��w����y����������N��f���;�t��3m���r@-z��HB���~����z�nRh�x��3���t!A�d�2�rhO��0)hc�E�#�E�>�X�������utt����ew��;8���b�mg'�m�)(B.�^������N��p�n���l��������Y��w����B�@.�X������S(VU<|�%��������&����\�r�JHH��s����7n�8h� Rtww?t����i���]K%%%YYYnnn�������/����.��{���d���vO"������)h������}||����u�������V}�������4�2::Z ���������DEE���"��������x����kX�����A����������NI.��*�.
����|>�Q___*�ZYh�OD~1	:�?�����
�Mn����\$q�����8!!a��T�������'����l�bm]w�{aa��e��>���


�-[Yh���{�+V���E��BV��Q�j)5ed������\�p�����9r�Z���

uqqY�paXX��={X��[F�
?�ZYh���r�F�6�u�i���k�pQ�F�V�����555���!!!G�%���m��,�}�h``@��,E"����J�2B��iWJ��U�R�/g��T�00��p��7M&Y���,�s�����|�r2#��x����f���������DB��$ ���������lmmUR�izG4�&
�+k��Kz�����&U1�����������@s�w&�`n.���W,�h��uk�~������ILN�81::����*z{{����-����K%E�g����7f���������}�7"Fm0��������i��I�&-Y��w��dL���I122r��a			T1""���/66V $%%���T�?7��G�������*#�'��}�a����
ss���>++K��������P����-4�)���#��9���?p�16�S����T����b�t���N����/�*���{l�e���l@!^�Q�h|�&����=�*V&I(����r�y��x�z���/���TeXO�����]z5A.<S����;��=�fQ(��.�?���@������ut������S����1?��JoW�n�EEU5�O��~�����>U17������,�m�r��zX6~�w&V�M�.����c���<���@�@.���zw�������pp)���_|�<����� l?����l�K�*��?s�:����
�r���������w�JG=��^[�+^G�=�"h�2Qu��u��Mff
��16�&C���6�@.�V���8x�"���z����K���<l>����5�r�����s�������f������z���Z
�Zj��c����aSL�����x���l�]����W�\			9w�������
D�%%%YYYnnn������j*B;V+�L��P,^��K��];8.�N���:���\�<y��������n����C����� %%%...&&&**JMEh�>MX��Q��F�5T����3G-��0���%��, �������N�BaZZ��'����Ke�:��.�\�����������r?r�z��@��an.�8$K�X���0b��XXXhiYw�J++������I;������6W�U���+�-v]^��+`��"���v�����#�jee���>�eEE�����H�����N�]���U1����6���6��\������		9z�(Y500 1F�"������FE�P��+�E`��ZI���
�v��������&�:Eoc�L����������x����Y�fQE���b;;���"[[[�e<=�f#����t�Ai��H��B6��:�,�����Ns[M��m�`n.���W,�h��uk�~�����wjj*)���������X�0mZ��
����68�x���������6m�4i��%Kz��M�T1""���/66V $%%�����?��������:�<�u�or��]�17�������|>?##�
�������+W��6Ee��t\<�kHoc����/�iE���s����o@�X���sb��_�:��L�Eh?.����Ts��M��'�V�����+�,�Eh'��<���4s����s$���%�j����@� �=�n��#�������:}�����K�@�!A�U��'�_*2H02��*�;���o9lz
�\
v��44~j��:����K$\?���;��h0�"h�SW
����b�pp��d��O����
4r4�O������?�V�Yv+'$�v��+h���a�R�������:t��*�<�>X�������E�$��������n��R)��W�'o���/h?���8����.�V%�jK��=)9^e��2U��>��
�����)�"0����ik~���)�������Ao��a��>Ud�X��0���W�i�/�"0��E	E25z��~�u�M��Vm�q
hm
�'�"0Z���P*�R���K��im
�3����k�BBBN�>��W������������C��
�M��v�Z2())	���rssKNN666n}��[�Bs���������wn���F������Q�F���/11q��)�����������i�ett�@ HII���������j}h����9 _���t���n������	&�x<���p�Hr���TaK�P������}|||}}�`ke�eb�p�[{uu����+��ss1$$�8p`��!���"��'N8;;o������---�����������E�����p���[�P�������$FKW�%��������3g����������p�����={��bee���>�������C[YZH��[���l�N�W��W��-��at.^�p���k������Te��m�`���O
H���H$222RIQF(6mLiZ���z����/���
5e
�][�e����n�������#sDY1;;������H$�GI@��������(CzP��PlZ�V�|���]���o<3j������������g�;�qq������'N���vww��������aaadI��*)B[�#����O���6�W:m��hl����@K17����R���K����<2�����4iRdd��a���"""���bccy�J��f�����o�x�G�H��p��M]�fs���ssQ"�4-:99Q'26���322T[�6 eI#����qt����uu�"�+����B���;����m��������a��d�P9r���n�|��Eh������X�s�:����C��u��#hn@k�
^�������r�yyy�����@�k���J���o8�FTf���m��{����]�v���ZXX������C�)**"��y����?�{��3f��������}�vpppff�������MLL���i�\��v���uG?��/�Vk��8i�yG\�@�fn8x����{�kX�s
5`���l�����g�_��z����H�Px��I�W����L"�<|�����0`���K���\�n���4B.B��|ts��%\���)���{���p�s�"��+7��|���)��'����*I��>����O�2e��5
�����&������/^������3��A<�H#�"�)K�`g����9:u������~���h�N�:������'O�n��Kj���Sj,;]i�F���.L�8���uuu��������y��7�����w���/MT]�8�L�p3E�����W����������<g�1s�y��������]?�x��I��6���������K;����J�4RM.N�:5  ��byy��I����z���k�.�|	�\��������6�e����b�-��z��@��d����;��g�7�p8������?�����C�"�]�z���ZOO�����������������}���>�H#��bvv6�&�����������J���-���S��As�rt��)\�����
����e@���M��5���������������LMM��������())��~�����BBB��D��
�S���>_�����(==}��QT�L��b����h�����Fqu�.� �r8e^��W���SmFhQ*��rKKK�����q���Qv
c���;u�DbBa'J�4R�O(''�������{��%�RSSC���@%��#����_^�����[M�nO�����9G�0C�O�o�T���V��0aBee��u�^}�UR9{����'e���R^�df��'��V��=���2����`�����snnn������=����f��� �1��(���^]������f���T)�������u*�����7��������O5�x�������/������O666*�-h���6������p�M��wS��7��,]��S���������7l�@��9�t��5���O����W||<uq���������,77���dccc5�E���e�.���:���J�+;l��0\�
4�jr��WFF���!	�E��d�����F���o_bb��)S��.FGG������������(5��J�����*�Z-+3y�*j���]����H$"�HdY^^��}N�0���������SE�P������}|||}}�SG�����y{j�
�����j��u��:���Kc�q7!!!����C���������d`eeUPP��"4���+Ne����{s���
���LoW���\d7:a�����^Bnn���3�|�Z������'����P_^(�T����������N�M�����lbHw_�b%�L���&�m�t��b+��Y.\�����~�zGGG�b``@b�,E"������2B��icJ��C,�=�`�]�)����*���L2���C4w-W"��i�s��k:���-��Uhe�+u��-OO���8�uRlll���������lmm�W�!=(TH(6-j�{������6�>1s1[�u�{��@k���m)�Y?����#�		=�B�u�T6K������{��+W��5kI�U�V�?��o���������{�n2� ?i�����c���2KK�7nt������K�.��]����1��m����p.��b����R<t����~���=z|����G�&��;w��;��$'?���[7|��g=�9��������������������F�d*��"(u����{?����Vo����m�G���vM��1�\q�n�����k4����x��k����=}�4''��o��A���3�k���K���7�d�����~���}������W__����!C�(
EV}��=��������.l��O���uww��}���~J2u�T������9s��3g����`n.&%%��>�kJ�����_�����G^�@@6�RG��*�9���U]�"e_w]9~�k�w@{������9s����������pC��^�x��IJQg����g�E2}��T�,�����Z�g>�?n�8�S�*��R�	&��$�J2�s�������P�q7����W%I�"���mP���,������Y�����X�s��{���{t�5�enn���;�~n�I���5���.z����/\����������������Ln��U����Y�U$�\�l�G��={vnn�H$�m���E���[���W�n�������\T��7@�q�w|�}k�*�}�o^.��d3�^/��>����9p9Q��S���?KW��������Kd)���������K]\\lmm����x����:uz��#�����{{{���l�={���iS|||bbb```QQ��+����<���U�G��n���q���sV������5x�%���"������;����3gNRR����������7<<|��d��{�EDD<��T�>}���������{��W�^%);t�P�g�M
�B.�3��;����U���jq���ST��Pz������� ������!���yTTTHH�5Z[[o���*����aaa���#c�a3g�$���=��5j���1� �)#�����/^�x���}�Q�=6n����}�"�(3��?�eG��]�����������Z��d�7'���i��_6&3���t�-�������q�n�d��������G�x���_}��=Z���zzs`���,i�������Vkku�]~w����Z�6��G�'��r�GeM�w�]��p�S��C�����q�7P��3��gH�"�=,��d_����?G<x����������>ua�'q�y
.�=�[h���(�����s�
|x��\�:��$m:���Uw��T���������x����rQ�I����/��Vkk�r/xn�O� {z�rQ��j�V�e�p��R��k�������B.j�;O��-�~�5j��}^���Q���h3����p����S�jK��������X1{.�Z���^N�!k�T*f�es>���Q3�y�6����������7������C
�M�6m���dPRR������F�bll��b�PY-����b���o�����/�L��y�{��$��\{��#�07333��?�p�'����|�����h�@�����b;@B���[j�oP���	d������oR��R��\�}��o2���S�N��Z�lY�"�ESSS�-�BaZZ�������������v�o����?��e�x��������p�7��������M�$���O�8����e�kkkR,,,���$++���j�V�2Y���a�]��a���;0�����I���
���������p�����={��bee���>�eEE�e+�����Q���l7�D(4�a?�m�F
,X ������dY�D"###�e�Ba�6������x���d6�|��@��\���vpp�p8���k8��dqq���]QQ����J�2���
'M�TY]k�r�m����J?�����4����=�94,'O�<q����hwww�������F�^^^*)�E���%���
u}}-�h����N���:a#>>~��I�����
KHH�6���������III*)j�[O���O���R�(5ed���qW���\Tz�J''���l�"�����PmQ��yzm~�������SF�gan.��yPvsA�_E�C2�H8�������J�Li��4r�]yT~{���J�n��C1����y_���N���\l?�V>����I�-V���s���8��2��v�i��/���HT������<}� {�"@� �������?(�?_���3b��r��Fw_����D5e��|x�,�nE����������Px�E�VU+Z�>�v�%j��?���|-`EoW�������_	'���Vs/
]��-G�ge@s 5U��f��!���V/��~7!�m��v����I,�]���+��Q��y�����*�M��E�#��c~�,����jA���q+�V���������|�s�����������"�EMBB����O�B���a;��������
�=A.j�
.:q=���i��`�8�>����0:�����/�Hd���������,777�������l�������;�S�}�Vz[h���������W�ctt�@ HII���������RS�i~:}(o5�{�*t�������.17O�:Ek��e��B�0--���������R��"����n��8j�����A���DoK�ss1<<�i����������


�Wd��3w�4�r��a���k�G���@;��\T���R__�����B}E��{~�/����u�������=����3
�Ecd)�����W�
�M�PZT��f��Ld��>a}Rb�j�'=�O��Kh-
�E���b;;���"[[[�e<==*$��U�P��sY����P,-5�k��wG���]�+/shX.z{{��������������:rY�1s:�]w�JY���5B��07�lv�u�FDD���_ll�@ HJJ�6PG�F'��?�*+*:�=�g�Mh������>�����E��.<���d6GL��
#g���~���h	���:���>fsjXuG�:�����qt7�]��L����o~����"��j���K�HwSZ��W�\�J����qM��������t7�����+�y~�/G�������:Q����B.���A~���8�22���Y�?�BwS��H��%E�|8��d,�p:K����3���j�E��}r����cV}(����MQr�thK�Ez<,�;=�]��CV���S���Of�� ������DO��=2������}�������m��������;u+R�~����}�P`�b�*�*�d��Iu��V�l=���iQ�"c �NEu����J��S�\���i1EFA.�������>R����R�-a��L�\l5�������^�V�ec��V��P`����%���j�9���|��i�\�]�R����������M[�v-���dee���%''���*$'��W�=M�J��l����X��$����mll���AJJJ\\\LLLTTT��*!�JC���E�V��>u�WGU����\455U(
����4>�������K[+��W���+�G���r�����z�*�9���������'����l�bmmM�������d`eeUPP@m��b�M����{��������zz��9�������Ghh������������C�������d@����,���d�6�������1j\��g������7mK�rq��m�`���O
H���H$222RIQ���S�BB�i���6.�����S�����h���mC���\���vpp�p8����QE����vvvEEE���*)����sIwQ�����~A(h
����'O�81::����*z{{�����������J�/�����
3�j�u|�^3��QB�d������&M���6lXBBU���������III*)�����w$[�k���,�w��_y��-4,�������|>?##C����2)�Z�z6KJ�5�]�LH�j�������a��L����X�f��bu���ww3�BwS�2����xG�����U7S|���zt���)xI��V���p�I�����k���|v��������!_��=I��/�r�B����k�{��t7��\|��?�%����9��g
����rkX����cG�����#�r���O�$]u��\��G����f���To�����kT����q�	23fU���;���/�����r�z�\22*eI���S2�bn��>��/P�bs��b�~g�F��H��������X�t�*�\l.2Y44,k\��]��� ��L��=����3��?��FWK���fi�d��>��4�j�\|���EB�WI��,�	,�g�t��\|���E
���r�URR���������lll���WE���J.gs��}���s��QP?�"+::Z ���������DEE)l0`�_d���?M������
�iii|>�������i.&�+l�����
@�A.�
---Yu�>�*((���r�UYY���OdYQQ��(�6}��"��E����F��DFFF
�6�������=
�\C�fix�t�
��,���b;;���"[[[��:!Y������aaad���Ew;@'�"+""���/66V $%%���	�����tw��\�C.�����B�t���5�m:U@.�!���r�E9�b$''�?^"���Hs5m��7�d�k������>}�W�^����&Eww�C�QL�6m��������&��������R�����9�mf��M;��7|{�\l�������S?24���_x�I&

5j��}��L�����������occCww���I����7��
>L����7����W^i����o����u��)��l��et7�\J~��&� 00p��	<���?<<�*�����6�BJ�����x��Qll�����U&���|o3��W������+�bs����Bi�q����jp���!C�Pc�c���8q������-[����k���6��9�r�����n��Q�L~����f�+��s�}��W�E����M2Jnn���3�������Ghh������������Ckw�)mR#^��������g��*��7��u��W^F���rQ�<�f��q��//����;::R�m��Q�0�C�Mj�k.
�|�MYE#^�������S4�
�^!��F�l���[���qqq�WfY1;;������H$�Gc{���I�x������{�5�h��X��Y#^y�&���+��v���M��=;((h������'O�8qbtt���;]�=��&5�5?w���E�W4�o�����<K����r��d�xQ�����a���$i�t�t�Rj5//���?i�����a��%$$���3)mR#^�7n����0�o�{������5�
�^!���A�@i�q�I�WNprr���b2�Mj�k���S�
�_�f����+�\s���r@� �\�C.�!���r�E9�"�r@� �\�C.�!���r�E9�"h��~�m��y�����������o;w����I,uc<����z��#G���[�r���c������u������.""B�N�/q����<���\m�x��>���---W�\�t�|��������������w�611������+y��Y��6�V�");r��[�n�g]�p!77w������������>|x[��
�E�999qqq��x����k���K���Gr���>}�����W�^�����\����dQ���o�����Y;w��9s��3g����� A[��w��4�9<H�:t 	7x�`�XSS��c�w�}���_�)]�t!9J�D@=U�m���������666���L
;u�D�y<��nnnN�zzz�D�)���QQQ�����u[�z�@ PQ������-��c�2�k��d�w��%25l��{���i������������������\m1g�2�������JKK�����O���S����������S����o�~���g�������;]�v:thxx�����6�M��@.��ptt$���s�~�������9'***$$�����7l�����Y��O�>%%%�/�>}�G}��G��7��;���\-2�^����E��S�N�������Xv�����sssU�3�1�"�r@� �\�C.�!���r�E9�"�r@� �\�C.�!���r�E9�"�r@� �\�C.��?��	�*�IEND�B`�
update-indexed-100000.pngimage/png; name=update-indexed-100000.pngDownload
�PNG


IHDR]T�)	pHYs��:�:�IDATx���XSW�����*�h�E�u����j��b��l�bW�U)n�T�����D�U���j�����=�����1�!�rC��'O�s��\Nb�?'��\���bF���� ���r�E9�"�r@� �\�C.�!���r�E9�"�r@� �\�C.�!���r�E�J�D���e���X777o�������N�jhh�����e������������\r�YYY��������Kn\���o-���r@�C(;;�������3f�X�|��
t�������8p�g���V����R���?�hcc�q�F�-�Lh���������4"�3g����,ylb�����a�\]];t���uk�
raaa��m:t���n``P���G�E�j����2Vx��W?���e���_�^V<x� ej�����k�o��A�UO�P��?~���Y�f-]���5���<<<++W��9s&5�������\��E
S������C��g�0`@��rrr�w��l����~������K��={fdd\�p���_���������\�������H����S�N�-��q���6l��?
����```P�����"�H���w�6o��o���l��%<<|�������/^lcc����!*�Q�F�=j���b���;vvvjn��_����^�~����,X��-RR��3G�?eC.TE��
����X��q��.3����AY�k�.j���>|Xi�Q�F���{�������-(r�����=z�3�F��o��1��������������c�����.l��)����4i��M�����i��~�i��x$P
�"@%4k�����_}�U�.]RRR(�p���F�U|#��'����������m���U+��j���#G��#
W�\YF.�6���8�"@��h�b��-��Zn��BC�)�m���E9�"�r@� �\�C.�!���r�E9�"�r@� �\�C.�!���r�E9�"�r@� �\�C.�!���5J$����r@-�"�Z� �\�����-ZDDD|�����]k�������w�N7effN�8q���
4X�j��.yyy_|���;h��#G�^�����W�^�'O����u6m��r��S�N*�lbbR������`ff����%K�DEE9::�Y����>;w����7���_����a~~~��,^�������K�p��/\�p���{)_y���]����s��9~����9�Tr�26U�\��H���B�F��~�����v��u��A�y�������o��-..��y��2e

.)��4iB�����o��y���7ntrr*m�26U�\��W_}�oXYYQL��;w�4j��o����u�����l��|�}��+V��������_�+��q�2�"���,����m##��8����KK��w�v���������rW.m�Pe��P�����>}J�������[�hq��m�S���cqqq{��9r��
>���2V.m�Pe�E�����e����3'##c��E"�������o��;w�������:t��A
�q��Q4�
k�����s�n���k�r�26U�\�^!!!���4�k����+�9���oll<y��3f������v��i����������&M��������������=[r��7��G����j�Mv�4��lQ,���CV����FFF+$�u��y�EO	�]r�26U�\�C.�!���r�E9�"�r@� �\�C.�!���r�E9�"�r@� '�\���9rdQQ��r��-__��g��l�2<<�k��TLMM���9y�d�=�.����d��'N�HHHP:���	�������u��I�&���S188���5***,,,$$$((H�"�� r���3�X�/V,�=z��Q���#F������111�w������^^^|��Y�D.�bO���/�8|�p�n��vbb���=5n����"�� r�W�^�6m
�����333j�uVV�F�24����@�����(��s�����}����_177�l����lKKK�eJ�R))����Gp�n.��w�2),,�������Srrr���������5R�n.��1c���P,80::�����i(��"�� rQ$)6�6"""���t��5�x{{�������
�Mjd��JG.����������-�"� �\�C.�!���r�E9�"�r@�P��v_1��=`����
��\�)g�YC��M�D�@!j�,y�F�B.T�g������~�}��M�FAB.hBz*{���'J�o������e���1�yMu*�P����}.��%�����\�,3��X#��JA{��H_�h����m���5l��8��~�����E�A.�n*.f"�f6EC=������$
����m��.�m�����I���������V_�&�bNBQ������b���]���~�+�(���,���O9?�$��MX[i��5�ve00d�R(�M
)��B�\]��"�l�:)�R*�x�:����21���2�N2�����%���b�)~b��
BQ����Sd���h\���>����[,���cd����
K�gd���]���J.FFF�9���HVIMM���9y�d�=�Vkk�j*��?ci)�y
w-�\����)�9tE�id������5i.�n�l�������A���'����������QQQaaa!!!AAA�T�k�.�����+�)�J&��X\T��J#2`6M��g�"�B�BC�=����g����Z�x�b1&&f���b�����������(�������l���a��\��l������_������H�)��
<�(]z�������/?���t��E�����D{{{j888��y���z�E�����>�U�Fy�����_l��tO�����r4�=����F�bU�m\��j��;����4E��RNN���5�:++����H��������*����1|�26e�z/���n��
�����7T�8r�[W���D(�F	7���)��:;;������2111%����[�s2-�?�/�i�l/Z�=V�^�*u�J��I��e��]��Z�un�z9�u����cQWu��1���=y�Xy�L���r4`��V�����9��E''������['%%9;;W_Q��]y�i
��E�*.��R�&-���N�f/��Gw��]�Q��L����I�w�����j����=��F�q�u��'>m�psq������~~~t���Q}E�����
��5��)����5w��'�%R��c������B�D.�^|��7�6���CCC]]]#""��� \��eO������E�����Y^6[����ck��/��	$��\�������(��^�G����r�P�Q��(��y�n����mG����M�SW7qT1��\����e�PP �\�a�kg�]�Y�1��H//�;�Oi����[f����O��x����8n��Jn��]:�P�A.�A�$v�<U��{e�}�"5���PP�\�*�P�Q4�!N�����[�X2]�l�2j4l���+B�e�E�*�P\�9������E����fm��y5�*�PI�i�����v���e���������� (�E����a�����?�5(���Dv��:
�P��"v���n��R�6,������	���.C.�"�ov(�������={���?���,6��+R{��������/�������nZ%���u s�^�&��������<���}lp3tr@��=�����K��o���^���}��K������r�[�sv��g)�h���uz��C�A:�;B�V@.�^��a��r��8�
�^�Id�^���a/n�5�3�E�'4"�����N��8%E���=���M��9�"�����n��g%�������g����=�\�Z-�o����Q�9+���q;����P��"�F��rS����%(�T��u���a�j��t��I�N�����W�t�k3�Fo����>�N�P
�"���lv�gnt�g+��&C�������;�Y��R�@� Ag���?�p?�����MI�����e�i�s�����kdy��R�(���-��!%��)��@.���w.�e{��|S����{o$k����@�!�\LHH����/_���k��_��C�TLMM���9y�d�="##����/�P��������'T{�����na����bU�u{����c/�D�jn.�?������{��m&L8w����]]]������BBB����/� P(.�����\�O���>7:���?�w.55���mD�yT�ps����#G�455;v��)S�bLL�����b������ljA�d�H�\g�s�hh�
^ZM6�w����R+=�ZO��H�M�6�=z������_LLL����������75R-SE^�C�Z�"��7O[�A��n�w��psq���o���'�|bccs��q����cffF
�����H�������T��[[��~;�=���
�F7'O���7��5*22r��I������9e]ggg[ZJ�IS�(S�*��A���bJ���o\�1k�N�g�|
�
��ps����I4��h�����NNN����[�NJJrvv�HQ���]�B(YM*�c���5U�����C,�9�}�}�i[�����^[�l�P���h��-_8p`tt4�$]{xxh�Z���}=�;!��7�2I(j�O�����7n�8q��)S(���{�������Jy��"h��[l�K�W�����B&	��[Y��Z��3��b�N���W�UI,��*��f���i6�|�!�����~�}���m�_#@[���Pk���-����H���!l�G��(�����5����o1w�Y�e�7���/��P�y��U>8X$+�H�X��}>�n���5%/�-��;	��9���h�O����}�����mMm������"����W>��?�����nC��	�Q��	\c��*G���;'L�`kk�}��v��u��-))���s�����/�6m:u������������7���]�t��kW��u��*�Z�\��w�2������{������z:�p��u���'�������Yy����N�-Q��?�����V��5kVLL����e_��uFFFQQ���Oi��}���������k��-��E�E�fg���Yf�����l�,�
��=Lb��T�^�o��?���|�'O���&MZ�f����yyyt��9�Q8..���Ku���6mZ��]�uT���)f[��tpcS6�[��K�}
k���M�^�z����o�qc��iii���|������!�z���l�z��u�`k�C�}�'/d>����y?����� ]��>;�3[u��{5�W*���P�=}���Q�=D/�"���\�bee�x���!�dgr���~����+������V��g��-���������_
c7W�@)���p�
6t���/6m��������&&&�+���o���3g������m[\\\iE-B.��=����b�/J;�����Y���'(���u_V!

,,,��k���`oo�������������+����KD��u���Q�"�"h��r�c<{$]0�MY�����'(��L���S���S����";�Q����
���Oi#*�Z�\��m[<��fsmC6a1U�]���<S
���!;�casXq�6�ds`��k�O��\���eS�//��ec�#Y+��	�������gl�h�p\����ng
���'��C.�����-��b���l����
tr����l�X��4<���tr�$v3[��
�����[�<|��'
@.B%)�[��z3��S�}��"TFN��N�H�Zp�9��j�4	�������I�uc���6Z�T����l^q667�vGJ%;�c
�]��XTT������?6n�x��
���SSS}||N�<��G���Hkkk��P!7/s��>�#]|�����y�u�������~��-���ps1<<���299���Cs��9u����]]]������BBB����/B��b���,nD������v��,�y�����vhV(]�]���'�*�����������m����+�O�N���+W�9�����7n��'�t��k����''��ZYYedd�����s�N�:���4��u����/�!�-[����-_��F5T<r����~���f���^��������;f��My����v��^�z�v���L.^�x��?<}������?�<e���w�����?��=eU���m�����W�K�/������[,{zzR�65�P�	�L������a���#j�����+�j]*���3?�Q&�611IKK�p�����\�:uj���w��8w�\z3���+�0|�����gff8p���ccc�u��2�$�h�7n����_(w�`�|	

���e�'�|r��M*��?��
v���r��i���+����\�>���P�G;~�x���-[��5�������m���KW�\������]DDD�������5��C�"��������|/]�������Z��H$����9s&

4b��G��4"���D�
���={(i��_��������2�d����i�|c��Q�����(Y)k5j�#Q��+B3�O�DjPbw�����^�xq��U�SQ�d�������473�~%����,~M5�24�,�
�E}`�����5�n����6�g}�}���n�;;�z��������t=t��y�������C������}���el�A�Lr6+���;6c���W�fgg�V��o_PPm��Q�V�ruu-�����


�������QX�o�KC�>��G��q��Qz�wsss�6��g���R#Ewww�
�b��^����z��W��o���jS�[@j�����������\h"�X6���8�������
����pW�\�7pY�s...���4vrvv���#�;t���a���Q�`i���1���h�������u�������*�M%��Ez����44^�p!UhdM�EYhWA�f���]KOl������L�Pz���j����a_�d�/����!wnaC���%�X���u�D���b
����_?J��3gFDDl��-..�I�J���_�|9��P�`Q��m��i�&//o�������_�7������===�M�RCC�4�fG�m��Q4[�vm����r�����Oo������A7%?=`�x����"}F�������k��%�v�����-<9�yN�v��ZTG"����|}}i�����n�:�H����;x�`jS�M�6����[��uss����"
i����`��)S��3�T���W�����N�:]�zU���k��~�M�m�SF�e[[[�8$_��������4��!)�Tq1�Z�6|��#a�-��Y�~��TM%�����6�������f�
rss������]�-S"������}�~����x����2���W�p������w�RQ,���j���\�����.6��N���^�}�����+Wh�|�����;o�����I#�������bOJ_��E�z���'�"��dQ��������n�:j��l�nH������m��� 6k=�-5F���fr�������������6�&�;��c���8�0�=��bvv6�"5�:33S#��j��G������L���{�'	���~7P��
�Y�J��U=�`sy[�]������m�}q,;�}�N�.6u��-��J�}
��"�Og<�������.���ne�
��%!�L.�yVe�!7.q�~|W�����k�O������{Y���	�pna�R ������,����[zy���T�~7������o����
��E��MG��	@�4���?�x��Xs��T��h�������i#{-v
@��=j�s�:����w�9���&���F����mbVe�@�!kY(�1��!���q�S�2���Z�Bq\'����t14Q�=/�*C�I��A(Tr��P�Dd��X����o��!k�����5��H���P�0�b�P\���V.r�X�P�������),v�KE
E�����2����7o;v�l���T��'O���#22���Z��n+*dK>c�l�.vq��>�I�,T��s�������b%88���5***,,,$$$((H��+�g�L`���.v����36e�� �@���o��7�|�����JLL�����b������ljuUA[�����.��d����4���P5��������7������ho�M����p��M�uR~.[�!;y@���06k=w?�G������[�n�-�999fff�����,�uOn6�r;L���86m%w�"�M���x���S�*���)��:;;���R#E������PY�"���.{�6H��/�~���6n���}/� ����|[$����F�NNN����[�NJJrvv�oU�(���|�^
��Em�x�fe/B�
�k6iQ�R�|	�$�O��D��(;6�BQ�8p`tt���]{xxh��3�S��!��9����l��v�n.����������nH}�fb7/K��CrQ�$Vb�866Vi5�: �!�n���t��l��v����\�w���/<��\[$�N/<t���Pk!��a2������������k�O�rQ��\�B��=�m`��X����v�j9��P%���dO�sm
��0�6B�}�����t�*����\���}��9P�}��E����fz��O�6���M������@.
�?q����k�����X�w��'=�\�K����,+�k�Y�E���^���~A.
F�q6��egrmK1��^���>���0�=����N E�X��;�+���'}�\��?�cY^����Bv�6oh�Oz
��m��fA��|�]�[���xM�}�_�E�:���p�����=��+���^C.jO�����Yq�nd���cM���'}�\������������E��Z� �c��f:��+���-���i�O�A.���������c�tk`���r�f)��s{�����j��%����@�e����w����V;���5���}7��&]l������JS�/*(\`�������L5�������r��mG�	B�\�~���������b�lgu4����Z�s�>~�u���L��V9�i�������{�:��vw���n����={�l��-����v�&�NMM���9y�d�="##����/V��B�l
;�E���6�����-^e���d����N�#%:���56��n.N�0���m���[�n�4iR||<���]]]������BBB����/V#
�%��_"��]�c_of&5W|:��h��q���M��D��iSqC�
������"���c��������������e���"~z-1��`j�~oVW�����7G�=j�(SS�#F����������w��bOOO///>��,V��|��/�m�t�{6�Gf\�o�G�Z^�NS�N�[�M��|FI���M������tQ�&#���[)�D"�:�
���!j�����/�8|�p�n��vbb���=5n����b�(�c��1���C��pfd���S\\�����?l����� '�.��&��"�{Q��!
y
��M�XH����&��|�R�M�X>O���>�����f��q4F����_���ps�w���i��� �_���13�������,�eh4Y�*�e3,���wy�����m{��.>�se�S��'�O��y|�t��'%o�7���y�����U�-���E��U|�y�Wn(
-�3�SQ��(�/(��S�3��~)d,��s�7��yO�601�^,)D�E�F�u,��P��h��$+P�_SV������\)����_jt�i���A����=<<���;�bnnN�F�������)����+U(K������^�"0�����"�~�ST��w���6��{��E�%W0�4�����!�(W��X�'�[��m�-20xuh���(�z���C����������������z��iE��l4n�y����8�yzq��^P���Y�����E����g��7BE�����)��Z#CS�:��V40�Jj�"5�--h\�7h +i[���U3ne3n���t)Kt�q�xn�t������*|���$�\�w�eRXXX���eE''������['%%9;;k��I���Ko��t���}��i������(��������Ik�w��t������c(�"�DT$KD�����d��z��������bi�2.\)bS����ZH,	�l>\�d���F����g4�&9�F�U��k�X�����<D�)e�,M�V/��x��=��)�5�t�psq��c��0`�bq������~~~tMCI�5&�9�=�]9+]���&-R��Ii������/)����V�z�1������wy]LD&9���D�V�t]���{��?*���NI'�1�2����9���(e�����I��u�b��g�a�5@�*�1���Z��������y=��Z�CSi��5�;���Q�?�*�*}��?
zH��A����k���/  ���;44����V�oR����l�'���tq�T6q�:����u����W�Rz����������J���V"���M�n�\�OD��w�h=k+����
Ci`J#T�R#?3+�y:�5�&
Zj��%#�F�y\1���MK����s�,��i,5-����������b�������k�W���@i���"��_�Bun.I�pQ"�ccc5[����l� v��tq�l6vV�7����#������I���������c����J~H��h����QM�^���
�2k.}���K�����\�R�fdR�R�2U�O���N���*���K������b+�7��V����+��v�A��?��K���3Rr���7���'�l�gU��a�������n��,�O���M��?�a��:Mj���>�7���pB�F`���9�]������~�E��o�<|���Np�T�h�"��JrT��/��*��jijh��q�O�n����Q��N���������&&���tMm�:����BS��)��e������(t�E�<���{���7G
���l�*��{�.����m�S(���9

;���O���w*�>�'��OD�T��(�D��1Tv%��S��QWDa���_O��o�@�(5�)��8����F�&fu����u,M��k�E#�q�4`�E����r�k��2]�[�u��+�vw��E5<Lf_����\�r�o	4���N���������L���X����;�����A&��1��:�������j8Mi�@U��M�`V������R�JB���������(����S�dS1v,�bU����K~40d����GU�~�y��=����c-*|��������w�8�~K���2yuH?��o����9��i\xt�JY�����S�����X������>�^��i��5��H�#VsR�
�����Q_^f�Tk:~��t �b�$���dO�sm
�����������K	��ya����/�hf������V��/�P��]X����lZCc�v#<z�����;w�$T����9p�P�P���T�[i�>O��4=��������G�VKM������������.���b����~���R%����/`=��zvJ���g��� ���Mv��Q�>z�E�z��_���������yW����U�����*��|(���$��s�IB�KS���\K���TM����r=Qq#|��C`E�A.V��0���pmcS6o�����?O������)�/P��N����t������2@���Vg��,�l"j���!������Y����$�X���O������2 +���l��.����-�`��)����7���S��_��[~������8���������J6F�P��G���\aO�9�YV�6�`�Q�����9��/E�$�o'?����8w7�����4�����>:'��������r�b���^����R���^���R\T�|�<��/l�������������O��;�|�@ ��>���
u=�\,]���<�b�G�<���u�����.��'���m�����n�>�e�`���=��C��/)�/����������Uf����%���k�B�4m�O��?�#JN�f�`�~����|��p�F�@]���,4i��8����s�1
E2�������,a���-��SR�����cL�������
��[���"cn8��gp����{�NUZ��S�N���X��@P����aI�������(���V+�F�
������o��N��!���S�������|�'&l������EV��c�6��
�XQ�s;��u.�:U�W�N���vC.��PT��q^�/����/��D��E�i��%D�&<0M��D��Ee.c<�	��bd_blY�f��
:A�r155�������=z������Vq������ou���v�W.���FEE������)���L��~�bLL�����b������W�\=�_����hooO
��714e���999fff�����,�[U�,g�+�������t���mi�<�����N6�%�:=�y:�s�6���k�P��E''������['%%9;;k�; 8����������kmwG�r1  ���;44���5""B�����\��������~�"@���e����z�
:�s�6C��� ���r�E9�"�rQ�����#Gi�#U��9�����u�������-[����+���s��~�O?���o��jUS�I�?�"�Hq����	�	��k[�O~����^� U8q�DBB���Tv���M
��	�������u��I�&���3���7������������s��d��uG���B~�+�����������+�E��9C��/^���T�����&G�=j�(SS�#F����Ez��W��v;V.������������?~�_�^�����|�=���^A.� {��
����M�������w���o���k�:u�S�N?�������:X*�������X�b���vvv������������������
r��*�d��r���i����}~���&L�����y��������������:��gffn��������N<��T>�:���t��W���V�'���/zxx|��w...|e��-|������.*;��yLLL��=���+����H���O>���^A.�Z:q��{���������GfY1>>���_700(**255�b�����:�����w��!��x��|�u�����^A.�Z:q��3f�;v�����'�7��?������V6����������?�bE'�pE*�g�x�u��W��*�����?`Ce�u�d��1�m`` �x��5��>~��Y�f���[�6m�jK���:����sG��
O�Ox�_�B{�U�\w_�z�����P�����&U����cG��.!S�I�x�����*B~�+�����������+�E9�"�r@� �\�C.�!���r�E9�"�r@� �\�C.�!����G<8w����/������/Y��Q�FLr�X�ly�F�������c��[�
�"����}�������?^�lY@@�����pJ�k��i�� �E����,6l����W�XQr�x�������;q�D�.]v��U�n����&M����O�N��\��R�o������{]�x�����g�NNNvvv^�vm�^�j��& A_\�p!,,��+O�:�C�;w��;w.����QZZZBB����<HcM��������w�9p��k����M;w�\u>�.�E��=�M����8���S��k��|1??��YVVV�
1b��]7nL9J�GB�]��\}akk{��]''���LC��
�mSSSY�����MLL������o�����.]�����Z����UC��\}��[�������"+�����+44���[�h�������o��u���IIIU�)hr����3igoo?l������7�={v��*W������K�.��m���+m�
4x��A�&M�w�������IC���I��E�...4^�={�G}dmm���2v�	

����Q�����u�����5k��m����`��)S��3�Y�f�����G5�z�}��u�a��F��
���W�j�m�^6������W5�g�a�E9�"�r@� �\�C.�!���r�E9�"�r@� �\�C.�!���r�E9�"�r@� �\��?,a������IEND�B`�
update-nonindexed-100000.pngimage/png; name=update-nonindexed-100000.pngDownload
�PNG


IHDR]T�)	pHYs��:�:�IDATx���\Te�?�g.�(� � ����w���lu�L5-�u&nbi��%��RR�Y�S�p�����*����"�b�"����?3�f�a@��s���_�N����3�����3s.b�a���� L��&�E�"�	r��`�\0A.� L��&�E�"�	r��`�\0A.� L��&�E�� �i�N�u�T:`���3g��?_$Y, �������c�������4_m}}�}��w��	��+7G��[�����`c�!�T*������.\����/@�555.\����~��U�VEGG[��������^�~}BB����&4���������p����\d�`���#�y��B1l����P�4����6m�4e��W_}U(vz�r����q�����
��+�|��k��5w��E3u��!���}��O>��=�"@G���:y������|����������0��\�r��E������;� :r����qww6l�o�1y����\ZZj�={�lYY���>J�?�pmm��'x���?�~t�+�E�����M�t�8�|�,2�^�z��u��`�p�Ba�#"t:�y����W�����mWTTl��!==����7. �H�-[�������!��������}}}��W�\���o������L�8�m�]����^{��w���4)/^����!���u��-Y����~�z�]f�Vmm-��-[���F�IKK��w��23f�x��G.\��������!��������0�a�}�Y:j�|�2��<�����am*����������{����W/Z��������,		���_��6m�
�	��p���s�����zk��Q4�&O�|��a�������X,����0a��M�����W�Z5}�t�w�C��+W���-�
���r��m����[��@w\����<������	�"�	r��`�\0A.� L��&�E�"�	r��`�\0A.� L��&�E�"�	r��`�\0A.� L��&�E��#����;�3�`I$�����G5/�$E�C��\�Ds���_�������u_��!,���[o���s����4����111��o����y�����]����T*���222^}������������c���[Y�����_��cG���W�Ze�����6o�L�>}�����Z��q�^z�%�Ov�/��r������n����s++s�EK4�h�������.�[�/_^__��	?������H$UUU������w��>���W^���_��������n��]�x�fX||��.��-���S�N���������{��m���h����F����?/^�����R����|�VV����}���tG�g��a��M�6�<c�b��_�I�U*UJJ
M)Z����}�������[�l��k����o��}�v��������h{��y�c4�z��I�S�N�����L��~������ne�`�`����_|1k�����d2c������l;$$��o4h�pww������.|�����&���K�Y:(dC�Y�bM>�s'Nl}��V���-1b���>K�\�������`��P(4�E"Q�Wnua�r�
�b��'����
�x6��������?��[Z9���@k^{�5:���tqqa+���EEE����]PP��o_>������7����;f�]�|�8�4:p���}���������[���^he��V����_}��B��
�Cc��W_�j��1����M�4��>X�xqmm�?�����!}�%K�xxx�]�v��=�v���7{�l�t��>���n��3x�`���r0�\���={�\�2<<��}��7^y�:^
����o���
+%%%..���w��b�������j''��^zi���C�mhh>|�����JeTT���s����t�?���t^^^��[_�
�� ,5?+����g�N��]\\>3h�^VOmc,���\.��y��^__�6�b�
�{=z�|6��m7_����9�"�	r��`�\0A.� L��&�E�"�	r��`�\0A.� L��&�E��bff������~�h;>>�������]�n����i���2666777,,�������E#^���C��?nq���t�LVRR�g����>|��.]�P(������RRR����_0�E.9r�&��e����6mZ�f���W�[�����u�\.����E6��Y0�E.&$$4/�:u���3�?����FF���i���(  �6�%�Y0�E.ZU]]}�����������7o�����R�$	m�i}}=�d;�Ft4�O^xx8�]��7���g��M�,���7[�J�4��T�T�d2����RiR������<��\���OMM
m0�������KJJBCC���CBBlR0�o.N�2e���4�V�\��#�����������x:����I���(��,������>|8
Hv���������T�B���a�"�/r���E�L&��m�EQ.������E#^�"O L��&�E�"�	r��`�\0A.� L��&�E��p���j���v�T�4��<�����������d>��7!����B_��.A���������O��t����z��Y�nTVV�����������xxx����h��
�Ry��
N�Owz��5���C�?~����h��/�0�,]�T�Pdee������$''�����S���/Nk���%y���v,�"�9Bk��e�����9��~0Vrrr�n�*���������`kg���F���4���b�.���X�={�>5���%���R�mG�/r1!!�j=55�����Sc���(  �6mRh������x��%MmME�EM�<��\�������AA�})�R�$	m�i}}�M�Ft4��V���D���
W���������+�S�@?�����P�D�b���~Vi���n�uj��\?	�3����e����oQ�J�4��T�T�d2�����-*4��W����k�l�M��V�a_�����\�pK�P��4�5J"�a�������y(<i�50n�{l�s��l[ �7�� ���KJJBCC���CBB�[�Y��4���P�����-��Z�FGj��VMT�TkU�jUC�R�Ti�*
����:g��dj��>���E���4����������x:����I������,�
:�y����#�K�������)�<�~j6��S�Nx�JE�x���@@�""�w����S���|�bc��~Q��$�D� ��E�|���l%&&������*����������T7���>S�zQ��dz7_l���W#l�f�\�
A�T�H��R�D�$�O�v�'{.�;����7)�bx�����<��\l%-n���;w��X��E��:�w�������������	?�O�-K,hp'en��A�DP/3R�H��"q�I]<$.�$.�'����.4�l��Mh:�;.$��gi:�8
��E.��Pi��}���k�7��E����P4�\��T��N?e�U�-�Sa�D�,q�pwq'�����a:\�R.�>|����5�E��Ej[���q�?*���Vit���yU��i�a�Wk�jH��
�F&��.(s�&\�U3�Pc,�r�dtSK(3�^oC�7��z���p����:�����lQk$��41��o���Sw�����h��AW���iD��}�����n�rwR�),v#���en��Cx���8����&)�sm�����Yx��%@�����1�y"ZPi�o;55����F}���j�-�L������wl�F�����<%n���a�����������"$bO"�2�#bCC���<	�!�K������7���-�m��5�E�s>9�����v������c�5:�������w	�]�����������H�\p�]��\�(u��
n��)���'��{�K{M�V��(]�����t`s�E�������JU�����T��
������eK��mY�DP�������SX�ooy9�$N�D$�����!N��l�[��z�x,#��t`S�E�.��
{
��������r���F�M�o�.�����?y!��	n�W�i�	+��:w�R��tw��;����}�������qo���>�).��`�47
�nh�W�XE#�f��
]�
]���CU����x��~wg�����E&u#���p�m�nE��E~�����F%5��Z�[Lo:
��Z�L�ez�eeA���-o��R�S���������Q?��\�':%����[�����+���K���)#����5^�b�����_��]�<V��H\��J����s}	�-s�!�x�/zG�}�b��}e��u�KC.����Q��i�[>�5�,��D}�0Z-q��������`9��J&��	�0w>�P�x�h�y�\zH=��GJ\Dw:���������1�/��]:�����������^�3W^��h�����_�E���B�������7�?�����d��(`�$���
B}�xKN���D{,��K�.�C������� ���#�FF�����r���m���5�/�b�4�(��^���i��HD�� �� �)(1^������t�8��/����9}�t��t��K�.���������/==}�h��+++cccsss����]<<<�_�w}���q�MD���������x�����G��w�����h���;���/r���C�������9s&L��c���7��;���c��t�R�B��������������"����H�7��g4���P/��}��?w\�P��=:4����;_�:/r���#4��-[f^�9s��3\\\�M�����srr�n�*���������`kg��hkH�����j���9��$��~"�2~���������6�E.c�\\\���w��1c�vQQQ@@m���V�
R����/����	JS���Dl9�CC\���x���8w���� ��U�T�~�;:����I���&�w�j�T\,�����H�X�yDw^;��������a�����x������O>�d���lE*��l�S�R)��lR4
���Pl^��p��J����y�?k�X�n��I�A��.��W���y���XZZJ�Y���&M�d,�����������`��4��9M����f�O��?���dd��]��
��<|K
��7.\8k����'�###��������%mR�T�)�����`��i�<��V�bA�/����%A�]�P�t
��E�!�l�=`###�6���������/111&&&55U�P����Y��N�������#�5��b���	g�MF����i	����"-�\d��o$��w��i�"����U��Q�P|Z��&�:3��v/�`���g�g��_��E�����,r��������y��j_����_��"6;��r������c��F�_5�'�Qb:��H@B��_����.���x47H����E��G4�������p�����D��a�3�Ep$�%������Ni&�l����X�n2�G�;"�������T���������c���������n$�����~Wp(�E��#^�*?�y��6Ug��	��n�D�!���/�E��j1�W���h�.�2�E�D�� #|�+��vxW��Gx~C�g�k��YQ��2�����!t'b��� �a��j���������t1������XB.B�`8<����#��s��w���������{���pG�E�s�[��.����8���y�[��t�!�+��em�7�[���������^��4���U3�WL��B��
�r�P����my7<�i^Q�1��xh����
��/o�����O7��FeeelllnnnXX�������`OTgJ����u����N�&T?�-�'�p��K�/�D:t��q��M-]�T�Pdee������$''wP������%��>\�������F��>������M�"�9Bk��e������[���������h6�:��b������2�@Xw���WKQ=Q�7�5PZ9��c��W��E.&$$4/�F```aaa��cg�5��3������F������4���&���+�x��<;��]P�%�.�R�""���F 4�����?�w�����C��E�T*�D�?��N���;�hDG���a�6��u������Ddg�6��x��T��
�kn:��������S��%��o|�����o�8��\�J�4��T�T�d��+���[Th(6/��M&���H��������Kt�������X�`�w��x�i[6���,�}(~���s��������o<f����b:�\�d��o���W����/]���F�]�6{��C��5j��-�����V��o.���������qE��������r��\��`t/� ���-�v�u��_kHB����-)����m�6�8G������/_^�j����NS���~2~�J����:�����t���!Ch�&%%��\�fMKE�7###�������4""����!�w������li�i�����4�����V?���47��h����Kr�|���}����4 �MR��P�}���:u���m���G�f��Z�/rQ���l����������P(222�:������u'�}z������UUUeu___�Q]]�����]\\Z)r���|_&�~���sg'�Cu���
7u���4j�3�]�"��Nt-eR8�H��I};8�Mkv�eQ�����w��M�V�a�����9���n~��"�x����.���r����$����;���j_N�`�$�mZl�%r*H`��qq�#
�����[7v�X���W������{;;��Iw�����/_�hQFF��M�����R�C�E����S�~�3��D�D��#�����}������Dj�w�Q�P��c34����������F�;v��4iReeeyy������qqqt�H#��O?m��!�"t6�!?�����JUS
��4�zs�qrv��#���<���P�!�=��555�u���C�5�h�KY�=�o�n��E!�S5j4[�^�W5�b7����������M$7~he�:C���J���yn��d��������>}������t
�"t��[e�]�tM{��Ar`��B�����\��p������Z�D��,j���32d����%�"t�����m;����9$p"�]h	r:P���z��eS(z	Jb��z{�|�!��T7��9[���/;(:1u���������ju�7����O�6�e����������!���\/�V,�0���
�v����"B)���3�"����J~�`�Tq&�Oyo
����
x�VM�:���5-��q����7�?�������������k�
F����������aaa����/���?������
dg����O����m��-����2�{M�v��M�7���������M�6��3��_���K�*���������������&j��7g�_kh:l?@tbj�F��c��
�ai�k�`�&��'n�����	����Q����W�^+V�x�����l�����O��k����=���C�F���e�J�
.++sww���
�r����[MM�����K��h4'
7l���� �?��C:����������}��Y�z�O<A��7o~��7JJJBBB��Y3n���������'O>��s?����������y��]�z������N7������st������5���-���l��U.�GEEEGG����"�_Y]c���jm�+�~��������
��e\ Em���U)��v��Q�7�������O�8A�j��]���l.��?��!�~�mRR��%KhV�=�.0u��;v���}����Mx���c�����lt�/^��{7�]6�h���>��ct���������������p��a4 ,X������-l���O�!i(�gK�y4{���������.��N:����/g����7��?�-���n��������V$Q3��H�4[1�q"��q���	�Z�h�":|��'�Mk��d��}�N��#B�R4i���w�}Gs��)[���V��h���H�������a�t�lc��qqql�&+��X�V�����c���a"m��1b8���e�BC����t���C�����d�t0.�Hh�N���mR4������Z�Jw���#�n5N�^������9E�\�:}�&���N��
,D����m�n<S�Ly�����i��=?��3�	��#��+W����������k>p������;�T*��m��=99�������UtX�����f�/�D"�s&L`+�s�V������K���>�D���9w�\�-hQ*��l�S�Ed2�d;�F��������2$'�zQ�;�.�-�w�_�BvOThI�|��J��/f���<v[e������O����9C��
�
:4))���BBB|�A�6lX�=���&t8�|����H1��}��g���7n�9sfqq�]���mr�>��{���th��{��
Y���1����C�����h4����������:�O�nY�����7go�64�bO���>%r����
�hz��#���~�:5Yq��"l����'��Z�hQFF��M����G_�&$$|�������O'&&�>X�j���hll\�z5�����f>v�X����(�&WCC�l�������C�5k�����V�=��O?m�����I������nV�!�"�����Mc�N#""lR��u�N�u��J����O�'2������������5�d�q�-999..��{��������)S������z��i�-X����������	<<<��������w��7o���>��O��k�����������;w��2z����=�\�~��/�H�*
E���"�|���JG�4/mR��RP���@����~2a.���8���W`��w}������6�m���b����744�mc���L�����m?������i��'��������=~�>y:��(����;w��mw��z�Uw����X�0�}���D$��_6c�\<s�2�:uj��t����-Cv\�:q��An������}�?Z�]X���=g�k'��8e��}��}������]�RC6���\�tH���\��!����t
F�U�������;w����P|��wl�N��r%�:_S�n
����O�8y���W�6��T*i(������d��+�������F:g=:H �m�:~������I���$"��]W?:�8������f�h���y�o_C[���R�/7��YW��g<>��o�q�1��f�\D�u%J
��������!����g`"v=G`�w�v^U��BE����ll
�~�����c!���4�Ia5�m~�J�tF��N��%����e]r����uY�#�+���	���'��\��Su�~7���*���E�������RA�����#����@`�\D���F-��@YP-eg�	�b�}���b"�r�1N�{T�v��d�W�hh��`����I����
�����:�.>>�������]�n����i���2666777,,,33�����E�URK�/4�k%����{3�������{�bzz�L&+))��g����>L�K�.U(YYYiii)))����/:���3��tZ�^6�}�e��~Y������6mZ�f���W�[�����u�\.����E6��Yt4�3�W���D����"���&���@\B�������N�:s����?������1p�@Z,**
�������Bv�vJ��l-P_�rbg��c�z�M����O�7���O�8QPP�|��y������U*�D��=�N����%�Y4������Z�S�BI�����G`�������?��p����s���}���t�`����{�E�TJ��N�J�L&�I�(<<��BC�y�N]�%���:M�H�Aq�$�ba�/���O��t�O�]s�O�>555��S��S��ypppIIIhhhqqqHH�M���t�^��0��( ��N+���O���sq��)7n���r��Gy�-FFFfgg����iDD�M�]��W��	��l�I}�dq���D6��~�sq�����~~~����cbbRSS
EFF�M�]�ZG�^���������[=��!.A�v�����2�l��mE�\�s�N��������j4�
WF�l����_�z}.�����\��>�7���5�;�����tt���m>?�{����Z�w��b�Ar����%�_5d[�PH|���[v��U8����M|�wjg�r��H���B���E�JV���������jB�����C.���d[�V����(��N����xVt��C.���R���B�\6N��)�Ez=H�1o�YI��t��u�r�^it$���TE��Aw�o��o�- �!��?h(����4�d�%��h�j�$�BCi�;�K���in}�I�������~����I�r����i�V7�� QN�o��?�����`����5���9[���aC�Q8}�����r�-���hOrK5�����k��l$o
~���r�/��������_��5�av���2666777,,,33�����E�����Pu������e����H���@��\�q��_�W�.]�P(������RRR����_�?�^6��J�2vV���w�[`
J��@���\��}���#��c%''g���r�<***::�
�vyN�����jMS(����U$�}����x�����_�5�+EEE�omR���7��.����a�Q8}��O��p�-�����x��OO��� ��J��H����i}}�M�Ft4��V����[����l&����> �J�W]�������-�?�EQ*��l�S�R)��lR4
���Pl^���l.V�T��Y�^6���ED��������-����6`��`��qt\RRZ\\����"����9WY��dg�{���w���8��C�o.����hlGFFfgg����iDD�M��RV�d����z���D�G5����i�s����������T�B���a�"������@�����zE��D���p$v����"%��w��i�@;�<�{���k����s�������_��r����eSp�T�;�����f��������1�^6go�64�b/����'�z�En �TV��?�����$�"����W��������.K���a�QH6)(�s ��pp�En�_�_�o���s��~3����~8:�bg��es����>��~/���@����A.v��FM������P�%:>�o�[����
�������:���Z���r�9 P,�m��r���-������aO��(\�*>JDwn�\�����p�?c�Q��M�-�c��w����4:&�|�������,��)?���
Z�\�@5
��g�������D�����?�m�����RVu-�@\����r>y_����5����K�������������>z�hZ���������
������h�#�-=��j��Hs��m�b�8"�t�����7���3a��;vl��q��������K�*��������������m.���*������)���SG<�sq���3f�pqq�6mZBB[�����u�\.������f���E�hs��>�|��u^��}��B�n�7�������{�������h#00����&E[�Q�g��,�4�b/����n�8l���7Y���[�`���*�J"��JG����6)��d�X-6��Z{N�H
���i0���o7\u"�h��,p���x������O>�d���Q�T*��F�J�R&���hnQ���������m�=�f~���#W��V���6p���XZZJ3)--m��I�bpppIIIhhhqqqHH�M���hr�P��i/�sz=d�5���.�5k���������������tJ��6)�3��2���S�q����Ftp�����\-p�������0LRR;���OGx���111���
��.����������}A]�
cg{9�O��M��=���7u:]��\.��s�m�����H����LOvv��D�}��"�nLh#���-�l��m�T);��vF���;6�\���������I�^6��
�y?���A.��FU�s����(v�]x3:D���m�����mRSu,��S�v<;���h�@?7n{6�\������}i%�L7(���Z��Q�d#���}���x��Rp�	n~�<�K��qV�]������N�\����M%��c���B��'E�P�������T�����-�C����I��� oS�@>>I����H�����>�����\�8�\���#���<��A.Z�OG����8�
��A.�@r������^�#��]���X�XYY�������������&cG��bR9X�I�9V..]�T�Pdee������$''���S����������A8V.���l��U.�GEEEGG�!�a��0���������,,,��;�;���*�J"������[�JG���b�]�c��T*��H�J�R&�Y�nQ����h���g�=��n;�9�]��8V.���������p�������������x:�����;�;������111���
�"##����8V.����;wr��/��E��![c�{���v�s;�6A��� L��&�E�"�	r���������t:�;�V�;|w����K�������������>z�hZ|�������.���/������:����6����o�6��y�����~_��h��C��?��e�����Mrd��9&L��c������{��1bx��x�bpp0��k��N��_$�~��?��������������s�}�;��G��c��-��#me�����d��9s��3\\\�M������������#����mNUTT���<x����o�k�o�j����P��V_���j���b�qqqlc���c��a��m��k>|x���_}�U�����`��v�.�9�b��Y�f�����|��mm�m�[�����
r��j�b��r��������I����3b����~;>>������w�Y��]l�����7=z�X��
n��v�������CA.vY�_l�?N�<��'�:��l���m���[����j'�b����<��������]lpsV��]l|b�/x��\����b�������iii�#��x���x@(�t:��
����m�m����~��b����l�~_���e���&.\8k����'�_|����g?��sK�.}�����[��v�.������w�y��b����l�~_��h��/���6�v�..6I;F{���������O�����?�������C}����v�EV;i����+�=nX|��mm�m�[�����
r�
������MZ=s��>���gV;i��������
���6�6������� L��&�E�"�	r��`�\0A.� L��&�E�"�	r��`�\0A.���k��%KN�>-��'M����������^-��h�Gy����[�r����������GDD���������G��KR��������Ep�������3�<C�+V�h��q�x�����g:th��Q[�l����h4={���z����2+W��)��������{�<y���so��FIIIHH��5k�����l�����iiim\x�����
���o����,YBsN,WWW?~���`��]t�Isq����(?~���O��y�������l�� �Q\�~��5�-���G�nnn4�F���j����������ON�6��.���4G�C���������(����^�������G�l����X����Sggg�Ngq����'''�5�����U
��F�N�\G1f��o������0��9s�
�������>�,==}���3g�,..�������(-ZD�p�<�LMM�������6o�lu��'._���%##c��M���ki���w����z��9v�������(:$m>�{�\G1t�P:^|��7^x��I�&��Nrrr\\5�����O?me�}����+++�}��y��=���}��Y�vm<��Ep 2h^7�hl���c���--f�6�e��O�;w��}�N�\0A.� L��&�E�"�	r��`�\0A.� L��&�E�"�	r��`�\0A.� L��&�E���W�Z�V4tIEND�B`�
#52Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Andres Freund (#50)
Re: Hash Indexes

On 17/09/16 06:38, Andres Freund wrote:

On 2016-09-16 09:12:22 -0700, Jeff Janes wrote:

On Thu, Sep 15, 2016 at 7:23 AM, Andres Freund <andres@anarazel.de> wrote:

One earlier question about this is whether that is actually a worthwhile
goal. Are the speed and space benefits big enough in the general case?
Could those benefits not be achieved in a more maintainable manner by
adding a layer that uses a btree over hash(columns), and adds
appropriate rechecks after heap scans?

Note that I'm not saying that hash indexes are not worthwhile, I'm just
doubtful that question has been explored sufficiently.

I think that exploring it well requires good code. If the code is good,
why not commit it?

Because getting there requires a lot of effort, debugging it afterwards
would take effort, and maintaining it would also takes a fair amount?
Adding code isn't free.

I'm rather unenthused about having a hash index implementation that's
mildly better in some corner cases, but otherwise doesn't have much
benefit. That'll mean we'll have to step up our user education a lot,
and we'll have to maintain something for little benefit.

While I see the point of what you are saying here, I recall all previous
discussions about has indexes tended to go a bit like this:

- until WAL logging of hash indexes is written it is not worthwhile
trying to make improvements to them
- WAL logging will be a lot of work, patches 1st please

Now someone has done that work, and we seem to be objecting that because
they are not improved then the patches are (maybe) not worthwhile. I
think that is - essentially - somewhat unfair.

regards

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53Amit Kapila
amit.kapila16@gmail.com
In reply to: Mark Kirkwood (#52)
Re: Hash Indexes

On Mon, Sep 19, 2016 at 11:20 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

On 17/09/16 06:38, Andres Freund wrote:

On 2016-09-16 09:12:22 -0700, Jeff Janes wrote:

On Thu, Sep 15, 2016 at 7:23 AM, Andres Freund <andres@anarazel.de>
wrote:

One earlier question about this is whether that is actually a worthwhile
goal. Are the speed and space benefits big enough in the general case?
Could those benefits not be achieved in a more maintainable manner by
adding a layer that uses a btree over hash(columns), and adds
appropriate rechecks after heap scans?

Note that I'm not saying that hash indexes are not worthwhile, I'm just
doubtful that question has been explored sufficiently.

I think that exploring it well requires good code. If the code is good,
why not commit it?

Because getting there requires a lot of effort, debugging it afterwards
would take effort, and maintaining it would also takes a fair amount?
Adding code isn't free.

I'm rather unenthused about having a hash index implementation that's
mildly better in some corner cases, but otherwise doesn't have much
benefit. That'll mean we'll have to step up our user education a lot,
and we'll have to maintain something for little benefit.

While I see the point of what you are saying here, I recall all previous
discussions about has indexes tended to go a bit like this:

- until WAL logging of hash indexes is written it is not worthwhile trying
to make improvements to them
- WAL logging will be a lot of work, patches 1st please

Now someone has done that work, and we seem to be objecting that because
they are not improved then the patches are (maybe) not worthwhile.

I think saying hash indexes are not improved after proposed set of
patches is an understatement. The read performance has improved by
more than 80% as compare to HEAD [1]/messages/by-id/CAD__OugX0aOa7qopz3d-nbBAoVmvSmdFJOX4mv5tFRpijqH47A@mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com (refer data in Mithun's mail).
Also, tests by Mithun and Jesper has indicated that in multiple
workloads, they are better than BTREE by 30~60% (in fact Jesper
mentioned that he is seeing 40~60% benefit on production database,
Jesper correct me if I am wrong.). I agree that when index column is
updated they are much worse than btree as of now, but no work has been
done improve it and I am sure that it can be improved for those cases
as well.

In general, I thought the tests done till now are sufficient to prove
the importance of work, but if still Andres and others have doubt and
they want to test some specific cases, then sure we can do more
performance benchmarking.

Mark, thanks for supporting the case for improving Hash Indexes.

[1]: /messages/by-id/CAD__OugX0aOa7qopz3d-nbBAoVmvSmdFJOX4mv5tFRpijqH47A@mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54AP
ap@zip.com.au
In reply to: Mark Kirkwood (#52)
Re: Hash Indexes

On Mon, Sep 19, 2016 at 05:50:13PM +1200, Mark Kirkwood wrote:

I'm rather unenthused about having a hash index implementation that's
mildly better in some corner cases, but otherwise doesn't have much
benefit. That'll mean we'll have to step up our user education a lot,
and we'll have to maintain something for little benefit.

While I see the point of what you are saying here, I recall all previous
discussions about has indexes tended to go a bit like this:

- until WAL logging of hash indexes is written it is not worthwhile trying
to make improvements to them
- WAL logging will be a lot of work, patches 1st please

Now someone has done that work, and we seem to be objecting that because
they are not improved then the patches are (maybe) not worthwhile. I think
that is - essentially - somewhat unfair.

My understanding of hash indexes is that they'd be good for indexing
random(esque) data (such as UUIDs or, well, hashes like shaX). If so
then I've got a DB that'll be rather big that is the very embodiment
of such a use case. It indexes such data for equality comparisons
and runs on SELECT, INSERT and, eventually, DELETE.

Lack of WAL and that big warning in the docs is why I haven't used it.

Given the above, many lamentations from me that it wont be available
for 9.6. :( When 10.0 comes I'd probably go to the bother of re-indexing
with hash indexes.

Andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Amit Kapila (#53)
Re: Hash Indexes

On Mon, Sep 19, 2016 at 12:14:26PM +0530, Amit Kapila wrote:

On Mon, Sep 19, 2016 at 11:20 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

On 17/09/16 06:38, Andres Freund wrote:

On 2016-09-16 09:12:22 -0700, Jeff Janes wrote:

On Thu, Sep 15, 2016 at 7:23 AM, Andres Freund <andres@anarazel.de>
wrote:

One earlier question about this is whether that is actually a worthwhile
goal. Are the speed and space benefits big enough in the general case?
Could those benefits not be achieved in a more maintainable manner by
adding a layer that uses a btree over hash(columns), and adds
appropriate rechecks after heap scans?

Note that I'm not saying that hash indexes are not worthwhile, I'm just
doubtful that question has been explored sufficiently.

I think that exploring it well requires good code. If the code is good,
why not commit it?

Because getting there requires a lot of effort, debugging it afterwards
would take effort, and maintaining it would also takes a fair amount?
Adding code isn't free.

I'm rather unenthused about having a hash index implementation that's
mildly better in some corner cases, but otherwise doesn't have much
benefit. That'll mean we'll have to step up our user education a lot,
and we'll have to maintain something for little benefit.

While I see the point of what you are saying here, I recall all previous
discussions about has indexes tended to go a bit like this:

- until WAL logging of hash indexes is written it is not worthwhile trying
to make improvements to them
- WAL logging will be a lot of work, patches 1st please

Now someone has done that work, and we seem to be objecting that because
they are not improved then the patches are (maybe) not worthwhile.

I think saying hash indexes are not improved after proposed set of
patches is an understatement. The read performance has improved by
more than 80% as compare to HEAD [1] (refer data in Mithun's mail).
Also, tests by Mithun and Jesper has indicated that in multiple
workloads, they are better than BTREE by 30~60% (in fact Jesper
mentioned that he is seeing 40~60% benefit on production database,
Jesper correct me if I am wrong.). I agree that when index column is
updated they are much worse than btree as of now, but no work has been
done improve it and I am sure that it can be improved for those cases
as well.

In general, I thought the tests done till now are sufficient to prove
the importance of work, but if still Andres and others have doubt and
they want to test some specific cases, then sure we can do more
performance benchmarking.

Mark, thanks for supporting the case for improving Hash Indexes.

[1] - /messages/by-id/CAD__OugX0aOa7qopz3d-nbBAoVmvSmdFJOX4mv5tFRpijqH47A@mail.gmail.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

+1 Throughout the years, I have seen benchmarks that demonstrated the
performance advantages of even the initial hash index (without WAL)
over the btree of a hash variant. It is pretty hard to dismiss the
O(1) versus O(log(n)) difference. There are classes of problems for
which a hash index is the best solution. Lack of WAL has hamstrung
development in those areas for years.

Regards,
Ken

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Jeff Janes
jeff.janes@gmail.com
In reply to: Amit Kapila (#53)
Re: Hash Indexes

On Sun, Sep 18, 2016 at 11:44 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Mon, Sep 19, 2016 at 11:20 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

On 17/09/16 06:38, Andres Freund wrote:

While I see the point of what you are saying here, I recall all previous
discussions about has indexes tended to go a bit like this:

- until WAL logging of hash indexes is written it is not worthwhile

trying

to make improvements to them
- WAL logging will be a lot of work, patches 1st please

Now someone has done that work, and we seem to be objecting that because
they are not improved then the patches are (maybe) not worthwhile.

+1

I think saying hash indexes are not improved after proposed set of
patches is an understatement. The read performance has improved by
more than 80% as compare to HEAD [1] (refer data in Mithun's mail).
Also, tests by Mithun and Jesper has indicated that in multiple
workloads, they are better than BTREE by 30~60% (in fact Jesper
mentioned that he is seeing 40~60% benefit on production database,
Jesper correct me if I am wrong.). I agree that when index column is
updated they are much worse than btree as of now,

Has anyone tested that with the relcache patch applied? I would expect
that to improve things by a lot (compared to hash-HEAD, not necessarily
compared to btree-HEAD), but if I am following the emails correctly, that
has not been done.

but no work has been
done improve it and I am sure that it can be improved for those cases
as well.

In general, I thought the tests done till now are sufficient to prove
the importance of work, but if still Andres and others have doubt and
they want to test some specific cases, then sure we can do more
performance benchmarking.

I think that a precursor to WAL is enough to justify it even if the
verified performance improvements were not impressive. But they are pretty
impressive, at least for some situations.

Cheers,

Jeff

#57Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#50)
Re: Hash Indexes

On Fri, Sep 16, 2016 at 2:38 PM, Andres Freund <andres@anarazel.de> wrote:

I think that exploring it well requires good code. If the code is good,
why not commit it?

Because getting there requires a lot of effort, debugging it afterwards
would take effort, and maintaining it would also takes a fair amount?
Adding code isn't free.

Of course not, but nobody's saying you have to be the one to put in
any of that effort. I was a bit afraid that nobody outside of
EnterpriseDB was going to take any interest in this patch, and I'm
really pretty pleased by the amount of interest that it's generated.
It's pretty clear that multiple smart people are working pretty hard
to break this, and Amit is fixing it, and at least for me that makes
me a lot less scared that the final result will be horribly broken.
It will probably have some bugs, but they probably won't be worse than
the status quo:

WARNING: hash indexes are not WAL-logged and their use is discouraged

Personally, I think it's outright embarrassing that we've had that
limitation for years; it boils down to "hey, we have this feature but
it doesn't work", which is a pretty crummy position for the world's
most advanced open-source database to take.

I'm rather unenthused about having a hash index implementation that's
mildly better in some corner cases, but otherwise doesn't have much
benefit. That'll mean we'll have to step up our user education a lot,
and we'll have to maintain something for little benefit.

If it turns out that it has little benefit, then we don't really need
to step up our user education. People can just keep using btree like
they do now and that will be fine. The only time we *really* need to
step up our user education is if it *does* have a benefit. I think
that's a real possibility, because it's pretty clear to me - based in
part on off-list conversations with Amit - that the hash index code
has gotten very little love compared to btree, and there are lots of
optimizations that have been done for btree that have not been done
for hash indexes, but which could be done. So I think there's a very
good chance that once we fix hash indexes to the point where they can
realistically be used, there will be further patches - either from
Amit or others - which improve performance even more. Even the
preliminary results are not bad, though.

Also, Oracle offers hash indexes, and SQL Server offers them for
memory-optimized tables. DB2 offers a "hash access path" which is not
described as an index but seems to work like one. MySQL, like SQL
Server, offers them only for memory-optimized tables. When all of the
other database products that we're competing against offer something,
it's not crazy to think that we should have it, too - and that it
should actually work, rather than being some kind of half-supported
wart.

By the way, I think that one thing which limits the performance
improvement we can get from hash indexes is the overall slowness of
the executor. You can't save more by speeding something up than the
percentage of time you were spending on it in the first place. IOW,
if you're spending all of your time in src/backend/executor then you
can't be spending it in src/backend/access, so making
src/backend/access faster doesn't help much. However, as the executor
gets faster, which I hope it will, the potential gains from a faster
index go up.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#45)
1 attachment(s)
Re: Hash Indexes

On Fri, Sep 16, 2016 at 11:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I do want to work on it, but it is always possible that due to some
other work this might get delayed. Also, I think there is always a
chance that while doing that work, we face some problem due to which
we might not be able to use that optimization. So I will go with your
suggestion of removing hashscan.c and it's usage for now and then if
required we will pull it back. If nobody else thinks otherwise, I
will update this in next patch version.

In the attached patch, I have removed the support of hashscans. I
think it might improve performance by few percentage (especially for
single row fetch transactions) as we have registration and destroy of
hashscans.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

concurrent_hash_index_v8.patchapplication/octet-stream; name=concurrent_hash_index_v8.patchDownload
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c1122b4..d02539a 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -400,7 +400,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	Buffer		buf;
 	Page		page;
 
-	_hash_getlock(rel, blkno, HASH_SHARE);
 	buf = _hash_getbuf_with_strategy(rel, blkno, HASH_READ, 0, bstrategy);
 	page = BufferGetPage(buf);
 
@@ -431,7 +430,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	}
 
 	_hash_relbuf(rel, buf);
-	_hash_droplock(rel, blkno, HASH_SHARE);
 }
 
 /*
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index 5d3bd94..e2e7e91 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/access/hash
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashscan.o \
-       hashsearch.o hashsort.o hashutil.o hashvalidate.o
+OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
+       hashsort.o hashutil.o hashvalidate.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 0a7da89..1974fad 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -125,49 +125,47 @@ the initially created buckets.
 
 Lock Definitions
 ----------------
-
-We use both lmgr locks ("heavyweight" locks) and buffer context locks
-(LWLocks) to control access to a hash index.  lmgr locks are needed for
-long-term locking since there is a (small) risk of deadlock, which we must
-be able to detect.  Buffer context locks are used for short-term access
-control to individual pages of the index.
-
-LockPage(rel, page), where page is the page number of a hash bucket page,
-represents the right to split or compact an individual bucket.  A process
-splitting a bucket must exclusive-lock both old and new halves of the
-bucket until it is done.  A process doing VACUUM must exclusive-lock the
-bucket it is currently purging tuples from.  Processes doing scans or
-insertions must share-lock the bucket they are scanning or inserting into.
-(It is okay to allow concurrent scans and insertions.)
-
-The lmgr lock IDs corresponding to overflow pages are currently unused.
-These are available for possible future refinements.  LockPage(rel, 0)
-is also currently undefined (it was previously used to represent the right
-to modify the hash-code-to-bucket mapping, but it is no longer needed for
-that purpose).
-
-Note that these lock definitions are conceptually distinct from any sort
-of lock on the pages whose numbers they share.  A process must also obtain
-read or write buffer lock on the metapage or bucket page before accessing
-said page.
-
-Processes performing hash index scans must hold share lock on the bucket
-they are scanning throughout the scan.  This seems to be essential, since
-there is no reasonable way for a scan to cope with its bucket being split
-underneath it.  This creates a possibility of deadlock external to the
-hash index code, since a process holding one of these locks could block
-waiting for an unrelated lock held by another process.  If that process
-then does something that requires exclusive lock on the bucket, we have
-deadlock.  Therefore the bucket locks must be lmgr locks so that deadlock
-can be detected and recovered from.
-
-Processes must obtain read (share) buffer context lock on any hash index
-page while reading it, and write (exclusive) lock while modifying it.
-To prevent deadlock we enforce these coding rules: no buffer lock may be
-held long term (across index AM calls), nor may any buffer lock be held
-while waiting for an lmgr lock, nor may more than one buffer lock
-be held at a time by any one process.  (The third restriction is probably
-stronger than necessary, but it makes the proof of no deadlock obvious.)
+We use buffer content locks (LWLocks) and buffer pins to control access to
+a hash index.  We will refer buffer content locks as locks in following
+paragraphs.
+
+Scan will take a lock in shared mode on the primary bucket or on one of the
+overflow page.  Inserts will acquire exclusive lock on the primary bucket or
+on the overflow page in which it has to insert.  Both the operations releases
+the lock on previous bucket or overflow page before moving to the next overflow
+page.  They will retain a pin on primary bucket till end of operation. Split
+operation must acquire cleanup lock on both old and new halves of the bucket
+and mark split-in-progress on both the buckets.  The cleanup lock at the start
+of split ensures that parallel insert won't get lost.  Consider a case where
+insertion has to add a tuple on some intermediate overflow page in the bucket
+chain, if we allow split when insertion is in progress, split might not move
+this newly inserted tuple.  Like inserts and scans, it releases the lock
+on previous bucket or overflow page before moving to the next overflow page
+both for old bucket or for new bucket.  After partitioning the tuples between
+old and new buckets, it again needs to acquire exclusive lock on both old and
+new buckets to clear the split-in-progress flag.  Like inserts and scans, it
+will also retain pins on both the old and new primary buckets till end of split
+operation, although we can do without that as well.
+
+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or tuples
+that are moved due to split.  The need for cleanup lock to remove dead tuples
+is to ensure that scans' returns correct results.  Scan that returns multiple
+tuples from the same bucket page always restart the scan from the previous
+offset number from which it has returned last tuple.  If we allow vacuum to
+remove the dead tuples with just an exclusive lock, it could remove the tuple
+required to resume the scan.  The need for cleanup lock to remove the tuples
+that are moved by split is to ensure that there is no pending scan that has
+started after the start of split and before the finish of split on bucket.
+If we don't do that, then vacuum can remove tuples that are required by such
+a scan.  We don't need to retain this cleanup lock during whole vacuum
+operation on bucket.  We releases the lock as we move ahead in the bucket
+chain.  In the end, for squeeze-phase, we conditionally acquire cleanup lock
+and if we don't get, then we just abandon the squeeze phase.
+
+To avoid deadlocks, we must be consistent about the lock order in which we
+lock the buckets for operations that requires locks on two different buckets.
+We use the rule "first lock the old bucket and then new bucket, basically
+lock the lowered number bucket first".
 
 
 Pseudocode Algorithms
@@ -188,63 +186,104 @@ track of available overflow pages.
 The reader algorithm is:
 
 	pin meta page and take buffer content lock in shared mode
-	loop:
-		compute bucket number for target hash key
-		release meta page buffer content lock
-		if (correct bucket page is already locked)
-			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock
-		retake meta page buffer content lock in shared mode
+	compute bucket number for target hash key
+	read and pin the primary bucket page
+	conditionally get the buffer content lock in shared mode on primary bucket page for search
+	if we didn't get the lock (need to wait for lock)
+		release the buffer content lock on meta page
+		acquire buffer content lock on primary bucket page in shared mode
+		acquire the buffer content lock in shared mode on meta page
+		to check for possibility of split, we need to recompute the bucket and
+		verify, if it is a correct bucket; set the retry flag
+	else if we get the lock, then we can skip the retry path
+	if (retry)
+		loop:
+			compute bucket number for target hash key
+			release meta page buffer content lock
+			if (correct bucket page is already locked)
+				break
+			release any existing content lock on bucket page (if a concurrent split happened)
+			pin primary bucket page and take shared buffer content lock
+			retake meta page buffer content lock in shared mode
 -- then, per read request:
 	release pin on metapage
-	read current page of bucket and take shared buffer content lock
-		step to next page if necessary (no chaining of locks)
+	if the split is in progress for current bucket and this is a new bucket
+		release the buffer content lock on current bucket page
+		pin and acquire the buffer content lock on old bucket in shared mode
+		release the buffer content lock on old bucket, but not pin
+		retake the buffer content lock on new bucket
+		mark the scan such that it skips the tuples that are marked as moved by split
+	step to next page if necessary (no chaining of locks)
+		if the scan indicates moved by split, then move to old bucket after the scan
+		of current bucket is finished
 	get tuple
 	release buffer content lock and pin on current page
 -- at scan shutdown:
-	release bucket share-lock
-
-We can't hold the metapage lock while acquiring a lock on the target bucket,
-because that might result in an undetected deadlock (lwlocks do not participate
-in deadlock detection).  Instead, we relock the metapage after acquiring the
-bucket page lock and check whether the bucket has been split.  If not, we're
-done.  If so, we release our previously-acquired lock and repeat the process
-using the new bucket number.  Holding the bucket sharelock for
+	release any pin we hold on current buffer, old bucket buffer, new bucket buffer
+
+We don't want to hold the meta page lock if we have to wait for acquiring the
+content lock on bucket page, because that might result in poor concurrency.
+Instead, we relock the metapage after acquiring the bucket page content lock
+and check whether the bucket has been split.  If not, we're done.  If so, we
+release our previously-acquired content lock, but not pin and repeat the
+process using the new bucket number.  Holding the buffer pin on bucket page for
 the remainder of the scan prevents the reader's current-tuple pointer from
-being invalidated by splits or compactions.  Notice that the reader's lock
+being invalidated by splits or compactions.  Notice that the reader's pin
 does not prevent other buckets from being split or compacted.
 
 To keep concurrency reasonably good, we require readers to cope with
 concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the page sharelock.  Since
-deletion is not possible while a reader holds the bucket sharelock, and
-we assume that heap tuple TIDs are unique, this can be implemented by
+their current scan position after re-acquiring the buffer content lock on
+page.  Since deletion is not possible while a reader holds the pin on bucket,
+and we assume that heap tuple TIDs are unique, this can be implemented by
 searching for the same heap tuple TID previously returned.  Insertion does
 not move index entries across pages, so the previously-returned index entry
 should always be on the same page, at the same or higher offset number,
 as it was before.
 
+To allow scan during bucket split, if at the start of the scan, bucket is
+marked as split-in-progress, it scan all the tuples in that bucket except for
+those that are marked as moved-by-split.  Once it finishes the scan of all the
+tuples in the current bucket, it scans the old bucket from which this bucket
+is formed by split.  This happens only for the new half bucket.
+
 The insertion algorithm is rather similar:
 
 	pin meta page and take buffer content lock in shared mode
-	loop:
-		compute bucket number for target hash key
-		release meta page buffer content lock
-		if (correct bucket page is already locked)
-			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock in shared mode
-		retake meta page buffer content lock in shared mode
--- (so far same as reader)
+	compute bucket number for target hash key
+	read and pin the primary bucket page
+	conditionally get the buffer content lock in exclusive mode on primary bucket page for search
+	if we didn't get the lock (need to wait for lock)
+		release the buffer content lock on meta page
+		acquire buffer content lock on primary bucket page in exclusive mode
+		acquire the buffer content lock in shared mode on meta page
+		to check for possibility of split, we need to recompute the bucket and
+		verify, if it is a correct bucket; set the retry flag
+	else if we get the lock, then we can skip the retry path
+	if (retry)
+		loop:
+			compute bucket number for target hash key
+			release meta page buffer content lock
+			if (correct bucket page is already locked)
+				break
+			release any existing content lock on bucket page (if a concurrent split happened)
+			pin primary bucket page and take exclusive buffer content lock
+			retake meta page buffer content lock in shared mode
+-- (so far same as reader, except for acquisation of buffer content lock in
+	exclusive mode on primary bucket page)
 	release pin on metapage
-	pin current page of bucket and take exclusive buffer content lock
-	if full, release, read/exclusive-lock next page; repeat as needed
+	if the split-in-progress flag is set for bucket in old half of split
+	and pin count on it is one, then finish the split
+		we already have a buffer content lock on old bucket, conditionally get the content lock on new bucket
+		if get the lock on new bucket
+			finish the split using algorithm mentioned below for split
+			release the buffer content lock and pin on new bucket
+	if full, release lock but not pin, read/exclusive-lock next page; repeat as needed
 	>> see below if no space in any page of bucket
 	insert tuple at appropriate place in page
 	mark current page dirty and release buffer content lock and pin
-	release heavyweight share-lock
-	pin meta page and take buffer content lock in shared mode
+	if current page is not a bucket page, release the pin on bucket page
+	pin meta page and take buffer content lock in exclusive mode
 	increment tuple count, decide if split needed
 	mark meta page dirty and release buffer content lock and pin
 	done if no split needed, else enter Split algorithm below
@@ -256,11 +295,13 @@ bucket that is being actively scanned, because readers can cope with this
 as explained above.  We only need the short-term buffer locks to ensure
 that readers do not see a partially-updated page.
 
-It is clearly impossible for readers and inserters to deadlock, and in
-fact this algorithm allows them a very high degree of concurrency.
-(The exclusive metapage lock taken to update the tuple count is stronger
-than necessary, since readers do not care about the tuple count, but the
-lock is held for such a short time that this is probably not an issue.)
+To avoid deadlock between readers and inserters, whenever there is a need
+to lock multiple buckets, we always take in the order suggested in Lock
+Definitions above.  This algorithm allows them a very high degree of
+concurrency.  (The exclusive metapage lock taken to update the tuple count
+is stronger than necessary, since readers do not care about the tuple count,
+but the lock is held for such a short time that this is probably not an
+issue.)
 
 When an inserter cannot find space in any existing page of a bucket, it
 must obtain an overflow page and add that page to the bucket's chain.
@@ -271,46 +312,84 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
 The algorithm attempts, but does not necessarily succeed, to split one
 existing bucket in two, thereby lowering the fill ratio:
 
-	pin meta page and take buffer content lock in exclusive mode
-	check split still needed
-	if split not needed anymore, drop buffer content lock and pin and exit
-	decide which bucket to split
-	Attempt to X-lock old bucket number (definitely could fail)
-	Attempt to X-lock new bucket number (shouldn't fail, but...)
-	if above fail, drop locks and pin and exit
+	expand:
+		take buffer content lock in exclusive mode on meta page
+		check split still needed
+		if split not needed anymore, drop buffer content lock and exit
+		decide which bucket to split
+		Attempt to acquire cleanup lock on old bucket number (definitely could fail)
+		if above fail, release lock and pin and exit
+		if the split-in-progress flag is set, then finish the split
+			conditionally get the content lock on new bucket which was involved in split
+			if got the lock on new bucket
+				finish the split using algorithm mentioned below for split
+				release the buffer content lock and pin on old and new bucketa
+				try to expand from start
+			else
+				release the buffer conetent lock and pin on old bucket and exit
+		if the garbage flag (indicates that tuples are moved by split) is set on bucket
+			release the buffer content lock on meta page
+			remove the tuples that doesn't belong to this bucket; see bucket cleanup below
+	Attempt to acquire cleanup lock on new bucket number (shouldn't fail, but...)
 	update meta page to reflect new number of buckets
-	mark meta page dirty and release buffer content lock and pin
+	mark meta page dirty and release buffer content lock
 	-- now, accesses to all other buckets can proceed.
 	Perform actual split of bucket, moving tuples as needed
 	>> see below about acquiring needed extra space
-	Release X-locks of old and new buckets
+
+	split guts
+	mark the old and new buckets indicating split-in-progress
+	mark the old bucket indicating has-garbage
+	copy the tuples that belongs to new bucket from old bucket
+	during copy mark such tuples as move-by-split
+	release lock but not pin for primary bucket page of old bucket,
+	read/shared-lock next page; repeat as needed
+	>> see below if no space in bucket page of new bucket
+	ensure to have exclusive-lock on both old and new buckets in that order
+	clear the split-in-progress flag from both the buckets
+	mark buffers dirty and release the locks and pins on both old and new buckets
 
 Note the metapage lock is not held while the actual tuple rearrangement is
 performed, so accesses to other buckets can proceed in parallel; in fact,
 it's possible for multiple bucket splits to proceed in parallel.
 
-Split's attempt to X-lock the old bucket number could fail if another
-process holds S-lock on it.  We do not want to wait if that happens, first
-because we don't want to wait while holding the metapage exclusive-lock,
-and second because it could very easily result in deadlock.  (The other
-process might be out of the hash AM altogether, and could do something
-that blocks on another lock this process holds; so even if the hash
-algorithm itself is deadlock-free, a user-induced deadlock could occur.)
-So, this is a conditional LockAcquire operation, and if it fails we just
-abandon the attempt to split.  This is all right since the index is
-overfull but perfectly functional.  Every subsequent inserter will try to
-split, and eventually one will succeed.  If multiple inserters failed to
-split, the index might still be overfull, but eventually, the index will
+Split's attempt to acquire cleanup-lock on the old bucket number could fail
+if another process holds any lock or pin on it.  We do not want to wait if
+that happens, because we don't want to wait while holding the metapage
+exclusive-lock.  So, this is a conditional LWLockAcquire operation, and if
+it fails we just abandon the attempt to split.  This is all right since the
+index is overfull but perfectly functional.  Every subsequent inserter will
+try to split, and eventually one will succeed.  If multiple inserters failed
+to split, the index might still be overfull, but eventually, the index will
 not be overfull and split attempts will stop.  (We could make a successful
 splitter loop to see if the index is still overfull, but it seems better to
 distribute the split overhead across successive insertions.)
 
+During copy of tuple from old bucket to new bucket, we mark tuple as
+move-by-split so that concurrent scans can skip such tuples till the split
+operation is finished.  Once the tuple is marked as moved-by-split, it will
+remain so forever but that does no harm.  We have intentionally not
+cleared it, as that can generate an additional I/O which is not necessary.
+
+has_garbage flag indicates that the bucket contains tuples that are moved due
+to split.  This will be set only for old bucket.  Now, why we need it besides
+split-in-progress flag is to distinguish the case when the split is over
+(aka split-in-progress flag is cleared.).  This is used both by vacuum as
+well as during re-split operation.  Vacuum, uses it to decide if it needs to
+clear the tuples (that are moved-by-split) from bucket along with dead tuples.
+Re-split of bucket uses it to ensure that it doesn't start a new split from a
+bucket without clearing the previous tuples from old bucket.  The usage by
+re-split helps to keep bloat under control and makes the design somewhat
+simpler as we don't have to any time handle the situation where a bucket can
+contain dead-tuples from multiple splits.
+
 A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt.  The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
+disk space or crash) the index is left corrupt.  The probability of that
+could be made quite low if we grab a free page or two before we update the
+meta page, but the only real solution is to treat a split as a WAL-loggable,
 must-complete action.  I'm not planning to teach hash about WAL in this
-go-round.
+go-round.  However, we do try to finish the incomplete splits during insert
+and split.
 
 The fourth operation is garbage collection (bulk deletion):
 
@@ -319,9 +398,13 @@ The fourth operation is garbage collection (bulk deletion):
 	fetch current max bucket number
 	release meta page buffer content lock and pin
 	while next bucket <= max bucket do
-		Acquire X lock on target bucket
-		Scan and remove tuples, compact free space as needed
-		Release X lock
+		Acquire cleanup lock on target bucket
+		Scan and remove tuples
+		For overflow page, first we need to lock the next page and then
+		release the lock on current bucket or overflow page
+		Ensure to have buffer content lock in exclusive mode on bucket page
+		If buffer pincount is one, then compact free space as needed
+		Release lock
 		next bucket ++
 	end loop
 	pin metapage and take buffer content lock in exclusive mode
@@ -330,20 +413,23 @@ The fourth operation is garbage collection (bulk deletion):
 	else update metapage tuple count
 	mark meta page dirty and release buffer content lock and pin
 
-Note that this is designed to allow concurrent splits.  If a split occurs,
-tuples relocated into the new bucket will be visited twice by the scan,
-but that does no harm.  (We must however be careful about the statistics
-reported by the VACUUM operation.  What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count
-if the stored tuple count and number of buckets did *not* change at any
-time during the scan.  This provides a way of correcting the stored tuple
-count if it gets out of sync for some reason.  But if a split or insertion
-does occur concurrently, the scan count is untrustworthy; instead,
-subtract the number of tuples deleted from the stored tuple count and
-use that.)
-
-The exclusive lock request could deadlock in some strange scenarios, but
-we can just error out without any great harm being done.
+Note that this is designed to allow concurrent splits and scans.  If a
+split occurs, tuples relocated into the new bucket will be visited twice
+by the scan, but that does no harm.  As we are releasing the locks on bucket
+page or overflow pages during cleanup scan of a bucket, it will allow
+concurrent scan to start on a bucket and ensures that scan will always be
+behind cleanup.  It is must to keep scans behind cleanup, else vacuum could
+remove tuples that are required to complete the scan as explained in Lock
+Definitions section above.  This holds true for backward scans as well
+(backward scans first traverse each bucket starting from first bucket to last
+overflow page in the chain).  We must be careful about the statistics reported
+by the VACUUM operation.  What we can do is count the number of tuples scanned,
+and believe this in preference to the stored tuple count if the stored tuple
+count and number of buckets did *not* change at any time during the scan.  This
+provides a way of correcting the stored tuple count if it gets out of sync for
+some reason.  But if a split or insertion does occur concurrently, the scan
+count is untrustworthy; instead, subtract the number of tuples deleted
+from the stored tuple count and use that.
 
 
 Free Space Management
@@ -417,13 +503,11 @@ free page; there can be no other process holding lock on it.
 
 Bucket splitting uses a similar algorithm if it has to extend the new
 bucket, but it need not worry about concurrent extension since it has
-exclusive lock on the new bucket.
+buffer content lock in exclusive mode on the new bucket.
 
-Freeing an overflow page is done by garbage collection and by bucket
-splitting (the old bucket may contain no-longer-needed overflow pages).
-In both cases, the process holds exclusive lock on the containing bucket,
-so need not worry about other accessors of pages in the bucket.  The
-algorithm is:
+Freeing an overflow page requires the process to hold buffer content lock in
+exclusive mode on the containing bucket, so need not worry about other
+accessors of pages in the bucket.  The algorithm is:
 
 	delink overflow page from bucket chain
 	(this requires read/update/write/release of fore and aft siblings)
@@ -454,14 +538,6 @@ locks.  Since they need no lmgr locks, deadlock is not possible.
 Other Notes
 -----------
 
-All the shenanigans with locking prevent a split occurring while *another*
-process is stopped in a given bucket.  They do not ensure that one of
-our *own* backend's scans is not stopped in the bucket, because lmgr
-doesn't consider a process's own locks to conflict.  So the Split
-algorithm must check for that case separately before deciding it can go
-ahead with the split.  VACUUM does not have this problem since nothing
-else can be happening within the vacuuming backend.
-
-Should we instead try to fix the state of any conflicting local scan?
-Seems mighty ugly --- got to move the held bucket S-lock as well as lots
-of other messiness.  For now, just punt and don't split.
+Clean up locks prevent a split occurring while *another* process is stopped in
+a given bucket.  It also ensures that one of our *own* backend's scans is not
+stopped in the bucket.
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..ba9a1c2 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -287,10 +287,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		/*
 		 * An insertion into the current index page could have happened while
 		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold share lock on
-		 * the bucket, no deletions or splits could have occurred; therefore
-		 * we can expect that the TID still exists in the current index page,
-		 * at an offset >= where we were.)
+		 * for the TID we previously returned.  (Because we hold pin on the
+		 * bucket, no deletions or splits could have occurred; therefore we
+		 * can expect that the TID still exists in the current index page, at
+		 * an offset >= where we were.)
 		 */
 		OffsetNumber maxoffnum;
 
@@ -424,17 +424,16 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_bucket_valid = false;
-	so->hashso_bucket_blkno = 0;
 	so->hashso_curbuf = InvalidBuffer;
+	so->hashso_bucket_buf = InvalidBuffer;
+	so->hashso_old_bucket_buf = InvalidBuffer;
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
 	ItemPointerSetInvalid(&(so->hashso_heappos));
 
-	scan->opaque = so;
+	so->hashso_skip_moved_tuples = false;
 
-	/* register scan in case we change pages it's using */
-	_hash_regscan(scan);
+	scan->opaque = so;
 
 	return scan;
 }
@@ -449,15 +448,7 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -469,8 +460,9 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		memmove(scan->keyData,
 				scankey,
 				scan->numberOfKeys * sizeof(ScanKeyData));
-		so->hashso_bucket_valid = false;
 	}
+
+	so->hashso_skip_moved_tuples = false;
 }
 
 /*
@@ -482,18 +474,7 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/* don't need scan registered anymore */
-	_hash_dropscan(scan);
-
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	pfree(so);
 	scan->opaque = NULL;
@@ -504,6 +485,9 @@ hashendscan(IndexScanDesc scan)
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
  *
+ * This function also delete the tuples that are moved by split to other
+ * bucket.
+ *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
@@ -548,83 +532,48 @@ loop_top:
 	{
 		BlockNumber bucket_blkno;
 		BlockNumber blkno;
-		bool		bucket_dirty = false;
+		Buffer		bucket_buf;
+		Buffer		buf;
+		HashPageOpaque bucket_opaque;
+		Page		page;
+		bool		bucket_has_garbage = false;
 
 		/* Get address of bucket's start page */
 		bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
 
-		/* Exclusive-lock the bucket so we can shrink it */
-		_hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
-		/* Shouldn't have any active scans locally, either */
-		if (_hash_has_active_scan(rel, cur_bucket))
-			elog(ERROR, "hash index has active scan during VACUUM");
-
-		/* Scan each page in bucket */
 		blkno = bucket_blkno;
-		while (BlockNumberIsValid(blkno))
-		{
-			Buffer		buf;
-			Page		page;
-			HashPageOpaque opaque;
-			OffsetNumber offno;
-			OffsetNumber maxoffno;
-			OffsetNumber deletable[MaxOffsetNumber];
-			int			ndeletable = 0;
-
-			vacuum_delay_point();
 
-			buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-											 info->strategy);
-			page = BufferGetPage(buf);
-			opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-			Assert(opaque->hasho_bucket == cur_bucket);
-
-			/* Scan each tuple in page */
-			maxoffno = PageGetMaxOffsetNumber(page);
-			for (offno = FirstOffsetNumber;
-				 offno <= maxoffno;
-				 offno = OffsetNumberNext(offno))
-			{
-				IndexTuple	itup;
-				ItemPointer htup;
+		/*
+		 * We need to acquire a cleanup lock on the primary bucket to out wait
+		 * concurrent scans.
+		 */
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+		LockBufferForCleanup(buf);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 
-				itup = (IndexTuple) PageGetItem(page,
-												PageGetItemId(page, offno));
-				htup = &(itup->t_tid);
-				if (callback(htup, callback_state))
-				{
-					/* mark the item for deletion */
-					deletable[ndeletable++] = offno;
-					tuples_removed += 1;
-				}
-				else
-					num_index_tuples += 1;
-			}
+		page = BufferGetPage(buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
-			/*
-			 * Apply deletions and write page if needed, advance to next page.
-			 */
-			blkno = opaque->hasho_nextblkno;
+		/*
+		 * If the bucket contains tuples that are moved by split, then we need
+		 * to delete such tuples on completion of split.  Before cleaning, we
+		 * need to out-wait the scans that have started when the split was in
+		 * progress for a bucket.
+		 */
+		if (H_HAS_GARBAGE(bucket_opaque) &&
+			!H_INCOMPLETE_SPLIT(bucket_opaque))
+			bucket_has_garbage = true;
 
-			if (ndeletable > 0)
-			{
-				PageIndexMultiDelete(page, deletable, ndeletable);
-				_hash_wrtbuf(rel, buf);
-				bucket_dirty = true;
-			}
-			else
-				_hash_relbuf(rel, buf);
-		}
+		bucket_buf = buf;
 
-		/* If we deleted anything, try to compact free space */
-		if (bucket_dirty)
-			_hash_squeezebucket(rel, cur_bucket, bucket_blkno,
-								info->strategy);
+		hashbucketcleanup(rel, bucket_buf, blkno, info->strategy,
+						  local_metapage.hashm_maxbucket,
+						  local_metapage.hashm_highmask,
+						  local_metapage.hashm_lowmask, &tuples_removed,
+						  &num_index_tuples, bucket_has_garbage, true,
+						  callback, callback_state);
 
-		/* Release bucket lock */
-		_hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+		_hash_relbuf(rel, bucket_buf);
 
 		/* Advance to next bucket */
 		cur_bucket++;
@@ -705,6 +654,197 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }
 
+/*
+ * Helper function to perform deletion of index entries from a bucket.
+ *
+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.
+ *
+ * During scan of overflow pages, first we need to lock the next bucket and
+ * then release the lock on current bucket.  This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.
+ *
+ * We need to retain a pin on the primary bucket to ensure that no concurrent
+ * split can start.
+ */
+void
+hashbucketcleanup(Relation rel, Buffer bucket_buf,
+				  BlockNumber bucket_blkno,
+				  BufferAccessStrategy bstrategy,
+				  uint32 maxbucket,
+				  uint32 highmask, uint32 lowmask,
+				  double *tuples_removed,
+				  double *num_index_tuples,
+				  bool bucket_has_garbage,
+				  bool delay,
+				  IndexBulkDeleteCallback callback,
+				  void *callback_state)
+{
+	BlockNumber blkno;
+	Buffer		buf;
+	Bucket		cur_bucket;
+	Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
+	Page		page;
+	bool		bucket_dirty = false;
+
+	blkno = bucket_blkno;
+	buf = bucket_buf;
+	page = BufferGetPage(buf);
+	cur_bucket = ((HashPageOpaque) PageGetSpecialPointer(page))->hasho_bucket;
+
+	if (bucket_has_garbage)
+		new_bucket = _hash_get_newbucket(rel, cur_bucket,
+										 lowmask, maxbucket);
+
+	/* Scan each page in bucket */
+	for (;;)
+	{
+		HashPageOpaque opaque;
+		OffsetNumber offno;
+		OffsetNumber maxoffno;
+		Buffer		next_buf;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+		bool		retain_pin = false;
+		bool		curr_page_dirty = false;
+
+		if (delay)
+			vacuum_delay_point();
+
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		/* Scan each tuple in page */
+		maxoffno = PageGetMaxOffsetNumber(page);
+		for (offno = FirstOffsetNumber;
+			 offno <= maxoffno;
+			 offno = OffsetNumberNext(offno))
+		{
+			IndexTuple	itup;
+			ItemPointer htup;
+			Bucket		bucket;
+
+			itup = (IndexTuple) PageGetItem(page,
+											PageGetItemId(page, offno));
+			htup = &(itup->t_tid);
+			if (callback && callback(htup, callback_state))
+			{
+				/* mark the item for deletion */
+				deletable[ndeletable++] = offno;
+				if (tuples_removed)
+					*tuples_removed += 1;
+			}
+			else if (bucket_has_garbage)
+			{
+				/* delete the tuples that are moved by split. */
+				bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+											  maxbucket,
+											  highmask,
+											  lowmask);
+				/* mark the item for deletion */
+				if (bucket != cur_bucket)
+				{
+					/*
+					 * We expect tuples to either belong to curent bucket or
+					 * new_bucket.  This is ensured because we don't allow
+					 * further splits from bucket that contains garbage. See
+					 * comments in _hash_expandtable.
+					 */
+					Assert(bucket == new_bucket);
+					deletable[ndeletable++] = offno;
+				}
+				else if (num_index_tuples)
+					*num_index_tuples += 1;
+			}
+			else if (num_index_tuples)
+				*num_index_tuples += 1;
+		}
+
+		/* retain the pin on primary bucket till end of bucket scan */
+		if (blkno == bucket_blkno)
+			retain_pin = true;
+		else
+			retain_pin = false;
+
+		blkno = opaque->hasho_nextblkno;
+
+		/*
+		 * Apply deletions and write page if needed, advance to next page.
+		 */
+		if (ndeletable > 0)
+		{
+			PageIndexMultiDelete(page, deletable, ndeletable);
+			bucket_dirty = true;
+			curr_page_dirty = true;
+		}
+
+		/* bail out if there are no more pages to scan. */
+		if (!BlockNumberIsValid(blkno))
+			break;
+
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
+		/*
+		 * release the lock on previous page after acquiring the lock on next
+		 * page
+		 */
+		if (curr_page_dirty)
+		{
+			if (retain_pin)
+				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+			else
+				_hash_wrtbuf(rel, buf);
+			curr_page_dirty = false;
+		}
+		else if (retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
+
+		buf = next_buf;
+	}
+
+	/*
+	 * lock the bucket page to clear the garbage flag and squeeze the bucket.
+	 * if the current buffer is same as bucket buffer, then we already have
+	 * lock on bucket page.
+	 */
+	if (buf != bucket_buf)
+	{
+		_hash_relbuf(rel, buf);
+		_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+	}
+
+	/*
+	 * Clear the garbage flag from bucket after deleting the tuples that are
+	 * moved by split.  We purposefully clear the flag before squeeze bucket,
+	 * so that after restart, vacuum shouldn't again try to delete the moved
+	 * by split tuples.
+	 */
+	if (bucket_has_garbage)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = BufferGetPage(bucket_buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+	}
+
+	/*
+	 * If we deleted anything, try to compact free space.  For squeezing the
+	 * bucket, we must have a cleanup lock, else it can impact the ordering of
+	 * tuples for a scan that has started before it.
+	 */
+	if (bucket_dirty && CheckBufferForCleanup(bucket_buf))
+		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
+							bstrategy);
+}
 
 void
 hash_redo(XLogReaderState *record)
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..5cfd0aa 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -28,7 +28,8 @@
 void
 _hash_doinsert(Relation rel, IndexTuple itup)
 {
-	Buffer		buf;
+	Buffer		buf = InvalidBuffer;
+	Buffer		bucket_buf;
 	Buffer		metabuf;
 	HashMetaPage metap;
 	BlockNumber blkno;
@@ -40,6 +41,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	bool		do_expand;
 	uint32		hashkey;
 	Bucket		bucket;
+	uint32		maxbucket;
+	uint32		highmask;
+	uint32		lowmask;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -70,51 +74,131 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			errhint("Values larger than a buffer page cannot be indexed.")));
 
 	/*
-	 * Loop until we get a lock on the correct target bucket.
+	 * Copy bucket mapping info now;  The comment in _hash_expandtable where
+	 * we copy this information and calls _hash_splitbucket explains why this
+	 * is OK.
 	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
-									  metap->hashm_maxbucket,
-									  metap->hashm_highmask,
-									  metap->hashm_lowmask);
+	maxbucket = metap->hashm_maxbucket;
+	highmask = metap->hashm_highmask;
+	lowmask = metap->hashm_lowmask;
 
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
+	/*
+	 * Conditionally get the lock on primary bucket page for insertion while
+	 * holding lock on meta page. If we have to wait, then release the meta
+	 * page lock and retry it in a hard way.
+	 */
+	bucket = _hash_hashkey2bucket(hashkey,
+								  maxbucket,
+								  highmask,
+								  lowmask);
+
+	blkno = BUCKET_TO_BLKNO(metap, bucket);
 
-		/* Release metapage lock, but keep pin. */
+	/* Fetch the primary bucket page for the bucket */
+	buf = ReadBuffer(rel, blkno);
+	if (!ConditionalLockBuffer(buf))
+	{
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(buf, HASH_WRITE);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		oldblkno = blkno;
+		retry = true;
+	}
+	else
+	{
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	}
 
+	if (retry)
+	{
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * Loop until we get a lock on the correct target bucket.  We get the
+		 * lock on primary bucket page and retain the pin on it during insert
+		 * operation to prevent the concurrent splits.  Retaining pin on a
+		 * primary bucket page ensures that split can't happen as it needs to
+		 * acquire the cleanup lock on primary bucket page.  Acquiring lock on
+		 * primary bucket and rechecking if it is a target bucket is mandatory
+		 * as otherwise a concurrent split might cause this insertion to fall
+		 * in wrong bucket.
 		 */
-		if (retry)
+		for (;;)
 		{
+			/*
+			 * Compute the target bucket number, and convert to block number.
+			 */
+			bucket = _hash_hashkey2bucket(hashkey,
+										  metap->hashm_maxbucket,
+										  metap->hashm_highmask,
+										  metap->hashm_lowmask);
+
+			blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+			/* Release metapage lock, but keep pin. */
+			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+			/*
+			 * If the previous iteration of this loop locked what is still the
+			 * correct target bucket, we are done.  Otherwise, drop any old
+			 * lock and lock what now appears to be the correct bucket.
+			 */
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
-		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
-		oldblkno = blkno;
-		retry = true;
+			/* Fetch the primary bucket page for the bucket */
+			buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+
+			/*
+			 * Reacquire metapage lock and check that no bucket split has
+			 * taken place while we were awaiting the bucket lock.
+			 */
+			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			oldblkno = blkno;
+		}
 	}
 
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+	/* remember the primary bucket buffer to release the pin on it at end. */
+	bucket_buf = buf;
+
 	page = BufferGetPage(buf);
 	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(pageopaque->hasho_bucket == bucket);
 
+	/*
+	 * If there is any pending split, try to finish it before proceeding for
+	 * the insertion.  We try to finish the split for the insertion in old
+	 * bucket, as that will allow us to remove the tuples from old bucket and
+	 * reuse the space.  There is no such apparent benefit from finishing the
+	 * split during insertion in new bucket.
+	 *
+	 * In future, if we want to finish the splits during insertion in new
+	 * bucket, we must ensure the locking order such that old bucket is locked
+	 * before new bucket.
+	 */
+	if (H_OLD_INCOMPLETE_SPLIT(pageopaque) && CheckBufferForCleanup(buf))
+	{
+		BlockNumber nblkno;
+		Buffer		nbuf;
+
+		nblkno = _hash_get_newblk(rel, pageopaque);
+
+		/* Fetch the primary bucket page for the new bucket */
+		nbuf = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+		if (nbuf)
+		{
+			_hash_finish_split(rel, metabuf, buf, nbuf, maxbucket,
+							   highmask, lowmask);
+
+			/*
+			 * release the buffer here as the insertion will happen in old
+			 * bucket.
+			 */
+			_hash_relbuf(rel, nbuf);
+		}
+	}
+
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
@@ -127,14 +211,23 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		{
 			/*
 			 * ovfl page exists; go get it.  if it doesn't have room, we'll
-			 * find out next pass through the loop test above.
+			 * find out next pass through the loop test above.  Retain the pin
+			 * if it is a primary bucket.
 			 */
-			_hash_relbuf(rel, buf);
+			if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
+				_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+			else
+				_hash_relbuf(rel, buf);
 			buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 			page = BufferGetPage(buf);
 		}
 		else
 		{
+			bool		retain_pin = false;
+
+			/* page flags must be accessed before releasing lock on a page. */
+			retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;
+
 			/*
 			 * we're at the end of the bucket chain and we haven't found a
 			 * page with enough room.  allocate a new overflow page.
@@ -144,7 +237,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 
 			/* chain to a new overflow page */
-			buf = _hash_addovflpage(rel, metabuf, buf);
+			buf = _hash_addovflpage(rel, metabuf, buf, retain_pin);
 			page = BufferGetPage(buf);
 
 			/* should fit now, given test above */
@@ -158,11 +251,13 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	/* found page with enough space, so add the item here */
 	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
 
-	/* write and release the modified page */
+	/*
+	 * write and release the modified page and ensure to release the pin on
+	 * primary page.
+	 */
 	_hash_wrtbuf(rel, buf);
-
-	/* We can drop the bucket lock now */
-	_hash_droplock(rel, blkno, HASH_SHARE);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
 
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index db3e268..760563a 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *
  *	On entry, the caller must hold a pin but no lock on 'buf'.  The pin is
  *	dropped before exiting (we assume the caller is not interested in 'buf'
- *	anymore).  The returned overflow page will be pinned and write-locked;
- *	it is guaranteed to be empty.
+ *	anymore) if not asked to retain.  The pin will be retained only for the
+ *	primary bucket.  The returned overflow page will be pinned and
+ *	write-locked; it is guaranteed to be empty.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
  *
- *	The caller must hold at least share lock on the bucket, to ensure that
- *	no one else tries to compact the bucket meanwhile.  This guarantees that
- *	'buf' won't stop being part of the bucket while it's unlocked.
- *
  * NB: since this could be executed concurrently by multiple processes,
  * one should not assume that the returned overflow page will be the
  * immediate successor of the originally passed 'buf'.  Additional overflow
  * pages might have been added to the bucket chain in between.
  */
 Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 {
 	Buffer		ovflbuf;
 	Page		page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 			break;
 
 		/* we assume we do not need to write the unmodified page */
-		_hash_relbuf(rel, buf);
+		if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
 
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
@@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 
 	/* logically chain overflow page to previous page */
 	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	_hash_wrtbuf(rel, buf);
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, buf);
 
 	return ovflbuf;
 }
@@ -370,11 +373,11 @@ _hash_firstfreebit(uint32 map)
  *	in the bucket, or InvalidBlockNumber if no following page.
  *
  *	NB: caller must not hold lock on metapage, nor on either page that's
- *	adjacent in the bucket chain.  The caller had better hold exclusive lock
- *	on the bucket, too.
+ *	adjacent in the bucket chain except from primary bucket.  The caller had
+ *	better hold cleanup lock on the primary bucket.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf,
+_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
@@ -413,22 +416,41 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
-	 * deleted.  No concurrency issues since we hold exclusive lock on the
-	 * entire bucket.
+	 * deleted.  No concurrency issues since we hold the cleanup lock on
+	 * primary bucket.  We don't need to aqcuire buffer lock to fix the
+	 * primary bucket, as we already have that lock.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-														 prevblkno,
-														 HASH_WRITE,
+		if (prevblkno == bucket_blkno)
+		{
+			Buffer		prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
+													 prevblkno,
+													 RBM_NORMAL,
+													 bstrategy);
+
+			Page		prevpage = BufferGetPage(prevbuf);
+			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+			Assert(prevopaque->hasho_bucket == bucket);
+			prevopaque->hasho_nextblkno = nextblkno;
+			MarkBufferDirty(prevbuf);
+			ReleaseBuffer(prevbuf);
+		}
+		else
+		{
+			Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
+															 prevblkno,
+															 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		prevpage = BufferGetPage(prevbuf);
-		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+															 bstrategy);
+			Page		prevpage = BufferGetPage(prevbuf);
+			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
 
-		Assert(prevopaque->hasho_bucket == bucket);
-		prevopaque->hasho_nextblkno = nextblkno;
-		_hash_wrtbuf(rel, prevbuf);
+			Assert(prevopaque->hasho_bucket == bucket);
+			prevopaque->hasho_nextblkno = nextblkno;
+			_hash_wrtbuf(rel, prevbuf);
+		}
 	}
 	if (BlockNumberIsValid(nextblkno))
 	{
@@ -570,7 +592,7 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
  *	required that to be true on entry as well, but it's a lot easier for
  *	callers to leave empty overflow pages and let this guy clean it up.
  *
- *	Caller must hold exclusive lock on the target bucket.  This allows
+ *	Caller must hold cleanup lock on the target bucket.  This allows
  *	us to safely lock multiple pages in the bucket.
  *
  *	Since this function is invoked in VACUUM, we provide an access strategy
@@ -580,6 +602,7 @@ void
 _hash_squeezebucket(Relation rel,
 					Bucket bucket,
 					BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy)
 {
 	BlockNumber wblkno;
@@ -591,27 +614,22 @@ _hash_squeezebucket(Relation rel,
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
 	bool		wbuf_dirty;
+	bool		release_buf = false;
 
 	/*
 	 * start squeezing into the base bucket page.
 	 */
 	wblkno = bucket_blkno;
-	wbuf = _hash_getbuf_with_strategy(rel,
-									  wblkno,
-									  HASH_WRITE,
-									  LH_BUCKET_PAGE,
-									  bstrategy);
+	wbuf = bucket_buf;
 	wpage = BufferGetPage(wbuf);
 	wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 
 	/*
-	 * if there aren't any overflow pages, there's nothing to squeeze.
+	 * if there aren't any overflow pages, there's nothing to squeeze. caller
+	 * is responsible to release the lock on primary bucket.
 	 */
 	if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
-	{
-		_hash_relbuf(rel, wbuf);
 		return;
-	}
 
 	/*
 	 * Find the last page in the bucket chain by starting at the base bucket
@@ -669,12 +687,17 @@ _hash_squeezebucket(Relation rel,
 			{
 				Assert(!PageIsEmpty(wpage));
 
+				if (wblkno != bucket_blkno)
+					release_buf = true;
+
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));
 
-				if (wbuf_dirty)
+				if (wbuf_dirty && release_buf)
 					_hash_wrtbuf(rel, wbuf);
-				else
+				else if (wbuf_dirty)
+					MarkBufferDirty(wbuf);
+				else if (release_buf)
 					_hash_relbuf(rel, wbuf);
 
 				/* nothing more to do if we reached the read page */
@@ -700,6 +723,7 @@ _hash_squeezebucket(Relation rel,
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
 				wbuf_dirty = false;
+				release_buf = false;
 			}
 
 			/*
@@ -733,19 +757,25 @@ _hash_squeezebucket(Relation rel,
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
-			/* yes, so release wbuf lock first */
-			if (wbuf_dirty)
+			if (wblkno != bucket_blkno)
+				release_buf = true;
+
+			/* yes, so release wbuf lock first if needed */
+			if (wbuf_dirty && release_buf)
 				_hash_wrtbuf(rel, wbuf);
-			else
+			else if (wbuf_dirty)
+				MarkBufferDirty(wbuf);
+			else if (release_buf)
 				_hash_relbuf(rel, wbuf);
+
 			/* free this overflow page (releases rbuf) */
-			_hash_freeovflpage(rel, rbuf, bstrategy);
+			_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 			/* done */
 			return;
 		}
 
 		/* free this overflow page, then get the previous one */
-		_hash_freeovflpage(rel, rbuf, bstrategy);
+		_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 178463f..2c8e4b5 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,10 +38,14 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
 					uint32 nblocks);
 static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
+static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
+					   Bucket obucket, Bucket nbucket, Buffer obuf,
+					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
+					   uint32 highmask, uint32 lowmask);
 
 
 /*
@@ -55,46 +59,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 
 
 /*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock.  (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		return ConditionalLockPage(rel, whichlock, access);
-	else
-		return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		UnlockPage(rel, whichlock, access);
-}
-
-/*
  *	_hash_getbuf() -- Get a buffer by block number for read or write.
  *
  *		'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -132,6 +96,35 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
 }
 
 /*
+ * _hash_getbuf_with_condlock_cleanup() -- as above, but get the buffer for write.
+ *
+ *		We try to take the conditional cleanup lock and if we get it then
+ *		return the buffer, else return InvalidBuffer.
+ */
+Buffer
+_hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
+{
+	Buffer		buf;
+
+	if (blkno == P_NEW)
+		elog(ERROR, "hash AM does not use P_NEW");
+
+	buf = ReadBuffer(rel, blkno);
+
+	if (!ConditionalLockBufferForCleanup(buf))
+	{
+		ReleaseBuffer(buf);
+		return InvalidBuffer;
+	}
+
+	/* ref count and lock type are correct */
+
+	_hash_checkpage(rel, buf, flags);
+
+	return buf;
+}
+
+/*
  *	_hash_getinitbuf() -- Get and initialize a buffer by block number.
  *
  *		This must be used only to fetch pages that are known to be before
@@ -266,6 +259,33 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
+ *	_hash_dropscanbuf() -- release buffers used in scan.
+ *
+ * This routine unpins the buffers used during scan on which we
+ * hold no lock.
+ */
+void
+_hash_dropscanbuf(Relation rel, HashScanOpaque so)
+{
+	/* release pin we hold on primary bucket */
+	if (BufferIsValid(so->hashso_bucket_buf) &&
+		so->hashso_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_bucket_buf);
+	so->hashso_bucket_buf = InvalidBuffer;
+
+	/* release pin we hold on old primary bucket */
+	if (BufferIsValid(so->hashso_old_bucket_buf) &&
+		so->hashso_old_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+	so->hashso_old_bucket_buf = InvalidBuffer;
+
+	/* release any pin we still hold */
+	if (BufferIsValid(so->hashso_curbuf))
+		_hash_dropbuf(rel, so->hashso_curbuf);
+	so->hashso_curbuf = InvalidBuffer;
+}
+
+/*
  *	_hash_wrtbuf() -- write a hash page to disk.
  *
  *		This routine releases the lock held on the buffer and our refcount
@@ -489,9 +509,11 @@ _hash_pageinit(Page page, Size size)
 /*
  * Attempt to expand the hash table by creating one new bucket.
  *
- * This will silently do nothing if it cannot get the needed locks.
+ * This will silently do nothing if we don't get cleanup lock on old or
+ * new bucket.
  *
- * The caller should hold no locks on the hash index.
+ * Complete the pending splits and remove the tuples from old bucket,
+ * if there are any left over from previous split.
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.
@@ -506,10 +528,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	BlockNumber start_oblkno;
 	BlockNumber start_nblkno;
 	Buffer		buf_nblkno;
+	Buffer		buf_oblkno;
+	Page		opage;
+	HashPageOpaque oopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
 
+restart_expand:
+
 	/*
 	 * Write-lock the meta page.  It used to be necessary to acquire a
 	 * heavyweight lock to begin a split, but that is no longer required.
@@ -548,11 +575,16 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		goto fail;
 
 	/*
-	 * Determine which bucket is to be split, and attempt to lock the old
-	 * bucket.  If we can't get the lock, give up.
+	 * Determine which bucket is to be split, and attempt to take cleanup lock
+	 * on the old bucket.  If we can't get the lock, give up.
 	 *
-	 * The lock protects us against other backends, but not against our own
-	 * backend.  Must check for active scans separately.
+	 * The cleanup lock protects us not only against other backends, but
+	 * against our own backend as well.
+	 *
+	 * The cleanup lock is mainly to protect the split from concurrent
+	 * inserts. See src/backend/access/hash/README, Lock Definitions for
+	 * further details.  Due to this locking restriction, if there is any
+	 * pending scan, split will give up which is not good, but harmless.
 	 */
 	new_bucket = metap->hashm_maxbucket + 1;
 
@@ -560,14 +592,90 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 
 	start_oblkno = BUCKET_TO_BLKNO(metap, old_bucket);
 
-	if (_hash_has_active_scan(rel, old_bucket))
+	buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
+	if (!buf_oblkno)
 		goto fail;
 
-	if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
-		goto fail;
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 
 	/*
-	 * Likewise lock the new bucket (should never fail).
+	 * We want to finish the split from a bucket as there is no apparent
+	 * benefit by not doing so and it will make the code complicated to finish
+	 * the split that involves multiple buckets considering the case where new
+	 * split also fails.  We don't need to consider the new bucket for
+	 * completing the split here as it is not possible that a re-split of new
+	 * bucket starts when there is still a pending split from old bucket.
+	 */
+	if (H_OLD_INCOMPLETE_SPLIT(oopaque))
+	{
+		BlockNumber nblkno;
+		Buffer		buf_nblkno;
+
+		/*
+		 * Copy bucket mapping info now;  The comment in code below where we
+		 * copy this information and calls _hash_splitbucket explains why this
+		 * is OK.
+		 */
+		maxbucket = metap->hashm_maxbucket;
+		highmask = metap->hashm_highmask;
+		lowmask = metap->hashm_lowmask;
+
+		/* Release the metapage lock, before completing the split. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		nblkno = _hash_get_newblk(rel, oopaque);
+
+		/* Fetch the primary bucket page for the new bucket */
+		buf_nblkno = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+		if (!buf_nblkno)
+		{
+			_hash_relbuf(rel, buf_oblkno);
+			return;
+		}
+
+		_hash_finish_split(rel, metabuf, buf_oblkno, buf_nblkno, maxbucket,
+						   highmask, lowmask);
+
+		/*
+		 * release the buffers and retry for expand.
+		 */
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+
+		goto restart_expand;
+	}
+
+	/*
+	 * Clean the tuples remained from previous split.  This operation requires
+	 * cleanup lock and we already have one on old bucket, so let's do it. We
+	 * also don't want to allow further splits from the bucket till the
+	 * garbage of previous split is cleaned.  This has two advantages, first
+	 * it helps in avoiding the bloat due to garbage and second is, during
+	 * cleanup of bucket, we are always sure that the garbage tuples belong to
+	 * most recently splitted bucket.  On the contrary, if we allow cleanup of
+	 * bucket after meta page is updated to indicate the new split and before
+	 * the actual split, the cleanup operation won't be able to decide whether
+	 * the tuple has been moved to the newly created bucket and ended up
+	 * deleting such tuples.
+	 */
+	if (H_HAS_GARBAGE(oopaque))
+	{
+		/* Release the metapage lock. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		hashbucketcleanup(rel, buf_oblkno, start_oblkno, NULL,
+						  metap->hashm_maxbucket, metap->hashm_highmask,
+						  metap->hashm_lowmask, NULL,
+						  NULL, true, false, NULL, NULL);
+
+		_hash_relbuf(rel, buf_oblkno);
+
+		goto restart_expand;
+	}
+
+	/*
+	 * There shouldn't be any active scan on new bucket.
 	 *
 	 * Note: it is safe to compute the new bucket's blkno here, even though we
 	 * may still need to update the BUCKET_TO_BLKNO mapping.  This is because
@@ -576,12 +684,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	 */
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
-	if (_hash_has_active_scan(rel, new_bucket))
-		elog(ERROR, "scan in progress on supposedly new bucket");
-
-	if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
-		elog(ERROR, "could not get lock on supposedly new bucket");
-
 	/*
 	 * If the split point is increasing (hashm_maxbucket's log base 2
 	 * increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +702,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
 			/* can't split due to BlockNumber overflow */
-			_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-			_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+			_hash_relbuf(rel, buf_oblkno);
 			goto fail;
 		}
 	}
@@ -609,9 +710,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/*
 	 * Physically allocate the new bucket's primary page.  We want to do this
 	 * before changing the metapage's mapping info, in case we can't get the
-	 * disk space.
+	 * disk space.  Ideally, we don't need to check for cleanup lock on new
+	 * bucket as no other backend could find this bucket unless meta page is
+	 * updated.  However, it is good to be consistent with old bucket locking.
 	 */
 	buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
+	if (!CheckBufferForCleanup(buf_nblkno))
+	{
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+		goto fail;
+	}
+
 
 	/*
 	 * Okay to proceed with split.  Update the metapage bucket mapping info.
@@ -665,13 +775,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  start_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno,
 					  maxbucket, highmask, lowmask);
 
-	/* Release bucket locks, allowing others to access them */
-	_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-	_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -738,13 +844,17 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
  * belong in the new bucket, and compress out any free space in the old
  * bucket.
  *
- * The caller must hold exclusive locks on both buckets to ensure that
+ * The caller must hold cleanup locks on both buckets to ensure that
  * no one else is trying to access them (see README).
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.  (The metapage is only
  * touched if it becomes necessary to add or remove overflow pages.)
  *
+ * Split needs to retain pin on primary bucket pages of both old and new
+ * buckets till end of operation.  This is to prevent vacuum to start
+ * when split is in progress.
+ *
  * In addition, the caller must have created the new bucket's base page,
  * which is passed in buffer nbuf, pinned and write-locked.  That lock and
  * pin are released here.  (The API is set up this way because we must do
@@ -756,37 +866,87 @@ _hash_splitbucket(Relation rel,
 				  Buffer metabuf,
 				  Bucket obucket,
 				  Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Buffer		obuf;
 	Page		opage;
 	Page		npage;
 	HashPageOpaque oopaque;
 	HashPageOpaque nopaque;
 
-	/*
-	 * It should be okay to simultaneously write-lock pages from each bucket,
-	 * since no one else can be trying to acquire buffer lock on pages of
-	 * either bucket.
-	 */
-	obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
 	opage = BufferGetPage(obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 
+	/*
+	 * Mark the old bucket to indicate that split is in progress and it has
+	 * deletable tuples. At operation end, we clear split in progress flag and
+	 * vacuum will clear page_has_garbage flag after deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
 	npage = BufferGetPage(nbuf);
 
-	/* initialize the new bucket's primary page */
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nopaque->hasho_prevblkno = InvalidBlockNumber;
 	nopaque->hasho_nextblkno = InvalidBlockNumber;
 	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
 	nopaque->hasho_page_id = HASHO_PAGE_ID;
 
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, nbuf, NULL,
+						   maxbucket, highmask, lowmask);
+
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, obuf);
+	_hash_relbuf(rel, nbuf);
+}
+
+/*
+ * _hash_splitbucket_guts -- Helper function to perform the split operation
+ *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, caller needs to fill htab.  If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
+ * Caller needs to lock and unlock the old and new primary buckets.
+ */
+static void
+_hash_splitbucket_guts(Relation rel,
+					   Buffer metabuf,
+					   Bucket obucket,
+					   Bucket nbucket,
+					   Buffer obuf,
+					   Buffer nbuf,
+					   HTAB *htab,
+					   uint32 maxbucket,
+					   uint32 highmask,
+					   uint32 lowmask)
+{
+	Buffer		bucket_obuf;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
+
+	bucket_obuf = obuf;
+	opage = BufferGetPage(obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
 	/*
 	 * Partition the tuples in the old bucket between the old bucket and the
 	 * new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +958,6 @@ _hash_splitbucket(Relation rel,
 		BlockNumber oblkno;
 		OffsetNumber ooffnum;
 		OffsetNumber omaxoffnum;
-		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
 
 		/* Scan each tuple in old page */
 		omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -810,39 +968,69 @@ _hash_splitbucket(Relation rel,
 			IndexTuple	itup;
 			Size		itemsz;
 			Bucket		bucket;
+			bool		found = false;
 
 			/*
-			 * Fetch the item's hash key (conveniently stored in the item) and
-			 * determine which bucket it now belongs in.
+			 * Before inserting tuple, probe the hash table containing TIDs of
+			 * tuples belonging to new bucket, if we find a match, then skip
+			 * that tuple, else fetch the item's hash key (conveniently stored
+			 * in the item) and determine which bucket it now belongs in.
 			 */
 			itup = (IndexTuple) PageGetItem(opage,
 											PageGetItemId(opage, ooffnum));
+
+			if (htab)
+				(void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
+
+			if (found)
+				continue;
+
 			bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
 										  maxbucket, highmask, lowmask);
 
 			if (bucket == nbucket)
 			{
+				Size		itupsize = 0;
+				IndexTuple	new_itup;
+
+				/*
+				 * make a copy of index tuple as we have to scribble on it.
+				 */
+				new_itup = CopyIndexTuple(itup);
+
+				/*
+				 * mark the index tuple as moved by split, such tuples are
+				 * skipped by scan if there is split in progress for a bucket.
+				 */
+				itupsize = new_itup->t_info & INDEX_SIZE_MASK;
+				new_itup->t_info &= ~INDEX_SIZE_MASK;
+				new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+				new_itup->t_info |= itupsize;
+
 				/*
 				 * insert the tuple into the new bucket.  if it doesn't fit on
 				 * the current page in the new bucket, we must allocate a new
 				 * overflow page and place the tuple on that page instead.
-				 *
-				 * XXX we have a problem here if we fail to get space for a
-				 * new overflow page: we'll error out leaving the bucket split
-				 * only partially complete, meaning the index is corrupt,
-				 * since searches may fail to find entries they should find.
 				 */
-				itemsz = IndexTupleDSize(*itup);
+				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
 				if (PageGetFreeSpace(npage) < itemsz)
 				{
+					bool		retain_pin = false;
+
+					/*
+					 * page flags must be accessed before releasing lock on a
+					 * page.
+					 */
+					retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
+
 					/* write out nbuf and drop lock, but keep pin */
 					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
 					/* chain to a new overflow page */
-					nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+					nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
 					npage = BufferGetPage(nbuf);
-					/* we don't need nopaque within the loop */
+					nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 				}
 
 				/*
@@ -852,12 +1040,10 @@ _hash_splitbucket(Relation rel,
 				 * Possible future improvement: accumulate all the items for
 				 * the new page and qsort them before insertion.
 				 */
-				(void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
+				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
-				/*
-				 * Mark tuple for deletion from old page.
-				 */
-				deletable[ndeletable++] = ooffnum;
+				/* be tidy */
+				pfree(new_itup);
 			}
 			else
 			{
@@ -870,15 +1056,9 @@ _hash_splitbucket(Relation rel,
 
 		oblkno = oopaque->hasho_nextblkno;
 
-		/*
-		 * Done scanning this old page.  If we moved any tuples, delete them
-		 * from the old page.
-		 */
-		if (ndeletable > 0)
-		{
-			PageIndexMultiDelete(opage, deletable, ndeletable);
-			_hash_wrtbuf(rel, obuf);
-		}
+		/* retain the pin on the old primary bucket */
+		if (obuf == bucket_obuf)
+			_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, obuf);
 
@@ -887,18 +1067,153 @@ _hash_splitbucket(Relation rel,
 			break;
 
 		/* Else, advance to next old page */
-		obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
 		opage = BufferGetPage(obuf);
 		oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	}
 
 	/*
 	 * We're at the end of the old bucket chain, so we're done partitioning
-	 * the tuples.  Before quitting, call _hash_squeezebucket to ensure the
-	 * tuples remaining in the old bucket (including the overflow pages) are
-	 * packed as tightly as possible.  The new bucket is already tight.
+	 * the tuples.  Mark the old and new buckets to indicate split is
+	 * finished.
+	 *
+	 * To avoid deadlocks due to locking order of buckets, first lock the old
+	 * bucket and then the new bucket.
 	 */
-	_hash_wrtbuf(rel, nbuf);
+	if (nopaque->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, nbuf);
+
+	/*
+	 * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+	 * there is no pending scan that has seen the flag after it is cleared.
+	 */
+	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+	opage = BufferGetPage(bucket_obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	_hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+	npage = BufferGetPage(bucket_nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+	/* indicate that split is finished */
+	oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
+	nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
+
+	/*
+	 * now write the buffers, here we don't release the locks as caller is
+	 * responsible to release locks.
+	 */
+	MarkBufferDirty(bucket_obuf);
+	MarkBufferDirty(bucket_nbuf);
+}
+
+/*
+ *	_hash_finish_split() -- Finish the previously interrupted split operation
+ *
+ * To complete the split operation, we form the hash table of TIDs in new
+ * bucket which is then used by split operation to skip tuples that are
+ * already moved before the split operation was previously interruptted.
+ *
+ * The caller must hold a pin, but no lock, on the metapage buffer.
+ * The buffer is returned in the same state.  (The metapage is only
+ * touched if it becomes necessary to add or remove overflow pages.)
+ *
+ * 'obuf' and 'nbuf' must be locked by the caller which is also responsible
+ * for unlocking them.
+ */
+void
+_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf,
+				   uint32 maxbucket, uint32 highmask, uint32 lowmask)
+{
+	HASHCTL		hash_ctl;
+	HTAB	   *tidhtab;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque opageopaque;
+	HashPageOpaque npageopaque;
+	Bucket		obucket;
+	Bucket		nbucket;
+	bool		found;
+
+	/* Initialize hash tables used to track TIDs */
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+	hash_ctl.keysize = sizeof(ItemPointerData);
+	hash_ctl.entrysize = sizeof(ItemPointerData);
+	hash_ctl.hcxt = CurrentMemoryContext;
+
+	tidhtab =
+		hash_create("bucket ctids",
+					256,		/* arbitrary initial size */
+					&hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	/*
+	 * Scan the new bucket and build hash table of TIDs
+	 */
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	for (;;)
+	{
+		BlockNumber nblkno;
+		OffsetNumber noffnum;
+		OffsetNumber nmaxoffnum;
+
+		/* Scan each tuple in new page */
+		nmaxoffnum = PageGetMaxOffsetNumber(npage);
+		for (noffnum = FirstOffsetNumber;
+			 noffnum <= nmaxoffnum;
+			 noffnum = OffsetNumberNext(noffnum))
+		{
+			IndexTuple	itup;
+
+			/* Fetch the item's TID and insert it in hash table. */
+			itup = (IndexTuple) PageGetItem(npage,
+											PageGetItemId(npage, noffnum));
+
+			(void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
+
+			Assert(!found);
+		}
+
+		nblkno = npageopaque->hasho_nextblkno;
+
+		/*
+		 * release our write lock without modifying buffer and ensure to
+		 * retain the pin on primary bucket.
+		 */
+		if (nbuf == bucket_nbuf)
+			_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, nbuf);
+
+		/* Exit loop if no more overflow pages in new bucket */
+		if (!BlockNumberIsValid(nblkno))
+			break;
+
+		/* Else, advance to next page */
+		nbuf = _hash_getbuf(rel, nblkno, HASH_READ, LH_OVERFLOW_PAGE);
+		npage = BufferGetPage(nbuf);
+		npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	}
+
+	/* Need a cleanup lock to perform split operation. */
+	LockBufferForCleanup(bucket_nbuf);
+
+	npage = BufferGetPage(bucket_nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nbucket = npageopaque->hasho_bucket;
+
+	opage = BufferGetPage(obuf);
+	opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+	obucket = opageopaque->hasho_bucket;
+
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, bucket_nbuf, tidhtab,
+						   maxbucket, highmask, lowmask);
 
-	_hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+	hash_destroy(tidhtab);
 }
diff --git a/src/backend/access/hash/hashscan.c b/src/backend/access/hash/hashscan.c
deleted file mode 100644
index fe97ef2..0000000
--- a/src/backend/access/hash/hashscan.c
+++ /dev/null
@@ -1,153 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * hashscan.c
- *	  manage scans on hash tables
- *
- * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- *
- * IDENTIFICATION
- *	  src/backend/access/hash/hashscan.c
- *
- *-------------------------------------------------------------------------
- */
-
-#include "postgres.h"
-
-#include "access/hash.h"
-#include "access/relscan.h"
-#include "utils/memutils.h"
-#include "utils/rel.h"
-#include "utils/resowner.h"
-
-
-/*
- * We track all of a backend's active scans on hash indexes using a list
- * of HashScanListData structs, which are allocated in TopMemoryContext.
- * It's okay to use a long-lived context because we rely on the ResourceOwner
- * mechanism to clean up unused entries after transaction or subtransaction
- * abort.  We can't safely keep the entries in the executor's per-query
- * context, because that might be already freed before we get a chance to
- * clean up the list.  (XXX seems like there should be a better way to
- * manage this...)
- */
-typedef struct HashScanListData
-{
-	IndexScanDesc hashsl_scan;
-	ResourceOwner hashsl_owner;
-	struct HashScanListData *hashsl_next;
-} HashScanListData;
-
-typedef HashScanListData *HashScanList;
-
-static HashScanList HashScans = NULL;
-
-
-/*
- * ReleaseResources_hash() --- clean up hash subsystem resources.
- *
- * This is here because it needs to touch this module's static var HashScans.
- */
-void
-ReleaseResources_hash(void)
-{
-	HashScanList l;
-	HashScanList prev;
-	HashScanList next;
-
-	/*
-	 * Release all HashScanList items belonging to the current ResourceOwner.
-	 * Note that we do not release the underlying IndexScanDesc; that's in
-	 * executor memory and will go away on its own (in fact quite possibly has
-	 * gone away already, so we mustn't try to touch it here).
-	 *
-	 * Note: this should be a no-op during normal query shutdown. However, in
-	 * an abort situation ExecutorEnd is not called and so there may be open
-	 * index scans to clean up.
-	 */
-	prev = NULL;
-
-	for (l = HashScans; l != NULL; l = next)
-	{
-		next = l->hashsl_next;
-		if (l->hashsl_owner == CurrentResourceOwner)
-		{
-			if (prev == NULL)
-				HashScans = next;
-			else
-				prev->hashsl_next = next;
-
-			pfree(l);
-			/* prev does not change */
-		}
-		else
-			prev = l;
-	}
-}
-
-/*
- *	_hash_regscan() -- register a new scan.
- */
-void
-_hash_regscan(IndexScanDesc scan)
-{
-	HashScanList new_el;
-
-	new_el = (HashScanList) MemoryContextAlloc(TopMemoryContext,
-											   sizeof(HashScanListData));
-	new_el->hashsl_scan = scan;
-	new_el->hashsl_owner = CurrentResourceOwner;
-	new_el->hashsl_next = HashScans;
-	HashScans = new_el;
-}
-
-/*
- *	_hash_dropscan() -- drop a scan from the scan list
- */
-void
-_hash_dropscan(IndexScanDesc scan)
-{
-	HashScanList chk,
-				last;
-
-	last = NULL;
-	for (chk = HashScans;
-		 chk != NULL && chk->hashsl_scan != scan;
-		 chk = chk->hashsl_next)
-		last = chk;
-
-	if (chk == NULL)
-		elog(ERROR, "hash scan list trashed; cannot find 0x%p", (void *) scan);
-
-	if (last == NULL)
-		HashScans = chk->hashsl_next;
-	else
-		last->hashsl_next = chk->hashsl_next;
-
-	pfree(chk);
-}
-
-/*
- * Is there an active scan in this bucket?
- */
-bool
-_hash_has_active_scan(Relation rel, Bucket bucket)
-{
-	Oid			relid = RelationGetRelid(rel);
-	HashScanList l;
-
-	for (l = HashScans; l != NULL; l = l->hashsl_next)
-	{
-		if (relid == l->hashsl_scan->indexRelation->rd_id)
-		{
-			HashScanOpaque so = (HashScanOpaque) l->hashsl_scan->opaque;
-
-			if (so->hashso_bucket_valid &&
-				so->hashso_bucket == bucket)
-				return true;
-		}
-	}
-
-	return false;
-}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..8723e13 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -72,7 +72,19 @@ _hash_readnext(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_nextblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan to ensure
+	 * that vacuum can't delete the tuples that are moved by split to new
+	 * bucket. Such tuples are required by the scans that are started on
+	 * splitted buckets, before a new buckets split in progress flag
+	 * (LH_BUCKET_NEW_PAGE_SPLIT) is cleared.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -94,7 +106,16 @@ _hash_readprev(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_prevblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan. See
+	 * comments in _hash_readnext to know the reason of retaining pin.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -104,6 +125,13 @@ _hash_readprev(Relation rel,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
+
+		/*
+		 * We always maintain the pin on bucket page for whole scan operation,
+		 * so releasing the additional pin we have acquired here.
+		 */
+		if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+			_hash_dropbuf(rel, *bufp);
 	}
 }
 
@@ -192,59 +220,138 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	metap = HashPageGetMeta(page);
 
 	/*
-	 * Loop until we get a lock on the correct target bucket.
+	 * Conditionally get the lock on primary bucket page for search while
+	 * holding lock on meta page. If we have to wait, then release the meta
+	 * page lock and retry it in a hard way.
 	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
-									  metap->hashm_maxbucket,
-									  metap->hashm_highmask,
-									  metap->hashm_lowmask);
+	bucket = _hash_hashkey2bucket(hashkey,
+								  metap->hashm_maxbucket,
+								  metap->hashm_highmask,
+								  metap->hashm_lowmask);
 
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
+	blkno = BUCKET_TO_BLKNO(metap, bucket);
 
-		/* Release metapage lock, but keep pin. */
+	/* Fetch the primary bucket page for the bucket */
+	buf = ReadBuffer(rel, blkno);
+	if (!ConditionalLockBufferShared(buf))
+	{
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(buf, HASH_READ);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		oldblkno = blkno;
+		retry = true;
+	}
+	else
+	{
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	}
 
+	if (retry)
+	{
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * Loop until we get a lock on the correct target bucket.  We get the
+		 * lock on primary bucket page and retain the pin on it during read
+		 * operation to prevent the concurrent splits.  Retaining pin on a
+		 * primary bucket page ensures that split can't happen as it needs to
+		 * acquire the cleanup lock on primary bucket page.  Acquiring lock on
+		 * primary bucket and rechecking if it is a target bucket is mandatory
+		 * as otherwise a concurrent split followed by vacuum could remove
+		 * tuples from the selected bucket which otherwise would have been
+		 * visible.
 		 */
-		if (retry)
+		for (;;)
 		{
+			/*
+			 * Compute the target bucket number, and convert to block number.
+			 */
+			bucket = _hash_hashkey2bucket(hashkey,
+										  metap->hashm_maxbucket,
+										  metap->hashm_highmask,
+										  metap->hashm_lowmask);
+
+			blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+			/* Release metapage lock, but keep pin. */
+			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+			/*
+			 * If the previous iteration of this loop locked what is still the
+			 * correct target bucket, we are done.  Otherwise, drop any old
+			 * lock and lock what now appears to be the correct bucket.
+			 */
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
-		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
-		oldblkno = blkno;
-		retry = true;
+			/* Fetch the primary bucket page for the bucket */
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
+			/*
+			 * Reacquire metapage lock and check that no bucket split has
+			 * taken place while we were awaiting the bucket lock.
+			 */
+			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			oldblkno = blkno;
+		}
 	}
 
 	/* done with the metapage */
 	_hash_dropbuf(rel, metabuf);
 
-	/* Update scan opaque state to show we have lock on the bucket */
-	so->hashso_bucket = bucket;
-	so->hashso_bucket_valid = true;
-	so->hashso_bucket_blkno = blkno;
-
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
 	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
+	so->hashso_bucket_buf = buf;
+
+	/*
+	 * If the bucket split is in progress, then we need to skip tuples that
+	 * are moved from old bucket.  To ensure that vacuum doesn't clean any
+	 * tuples from old or new buckets till this scan is in progress, maintain
+	 * a pin on both of the buckets.  Here, we have to be cautious about lock
+	 * ordering, first acquire the lock on old bucket, release the lock on old
+	 * bucket, but not pin, then acquire the lock on new bucket and again
+	 * re-verify whether the bucket split still is in progress. Acquiring lock
+	 * on old bucket first ensures that the vacuum waits for this scan to
+	 * finish.
+	 */
+	if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+	{
+		BlockNumber old_blkno;
+		Buffer		old_buf;
+
+		old_blkno = _hash_get_oldblk(rel, opaque);
+
+		/*
+		 * release the lock on new bucket and re-acquire it after acquiring
+		 * the lock on old bucket.
+		 */
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+		/*
+		 * remember the old bucket buffer so as to use it later for scanning.
+		 */
+		so->hashso_old_bucket_buf = old_buf;
+		_hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+		_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		Assert(opaque->hasho_bucket == bucket);
+
+		if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+			so->hashso_skip_moved_tuples = true;
+		else
+		{
+			_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+			so->hashso_old_bucket_buf = InvalidBuffer;
+		}
+	}
+
 	/* If a backwards scan is requested, move to the end of the chain */
 	if (ScanDirectionIsBackward(dir))
 	{
@@ -273,6 +380,13 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
  *		false.  Else, return true and set the hashso_curpos for the
  *		scan to the right thing.
  *
+ *		Here we also scan the old bucket if the split for current bucket
+ *		was in progress at the start of scan.  The basic idea is that
+ *		skip the tuples that are moved by split while scanning current
+ *		bucket and then scan the old bucket to cover all such tuples. This
+ *		is done to ensure that we don't miss any tuples in the scans that
+ *		started during split.
+ *
  *		'bufP' points to the current buffer, which is pinned and read-locked.
  *		On success exit, we have pin and read-lock on whichever page
  *		contains the right item; on failure, we have released all buffers.
@@ -338,6 +452,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum >= FirstOffsetNumber);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberNext(offnum);	/* move forward */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -353,9 +480,42 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -379,6 +539,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum <= maxoff);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberPrev(offnum);	/* move back */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -394,9 +567,42 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -410,9 +616,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 
 		if (itup == NULL)
 		{
-			/* we ran off the end of the bucket without finding a match */
+			/*
+			 * We ran off the end of the bucket without finding a match.
+			 * Release the pin on bucket buffers.  Normally, such pins are
+			 * released at end of scan, however scrolling cursors can
+			 * reacquire the bucket lock and pin in the same scan multiple
+			 * times.
+			 */
 			*bufP = so->hashso_curbuf = InvalidBuffer;
 			ItemPointerSetInvalid(current);
+			_hash_dropscanbuf(rel, so);
 			return false;
 		}
 
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..b5164d7 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -147,6 +147,23 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_msb-- returns most significant bit position.
+ */
+static uint32
+_hash_msb(uint32 num)
+{
+	uint32		i = 0;
+
+	while (num)
+	{
+		num = num >> 1;
+		++i;
+	}
+
+	return i - 1;
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
@@ -352,3 +369,123 @@ _hash_binsearch_last(Page page, uint32 hash_value)
 
 	return lower;
 }
+
+/*
+ *	_hash_get_oldblk() -- get the block number from which current bucket
+ *			is being splitted.
+ */
+BlockNumber
+_hash_get_oldblk(Relation rel, HashPageOpaque opaque)
+{
+	Bucket		curr_bucket;
+	Bucket		old_bucket;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	/*
+	 * To get the old bucket from the current bucket, we need a mask to modulo
+	 * into lower half of table.  This mask is stored in meta page as
+	 * hashm_lowmask, but here we can't rely on the same, because we need a
+	 * value of lowmask that was prevalent at the time when bucket split was
+	 * started.  Masking the most significant bit of new bucket would give us
+	 * old bucket.
+	 */
+	curr_bucket = opaque->hasho_bucket;
+	mask = (((uint32) 1) << _hash_msb(curr_bucket)) - 1;
+	old_bucket = curr_bucket & mask;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newblk() -- get the block number of bucket for the new bucket
+ *			that will be generated after split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket based on current table
+ * half.  It is mainly required to finish the incomplete splits where we are
+ * sure that not more than one bucket could have split in progress from old
+ * bucket.
+ */
+BlockNumber
+_hash_get_newblk(Relation rel, HashPageOpaque opaque)
+{
+	Bucket		curr_bucket;
+	Bucket		new_bucket;
+	uint32		lowmask;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	curr_bucket = opaque->hasho_bucket;
+
+	/*
+	 * new bucket can be obtained by OR'ing old bucket with most significant
+	 * bit of current table half.  There could be multiple buckets that could
+	 * have splitted from curent bucket.  We need the first such bucket that
+	 * exists based on current table half.
+	 */
+	lowmask = metap->hashm_lowmask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = curr_bucket | mask;
+		if (new_bucket > metap->hashm_maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+		break;
+	}
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newbucket() -- get the new bucket that will be generated after
+ *			split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket.  New bucket can be
+ * obtained by OR'ing old bucket with most significant bit of table half
+ * for lowmask passed in this function.  There could be multiple buckets that
+ * could have splitted from curent bucket.  We need the first such bucket that
+ * exists.  Caller must ensure that no more than one split has happened from
+ * old bucket.
+ */
+Bucket
+_hash_get_newbucket(Relation rel, Bucket curr_bucket,
+					uint32 lowmask, uint32 maxbucket)
+{
+	Bucket		new_bucket;
+	uint32		mask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = curr_bucket | mask;
+		if (new_bucket > maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		break;
+	}
+
+	return new_bucket;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 90804a3..3e5b1d2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3567,6 +3567,26 @@ ConditionalLockBuffer(Buffer buffer)
 }
 
 /*
+ * Acquire the content_lock for the buffer, but only if we don't have to wait.
+ *
+ * This assumes the caller wants BUFFER_LOCK_SHARED mode.
+ */
+bool
+ConditionalLockBufferShared(Buffer buffer)
+{
+	BufferDesc *buf;
+
+	Assert(BufferIsValid(buffer));
+	if (BufferIsLocal(buffer))
+		return true;			/* act as though we got it */
+
+	buf = GetBufferDescriptor(buffer - 1);
+
+	return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+									LW_SHARED);
+}
+
+/*
  * LockBufferForCleanup - lock a buffer in preparation for deleting items
  *
  * Items may be deleted from a disk page only when the caller (a) holds an
@@ -3750,6 +3770,49 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 	return false;
 }
 
+/*
+ * CheckBufferForCleanup - as above, but don't attempt to take lock
+ *
+ * We won't loop, but just check once to see if the pin count is OK.  If
+ * not, return FALSE.
+ */
+bool
+CheckBufferForCleanup(Buffer buffer)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(BufferIsValid(buffer));
+
+	if (BufferIsLocal(buffer))
+	{
+		/* There should be exactly one pin */
+		if (LocalRefCount[-buffer - 1] != 1)
+			return false;
+		/* Nobody else to wait for */
+		return true;
+	}
+
+	/* There should be exactly one local pin */
+	if (GetPrivateRefCount(buffer) != 1)
+		return false;
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+	buf_state = LockBufHdr(bufHdr);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+	{
+		/* pincount is OK. */
+		UnlockBufHdr(bufHdr, buf_state);
+		return true;
+	}
+
+	UnlockBufHdr(bufHdr, buf_state);
+	return false;
+}
+
 
 /*
  *	Functions for buffer I/O handling
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 07075ce..cdc460b 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -668,9 +668,6 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintFileLeakWarning(res);
 			FileClose(res);
 		}
-
-		/* Clean up index scans too */
-		ReleaseResources_hash();
 	}
 
 	/* Let add-on modules get a chance too */
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index d9df904..2967ba7 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -24,6 +24,7 @@
 #include "lib/stringinfo.h"
 #include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
+#include "utils/hsearch.h"
 #include "utils/relcache.h"
 
 /*
@@ -32,6 +33,8 @@
  */
 typedef uint32 Bucket;
 
+#define InvalidBucket	((Bucket) 0xFFFFFFFF)
+
 #define BUCKET_TO_BLKNO(metap,B) \
 		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
 
@@ -51,6 +54,9 @@ typedef uint32 Bucket;
 #define LH_BUCKET_PAGE			(1 << 1)
 #define LH_BITMAP_PAGE			(1 << 2)
 #define LH_META_PAGE			(1 << 3)
+#define LH_BUCKET_NEW_PAGE_SPLIT	(1 << 4)
+#define LH_BUCKET_OLD_PAGE_SPLIT	(1 << 5)
+#define LH_BUCKET_PAGE_HAS_GARBAGE	(1 << 6)
 
 typedef struct HashPageOpaqueData
 {
@@ -63,6 +69,12 @@ typedef struct HashPageOpaqueData
 
 typedef HashPageOpaqueData *HashPageOpaque;
 
+#define H_HAS_GARBAGE(opaque)			((opaque)->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+#define H_OLD_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+#define H_INCOMPLETE_SPLIT(opaque)		(((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) || \
+										 ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT))
+
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
  * which otherwise would have a hard time telling pages of different index
@@ -80,19 +92,6 @@ typedef struct HashScanOpaqueData
 	uint32		hashso_sk_hash;
 
 	/*
-	 * By definition, a hash scan should be examining only one bucket. We
-	 * record the bucket number here as soon as it is known.
-	 */
-	Bucket		hashso_bucket;
-	bool		hashso_bucket_valid;
-
-	/*
-	 * If we have a share lock on the bucket, we record it here.  When
-	 * hashso_bucket_blkno is zero, we have no such lock.
-	 */
-	BlockNumber hashso_bucket_blkno;
-
-	/*
 	 * We also want to remember which buffer we're currently examining in the
 	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
 	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -100,11 +99,23 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_curbuf;
 
+	/* remember the buffer associated with primary bucket */
+	Buffer		hashso_bucket_buf;
+
+	/*
+	 * remember the buffer associated with old primary bucket which is
+	 * required during the scan of the bucket for which split is in progress.
+	 */
+	Buffer		hashso_old_bucket_buf;
+
 	/* Current position of the scan, as an index TID */
 	ItemPointerData hashso_curpos;
 
 	/* Current position of the scan, as a heap TID */
 	ItemPointerData hashso_heappos;
+
+	/* Whether scan needs to skip tuples that are moved by split */
+	bool		hashso_skip_moved_tuples;
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -175,6 +186,8 @@ typedef HashMetaPageData *HashMetaPage;
 				  sizeof(ItemIdData) - \
 				  MAXALIGN(sizeof(HashPageOpaqueData)))
 
+#define INDEX_MOVED_BY_SPLIT_MASK	0x2000
+
 #define HASH_MIN_FILLFACTOR			10
 #define HASH_DEFAULT_FILLFACTOR		75
 
@@ -223,9 +236,6 @@ typedef HashMetaPageData *HashMetaPage;
 #define HASH_WRITE		BUFFER_LOCK_EXCLUSIVE
 #define HASH_NOLOCK		(-1)
 
-#define HASH_SHARE		ShareLock
-#define HASH_EXCLUSIVE	ExclusiveLock
-
 /*
  *	Strategy number. There's only one valid strategy for hashing: equality.
  */
@@ -298,21 +308,21 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 
 /* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
 extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BufferAccessStrategy bstrategy);
+				   BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy);
 
 /* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
 extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 			 int access, int flags);
+extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
+								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
@@ -321,6 +331,7 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 						   BufferAccessStrategy bstrategy);
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
+extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
 extern void _hash_wrtbuf(Relation rel, Buffer buf);
 extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
 				   int to_access);
@@ -328,12 +339,9 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
 				ForkNumber forkNum);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
-
-/* hashscan.c */
-extern void _hash_regscan(IndexScanDesc scan);
-extern void _hash_dropscan(IndexScanDesc scan);
-extern bool _hash_has_active_scan(Relation rel, Bucket bucket);
-extern void ReleaseResources_hash(void);
+extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
+				   Buffer nbuf, uint32 maxbucket, uint32 highmask,
+				   uint32 lowmask);
 
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
@@ -363,5 +371,17 @@ extern bool _hash_convert_tuple(Relation index,
 					Datum *index_values, bool *index_isnull);
 extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
 extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+extern BlockNumber _hash_get_oldblk(Relation rel, HashPageOpaque opaque);
+extern BlockNumber _hash_get_newblk(Relation rel, HashPageOpaque opaque);
+extern Bucket _hash_get_newbucket(Relation rel, Bucket curr_bucket,
+					uint32 lowmask, uint32 maxbucket);
+
+/* hash.c */
+extern void hashbucketcleanup(Relation rel, Buffer bucket_buf,
+				  BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
+				  double *tuples_removed, double *num_index_tuples,
+				  bool bucket_has_garbage, bool delay,
+				  IndexBulkDeleteCallback callback, void *callback_state);
 
 #endif   /* HASH_H */
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 8350fa0..788ba9f 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -63,7 +63,7 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
  * t_info manipulation macros
  */
 #define INDEX_SIZE_MASK 0x1FFF
-/* bit 0x2000 is not used at present */
+/* bit 0x2000 is reserved for index-AM specific usage */
 #define INDEX_VAR_MASK	0x4000
 #define INDEX_NULL_MASK 0x8000
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7b6ba96..accbb88 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -225,8 +225,10 @@ extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
 extern void UnlockBuffers(void);
 extern void LockBuffer(Buffer buffer, int mode);
 extern bool ConditionalLockBuffer(Buffer buffer);
+extern bool ConditionalLockBufferShared(Buffer buffer);
 extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
+extern bool CheckBufferForCleanup(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern void AbortBufferIO(void);
#59Bruce Momjian
bruce@momjian.us
In reply to: Amit Kapila (#37)
Re: Hash Indexes

On Thu, Sep 15, 2016 at 11:11:41AM +0530, Amit Kapila wrote:

I think it is possible without breaking pg_upgrade, if we match all
items of a page at once (and save them as local copy), rather than
matching item-by-item as we do now. We are already doing similar for
btree, refer explanation of BTScanPosItem and BTScanPosData in
nbtree.h.

FYI, pg_upgrade has code to easily mark indexes as invalid and create a
script the use can run to recreate the indexes as valid. I have
received no complaints when this was used.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#57)
Re: Hash Indexes

On Mon, Sep 19, 2016 at 03:50:38PM -0400, Robert Haas wrote:

It will probably have some bugs, but they probably won't be worse than
the status quo:

WARNING: hash indexes are not WAL-logged and their use is discouraged

Personally, I think it's outright embarrassing that we've had that
limitation for years; it boils down to "hey, we have this feature but
it doesn't work", which is a pretty crummy position for the world's
most advanced open-source database to take.

No question. We inherited the technical dept of hash indexes 20 years
ago and haven't really solved it yet. We keep making incremental
improvements, which keeps it from being removed, but hash is still far
behind other index types.

I'm rather unenthused about having a hash index implementation that's
mildly better in some corner cases, but otherwise doesn't have much
benefit. That'll mean we'll have to step up our user education a lot,
and we'll have to maintain something for little benefit.

If it turns out that it has little benefit, then we don't really need
to step up our user education. People can just keep using btree like

The big problem is people coming from other databases and assuming our
hash indexes have the same benefits over btree that exist in some other
database software. The 9.5 warning at least helps with that.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#60)
Re: Hash Indexes

On Tue, Sep 20, 2016 at 7:55 PM, Bruce Momjian <bruce@momjian.us> wrote:

If it turns out that it has little benefit, then we don't really need
to step up our user education. People can just keep using btree like

The big problem is people coming from other databases and assuming our
hash indexes have the same benefits over btree that exist in some other
database software. The 9.5 warning at least helps with that.

I'd be curious what benefits people expect to get. For example, I
searched for "Oracle hash indexes" using Google and found this page:

http://logicalread.solarwinds.com/oracle-11g-hash-indexes-mc02/

It implies that their hash indexes are actually clustered indexes;
that is, the table data is physically organized into contiguous chunks
by hash bucket. Also, they can't split buckets on the fly. I think
the DB2 implementation is similar. So our hash indexes, even once we
add write-ahead logging and better concurrency, will be somewhat
different from those products. However, I'm not actually sure how
widely-used those index types are. I wonder if people who use hash
indexes in PostgreSQL are even likely to be familiar with those
technologies, and what expectations they might have.

For PostgreSQL, I expect the benefits of improving hash indexes to be
(1) slightly better raw performance for equality comparisons and (2)
better concurrency. The details aren't very clear at this stage. We
know that write performance is bad right now, even with Amit's
patches, but that's without the kill_prior_tuple optimization which is
probably extremely important but which has never been implemented for
hash indexes. Read performance is good, but there are still further
optimizations that haven't been done there, too, so it may be even
better by the time Amit gets done working in this area.

Of course, if we want to implement clustered indexes, that's going to
require significant changes to the heap format ... or the ability to
support multiple heap storage formats. I'm not opposed to that, but I
think it makes sense to fix the existing implementation first.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Robert Haas (#61)
Re: Hash Indexes

21.09.2016, 15:29, Robert Haas kirjoitti:

For PostgreSQL, I expect the benefits of improving hash indexes to be
(1) slightly better raw performance for equality comparisons and (2)
better concurrency.

There's a third benefit: with large columns a hash index is a lot
smaller on disk than a btree index. This is the biggest reason I've
seen people want to use hash indexes instead of btrees. hashtext()
btrees are a workaround, but they require all queries to be adjusted
which is a pain.

/ Oskari

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63Jeff Janes
jeff.janes@gmail.com
In reply to: Robert Haas (#42)
Re: Hash Indexes

On Thu, Sep 15, 2016 at 7:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 15, 2016 at 1:41 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

I think it is possible without breaking pg_upgrade, if we match all
items of a page at once (and save them as local copy), rather than
matching item-by-item as we do now. We are already doing similar for
btree, refer explanation of BTScanPosItem and BTScanPosData in
nbtree.h.

If ever we want to sort hash buckets by TID, it would be best to do
that in v10 since we're presumably going to be recommending a REINDEX
anyway.

We are? I thought we were trying to preserve on-disk compatibility so that
we didn't have to rebuild the indexes.

Is the concern that lack of WAL logging has generated some subtle
unrecognized on disk corruption?

If I were using hash indexes on a production system and I experienced a
crash, I would surely reindex immediately after the crash, not wait until
the next pg_upgrade.

But is that a good thing to do? That's a little harder to
say.

How could we go about deciding that? Do you think anything short of coding
it up and seeing how it works would suffice? I agree that if we want to do
it, v10 is the time. But we have about 6 months yet on that.

Cheers,

Jeff

#64Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#61)
Re: Hash Indexes

On Wed, Sep 21, 2016 at 08:29:59AM -0400, Robert Haas wrote:

Of course, if we want to implement clustered indexes, that's going to
require significant changes to the heap format ... or the ability to
support multiple heap storage formats. I'm not opposed to that, but I
think it makes sense to fix the existing implementation first.

For me, there are several measurements for indexes:

Build time
INSERT / UPDATE overhead
Storage size
Access speed

I am guessing people make conclusions based on their Computer Science
education.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Janes (#63)
Re: Hash Indexes

On Wed, Sep 21, 2016 at 2:11 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

We are? I thought we were trying to preserve on-disk compatibility so that
we didn't have to rebuild the indexes.

Well, that was my initial idea, but ...

Is the concern that lack of WAL logging has generated some subtle
unrecognized on disk corruption?

...this is a consideration in the other direction.

If I were using hash indexes on a production system and I experienced a
crash, I would surely reindex immediately after the crash, not wait until
the next pg_upgrade.

You might be more responsible, and more knowledgeable, than our typical user.

But is that a good thing to do? That's a little harder to
say.

How could we go about deciding that? Do you think anything short of coding
it up and seeing how it works would suffice? I agree that if we want to do
it, v10 is the time. But we have about 6 months yet on that.

Yes, I think some experimentation will be needed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66Geoff Winkless
pgsqladmin@geoff.dj
In reply to: Robert Haas (#61)
Re: Hash Indexes

On 21 September 2016 at 13:29, Robert Haas <robertmhaas@gmail.com> wrote:

I'd be curious what benefits people expect to get.

An edge case I came across the other day was a unique index on a large
string: postgresql popped up and told me that I couldn't insert a
value into the field because the BTREE-index-based constraint wouldn't
support the size of string, and that I should use a HASH index
instead. Which, of course, I can't, because it's fairly clearly
deprecated in the documentation...

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67Jeff Janes
jeff.janes@gmail.com
In reply to: Geoff Winkless (#66)
Re: Hash Indexes

On Wed, Sep 21, 2016 at 12:44 PM, Geoff Winkless <pgsqladmin@geoff.dj>
wrote:

On 21 September 2016 at 13:29, Robert Haas <robertmhaas@gmail.com> wrote:

I'd be curious what benefits people expect to get.

An edge case I came across the other day was a unique index on a large
string: postgresql popped up and told me that I couldn't insert a
value into the field because the BTREE-index-based constraint wouldn't
support the size of string, and that I should use a HASH index
instead. Which, of course, I can't, because it's fairly clearly
deprecated in the documentation...

Yes, this large string issue is why I argued against removing hash indexes
the last couple times people proposed removing them. I'd rather be able to
use something that gets the job done, even if it is deprecated.

You could use btree indexes over hashes of the strings. But then you would
have to rewrite all your queries to inject an additional qualification,
something like:

Where value = 'really long string' and md5(value)=md5('really long string').

Alas, it still wouldn't support unique indexes. I don't think you can even
use an excluding constraint, because you would have to exclude on the hash
value alone, not the original value, and so it would also forbid
false-positive collisions.

There has been discussion to make btree-over-hash just work without needing
to rewrite the queries, but discussions aren't patches...

Cheers,

Jeff

#68Andres Freund
andres@anarazel.de
In reply to: Oskari Saarenmaa (#62)
Re: Hash Indexes

On 2016-09-21 19:49:15 +0300, Oskari Saarenmaa wrote:

21.09.2016, 15:29, Robert Haas kirjoitti:

For PostgreSQL, I expect the benefits of improving hash indexes to be
(1) slightly better raw performance for equality comparisons and (2)
better concurrency.

There's a third benefit: with large columns a hash index is a lot smaller on
disk than a btree index. This is the biggest reason I've seen people want
to use hash indexes instead of btrees. hashtext() btrees are a workaround,
but they require all queries to be adjusted which is a pain.

Sure. But that can be addressed, with a lot less effort than fixing and
maintaining the hash indexes, by adding the ability to do that
transparently using btree indexes + a recheck internally. How that
compares efficiency-wise is unclear as of now. But I do think it's
something we should measure before committing the new code.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69AP
ap@zip.com.au
In reply to: Geoff Winkless (#66)
Re: Hash Indexes

On Wed, Sep 21, 2016 at 08:44:15PM +0100, Geoff Winkless wrote:

On 21 September 2016 at 13:29, Robert Haas <robertmhaas@gmail.com> wrote:

I'd be curious what benefits people expect to get.

An edge case I came across the other day was a unique index on a large
string: postgresql popped up and told me that I couldn't insert a
value into the field because the BTREE-index-based constraint wouldn't
support the size of string, and that I should use a HASH index
instead. Which, of course, I can't, because it's fairly clearly
deprecated in the documentation...

Thanks for that. Forgot about that bit of nastiness. I came across the
above migrating a MySQL app to PostgreSQL. MySQL, I believe, handles
this by silently truncating the string on index. PostgreSQL by telling
you it can't index. :( So, as a result, AFAIK, I had a choice between a
trigger that did a left() on the string and inserts it into a new column
on the table that I can then index or do an index on left(). Either way
you wind up re-writing a whole bunch of queries. If I wanted to avoid
the re-writes I had the option of making the DB susceptible to poor
recovery from crashes, et all.

No matter which option I chose, the end result was going to be ugly.

It would be good not to have to go ugly in such situations.

Sometimes one size does not fit all.

For me this would be a second major case where I'd use usable hashed
indexes the moment they showed up.

Andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#68)
Re: Hash Indexes

Andres Freund <andres@anarazel.de> writes:

Sure. But that can be addressed, with a lot less effort than fixing and
maintaining the hash indexes, by adding the ability to do that
transparently using btree indexes + a recheck internally. How that
compares efficiency-wise is unclear as of now. But I do think it's
something we should measure before committing the new code.

TBH, I think we should reject that argument out of hand. If someone
wants to spend time developing a hash-wrapper-around-btree AM, they're
welcome to do so. But to kick the hash AM as such to the curb is to say
"sorry, there will never be O(1) index lookups in Postgres".

It's certainly conceivable that it's impossible to get decent performance
out of hash indexes, but I do not agree that we should simply stop trying.

Even if I granted the unproven premise that use-a-btree-on-hash-codes will
always be superior, I don't see how it follows that we should refuse to
commit work that's already been done. Is committing it somehow going to
prevent work on the btree-wrapper approach?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#70)
Re: Hash Indexes

On 2016-09-21 22:23:27 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Sure. But that can be addressed, with a lot less effort than fixing and
maintaining the hash indexes, by adding the ability to do that
transparently using btree indexes + a recheck internally. How that
compares efficiency-wise is unclear as of now. But I do think it's
something we should measure before committing the new code.

TBH, I think we should reject that argument out of hand. If someone
wants to spend time developing a hash-wrapper-around-btree AM, they're
welcome to do so. But to kick the hash AM as such to the curb is to say
"sorry, there will never be O(1) index lookups in Postgres".

Note that I'm explicitly *not* saying that. I just would like to see
actual comparisons being made before investing significant amounts of
code and related effort being invested in fixing the current hash table
implementation. And I haven't seen a lot of that. If the result of that
comparison is that hash-indexes actually perform very well: Great!

always be superior, I don't see how it follows that we should refuse to
commit work that's already been done. Is committing it somehow going to
prevent work on the btree-wrapper approach?

The necessary work seems a good bit from finished.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#71)
Re: Hash Indexes

On Thu, Sep 22, 2016 at 8:03 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-09-21 22:23:27 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Sure. But that can be addressed, with a lot less effort than fixing and
maintaining the hash indexes, by adding the ability to do that
transparently using btree indexes + a recheck internally. How that
compares efficiency-wise is unclear as of now. But I do think it's
something we should measure before committing the new code.

TBH, I think we should reject that argument out of hand. If someone
wants to spend time developing a hash-wrapper-around-btree AM, they're
welcome to do so. But to kick the hash AM as such to the curb is to say
"sorry, there will never be O(1) index lookups in Postgres".

Note that I'm explicitly *not* saying that. I just would like to see
actual comparisons being made before investing significant amounts of
code and related effort being invested in fixing the current hash table
implementation. And I haven't seen a lot of that.

I think it can be deduced from testing done till now. Basically, by
having index (btree/hash) on integer column can do the fair
comparison. The size of key will be same in both hash and btree
index. In such a case, if we know that hash index is performing
better in certain cases, then it is indication that it will perform
better in the scheme you are suggesting because it doesn't have extra
recheck in btree code which will further worsen the case for btree.

If the result of that
comparison is that hash-indexes actually perform very well: Great!

always be superior, I don't see how it follows that we should refuse to
commit work that's already been done. Is committing it somehow going to
prevent work on the btree-wrapper approach?

The necessary work seems a good bit from finished.

Are you saying this about WAL patch? If yes, then even if it is still
away from being in shape to committed, there is a lot of effort being
put in to taking into its current stage and it is not in bad shape
either. It has survived lot of testing, there are still some bugs
which we are fixing.

One more thing, I want to say that don't assume that all people
involved in current development of hash indexes or further development
on it will run away once the code is committed and the responsibility
of maintenance will be on other senior members of community.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#71)
Re: Hash Indexes

On Wed, Sep 21, 2016 at 10:33 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-09-21 22:23:27 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Sure. But that can be addressed, with a lot less effort than fixing and
maintaining the hash indexes, by adding the ability to do that
transparently using btree indexes + a recheck internally. How that
compares efficiency-wise is unclear as of now. But I do think it's
something we should measure before committing the new code.

TBH, I think we should reject that argument out of hand. If someone
wants to spend time developing a hash-wrapper-around-btree AM, they're
welcome to do so. But to kick the hash AM as such to the curb is to say
"sorry, there will never be O(1) index lookups in Postgres".

Note that I'm explicitly *not* saying that. I just would like to see
actual comparisons being made before investing significant amounts of
code and related effort being invested in fixing the current hash table
implementation. And I haven't seen a lot of that. If the result of that
comparison is that hash-indexes actually perform very well: Great!

Yeah, I just don't agree with that. I don't think we have any policy
that you can't develop A and get it committed unless you try every
alternative that some other community member thinks might be better in
the long run first. If we adopt such a policy, we'll have no
developers and no new features. Also, in this particular case, I
think there's no evidence that the alternative you are proposing would
actually be better or less work to maintain.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#73)
Re: Hash Indexes

On 2016-09-23 15:19:14 -0400, Robert Haas wrote:

On Wed, Sep 21, 2016 at 10:33 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-09-21 22:23:27 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Sure. But that can be addressed, with a lot less effort than fixing and
maintaining the hash indexes, by adding the ability to do that
transparently using btree indexes + a recheck internally. How that
compares efficiency-wise is unclear as of now. But I do think it's
something we should measure before committing the new code.

TBH, I think we should reject that argument out of hand. If someone
wants to spend time developing a hash-wrapper-around-btree AM, they're
welcome to do so. But to kick the hash AM as such to the curb is to say
"sorry, there will never be O(1) index lookups in Postgres".

Note that I'm explicitly *not* saying that. I just would like to see
actual comparisons being made before investing significant amounts of
code and related effort being invested in fixing the current hash table
implementation. And I haven't seen a lot of that. If the result of that
comparison is that hash-indexes actually perform very well: Great!

Yeah, I just don't agree with that. I don't think we have any policy
that you can't develop A and get it committed unless you try every
alternative that some other community member thinks might be better in
the long run first.

Huh. I think we make such arguments *ALL THE TIME*.

Anyway, I don't see much point in continuing to discuss this, I'm
clearly in the minority.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75Greg Stark
stark@mit.edu
In reply to: Tom Lane (#70)
Re: Hash Indexes

On Thu, Sep 22, 2016 at 3:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

But to kick the hash AM as such to the curb is to say
"sorry, there will never be O(1) index lookups in Postgres".

Well there's plenty of halfway solutions for that. We could move hash
indexes to contrib or even have them in core as experimental_hash or
unlogged_hash until the day they achieve their potential.

We definitely shouldn't discourage people from working on hash indexes
but we probably shouldn't have released ten years worth of a feature
marked "please don't use this" that's guaranteed to corrupt your
database and cause weird problems if you use it a any of a number of
supported situations (including non-replicated system recovery that
has been a bedrock feature of Postgres for over a decade).

Arguably adding a hashed btree opclass and relegating the existing
code to an experimental state would actually encourage development
since a) Users would actually be likely to use the hashed btree
opclass so any work on a real hash opclass would have a real userbase
ready and waiting for delivery, b) delivering a real hash opclass
wouldn't involve convincing users to unlearn a million instructions
warning not to use this feature and c) The fear of breaking existing
users use cases and databases would be less and pg_upgrade would be an
ignorable problem at least until the day comes for the big cutover of
the default to the new opclass.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Stark (#75)
Re: Hash Indexes

Greg Stark <stark@mit.edu> writes:

On Thu, Sep 22, 2016 at 3:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

But to kick the hash AM as such to the curb is to say
"sorry, there will never be O(1) index lookups in Postgres".

Well there's plenty of halfway solutions for that. We could move hash
indexes to contrib or even have them in core as experimental_hash or
unlogged_hash until the day they achieve their potential.

We definitely shouldn't discourage people from working on hash indexes
but we probably shouldn't have released ten years worth of a feature
marked "please don't use this" that's guaranteed to corrupt your
database and cause weird problems if you use it a any of a number of
supported situations (including non-replicated system recovery that
has been a bedrock feature of Postgres for over a decade).

Obviously that has not been a good situation, but we lack a time
machine to retroactively make it better, so I don't see much point
in fretting over what should have been done in the past.

Arguably adding a hashed btree opclass and relegating the existing
code to an experimental state would actually encourage development
since a) Users would actually be likely to use the hashed btree
opclass so any work on a real hash opclass would have a real userbase
ready and waiting for delivery, b) delivering a real hash opclass
wouldn't involve convincing users to unlearn a million instructions
warning not to use this feature and c) The fear of breaking existing
users use cases and databases would be less and pg_upgrade would be an
ignorable problem at least until the day comes for the big cutover of
the default to the new opclass.

I'm not following your point here. There is no hash-over-btree AM and
nobody (including Andres) has volunteered to create one. Meanwhile,
we have a patch in hand to WAL-enable the hash AM. Why would we do
anything other than apply that patch and stop saying hash is deprecated?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77Amit Kapila
amit.kapila16@gmail.com
In reply to: Greg Stark (#75)
Re: Hash Indexes

On Sat, Sep 24, 2016 at 10:49 PM, Greg Stark <stark@mit.edu> wrote:

On Thu, Sep 22, 2016 at 3:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

But to kick the hash AM as such to the curb is to say
"sorry, there will never be O(1) index lookups in Postgres".

Well there's plenty of halfway solutions for that. We could move hash
indexes to contrib or even have them in core as experimental_hash or
unlogged_hash until the day they achieve their potential.

We definitely shouldn't discourage people from working on hash indexes

Okay, but to me it appears that naming it as experimental_hash or
moving it to contrib could discourage people or at the very least
people will be less motivated. Thinking on those lines a year or so
back would have been a wise direction, but now when already there is
lot of work done (patches to make it wal-enabled, more concurrent and
performant, page inspect module are available) for hash indexes and
still more is in progress, that sounds like a step backward then step
forward.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Amit Kapila (#77)
Re: Hash Indexes

On 25/09/16 18:18, Amit Kapila wrote:

On Sat, Sep 24, 2016 at 10:49 PM, Greg Stark <stark@mit.edu> wrote:

On Thu, Sep 22, 2016 at 3:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

But to kick the hash AM as such to the curb is to say
"sorry, there will never be O(1) index lookups in Postgres".

Well there's plenty of halfway solutions for that. We could move hash
indexes to contrib or even have them in core as experimental_hash or
unlogged_hash until the day they achieve their potential.

We definitely shouldn't discourage people from working on hash indexes

Okay, but to me it appears that naming it as experimental_hash or
moving it to contrib could discourage people or at the very least
people will be less motivated. Thinking on those lines a year or so
back would have been a wise direction, but now when already there is
lot of work done (patches to make it wal-enabled, more concurrent and
performant, page inspect module are available) for hash indexes and
still more is in progress, that sounds like a step backward then step
forward.

+1

I think so too - I've seen many email threads over the years on this
list that essentially state "we need hash indexes wal logged to make
progress with them"...and Amit et al has/have done this (more than this
obviously - made 'em better too) and I'm astonished that folk are
suggesting anything other than 'commit this great patch now!'...

regards

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Amit Kapila (#58)
Re: Hash Indexes

On 09/20/2016 09:02 AM, Amit Kapila wrote:

On Fri, Sep 16, 2016 at 11:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I do want to work on it, but it is always possible that due to some
other work this might get delayed. Also, I think there is always a
chance that while doing that work, we face some problem due to which
we might not be able to use that optimization. So I will go with your
suggestion of removing hashscan.c and it's usage for now and then if
required we will pull it back. If nobody else thinks otherwise, I
will update this in next patch version.

In the attached patch, I have removed the support of hashscans. I
think it might improve performance by few percentage (especially for
single row fetch transactions) as we have registration and destroy of
hashscans.

I have been running various tests, and applications with this patch
together with the WAL v5 patch [1]/messages/by-id/CAA4eK1KE=+kkowyYD0vmch=ph4ND3H1tViAB+0cWTHqjZDDfqg@mail.gmail.com.

As I havn't seen any failures and doesn't currently have additional
feedback I'm moving this patch to "Ready for Committer" for their feedback.

If others have comments, move the patch status back in the CommitFest
application, please.

[1]: /messages/by-id/CAA4eK1KE=+kkowyYD0vmch=ph4ND3H1tViAB+0cWTHqjZDDfqg@mail.gmail.com
/messages/by-id/CAA4eK1KE=+kkowyYD0vmch=ph4ND3H1tViAB+0cWTHqjZDDfqg@mail.gmail.com

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80Robert Haas
robertmhaas@gmail.com
In reply to: Jesper Pedersen (#79)
Re: Hash Indexes

On Tue, Sep 27, 2016 at 3:06 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

I have been running various tests, and applications with this patch together
with the WAL v5 patch [1].

As I havn't seen any failures and doesn't currently have additional feedback
I'm moving this patch to "Ready for Committer" for their feedback.

Cool! Thanks for reviewing.

Amit, can you please split the buffer manager changes in this patch
into a separate patch? I think those changes can be committed first
and then we can try to deal with the rest of it. Instead of adding
ConditionalLockBufferShared, I think we should add an "int mode"
argument to the existing ConditionalLockBuffer() function. That way
is more consistent with LockBuffer(). It means an API break for any
third-party code that's calling this function, but that doesn't seem
like a big problem. There are only 10 callers of
ConditionalLockBuffer() in our source tree and only one of those is in
contrib, so probably there isn't much third-party code that will be
affected by this, and I think it's worth it for the long-term
cleanliness.

As for CheckBufferForCleanup, I think that looks OK, but: (1) please
add an Assert() that we hold an exclusive lock on the buffer, using
LWLockHeldByMeInMode; and (2) I think we should rename it to something
like IsBufferCleanupOK. Then, when it's used, it reads like English:
if (IsBufferCleanupOK(buf)) { /* clean up the buffer */ }.

I'll write another email with my thoughts about the rest of the patch.
For the record, Amit and I have had extensive discussions about this
effort off-list, and as Amit noted in his original post, the design is
based on suggestions which I previously posted to the list suggesting
how the issues with hash indexes might get fixed. Therefore, I don't
expect to have too many basic disagreements regarding the design of
the patch; if anyone else does, please speak up. Andres already
stated that he things working on btree-over-hash would be more
beneficial than fixing hash, but at this point it seems like he's the
only one who takes that position. Even if we accept that working on
the hash AM is a reasonable thing to do, it doesn't follow that the
design Amit has adopted here is ideal. I think it's reasonably good,
but that's only to be expected considering that I drafted the original
version of it and have been involved in subsequent discussions;
someone else might dislike something that I thought was OK, and any
such opinions certainly deserve a fair hearing. To be clear, It's
been a long time since I've looked at any of the actual code in this
patch and I have at no point studied it deeply, so I expect that I may
find a fair number of things that I'm not happy with in detail, and
I'll write those up along with any design-level concerns that I do
have. This should in no way forestall review from anyone else who
wants to get involved.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#80)
Re: Hash Indexes

On 2016-09-28 15:04:30 -0400, Robert Haas wrote:

Andres already
stated that he things working on btree-over-hash would be more
beneficial than fixing hash, but at this point it seems like he's the
only one who takes that position.

Note that I did *NOT* take that position. I was saying that I think we
should evaluate whether that's not a better approach, doing some simple
performance comparisons.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#81)
Re: Hash Indexes

On Wed, Sep 28, 2016 at 3:06 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-09-28 15:04:30 -0400, Robert Haas wrote:

Andres already
stated that he things working on btree-over-hash would be more
beneficial than fixing hash, but at this point it seems like he's the
only one who takes that position.

Note that I did *NOT* take that position. I was saying that I think we
should evaluate whether that's not a better approach, doing some simple
performance comparisons.

OK, sorry. I evidently misunderstood your position, for which I apologize.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#80)
Re: Hash Indexes

On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I'll write another email with my thoughts about the rest of the patch.

I think that the README changes for this patch need a fairly large
amount of additional work. Here are a few things I notice:

- The confusion between buckets and pages hasn't been completely
cleared up. If you read the beginning of the README, the terminology
is clearly set forth. It says:

A hash index consists of two or more "buckets", into which tuples are placed whenever their hash key maps to the bucket number. Each bucket in the hash index comprises one or more index pages. The bucket's first page is permanently assigned to it when the bucket is created. Additional pages, called "overflow pages", are added if the bucket receives too many tuples to fit in the primary bucket page."

But later on, you say:

Scan will take a lock in shared mode on the primary bucket or on one of the overflow page.

So the correct terminology here would be "primary bucket page" not
"primary bucket".

- In addition, notice that there are two English errors in this
sentence: the word "the" needs to be added to the beginning of the
sentence, and the last word needs to be "pages" rather than "page".
There are a considerable number of similar minor errors; if you can't
fix them, I'll make a pass over it and clean it up.

- The whole "lock definitions" section seems to me to be pretty loose
and imprecise about what is happening. For example, it uses the term
"split-in-progress" without first defining it. The sentence quoted
above says that scans take a lock in shared mode either on the primary
page or on one of the overflow pages, but it's not to document code by
saying that it will do either A or B without explaining which one! In
fact, I think that a scan will take a content lock first on the
primary bucket page and then on each overflow page in sequence,
retaining a pin on the primary buffer page throughout the scan. So it
is not one or the other but both in a particular sequence, and that
can and should be explained.

Another problem with this section is that even when it's precise about
what is going on, it's probably duplicating what is or should be in
the following sections where the algorithms for each operation are
explained. In the original wording, this section explains what each
lock protects, and then the following sections explain the algorithms
in the context of those definitions. Now, this section contains a
sketch of the algorithm, and then the following sections lay it out
again in more detail. The question of what each lock protects has
been lost. Here's an attempt at some text to replace what you have
here:

===
Concurrency control for hash indexes is provided using buffer content
locks, buffer pins, and cleanup locks. Here as elsewhere in
PostgreSQL, cleanup lock means that we hold an exclusive lock on the
buffer and have observed at some point after acquiring the lock that
we hold the only pin on that buffer. For hash indexes, a cleanup lock
on a primary bucket page represents the right to perform an arbitrary
reorganization of the entire bucket, while a cleanup lock on an
overflow page represents the right to perform a reorganization of just
that page. Therefore, scans retain a pin on both the primary bucket
page and the overflow page they are currently scanning, if any.
Splitting a bucket requires a cleanup lock on both the old and new
primary bucket pages. VACUUM therefore takes a cleanup lock on every
bucket page in turn order to remove tuples. It can also remove tuples
copied to a new bucket by any previous split operation, because the
cleanup lock taken on the primary bucket page guarantees that no scans
which started prior to the most recent split can still be in progress.
After cleaning each page individually, it attempts to take a cleanup
lock on the primary bucket page in order to "squeeze" the bucket down
to the minimum possible number of pages.
===

As I was looking at the old text regarding deadlock risk, I realized
what may be a serious problem. Suppose process A is performing a scan
of some hash index. While the scan is suspended, it attempts to take
a lock and is blocked by process B. Process B, meanwhile, is running
VACUUM on that hash index. Eventually, it will do
LockBufferForCleanup() on the hash bucket on which process A holds a
buffer pin, resulting in an undetected deadlock. In the current
coding, A would hold a heavyweight lock and B would attempt to acquire
a conflicting heavyweight lock, and the deadlock detector would kill
one of them. This patch probably breaks that. I notice that that's
the only place where we attempt to acquire a buffer cleanup lock
unconditionally; every place else, we acquire the lock conditionally,
so there's no deadlock risk. Once we resolve this problem, the
paragraph about deadlock risk in this section should be revised to
explain whatever solution we come up with.

By the way, since VACUUM must run in its own transaction, B can't be
holding arbitrary locks, but that doesn't seem quite sufficient to get
us out of the woods. It will at least hold ShareUpdateExclusiveLock
on the relation being vacuumed, and process A could attempt to acquire
that same lock.

Also in regards to deadlock, I notice that you added a paragraph
saying that we lock higher-numbered buckets before lower-numbered
buckets. That's fair enough, but what about the metapage? The reader
algorithm suggests that the metapage must lock must be taken after the
bucket locks, because it tries to grab the bucket lock conditionally
after acquiring the metapage lock, but that's not documented here.

The reader algorithm itself seems to be a bit oddly explained.

      pin meta page and take buffer content lock in shared mode
+    compute bucket number for target hash key
+    read and pin the primary bucket page

So far, I'm with you.

+    conditionally get the buffer content lock in shared mode on
primary bucket page for search
+    if we didn't get the lock (need to wait for lock)

"didn't get the lock" and "wait for the lock" are saying the same
thing, so this is redundant, and the statement that it is "for search"
on the previous line is redundant with the introductory text
describing this as the reader algorithm.

+        release the buffer content lock on meta page
+        acquire buffer content lock on primary bucket page in shared mode
+        acquire the buffer content lock in shared mode on meta page

OK...

+        to check for possibility of split, we need to recompute the bucket and
+        verify, if it is a correct bucket; set the retry flag

This makes it sound like we set the retry flag whether it was the
correct bucket or not, which isn't sensible.

+ else if we get the lock, then we can skip the retry path

This line is totally redundant. If we don't set the retry flag, then
of course we can skip the part guarded by if (retry).

+    if (retry)
+        loop:
+            compute bucket number for target hash key
+            release meta page buffer content lock
+            if (correct bucket page is already locked)
+                break
+            release any existing content lock on bucket page (if a
concurrent split happened)
+            pin primary bucket page and take shared buffer content lock
+            retake meta page buffer content lock in shared mode

This is the part I *really* don't understand. It makes sense to me
that we need to loop until we get the correct bucket locked with no
concurrent splits, but why is this retry loop separate from the
previous bit of code that set the retry flag. In other words, why is
not something like this?

pin the meta page and take shared content lock on it
compute bucket number for target hash key
if (we can't get a shared content lock on the target bucket without blocking)
loop:
release meta page content lock
take a shared content lock on the target primary bucket page
take a shared content lock on the metapage
if (previously-computed target bucket has not been split)
break;

Another thing I don't quite understand about this algorithm is that in
order to conditionally lock the target primary bucket page, we'd first
need to read and pin it. And that doesn't seem like a good thing to
do while we're holding a shared content lock on the metapage, because
of the principle that we don't want to hold content locks across I/O.

 -- then, per read request:
    release pin on metapage
-    read current page of bucket and take shared buffer content lock
-        step to next page if necessary (no chaining of locks)
+    if the split is in progress for current bucket and this is a new bucket
+        release the buffer content lock on current bucket page
+        pin and acquire the buffer content lock on old bucket in shared mode
+        release the buffer content lock on old bucket, but not pin
+        retake the buffer content lock on new bucket
+        mark the scan such that it skips the tuples that are marked
as moved by split

Aren't these steps done just once per scan? If so, I think they
should appear before "-- then, per read request" which AIUI is
intended to imply a loop over tuples.

+    step to next page if necessary (no chaining of locks)
+        if the scan indicates moved by split, then move to old bucket
after the scan
+        of current bucket is finished
     get tuple
     release buffer content lock and pin on current page
 -- at scan shutdown:
-    release bucket share-lock

Don't we have a pin to release at scan shutdown in the new system?

Well, I was hoping to get through the whole patch in one email, but
I'm not even all the way through the README. However, it's late, so
I'm stopping here for now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#83)
Re: Hash Indexes

On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I'll write another email with my thoughts about the rest of the patch.

I think that the README changes for this patch need a fairly large
amount of additional work. Here are a few things I notice:

- The confusion between buckets and pages hasn't been completely
cleared up. If you read the beginning of the README, the terminology
is clearly set forth. It says:

A hash index consists of two or more "buckets", into which tuples are placed whenever their hash key maps to the bucket number. Each bucket in the hash index comprises one or more index pages. The bucket's first page is permanently assigned to it when the bucket is created. Additional pages, called "overflow pages", are added if the bucket receives too many tuples to fit in the primary bucket page."

But later on, you say:

Scan will take a lock in shared mode on the primary bucket or on one of the overflow page.

So the correct terminology here would be "primary bucket page" not
"primary bucket".

- In addition, notice that there are two English errors in this
sentence: the word "the" needs to be added to the beginning of the
sentence, and the last word needs to be "pages" rather than "page".
There are a considerable number of similar minor errors; if you can't
fix them, I'll make a pass over it and clean it up.

- The whole "lock definitions" section seems to me to be pretty loose
and imprecise about what is happening. For example, it uses the term
"split-in-progress" without first defining it. The sentence quoted
above says that scans take a lock in shared mode either on the primary
page or on one of the overflow pages, but it's not to document code by
saying that it will do either A or B without explaining which one! In
fact, I think that a scan will take a content lock first on the
primary bucket page and then on each overflow page in sequence,
retaining a pin on the primary buffer page throughout the scan. So it
is not one or the other but both in a particular sequence, and that
can and should be explained.

Another problem with this section is that even when it's precise about
what is going on, it's probably duplicating what is or should be in
the following sections where the algorithms for each operation are
explained. In the original wording, this section explains what each
lock protects, and then the following sections explain the algorithms
in the context of those definitions. Now, this section contains a
sketch of the algorithm, and then the following sections lay it out
again in more detail. The question of what each lock protects has
been lost. Here's an attempt at some text to replace what you have
here:

===
Concurrency control for hash indexes is provided using buffer content
locks, buffer pins, and cleanup locks. Here as elsewhere in
PostgreSQL, cleanup lock means that we hold an exclusive lock on the
buffer and have observed at some point after acquiring the lock that
we hold the only pin on that buffer. For hash indexes, a cleanup lock
on a primary bucket page represents the right to perform an arbitrary
reorganization of the entire bucket, while a cleanup lock on an
overflow page represents the right to perform a reorganization of just
that page. Therefore, scans retain a pin on both the primary bucket
page and the overflow page they are currently scanning, if any.

I don't think we take cleanup lock on overflow page, so I will edit that part.

Splitting a bucket requires a cleanup lock on both the old and new
primary bucket pages. VACUUM therefore takes a cleanup lock on every
bucket page in turn order to remove tuples. It can also remove tuples
copied to a new bucket by any previous split operation, because the
cleanup lock taken on the primary bucket page guarantees that no scans
which started prior to the most recent split can still be in progress.
After cleaning each page individually, it attempts to take a cleanup
lock on the primary bucket page in order to "squeeze" the bucket down
to the minimum possible number of pages.
===

As I was looking at the old text regarding deadlock risk, I realized
what may be a serious problem. Suppose process A is performing a scan
of some hash index. While the scan is suspended, it attempts to take
a lock and is blocked by process B. Process B, meanwhile, is running
VACUUM on that hash index. Eventually, it will do
LockBufferForCleanup() on the hash bucket on which process A holds a
buffer pin, resulting in an undetected deadlock. In the current
coding, A would hold a heavyweight lock and B would attempt to acquire
a conflicting heavyweight lock, and the deadlock detector would kill
one of them. This patch probably breaks that. I notice that that's
the only place where we attempt to acquire a buffer cleanup lock
unconditionally; every place else, we acquire the lock conditionally,
so there's no deadlock risk. Once we resolve this problem, the
paragraph about deadlock risk in this section should be revised to
explain whatever solution we come up with.

By the way, since VACUUM must run in its own transaction, B can't be
holding arbitrary locks, but that doesn't seem quite sufficient to get
us out of the woods. It will at least hold ShareUpdateExclusiveLock
on the relation being vacuumed, and process A could attempt to acquire
that same lock.

Right, I think there is a danger of deadlock in above situation.
Needs some more thoughts.

Also in regards to deadlock, I notice that you added a paragraph
saying that we lock higher-numbered buckets before lower-numbered
buckets. That's fair enough, but what about the metapage? The reader
algorithm suggests that the metapage must lock must be taken after the
bucket locks, because it tries to grab the bucket lock conditionally
after acquiring the metapage lock, but that's not documented here.

That is for efficiency. This patch haven't changed anything in
metapage locking which can directly impact deadlock.

The reader algorithm itself seems to be a bit oddly explained.

pin meta page and take buffer content lock in shared mode
+    compute bucket number for target hash key
+    read and pin the primary bucket page

So far, I'm with you.

+    conditionally get the buffer content lock in shared mode on
primary bucket page for search
+    if we didn't get the lock (need to wait for lock)

"didn't get the lock" and "wait for the lock" are saying the same
thing, so this is redundant, and the statement that it is "for search"
on the previous line is redundant with the introductory text
describing this as the reader algorithm.

+        release the buffer content lock on meta page
+        acquire buffer content lock on primary bucket page in shared mode
+        acquire the buffer content lock in shared mode on meta page

OK...

+        to check for possibility of split, we need to recompute the bucket and
+        verify, if it is a correct bucket; set the retry flag

This makes it sound like we set the retry flag whether it was the
correct bucket or not, which isn't sensible.

+ else if we get the lock, then we can skip the retry path

This line is totally redundant. If we don't set the retry flag, then
of course we can skip the part guarded by if (retry).

Will change as per suggestions.

+    if (retry)
+        loop:
+            compute bucket number for target hash key
+            release meta page buffer content lock
+            if (correct bucket page is already locked)
+                break
+            release any existing content lock on bucket page (if a
concurrent split happened)
+            pin primary bucket page and take shared buffer content lock
+            retake meta page buffer content lock in shared mode

This is the part I *really* don't understand. It makes sense to me
that we need to loop until we get the correct bucket locked with no
concurrent splits, but why is this retry loop separate from the
previous bit of code that set the retry flag. In other words, why is
not something like this?

pin the meta page and take shared content lock on it
compute bucket number for target hash key
if (we can't get a shared content lock on the target bucket without blocking)
loop:
release meta page content lock
take a shared content lock on the target primary bucket page
take a shared content lock on the metapage
if (previously-computed target bucket has not been split)
break;

I think we can write it the way you are suggesting, but I don't want
to change much in the existing for loop in code, which uses
_hash_getbuf() to acquire the pin and lock together.

Another thing I don't quite understand about this algorithm is that in
order to conditionally lock the target primary bucket page, we'd first
need to read and pin it. And that doesn't seem like a good thing to
do while we're holding a shared content lock on the metapage, because
of the principle that we don't want to hold content locks across I/O.

I think we can release metapage content lock before reading the buffer.

-- then, per read request:
release pin on metapage
-    read current page of bucket and take shared buffer content lock
-        step to next page if necessary (no chaining of locks)
+    if the split is in progress for current bucket and this is a new bucket
+        release the buffer content lock on current bucket page
+        pin and acquire the buffer content lock on old bucket in shared mode
+        release the buffer content lock on old bucket, but not pin
+        retake the buffer content lock on new bucket
+        mark the scan such that it skips the tuples that are marked
as moved by split

Aren't these steps done just once per scan? If so, I think they
should appear before "-- then, per read request" which AIUI is
intended to imply a loop over tuples.

As per code, there is no such intention (loop over tuples). It is
about reading the page and getting the tuple.

+    step to next page if necessary (no chaining of locks)
+        if the scan indicates moved by split, then move to old bucket
after the scan
+        of current bucket is finished
get tuple
release buffer content lock and pin on current page
-- at scan shutdown:
-    release bucket share-lock

Don't we have a pin to release at scan shutdown in the new system?

Yes, it is mentioned in line below:

+ release any pin we hold on current buffer, old bucket buffer, new
bucket buffer
+

Well, I was hoping to get through the whole patch in one email, but
I'm not even all the way through the README. However, it's late, so
I'm stopping here for now.

Thanks for the review!

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85Peter Geoghegan
pg@heroku.com
In reply to: Andres Freund (#81)
Re: Hash Indexes

On Wed, Sep 28, 2016 at 8:06 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-09-28 15:04:30 -0400, Robert Haas wrote:

Andres already
stated that he things working on btree-over-hash would be more
beneficial than fixing hash, but at this point it seems like he's the
only one who takes that position.

Note that I did *NOT* take that position. I was saying that I think we
should evaluate whether that's not a better approach, doing some simple
performance comparisons.

I, for one, agree with this position.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#85)
Re: Hash Indexes

On Thu, Sep 29, 2016 at 8:07 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Sep 28, 2016 at 8:06 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-09-28 15:04:30 -0400, Robert Haas wrote:

Andres already
stated that he things working on btree-over-hash would be more
beneficial than fixing hash, but at this point it seems like he's the
only one who takes that position.

Note that I did *NOT* take that position. I was saying that I think we
should evaluate whether that's not a better approach, doing some simple
performance comparisons.

I, for one, agree with this position.

Well, I, for one, find it frustrating. It seems pretty unhelpful to
bring this up only after the code has already been written. The first
post on this thread was on May 10th. The first version of the patch
was posted on June 16th. This position was first articulated on
September 15th.

But, by all means, please feel free to do the performance comparison
and post the results. I'd be curious to see them myself.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#86)
Re: Hash Indexes

On 2016-09-29 20:14:40 -0400, Robert Haas wrote:

On Thu, Sep 29, 2016 at 8:07 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Sep 28, 2016 at 8:06 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-09-28 15:04:30 -0400, Robert Haas wrote:

Andres already
stated that he things working on btree-over-hash would be more
beneficial than fixing hash, but at this point it seems like he's the
only one who takes that position.

Note that I did *NOT* take that position. I was saying that I think we
should evaluate whether that's not a better approach, doing some simple
performance comparisons.

I, for one, agree with this position.

Well, I, for one, find it frustrating. It seems pretty unhelpful to
bring this up only after the code has already been written.

I brought this up in person at pgcon too.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#87)
Re: Hash Indexes

On Thu, Sep 29, 2016 at 8:16 PM, Andres Freund <andres@anarazel.de> wrote:

Well, I, for one, find it frustrating. It seems pretty unhelpful to
bring this up only after the code has already been written.

I brought this up in person at pgcon too.

To whom? In what context?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#88)
Re: Hash Indexes

On September 29, 2016 5:28:00 PM PDT, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 29, 2016 at 8:16 PM, Andres Freund <andres@anarazel.de>
wrote:

Well, I, for one, find it frustrating. It seems pretty unhelpful to
bring this up only after the code has already been written.

I brought this up in person at pgcon too.

To whom? In what context?

Amit, over dinner.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90Peter Geoghegan
pg@heroku.com
In reply to: Andres Freund (#89)
Re: Hash Indexes

On Fri, Sep 30, 2016 at 1:29 AM, Andres Freund <andres@anarazel.de> wrote:

To whom? In what context?

Amit, over dinner.

In case it matters, I also talked to Amit about this privately.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#86)
Re: Hash Indexes

On Fri, Sep 30, 2016 at 1:14 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I, for one, agree with this position.

Well, I, for one, find it frustrating. It seems pretty unhelpful to
bring this up only after the code has already been written. The first
post on this thread was on May 10th. The first version of the patch
was posted on June 16th. This position was first articulated on
September 15th.

Really, what do we have to lose at this point? It's not very difficult
to do what Andres proposes.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#92Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#89)
Re: Hash Indexes

On Thu, Sep 29, 2016 at 8:29 PM, Andres Freund <andres@anarazel.de> wrote:

On September 29, 2016 5:28:00 PM PDT, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 29, 2016 at 8:16 PM, Andres Freund <andres@anarazel.de>
wrote:

Well, I, for one, find it frustrating. It seems pretty unhelpful to
bring this up only after the code has already been written.

I brought this up in person at pgcon too.

To whom? In what context?

Amit, over dinner.

OK, well, I can't really comment on that, then, except to say that if
you waited three months to follow up on the mailing list, you probably
can't blame Amit if he thought that it was more of a casual suggestion
than a serious objection. Maybe it was? I don't know.

For my part, I don't really understand how you think that we could
find anything out via relatively simple tests. The hash index code is
horribly under-maintained, which is why Amit is able to get large
performance improvements out of improving it. If you compare it to
btree in some way, it's probably going to lose. But I don't think
that answers the question of whether a hash AM that somebody's put
some work into will win or lose against a hypothetical hash-over-btree
AM that nobody's written. Even if it wins, is that really a reason to
leave the hash index code itself in a state of disrepair? We probably
would have removed it already except that the infrastructure is used
for hash joins and hash aggregation, so we really can't.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#91)
Re: Hash Indexes

On Thu, Sep 29, 2016 at 8:53 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Fri, Sep 30, 2016 at 1:14 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I, for one, agree with this position.

Well, I, for one, find it frustrating. It seems pretty unhelpful to
bring this up only after the code has already been written. The first
post on this thread was on May 10th. The first version of the patch
was posted on June 16th. This position was first articulated on
September 15th.

Really, what do we have to lose at this point? It's not very difficult
to do what Andres proposes.

Well, first of all, I can't, because I don't really understand what
tests he has in mind. Maybe somebody else does, in which case perhaps
they could do the work and post the results. If the tests really are
simple, that shouldn't be much of a burden.

But, second, suppose we do the tests and find out that the
hash-over-btree idea completely trounces hash indexes. What then? I
don't think that would really prove anything because, as I said in my
email to Andres, the current hash index code is severely
under-optimized, so it's not really an apples-to-apples comparison.
But even if it did prove something, is the idea then that Amit (with
help from Mithun and Ashutosh Sharma) should throw away the ~8 months
of development work that's been done on hash indexes in favor of
starting all over with a new and probably harder project to build a
whole new AM, and just leave hash indexes broken? That doesn't seem
like a very reasonable think to ask. Leaving hash indexes broken
fixes no problem that we have.

On the other hand, applying those patches (after they've been suitably
reviewed and fixed up) does fix several things. For one thing, we can
stop shipping a totally broken feature in release after release. For
another thing, those hash indexes do in fact outperform btree on some
workloads, and with more work they can probably beat btree on more
workloads. And if somebody later wants to write hash-over-btree and
that turns out to be better still, great! I'm not blocking anyone
from doing that.

The only argument that's been advanced for not fixing hash indexes is
that we'd then have to give people accurate guidance on whether to
choose hash or btree, but that would also be true of a hypothetical
hash-over-btree AM.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#92)
Re: Hash Indexes

On 30-Sep-2016 6:24 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:

On Thu, Sep 29, 2016 at 8:29 PM, Andres Freund <andres@anarazel.de> wrote:

On September 29, 2016 5:28:00 PM PDT, Robert Haas <robertmhaas@gmail.com>

wrote:

On Thu, Sep 29, 2016 at 8:16 PM, Andres Freund <andres@anarazel.de>
wrote:

Well, I, for one, find it frustrating. It seems pretty unhelpful to
bring this up only after the code has already been written.

I brought this up in person at pgcon too.

To whom? In what context?

Amit, over dinner.

OK, well, I can't really comment on that, then, except to say that if
you waited three months to follow up on the mailing list, you probably
can't blame Amit if he thought that it was more of a casual suggestion
than a serious objection. Maybe it was? I don't know.

Both of them have talked about hash indexes with me offline. Peter
mentioned that it would be better to improve btree rather than hash
indexes. IIRC, Andres asked me mainly about what use cases I have in mind
for hash indexes and then we do have some further discussion on the same
thing where he was not convinced that there is any big use case for hash
indexes even though there may be some cases. In that discussion, as he is
saying and I don't doubt him, he would have told me the alternative, but it
was not apparent to me that he is expecting some sort of comparison.

What I got from both the discussions was a friendly gesture that it might
be a better use of my time, if I work on some other problem. I really
respect suggestions from both of them, but it was no where clear to me that
any one of them is expecting any comparison of other approach.

Considering, I have missed the real intention of their suggestions, I
think such a serious objection on any work should be discussed on list. To
answer the actual objection, I have already mentioned upthread that we can
deduce from the current tests done by Jesper and Mithun that there are some
cases where hash index will be better than hash-over-btree (tests done over
integer columns). I think any discussion on whether we should consider not
to improve current hash indexes is only meaningful if some one has a code
which can prove both theoretically and practically that it is better than
hash indexes for all usages.

Note - excuse me for formatting of this email as I am on travel and using
my phone.

With Regards,
Amit Kapila.

#95Peter Geoghegan
pg@heroku.com
In reply to: Amit Kapila (#94)
Re: Hash Indexes

On Fri, Sep 30, 2016 at 9:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Considering, I have missed the real intention of their suggestions, I think
such a serious objection on any work should be discussed on list. To answer
the actual objection, I have already mentioned upthread that we can deduce
from the current tests done by Jesper and Mithun that there are some cases
where hash index will be better than hash-over-btree (tests done over
integer columns). I think any discussion on whether we should consider not
to improve current hash indexes is only meaningful if some one has a code
which can prove both theoretically and practically that it is better than
hash indexes for all usages.

I cannot speak for Andres, but you judged my intent here correctly. I
have no firm position on any of this just yet; I haven't even read the
patch. I just think that it is worth doing some simple analysis of a
hash-over-btree implementation, with simple prototyping and a simple
test-case. I would consider that a due-diligence thing, because,
honestly, it seems obvious to me that it should be at least checked
out informally.

I wasn't aware that there was already some analysis of this. Robert
did just acknowledge that it is *possible* that "the hash-over-btree
idea completely trounces hash indexes", so the general tone of this
thread suggested to me that there was little or no analysis of
hash-over-btree. I'm willing to believe that I'm wrong to be
dismissive of the hash AM in general, and I'm even willing to be
flexible on crediting the hash AM with being less optimized overall
(assuming we can see a way past that).

My only firm position is that it wouldn't be very hard to investigate
hash-over-btree to Andres' satisfaction, say, so, why not? I'm
surprised that this has caused consternation -- ISTM that Andres'
suggestion is *perfectly* reasonable. It doesn't appear to be an
objection to anything in particular.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#95)
Re: Hash Indexes

On Fri, Sep 30, 2016 at 7:47 AM, Peter Geoghegan <pg@heroku.com> wrote:

My only firm position is that it wouldn't be very hard to investigate
hash-over-btree to Andres' satisfaction, say, so, why not? I'm
surprised that this has caused consternation -- ISTM that Andres'
suggestion is *perfectly* reasonable. It doesn't appear to be an
objection to anything in particular.

I would just be very disappointed if, after the amount of work that
Amit and others have put into this project, the code gets rejected
because somebody thinks a different project would have been more worth
doing. As Tom said upthread: $$But to kick the hash AM as such to the
curb is to say
"sorry, there will never be O(1) index lookups in Postgres".$$ I
think that's correct and a sufficiently-good reason to pursue this
work, regardless of the merits (or lack of merits) of hash-over-btree.
The fact that we have hash indexes already and cannot remove them
because too much other code depends on hash opclasses is also, in my
opinion, a sufficiently good reason to pursue improving them. I don't
think the project needs the additional justification of outperforming
a hash-over-btree in order to exist, even if such a comparison could
be done fairly, which I suspect is harder than you're crediting.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#97Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#96)
Re: Hash Indexes

On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I would just be very disappointed if, after the amount of work that
Amit and others have put into this project, the code gets rejected
because somebody thinks a different project would have been more worth
doing.

I wouldn't presume to tell anyone else how to spend their time, and am
not concerned about this making the hash index code any less useful
from the user's perspective. If this is how we remove the wart of hash
indexes not being WAL-logged, that's fine by me. I am trying to be
helpful.

As Tom said upthread: $But to kick the hash AM as such to the
curb is to say
"sorry, there will never be O(1) index lookups in Postgres".$ I
think that's correct and a sufficiently-good reason to pursue this
work, regardless of the merits (or lack of merits) of hash-over-btree.

I don't think that "O(1) index lookups" is a useful guarantee with a
very expensive constant factor. Amit said: "I think any discussion on
whether we should consider not to improve current hash indexes is only
meaningful if some one has a code which can prove both theoretically
and practically that it is better than hash indexes for all usages",
so I think that he shares this view.

The fact that we have hash indexes already and cannot remove them
because too much other code depends on hash opclasses is also, in my
opinion, a sufficiently good reason to pursue improving them.

I think that Andres was suggesting that hash index opclasses would be
usable with hash-over-btree, so you might still not end up with the
wart of having hash opclasses without hash indexes (an idea that has
been proposed and rejected at least once before now). Andres?

To be clear: I haven't expressed any opinion on this patch.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#97)
Re: Hash Indexes

On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I would just be very disappointed if, after the amount of work that
Amit and others have put into this project, the code gets rejected
because somebody thinks a different project would have been more worth
doing.

I wouldn't presume to tell anyone else how to spend their time, and am
not concerned about this patch making the hash index code any less
useful from the user's perspective. If this is how we remove the wart
of hash indexes not being WAL-logged, that's fine by me. I'm trying to
be helpful.

As Tom said upthread: $But to kick the hash AM as such to the
curb is to say
"sorry, there will never be O(1) index lookups in Postgres".$ I
think that's correct and a sufficiently-good reason to pursue this
work, regardless of the merits (or lack of merits) of hash-over-btree.

I don't think that "O(1) index lookups" is a useful guarantee with a
very expensive constant factor. Amit seemed to agree with this, since
he spoke of the importance of both theoretical performance benefits
and practically realizable performance benefits.

The fact that we have hash indexes already and cannot remove them
because too much other code depends on hash opclasses is also, in my
opinion, a sufficiently good reason to pursue improving them.

I think that Andres was suggesting that hash index opclasses would be
usable with hash-over-btree, so you might still not end up with the
wart of having hash opclasses without hash indexes (an idea that has
been proposed and rejected at least once before).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#99Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#97)
Re: Hash Indexes

Peter Geoghegan <pg@heroku.com> writes:

On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:

The fact that we have hash indexes already and cannot remove them
because too much other code depends on hash opclasses is also, in my
opinion, a sufficiently good reason to pursue improving them.

I think that Andres was suggesting that hash index opclasses would be
usable with hash-over-btree, so you might still not end up with the
wart of having hash opclasses without hash indexes (an idea that has
been proposed and rejected at least once before now). Andres?

That's an interesting point. If we were to flat-out replace the hash AM
with a hash-over-btree AM, the existing hash opclasses would just migrate
to that unchanged. But if someone wanted to add hash-over-btree alongside
the hash AM, it would be necessary to clone all those opclass entries,
or else find a way for the two AMs to share pg_opclass etc entries.
Either one of those is kind of annoying. (Although if we did do the work
of implementing the latter, it might come in handy in future; you could
certainly imagine that there will be cases like a next-generation GIST AM
wanting to reuse the opclasses of existing GIST, say.)

But having said that, I remain opposed to removing the hash AM.
If someone wants to implement hash-over-btree, that's cool with me,
but I'd much rather put it in beside plain hash and let them duke
it out in the field.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#97)
Re: Hash Indexes

On 2016-09-30 17:39:04 +0100, Peter Geoghegan wrote:

On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I would just be very disappointed if, after the amount of work that
Amit and others have put into this project, the code gets rejected
because somebody thinks a different project would have been more worth
doing.

I wouldn't presume to tell anyone else how to spend their time, and am
not concerned about this making the hash index code any less useful
from the user's perspective.

Me neither.

I'm concerned that this is a heck of a lot of work, and I don't think
we've reached the end of it by a good bit. I think it would have, and
probably still is, a more efficient use of time to go for the
hash-via-btree method, and rip out the current hash indexes. But that's
just me.

I find it more than a bit odd to be accused of trying to waste others
time by saying this, and that this is too late because time has already
been invested. Especially the latter never has been a standard in the
community, and while excruciatingly painful when one is the person(s)
having invested the time, it probably shouldn't be.

The fact that we have hash indexes already and cannot remove them
because too much other code depends on hash opclasses is also, in my
opinion, a sufficiently good reason to pursue improving them.

I think that Andres was suggesting that hash index opclasses would be
usable with hash-over-btree, so you might still not end up with the
wart of having hash opclasses without hash indexes (an idea that has
been proposed and rejected at least once before now). Andres?

Yes, that was what I was pretty much thinking. I was kind of guessing
that this might be easiest implemented as a separate AM ("hash2" ;))
that's just a layer ontop of nbtree.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Geoghegan (#98)
Re: Hash Indexes

On 30-Sep-2016 10:26 PM, "Peter Geoghegan" <pg@heroku.com> wrote:

On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

I would just be very disappointed if, after the amount of work that
Amit and others have put into this project, the code gets rejected
because somebody thinks a different project would have been more worth
doing.

I wouldn't presume to tell anyone else how to spend their time, and am
not concerned about this patch making the hash index code any less
useful from the user's perspective. If this is how we remove the wart
of hash indexes not being WAL-logged, that's fine by me. I'm trying to
be helpful.

If that is fine, then I think we should do that. I want to bring it to
your notice that we have already seen and reported that with proposed set
of patches, hash indexes are good bit faster than btree, so that adds
additional value in making them WAL-logged.

As Tom said upthread: $But to kick the hash AM as such to the
curb is to say
"sorry, there will never be O(1) index lookups in Postgres".$ I
think that's correct and a sufficiently-good reason to pursue this
work, regardless of the merits (or lack of merits) of hash-over-btree.

I don't think that "O(1) index lookups" is a useful guarantee with a
very expensive constant factor.

The constant factor doesn't play much role when data doesn't have
duplicates or have lesser duplicates.

Amit seemed to agree with this, since

he spoke of the importance of both theoretical performance benefits
and practically realizable performance benefits.

No, I don't agree with that rather I think hash indexes are theoretically
faster than btree and we have seen that practically as well for quite a few
cases (for read workloads - when used with unique data and also in nest
loops).

With Regards,
Amit Kapila

#102Noname
ktm@rice.edu
In reply to: Andres Freund (#100)
Re: Hash Indexes

Andres Freund <andres@anarazel.de>:

On 2016-09-30 17:39:04 +0100, Peter Geoghegan wrote:

On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I would just be very disappointed if, after the amount of work that
Amit and others have put into this project, the code gets rejected
because somebody thinks a different project would have been more worth
doing.

I wouldn't presume to tell anyone else how to spend their time, and am
not concerned about this making the hash index code any less useful
from the user's perspective.

Me neither.

I'm concerned that this is a heck of a lot of work, and I don't think
we've reached the end of it by a good bit. I think it would have, and
probably still is, a more efficient use of time to go for the
hash-via-btree method, and rip out the current hash indexes. But that's
just me.

I find it more than a bit odd to be accused of trying to waste others
time by saying this, and that this is too late because time has already
been invested. Especially the latter never has been a standard in the
community, and while excruciatingly painful when one is the person(s)
having invested the time, it probably shouldn't be.

The fact that we have hash indexes already and cannot remove them
because too much other code depends on hash opclasses is also, in my
opinion, a sufficiently good reason to pursue improving them.

I think that Andres was suggesting that hash index opclasses would be
usable with hash-over-btree, so you might still not end up with the
wart of having hash opclasses without hash indexes (an idea that has
been proposed and rejected at least once before now). Andres?

Yes, that was what I was pretty much thinking. I was kind of guessing
that this might be easiest implemented as a separate AM ("hash2" ;))
that's just a layer ontop of nbtree.

Greetings,

Andres Freund

Hi,

There have been benchmarks posted over the years were even the non-WAL
logged hash out performed the btree usage variant. You cannot argue
against O(1) algorithm behavior. We need to have a usable hash index
so that others can help improve it.

My 2 cents.

Regards,
Ken

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103Greg Stark
stark@mit.edu
In reply to: Robert Haas (#93)
Re: Hash Indexes

On Fri, Sep 30, 2016 at 2:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

For one thing, we can stop shipping a totally broken feature in release after release

For what it's worth I'm for any patch that can accomplish that.

In retrospect I think we should have done the hash-over-btree thing
ten years ago but we didn't and if Amit's patch makes hash indexes
recoverable today then go for it.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104Michael Paquier
michael.paquier@gmail.com
In reply to: Greg Stark (#103)
Re: Hash Indexes

On Sun, Oct 2, 2016 at 3:31 AM, Greg Stark <stark@mit.edu> wrote:

On Fri, Sep 30, 2016 at 2:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

For one thing, we can stop shipping a totally broken feature in release after release

For what it's worth I'm for any patch that can accomplish that.

In retrospect I think we should have done the hash-over-btree thing
ten years ago but we didn't and if Amit's patch makes hash indexes
recoverable today then go for it.

+1.
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#105Pavel Stehule
pavel.stehule@gmail.com
In reply to: Michael Paquier (#104)
Re: Hash Indexes

2016-10-02 12:40 GMT+02:00 Michael Paquier <michael.paquier@gmail.com>:

On Sun, Oct 2, 2016 at 3:31 AM, Greg Stark <stark@mit.edu> wrote:

On Fri, Sep 30, 2016 at 2:11 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

For one thing, we can stop shipping a totally broken feature in release

after release

For what it's worth I'm for any patch that can accomplish that.

In retrospect I think we should have done the hash-over-btree thing
ten years ago but we didn't and if Amit's patch makes hash indexes
recoverable today then go for it.

+1.

+1

Pavel

Show quoted text

--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106Michael Paquier
michael.paquier@gmail.com
In reply to: Pavel Stehule (#105)
Re: Hash Indexes

On Mon, Oct 3, 2016 at 12:42 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

2016-10-02 12:40 GMT+02:00 Michael Paquier <michael.paquier@gmail.com>:

On Sun, Oct 2, 2016 at 3:31 AM, Greg Stark <stark@mit.edu> wrote:

On Fri, Sep 30, 2016 at 2:11 AM, Robert Haas <robertmhaas@gmail.com>
wrote:

For one thing, we can stop shipping a totally broken feature in release
after release

For what it's worth I'm for any patch that can accomplish that.

In retrospect I think we should have done the hash-over-btree thing
ten years ago but we didn't and if Amit's patch makes hash indexes
recoverable today then go for it.

+1.

+1

And moved to next CF to make it breath.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#107Jeff Janes
jeff.janes@gmail.com
In reply to: Robert Haas (#86)
Re: Hash Indexes

On Thu, Sep 29, 2016 at 5:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 29, 2016 at 8:07 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Sep 28, 2016 at 8:06 PM, Andres Freund <andres@anarazel.de>

wrote:

On 2016-09-28 15:04:30 -0400, Robert Haas wrote:

Andres already
stated that he things working on btree-over-hash would be more
beneficial than fixing hash, but at this point it seems like he's the
only one who takes that position.

Note that I did *NOT* take that position. I was saying that I think we
should evaluate whether that's not a better approach, doing some simple
performance comparisons.

I, for one, agree with this position.

Well, I, for one, find it frustrating. It seems pretty unhelpful to
bring this up only after the code has already been written. The first
post on this thread was on May 10th. The first version of the patch
was posted on June 16th. This position was first articulated on
September 15th.

But, by all means, please feel free to do the performance comparison
and post the results. I'd be curious to see them myself.

I've done a simple comparison using pgbench's default transaction, in which
all the primary keys have been dropped and replaced with indexes of either
hash or btree type, alternating over many rounds.

I run 'pgbench -c16 -j16 -T 900 -M prepared' on an 8 core machine with a
scale of 40. All the data fits in RAM, but not in shared_buffers (128MB).

I find a 4% improvement for hash indexes over btree indexes, 9324.744
vs 9727.766. The difference is significant at p-value of 1.9e-9.

The four versions of hash indexes (HEAD, concurrent, wal, cache, applied
cumulatively) have no statistically significant difference in performance
from each other.

I certainly don't see how btree-over-hash-over-integer could be better than
direct btree-over-integer.

I think I don't see improvement in hash performance with the concurrent and
cache patches because I don't have enough cores to get to the contention
that those patches are targeted at. But since the concurrent patch is a
prerequisite to the wal patch, that is enough to justify it even without a
demonstrated performance boost. A 4% gain is not astonishing, but is nice
to have provided we can get it without giving up crash safety.

Cheers,

Jeff

#108Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Janes (#107)
Re: Hash Indexes

Jeff Janes <jeff.janes@gmail.com> writes:

I've done a simple comparison using pgbench's default transaction, in which
all the primary keys have been dropped and replaced with indexes of either
hash or btree type, alternating over many rounds.

I run 'pgbench -c16 -j16 -T 900 -M prepared' on an 8 core machine with a
scale of 40. All the data fits in RAM, but not in shared_buffers (128MB).

I find a 4% improvement for hash indexes over btree indexes, 9324.744
vs 9727.766. The difference is significant at p-value of 1.9e-9.

Thanks for doing this work!

The four versions of hash indexes (HEAD, concurrent, wal, cache, applied
cumulatively) have no statistically significant difference in performance
from each other.

Interesting.

I think I don't see improvement in hash performance with the concurrent and
cache patches because I don't have enough cores to get to the contention
that those patches are targeted at.

Possibly. However, if the cache patch is not a prerequisite to the WAL
fixes, IMO somebody would have to demonstrate that it has a measurable
performance benefit before it would get in. It certainly doesn't look
like it's simplifying the code, so I wouldn't take it otherwise.

I think, though, that this is enough to put to bed the argument that
we should toss the hash AM entirely. If it's already competitive with
btree today, despite the lack of attention that it's gotten, there is
reason to hope that it will be a significant win (for some use-cases,
obviously) in future. We should now get back to reviewing these patches
on their own merits.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#84)
Re: Hash Indexes

On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

As I was looking at the old text regarding deadlock risk, I realized
what may be a serious problem. Suppose process A is performing a scan
of some hash index. While the scan is suspended, it attempts to take
a lock and is blocked by process B. Process B, meanwhile, is running
VACUUM on that hash index. Eventually, it will do
LockBufferForCleanup() on the hash bucket on which process A holds a
buffer pin, resulting in an undetected deadlock. In the current
coding, A would hold a heavyweight lock and B would attempt to acquire
a conflicting heavyweight lock, and the deadlock detector would kill
one of them. This patch probably breaks that. I notice that that's
the only place where we attempt to acquire a buffer cleanup lock
unconditionally; every place else, we acquire the lock conditionally,
so there's no deadlock risk. Once we resolve this problem, the
paragraph about deadlock risk in this section should be revised to
explain whatever solution we come up with.

By the way, since VACUUM must run in its own transaction, B can't be
holding arbitrary locks, but that doesn't seem quite sufficient to get
us out of the woods. It will at least hold ShareUpdateExclusiveLock
on the relation being vacuumed, and process A could attempt to acquire
that same lock.

Right, I think there is a danger of deadlock in above situation.
Needs some more thoughts.

I think one way to avoid the risk of deadlock in above scenario is to
take the cleanup lock conditionally, if we get the cleanup lock then
we will delete the items as we are doing in patch now, else it will
just mark the tuples as dead and ensure that it won't try to remove
tuples that are moved-by-split. Now, I think the question is how will
these dead tuples be removed. We anyway need a separate mechanism to
clear dead tuples for hash indexes as during scans we are marking the
tuples as dead if corresponding tuple in heap is dead which are not
removed later. This is already taken care in btree code via
kill_prior_tuple optimization. So I think clearing of dead tuples can
be handled by a separate patch.

I have also thought about using page-scan-at-a-time idea which has
been discussed upthread[1]/messages/by-id/CAA4eK1Jj1UqneTXrywr=Gg87vgmnMma87LuscN_r3hKaHd=L2g@mail.gmail.com, but I think we can't completely eliminate
the need to out-wait scans (cleanup lock requirement) for scans that
are started when split-in-progress or for non-MVCC scans as described
in that e-mail [1]/messages/by-id/CAA4eK1Jj1UqneTXrywr=Gg87vgmnMma87LuscN_r3hKaHd=L2g@mail.gmail.com. We might be able to find some way to solve the
problem with this approach, but I think it will be slightly
complicated and much more work is required as compare to previous
approach.

What is your preference among above approaches to resolve this problem
or let me know if you have a better idea to solve it?

[1]: /messages/by-id/CAA4eK1Jj1UqneTXrywr=Gg87vgmnMma87LuscN_r3hKaHd=L2g@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#109)
Re: Hash Indexes

On Tue, Oct 4, 2016 at 10:06 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

As I was looking at the old text regarding deadlock risk, I realized
what may be a serious problem. Suppose process A is performing a scan
of some hash index. While the scan is suspended, it attempts to take
a lock and is blocked by process B. Process B, meanwhile, is running
VACUUM on that hash index. Eventually, it will do
LockBufferForCleanup() on the hash bucket on which process A holds a
buffer pin, resulting in an undetected deadlock. In the current
coding, A would hold a heavyweight lock and B would attempt to acquire
a conflicting heavyweight lock, and the deadlock detector would kill
one of them. This patch probably breaks that. I notice that that's
the only place where we attempt to acquire a buffer cleanup lock
unconditionally; every place else, we acquire the lock conditionally,
so there's no deadlock risk. Once we resolve this problem, the
paragraph about deadlock risk in this section should be revised to
explain whatever solution we come up with.

By the way, since VACUUM must run in its own transaction, B can't be
holding arbitrary locks, but that doesn't seem quite sufficient to get
us out of the woods. It will at least hold ShareUpdateExclusiveLock
on the relation being vacuumed, and process A could attempt to acquire
that same lock.

Right, I think there is a danger of deadlock in above situation.
Needs some more thoughts.

I think one way to avoid the risk of deadlock in above scenario is to
take the cleanup lock conditionally, if we get the cleanup lock then
we will delete the items as we are doing in patch now, else it will
just mark the tuples as dead and ensure that it won't try to remove
tuples that are moved-by-split. Now, I think the question is how will
these dead tuples be removed. We anyway need a separate mechanism to
clear dead tuples for hash indexes as during scans we are marking the
tuples as dead if corresponding tuple in heap is dead which are not
removed later. This is already taken care in btree code via
kill_prior_tuple optimization. So I think clearing of dead tuples can
be handled by a separate patch.

I think we can also remove the dead tuples next time when vacuum
visits the bucket and is able to acquire the cleanup lock. Right now,
we are just checking if the corresponding heap tuple is dead, we can
have an additional check as well to ensure if the current item is dead
in index, then consider it in list of deletable items.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#111Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#109)
Re: Hash Indexes

On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think one way to avoid the risk of deadlock in above scenario is to
take the cleanup lock conditionally, if we get the cleanup lock then
we will delete the items as we are doing in patch now, else it will
just mark the tuples as dead and ensure that it won't try to remove
tuples that are moved-by-split. Now, I think the question is how will
these dead tuples be removed. We anyway need a separate mechanism to
clear dead tuples for hash indexes as during scans we are marking the
tuples as dead if corresponding tuple in heap is dead which are not
removed later. This is already taken care in btree code via
kill_prior_tuple optimization. So I think clearing of dead tuples can
be handled by a separate patch.

That seems like it could work. The hash scan code will need to be
made smart enough to ignore any tuples marked dead, if it isn't
already. More aggressive cleanup can be left for another patch.

I have also thought about using page-scan-at-a-time idea which has
been discussed upthread[1], but I think we can't completely eliminate
the need to out-wait scans (cleanup lock requirement) for scans that
are started when split-in-progress or for non-MVCC scans as described
in that e-mail [1]. We might be able to find some way to solve the
problem with this approach, but I think it will be slightly
complicated and much more work is required as compare to previous
approach.

There are several levels of aggressiveness here with different locking
requirements:

1. Mark line items dead without reorganizing the page. Needs an
exclusive content lock, no more. Even a shared content lock may be
OK, as for other opportunistic bit-flipping.
2. Mark line items dead and compact the tuple data. If a pin is
sufficient to look at tuple data, as it is for the heap, then a
cleanup lock is required here. But if we always hold a shared content
lock when looking at the tuple data, it might be possible to do this
with just an exclusive content lock.
3. Remove dead line items completely, compacting the tuple data and
the item-pointer array. Doing this with only an exclusive content
lock certainly needs page-at-a-time mode because otherwise a searcher
that resumes a scan later might resume from the wrong place. It also
needs the guarantee mentioned for point #2, namely that nobody will be
examining the tuple data without a shared content lock.
4. Squeezing the bucket. This is probably always going to require a
cleanup lock, because otherwise it's pretty unclear how a concurrent
scan could be made safe. I suppose the scan could remember every TID
it has seen, somehow detect that a squeeze had happened, and rescan
the whole bucket ignoring TIDs already returned, but that seems to
require the client to use potentially unbounded amounts of memory to
remember already-returned TIDs, plus an as-yet-uninvented mechanism
for detecting that a squeeze has happened. So this seems like a
dead-end to me.

I think that it is very much worthwhile to reduce the required lock
strength from cleanup-lock to exclusive-lock in as many cases as
possible, but I don't think it will be possible to completely
eliminate the need to take the cleanup lock in some cases. However,
if we can always take the cleanup lock conditionally and never be in a
situation where it's absolutely required, we should be OK - and even
level (1) gives you that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#112Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#111)
Re: Hash Indexes

On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think one way to avoid the risk of deadlock in above scenario is to
take the cleanup lock conditionally, if we get the cleanup lock then
we will delete the items as we are doing in patch now, else it will
just mark the tuples as dead and ensure that it won't try to remove
tuples that are moved-by-split. Now, I think the question is how will
these dead tuples be removed. We anyway need a separate mechanism to
clear dead tuples for hash indexes as during scans we are marking the
tuples as dead if corresponding tuple in heap is dead which are not
removed later. This is already taken care in btree code via
kill_prior_tuple optimization. So I think clearing of dead tuples can
be handled by a separate patch.

That seems like it could work. The hash scan code will need to be
made smart enough to ignore any tuples marked dead, if it isn't
already.

It already takes care of ignoring killed tuples in below code, though
the way is much less efficient as compare to btree. Basically, it
fetches the matched tuple and then check if it is dead where as in
btree while matching the key, it does the same check. It might be
efficient to do it before matching the hashkey, but I think it is a
matter of separate patch.
hashgettuple()
{
..
/*
* Skip killed tuples if asked to.
*/
if (scan->ignore_killed_tuples)
}

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#113Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#84)
Re: Hash Indexes

On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Another thing I don't quite understand about this algorithm is that in
order to conditionally lock the target primary bucket page, we'd first
need to read and pin it. And that doesn't seem like a good thing to
do while we're holding a shared content lock on the metapage, because
of the principle that we don't want to hold content locks across I/O.

Aren't we already doing this during BufferAlloc() when the buffer
selected by StrategyGetBuffer() is dirty?

I think we can release metapage content lock before reading the buffer.

On thinking about this again, if we release the metapage content lock
before reading and pinning the primary bucket page, then we need to
take it again to verify if the split has happened during the time we
don't have a lock on a metapage. Releasing and again taking content
lock on metapage is not
good from the performance aspect. Do you have some other idea for this?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#114Jeff Janes
jeff.janes@gmail.com
In reply to: Amit Kapila (#113)
Re: Hash Indexes

On Mon, Oct 10, 2016 at 5:55 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

Another thing I don't quite understand about this algorithm is that in
order to conditionally lock the target primary bucket page, we'd first
need to read and pin it. And that doesn't seem like a good thing to
do while we're holding a shared content lock on the metapage, because
of the principle that we don't want to hold content locks across I/O.

Aren't we already doing this during BufferAlloc() when the buffer
selected by StrategyGetBuffer() is dirty?

Right, you probably shouldn't allocate another buffer while holding a
content lock on a different one, if you can help it. But, BufferAlloc
doesn't do that internally, does it? It is only a problem if you make it
be one by the way you use it. Am I missing something?

I think we can release metapage content lock before reading the buffer.

On thinking about this again, if we release the metapage content lock
before reading and pinning the primary bucket page, then we need to
take it again to verify if the split has happened during the time we
don't have a lock on a metapage. Releasing and again taking content
lock on metapage is not
good from the performance aspect. Do you have some other idea for this?

Doesn't the relcache patch effectively deal wit hthis? If this is a
sticking point, maybe the relcache patch could be incorporated into this
one.

Cheers,

Jeff

#115Amit Kapila
amit.kapila16@gmail.com
In reply to: Jeff Janes (#114)
Re: Hash Indexes

On Mon, Oct 10, 2016 at 10:07 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Mon, Oct 10, 2016 at 5:55 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas <robertmhaas@gmail.com>
wrote:

Another thing I don't quite understand about this algorithm is that in
order to conditionally lock the target primary bucket page, we'd first
need to read and pin it. And that doesn't seem like a good thing to
do while we're holding a shared content lock on the metapage, because
of the principle that we don't want to hold content locks across I/O.

Aren't we already doing this during BufferAlloc() when the buffer
selected by StrategyGetBuffer() is dirty?

Right, you probably shouldn't allocate another buffer while holding a
content lock on a different one, if you can help it.

I don't see the problem in that, but I guess the simple rule is that
we should not hold content locks for longer duration, which could
happen if we do I/O, or need to allocate a new buffer.

But, BufferAlloc
doesn't do that internally, does it?

You are right that BufferAlloc() doesn't allocate a new buffer while
holding content lock on another buffer, but it does perform I/O while
holding content lock.

It is only a problem if you make it be
one by the way you use it. Am I missing something?

I think we can release metapage content lock before reading the buffer.

On thinking about this again, if we release the metapage content lock
before reading and pinning the primary bucket page, then we need to
take it again to verify if the split has happened during the time we
don't have a lock on a metapage. Releasing and again taking content
lock on metapage is not
good from the performance aspect. Do you have some other idea for this?

Doesn't the relcache patch effectively deal wit hthis? If this is a
sticking point, maybe the relcache patch could be incorporated into this
one.

Yeah, relcache patch would eliminate the need for metapage locking,
but that is not a blocking point. As this patch is mainly to enable
WAL logging, so there is no urgency to incorporate relcache patch,
even if we have to go with an algorithm where we need to take the
metapage lock twice to verify the splits. Having said that, I am
okay, if Robert and or others are also in favour of combining the two
patches (patch in this thread and cache the metapage patch). If we
don't want to hold content lock across another ReadBuffer call, then
another option could be to modify the read algorithm as below:

read the metapage
compute bucket number for target hash key based on metapage contents
read the required block
loop:
acquire shared content lock on metapage
recompute bucket number for target hash key based on metapage contents
if the recomputed block number is not same as the block number we read
release meta page content lock
read the recomputed block number
else
break;
if (we can't get a shared content lock on the target bucket without blocking)
loop:
release meta page content lock
take a shared content lock on the target primary bucket page
take a shared content lock on the metapage
if (previously-computed target bucket has not been split)
break;

The basic change here is that first we compute the target block number
*without* locking metapage and then after locking the metapage, if
both doesn't match, then we need to again read the computed block
number.

Thoughts?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#116Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#111)
Re: Hash Indexes

On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think one way to avoid the risk of deadlock in above scenario is to
take the cleanup lock conditionally, if we get the cleanup lock then
we will delete the items as we are doing in patch now, else it will
just mark the tuples as dead and ensure that it won't try to remove
tuples that are moved-by-split. Now, I think the question is how will
these dead tuples be removed. We anyway need a separate mechanism to
clear dead tuples for hash indexes as during scans we are marking the
tuples as dead if corresponding tuple in heap is dead which are not
removed later. This is already taken care in btree code via
kill_prior_tuple optimization. So I think clearing of dead tuples can
be handled by a separate patch.

That seems like it could work.

I have implemented this idea and it works for MVCC scans. However, I
think this might not work for non-MVCC scans. Consider a case where
in Process-1, hash scan has returned one row and before it could check
it's validity in heap, vacuum marks that tuple as dead and removed the
entry from heap and some new tuple has been placed at that offset in
heap. Now when Process-1 checks the validity in heap, it will check
for different tuple then what the index tuple was suppose to check.
If we want, we can make it work similar to what btree does as being
discussed on thread [1]/messages/by-id/CACjxUsNtBXe1OfRp=acB+8QFAVWJ=nr55_HMmqQYceCzVGF4tQ@mail.gmail.com, but for that we need to introduce page-scan
mode as well in hash indexes. However, do we really want to solve
this problem as part of this patch when this exists for other index am
as well?

[1]: /messages/by-id/CACjxUsNtBXe1OfRp=acB+8QFAVWJ=nr55_HMmqQYceCzVGF4tQ@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#116)
Re: Hash Indexes

On Tue, Oct 18, 2016 at 5:37 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think one way to avoid the risk of deadlock in above scenario is to
take the cleanup lock conditionally, if we get the cleanup lock then
we will delete the items as we are doing in patch now, else it will
just mark the tuples as dead and ensure that it won't try to remove
tuples that are moved-by-split. Now, I think the question is how will
these dead tuples be removed. We anyway need a separate mechanism to
clear dead tuples for hash indexes as during scans we are marking the
tuples as dead if corresponding tuple in heap is dead which are not
removed later. This is already taken care in btree code via
kill_prior_tuple optimization. So I think clearing of dead tuples can
be handled by a separate patch.

That seems like it could work.

I have implemented this idea and it works for MVCC scans. However, I
think this might not work for non-MVCC scans. Consider a case where
in Process-1, hash scan has returned one row and before it could check
it's validity in heap, vacuum marks that tuple as dead and removed the
entry from heap and some new tuple has been placed at that offset in
heap.

Oops, that's bad.

Now when Process-1 checks the validity in heap, it will check
for different tuple then what the index tuple was suppose to check.
If we want, we can make it work similar to what btree does as being
discussed on thread [1], but for that we need to introduce page-scan
mode as well in hash indexes. However, do we really want to solve
this problem as part of this patch when this exists for other index am
as well?

For what other index AM does this problem exist? Kevin has been
careful not to create this problem for btree, or at least I think he
has. That's why the pin still has to be held on the index page when
it's a non-MVCC scan.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#118Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#117)
Re: Hash Indexes

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Oct 18, 2016 at 5:37 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have implemented this idea and it works for MVCC scans. However, I
think this might not work for non-MVCC scans. Consider a case where
in Process-1, hash scan has returned one row and before it could check
it's validity in heap, vacuum marks that tuple as dead and removed the
entry from heap and some new tuple has been placed at that offset in
heap.

Oops, that's bad.

Do we care? Under what circumstances would a hash index be used for a
non-MVCC scan?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#119Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#118)
Re: Hash Indexes

On 2016-10-18 13:38:14 -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Oct 18, 2016 at 5:37 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have implemented this idea and it works for MVCC scans. However, I
think this might not work for non-MVCC scans. Consider a case where
in Process-1, hash scan has returned one row and before it could check
it's validity in heap, vacuum marks that tuple as dead and removed the
entry from heap and some new tuple has been placed at that offset in
heap.

Oops, that's bad.

Do we care? Under what circumstances would a hash index be used for a
non-MVCC scan?

Uniqueness checks, are the most important one that comes to mind.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#117)
Re: Hash Indexes

On Tue, Oct 18, 2016 at 10:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 18, 2016 at 5:37 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think one way to avoid the risk of deadlock in above scenario is to
take the cleanup lock conditionally, if we get the cleanup lock then
we will delete the items as we are doing in patch now, else it will
just mark the tuples as dead and ensure that it won't try to remove
tuples that are moved-by-split. Now, I think the question is how will
these dead tuples be removed. We anyway need a separate mechanism to
clear dead tuples for hash indexes as during scans we are marking the
tuples as dead if corresponding tuple in heap is dead which are not
removed later. This is already taken care in btree code via
kill_prior_tuple optimization. So I think clearing of dead tuples can
be handled by a separate patch.

That seems like it could work.

I have implemented this idea and it works for MVCC scans. However, I
think this might not work for non-MVCC scans. Consider a case where
in Process-1, hash scan has returned one row and before it could check
it's validity in heap, vacuum marks that tuple as dead and removed the
entry from heap and some new tuple has been placed at that offset in
heap.

Oops, that's bad.

Now when Process-1 checks the validity in heap, it will check
for different tuple then what the index tuple was suppose to check.
If we want, we can make it work similar to what btree does as being
discussed on thread [1], but for that we need to introduce page-scan
mode as well in hash indexes. However, do we really want to solve
this problem as part of this patch when this exists for other index am
as well?

For what other index AM does this problem exist?

By this problem, I mean to say deadlocks for suspended scans, that can
happen in btree for non-Mvcc or other type of scans where we don't
release pin during scan. In my mind, we have below options:

a. problem of deadlocks for suspended scans should be tackled as a
separate patch as it exists for other indexes (at least for some type
of scans).
b. Implement page-scan mode and then we won't have deadlock problem
for MVCC scans.
c. Let's not care for non-MVCC scans unless we have some way to hit
those for hash indexes and proceed with Dead tuple marking idea. I
think even if we don't care for non-MVCC scans, we might hit this
problem (deadlocks) when the index relation is unlogged.

Here, even if we want to go with (b), I think we can handle it in a
separate patch, unless you think otherwise.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#121Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#120)
Re: Hash Indexes

On Wed, Oct 19, 2016 at 5:57 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Oct 18, 2016 at 10:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 18, 2016 at 5:37 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think one way to avoid the risk of deadlock in above scenario is to
take the cleanup lock conditionally, if we get the cleanup lock then
we will delete the items as we are doing in patch now, else it will
just mark the tuples as dead and ensure that it won't try to remove
tuples that are moved-by-split. Now, I think the question is how will
these dead tuples be removed. We anyway need a separate mechanism to
clear dead tuples for hash indexes as during scans we are marking the
tuples as dead if corresponding tuple in heap is dead which are not
removed later. This is already taken care in btree code via
kill_prior_tuple optimization. So I think clearing of dead tuples can
be handled by a separate patch.

That seems like it could work.

I have implemented this idea and it works for MVCC scans. However, I
think this might not work for non-MVCC scans. Consider a case where
in Process-1, hash scan has returned one row and before it could check
it's validity in heap, vacuum marks that tuple as dead and removed the
entry from heap and some new tuple has been placed at that offset in
heap.

Oops, that's bad.

Now when Process-1 checks the validity in heap, it will check
for different tuple then what the index tuple was suppose to check.
If we want, we can make it work similar to what btree does as being
discussed on thread [1], but for that we need to introduce page-scan
mode as well in hash indexes. However, do we really want to solve
this problem as part of this patch when this exists for other index am
as well?

For what other index AM does this problem exist?

By this problem, I mean to say deadlocks for suspended scans, that can
happen in btree for non-Mvcc or other type of scans where we don't
release pin during scan. In my mind, we have below options:

a. problem of deadlocks for suspended scans should be tackled as a
separate patch as it exists for other indexes (at least for some type
of scans).
b. Implement page-scan mode and then we won't have deadlock problem
for MVCC scans.
c. Let's not care for non-MVCC scans unless we have some way to hit
those for hash indexes and proceed with Dead tuple marking idea. I
think even if we don't care for non-MVCC scans, we might hit this
problem (deadlocks) when the index relation is unlogged.

oops, my last sentence is wrong. What I wanted to say is: "I think
even if we don't care for non-MVCC scans, we might hit the problem of
TIDs reuse when the index relation is unlogged."

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#122Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#120)
Re: Hash Indexes

On Tue, Oct 18, 2016 at 8:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

By this problem, I mean to say deadlocks for suspended scans, that can
happen in btree for non-Mvcc or other type of scans where we don't
release pin during scan. In my mind, we have below options:

a. problem of deadlocks for suspended scans should be tackled as a
separate patch as it exists for other indexes (at least for some type
of scans).
b. Implement page-scan mode and then we won't have deadlock problem
for MVCC scans.
c. Let's not care for non-MVCC scans unless we have some way to hit
those for hash indexes and proceed with Dead tuple marking idea. I
think even if we don't care for non-MVCC scans, we might hit this
problem (deadlocks) when the index relation is unlogged.

Here, even if we want to go with (b), I think we can handle it in a
separate patch, unless you think otherwise.

After some off-list discussion with Amit, I think I get his point
here: the deadlock hazard which is introduced by this patch already
exists for btree and has for a long time, and nobody's gotten around
to fixing it (although 2ed5b87f96d473962ec5230fd820abfeaccb2069
improved things). So it's probably OK for hash indexes to have the
same issue.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#123Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#122)
2 attachment(s)
Re: Hash Indexes

On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Amit, can you please split the buffer manager changes in this patch
into a separate patch?

Sure, attached patch extend_bufmgr_api_for_hash_index_v1.patch does that.

I think those changes can be committed first
and then we can try to deal with the rest of it. Instead of adding
ConditionalLockBufferShared, I think we should add an "int mode"
argument to the existing ConditionalLockBuffer() function. That way
is more consistent with LockBuffer(). It means an API break for any
third-party code that's calling this function, but that doesn't seem
like a big problem.

That was the reason I have chosen to write separate API, but now I
have changed it as per your suggestion.

As for CheckBufferForCleanup, I think that looks OK, but: (1) please
add an Assert() that we hold an exclusive lock on the buffer, using
LWLockHeldByMeInMode; and (2) I think we should rename it to something
like IsBufferCleanupOK. Then, when it's used, it reads like English:
if (IsBufferCleanupOK(buf)) { /* clean up the buffer */ }.

Changed as per suggestion.

I'll write another email with my thoughts about the rest of the patch.

I think that the README changes for this patch need a fairly large
amount of additional work. Here are a few things I notice:

- The confusion between buckets and pages hasn't been completely
cleared up. If you read the beginning of the README, the terminology
is clearly set forth. It says:

A hash index consists of two or more "buckets", into which tuples are placed whenever their hash key maps to the bucket number. Each bucket in the hash index comprises one or more index pages. The bucket's first page is permanently assigned to it when the bucket is created. Additional pages, called "overflow pages", are added if the bucket receives too many tuples to fit in the primary bucket page."

But later on, you say:

Scan will take a lock in shared mode on the primary bucket or on one of the overflow page.

So the correct terminology here would be "primary bucket page" not
"primary bucket".

- In addition, notice that there are two English errors in this
sentence: the word "the" needs to be added to the beginning of the
sentence, and the last word needs to be "pages" rather than "page".
There are a considerable number of similar minor errors; if you can't
fix them, I'll make a pass over it and clean it up.

I have tried to fix as per above suggestion, but I think may be some
more work is needed.

- The whole "lock definitions" section seems to me to be pretty loose
and imprecise about what is happening. For example, it uses the term
"split-in-progress" without first defining it. The sentence quoted
above says that scans take a lock in shared mode either on the primary
page or on one of the overflow pages, but it's not to document code by
saying that it will do either A or B without explaining which one! In
fact, I think that a scan will take a content lock first on the
primary bucket page and then on each overflow page in sequence,
retaining a pin on the primary buffer page throughout the scan. So it
is not one or the other but both in a particular sequence, and that
can and should be explained.

Another problem with this section is that even when it's precise about
what is going on, it's probably duplicating what is or should be in
the following sections where the algorithms for each operation are
explained. In the original wording, this section explains what each
lock protects, and then the following sections explain the algorithms
in the context of those definitions. Now, this section contains a
sketch of the algorithm, and then the following sections lay it out
again in more detail. The question of what each lock protects has
been lost. Here's an attempt at some text to replace what you have
here:

===
Concurrency control for hash indexes is provided using buffer content
locks, buffer pins, and cleanup locks. Here as elsewhere in
PostgreSQL, cleanup lock means that we hold an exclusive lock on the
buffer and have observed at some point after acquiring the lock that
we hold the only pin on that buffer. For hash indexes, a cleanup lock
on a primary bucket page represents the right to perform an arbitrary
reorganization of the entire bucket, while a cleanup lock on an
overflow page represents the right to perform a reorganization of just
that page. Therefore, scans retain a pin on both the primary bucket
page and the overflow page they are currently scanning, if any.
Splitting a bucket requires a cleanup lock on both the old and new
primary bucket pages. VACUUM therefore takes a cleanup lock on every
bucket page in turn order to remove tuples. It can also remove tuples
copied to a new bucket by any previous split operation, because the
cleanup lock taken on the primary bucket page guarantees that no scans
which started prior to the most recent split can still be in progress.
After cleaning each page individually, it attempts to take a cleanup
lock on the primary bucket page in order to "squeeze" the bucket down
to the minimum possible number of pages.
===

Changed as per suggestion.

As I was looking at the old text regarding deadlock risk, I realized
what may be a serious problem. Suppose process A is performing a scan
of some hash index. While the scan is suspended, it attempts to take
a lock and is blocked by process B. Process B, meanwhile, is running
VACUUM on that hash index. Eventually, it will do
LockBufferForCleanup() on the hash bucket on which process A holds a
buffer pin, resulting in an undetected deadlock. In the current
coding, A would hold a heavyweight lock and B would attempt to acquire
a conflicting heavyweight lock, and the deadlock detector would kill
one of them. This patch probably breaks that. I notice that that's
the only place where we attempt to acquire a buffer cleanup lock
unconditionally; every place else, we acquire the lock conditionally,
so there's no deadlock risk. Once we resolve this problem, the
paragraph about deadlock risk in this section should be revised to
explain whatever solution we come up with.

By the way, since VACUUM must run in its own transaction, B can't be
holding arbitrary locks, but that doesn't seem quite sufficient to get
us out of the woods. It will at least hold ShareUpdateExclusiveLock
on the relation being vacuumed, and process A could attempt to acquire
that same lock.

As discussed [1]/messages/by-id/CA+TgmoZWH0L=mEq9-7+o-yogbXqDhF35nERcK4HgjCoFKVbCkA@mail.gmail.com that this risk exists for btree, so leaving it as it
is for now.

Also in regards to deadlock, I notice that you added a paragraph
saying that we lock higher-numbered buckets before lower-numbered
buckets. That's fair enough, but what about the metapage?

Updated README with regard to metapage as well.

The reader

algorithm suggests that the metapage must lock must be taken after the
bucket locks, because it tries to grab the bucket lock conditionally
after acquiring the metapage lock, but that's not documented here.

The reader algorithm itself seems to be a bit oddly explained.

pin meta page and take buffer content lock in shared mode
+    compute bucket number for target hash key
+    read and pin the primary bucket page

So far, I'm with you.

+    conditionally get the buffer content lock in shared mode on
primary bucket page for search
+    if we didn't get the lock (need to wait for lock)

"didn't get the lock" and "wait for the lock" are saying the same
thing, so this is redundant, and the statement that it is "for search"
on the previous line is redundant with the introductory text
describing this as the reader algorithm.

+        release the buffer content lock on meta page
+        acquire buffer content lock on primary bucket page in shared mode
+        acquire the buffer content lock in shared mode on meta page

OK...

+        to check for possibility of split, we need to recompute the bucket and
+        verify, if it is a correct bucket; set the retry flag

This makes it sound like we set the retry flag whether it was the
correct bucket or not, which isn't sensible.

+ else if we get the lock, then we can skip the retry path

This line is totally redundant. If we don't set the retry flag, then
of course we can skip the part guarded by if (retry).

+    if (retry)
+        loop:
+            compute bucket number for target hash key
+            release meta page buffer content lock
+            if (correct bucket page is already locked)
+                break
+            release any existing content lock on bucket page (if a
concurrent split happened)
+            pin primary bucket page and take shared buffer content lock
+            retake meta page buffer content lock in shared mode

This is the part I *really* don't understand. It makes sense to me
that we need to loop until we get the correct bucket locked with no
concurrent splits, but why is this retry loop separate from the
previous bit of code that set the retry flag. In other words, why is
not something like this?

pin the meta page and take shared content lock on it
compute bucket number for target hash key
if (we can't get a shared content lock on the target bucket without blocking)
loop:
release meta page content lock
take a shared content lock on the target primary bucket page
take a shared content lock on the metapage
if (previously-computed target bucket has not been split)
break;

Another thing I don't quite understand about this algorithm is that in
order to conditionally lock the target primary bucket page, we'd first
need to read and pin it. And that doesn't seem like a good thing to
do while we're holding a shared content lock on the metapage, because
of the principle that we don't want to hold content locks across I/O.

I have changed it such that we don't perform I/O across content lock,
but that needs to lock metapage twice which will hurt performance, but
we can buy back that performance with caching the metapage [2]https://commitfest.postgresql.org/11/715/.
Updated the readme accordingly.

-- then, per read request:
release pin on metapage
-    read current page of bucket and take shared buffer content lock
-        step to next page if necessary (no chaining of locks)
+    if the split is in progress for current bucket and this is a new bucket
+        release the buffer content lock on current bucket page
+        pin and acquire the buffer content lock on old bucket in shared mode
+        release the buffer content lock on old bucket, but not pin
+        retake the buffer content lock on new bucket
+        mark the scan such that it skips the tuples that are marked
as moved by split

Aren't these steps done just once per scan? If so, I think they
should appear before "-- then, per read request" which AIUI is
intended to imply a loop over tuples.

+    step to next page if necessary (no chaining of locks)
+        if the scan indicates moved by split, then move to old bucket
after the scan
+        of current bucket is finished
get tuple
release buffer content lock and pin on current page
-- at scan shutdown:
-    release bucket share-lock

Don't we have a pin to release at scan shutdown in the new system?

Already replied to this point in previous e-mail.

Well, I was hoping to get through the whole patch in one email, but
I'm not even all the way through the README. However, it's late, so
I'm stopping here for now.

Thanks for the valuable feedback.

[1]: /messages/by-id/CA+TgmoZWH0L=mEq9-7+o-yogbXqDhF35nERcK4HgjCoFKVbCkA@mail.gmail.com
[2]: https://commitfest.postgresql.org/11/715/

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

extend_bufmgr_api_for_hash_index_v1.patchapplication/octet-stream; name=extend_bufmgr_api_for_hash_index_v1.patchDownload
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index b68a0d1..bcbf387 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -356,7 +356,7 @@ BloomNewBuffer(Relation index)
 		 * We have to guard against the possibility that someone else already
 		 * recycled this page; the buffer may be locked if so.
 		 */
-		if (ConditionalLockBuffer(buffer))
+		if (ConditionalLockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE))
 		{
 			Page		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index f07eedc..75c516a 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -298,7 +298,7 @@ GinNewBuffer(Relation index)
 		 * We have to guard against the possibility that someone else already
 		 * recycled this page; the buffer may be locked if so.
 		 */
-		if (ConditionalLockBuffer(buffer))
+		if (ConditionalLockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE))
 		{
 			Page		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 887c58b..607fe34 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -777,7 +777,7 @@ gistNewBuffer(Relation r)
 		 * We have to guard against the possibility that someone else already
 		 * recycled this page; the buffer may be locked if so.
 		 */
-		if (ConditionalLockBuffer(buffer))
+		if (ConditionalLockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE))
 		{
 			Page		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 2001dc1..6434af0 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -614,7 +614,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 			if (blkno == InvalidBlockNumber)
 				break;
 			buf = ReadBuffer(rel, blkno);
-			if (ConditionalLockBuffer(buf))
+			if (ConditionalLockBuffer(buf, BUFFER_LOCK_EXCLUSIVE))
 			{
 				page = BufferGetPage(buf);
 				if (_bt_page_recyclable(page))
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 6fc04b2..2d38cb6 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -1997,7 +1997,7 @@ spgdoinsert(Relation index, SpGistState *state,
 			 * held by a reader, or even just background writer/checkpointer
 			 * process.  Perhaps it'd be worth retrying after sleeping a bit?
 			 */
-			if (!ConditionalLockBuffer(current.buffer))
+			if (!ConditionalLockBuffer(current.buffer, BUFFER_LOCK_EXCLUSIVE))
 			{
 				ReleaseBuffer(current.buffer);
 				UnlockReleaseBuffer(parent.buffer);
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index d570ae5..3adfa23 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -205,7 +205,7 @@ SpGistNewBuffer(Relation index)
 		 * We have to guard against the possibility that someone else already
 		 * recycled this page; the buffer may be locked if so.
 		 */
-		if (ConditionalLockBuffer(buffer))
+		if (ConditionalLockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE))
 		{
 			Page		page = BufferGetPage(buffer);
 
@@ -255,7 +255,7 @@ SpGistUpdateMetaPage(Relation index)
 
 		metabuffer = ReadBuffer(index, SPGIST_METAPAGE_BLKNO);
 
-		if (ConditionalLockBuffer(metabuffer))
+		if (ConditionalLockBuffer(metabuffer, BUFFER_LOCK_EXCLUSIVE))
 		{
 			metadata = SpGistPageGetMeta(BufferGetPage(metabuffer));
 			metadata->lastUsedPages = cache->lastUsedPages;
@@ -392,7 +392,7 @@ SpGistGetBuffer(Relation index, int flags, int needSpace, bool *isNew)
 
 		buffer = ReadBuffer(index, lup->blkno);
 
-		if (!ConditionalLockBuffer(buffer))
+		if (!ConditionalLockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE))
 		{
 			/*
 			 * buffer is locked by another process, so return a new buffer
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index df4c9d7..c1ab893 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3548,22 +3548,27 @@ LockBuffer(Buffer buffer, int mode)
 
 /*
  * Acquire the content_lock for the buffer, but only if we don't have to wait.
- *
- * This assumes the caller wants BUFFER_LOCK_EXCLUSIVE mode.
  */
 bool
-ConditionalLockBuffer(Buffer buffer)
+ConditionalLockBuffer(Buffer buffer, int mode)
 {
 	BufferDesc *buf;
 
 	Assert(BufferIsValid(buffer));
+	Assert(mode == BUFFER_LOCK_SHARE || mode == BUFFER_LOCK_EXCLUSIVE);
 	if (BufferIsLocal(buffer))
 		return true;			/* act as though we got it */
 
 	buf = GetBufferDescriptor(buffer - 1);
 
-	return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
-									LW_EXCLUSIVE);
+	if (mode == BUFFER_LOCK_SHARE)
+		return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+										LW_SHARED);
+	else if (mode == BUFFER_LOCK_EXCLUSIVE)
+		return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+										LW_EXCLUSIVE);
+
+	return false;
 }
 
 /*
@@ -3724,7 +3729,7 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 		return false;
 
 	/* Try to acquire lock */
-	if (!ConditionalLockBuffer(buffer))
+	if (!ConditionalLockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE))
 		return false;
 
 	bufHdr = GetBufferDescriptor(buffer - 1);
@@ -3745,6 +3750,53 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 	return false;
 }
 
+/*
+ * IsBufferCleanupOK - as above, but don't attempt to take lock
+ *
+ * We won't loop, but just check once to see if the pin count is OK.  If
+ * not, return FALSE.
+ */
+bool
+IsBufferCleanupOK(Buffer buffer)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(BufferIsValid(buffer));
+
+	if (BufferIsLocal(buffer))
+	{
+		/* There should be exactly one pin */
+		if (LocalRefCount[-buffer - 1] != 1)
+			return false;
+		/* Nobody else to wait for */
+		return true;
+	}
+
+	/* There should be exactly one local pin */
+	if (GetPrivateRefCount(buffer) != 1)
+		return false;
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+	/* caller must hold exclusive lock on buffer */
+	Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
+								LW_EXCLUSIVE));
+
+	buf_state = LockBufHdr(bufHdr);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+	{
+		/* pincount is OK. */
+		UnlockBufHdr(bufHdr, buf_state);
+		return true;
+	}
+
+	UnlockBufHdr(bufHdr, buf_state);
+	return false;
+}
+
 
 /*
  *	Functions for buffer I/O handling
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7b6ba96..fcb2bf2 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -224,9 +224,10 @@ extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
 
 extern void UnlockBuffers(void);
 extern void LockBuffer(Buffer buffer, int mode);
-extern bool ConditionalLockBuffer(Buffer buffer);
+extern bool ConditionalLockBuffer(Buffer buffer, int mode);
 extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
+extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern void AbortBufferIO(void);
concurrent_hash_index_v9.patchapplication/octet-stream; name=concurrent_hash_index_v9.patchDownload
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 68b07aa..f48c85d 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -441,7 +441,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	Buffer		buf;
 	Page		page;
 
-	_hash_getlock(rel, blkno, HASH_SHARE);
 	buf = _hash_getbuf_with_strategy(rel, blkno, HASH_READ, 0, bstrategy);
 	page = BufferGetPage(buf);
 
@@ -472,7 +471,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
 	}
 
 	_hash_relbuf(rel, buf);
-	_hash_droplock(rel, blkno, HASH_SHARE);
 }
 
 /*
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index 5d3bd94..e2e7e91 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/access/hash
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashscan.o \
-       hashsearch.o hashsort.o hashutil.o hashvalidate.o
+OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
+       hashsort.o hashutil.o hashvalidate.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 0a7da89..7972d9d 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -125,54 +125,59 @@ the initially created buckets.
 
 Lock Definitions
 ----------------
-
-We use both lmgr locks ("heavyweight" locks) and buffer context locks
-(LWLocks) to control access to a hash index.  lmgr locks are needed for
-long-term locking since there is a (small) risk of deadlock, which we must
-be able to detect.  Buffer context locks are used for short-term access
-control to individual pages of the index.
-
-LockPage(rel, page), where page is the page number of a hash bucket page,
-represents the right to split or compact an individual bucket.  A process
-splitting a bucket must exclusive-lock both old and new halves of the
-bucket until it is done.  A process doing VACUUM must exclusive-lock the
-bucket it is currently purging tuples from.  Processes doing scans or
-insertions must share-lock the bucket they are scanning or inserting into.
-(It is okay to allow concurrent scans and insertions.)
-
-The lmgr lock IDs corresponding to overflow pages are currently unused.
-These are available for possible future refinements.  LockPage(rel, 0)
-is also currently undefined (it was previously used to represent the right
-to modify the hash-code-to-bucket mapping, but it is no longer needed for
-that purpose).
-
-Note that these lock definitions are conceptually distinct from any sort
-of lock on the pages whose numbers they share.  A process must also obtain
-read or write buffer lock on the metapage or bucket page before accessing
-said page.
-
-Processes performing hash index scans must hold share lock on the bucket
-they are scanning throughout the scan.  This seems to be essential, since
-there is no reasonable way for a scan to cope with its bucket being split
-underneath it.  This creates a possibility of deadlock external to the
-hash index code, since a process holding one of these locks could block
-waiting for an unrelated lock held by another process.  If that process
-then does something that requires exclusive lock on the bucket, we have
-deadlock.  Therefore the bucket locks must be lmgr locks so that deadlock
-can be detected and recovered from.
-
-Processes must obtain read (share) buffer context lock on any hash index
-page while reading it, and write (exclusive) lock while modifying it.
-To prevent deadlock we enforce these coding rules: no buffer lock may be
-held long term (across index AM calls), nor may any buffer lock be held
-while waiting for an lmgr lock, nor may more than one buffer lock
-be held at a time by any one process.  (The third restriction is probably
-stronger than necessary, but it makes the proof of no deadlock obvious.)
+Concurrency control for hash indexes is provided using buffer content
+locks, buffer pins, and cleanup locks.   Here as elsewhere in PostgreSQL,
+cleanup lock means that we hold an exclusive lock on the buffer and have
+observed at some point after acquiring the lock that we hold the only pin
+on that buffer.  For hash indexes, a cleanup lock on a primary bucket page
+represents the right to perform an arbitrary reorganization of the entire
+bucket.  Therefore, scans retain a pin on the primary bucket page for the
+bucket they are currently scanning.  Splitting a bucket requires a cleanup
+lock on both the old and new primary bucket pages.  VACUUM therefore takes
+a cleanup lock on every bucket page in order to remove tuples.  It can also
+remove tuples copied to a new bucket by any previous split operation, because
+the cleanup lock taken on the primary bucket page guarantees that no scans
+which started prior to the most recent split can still be in progress.  After
+cleaning each page individually, it attempts to take a cleanup lock on the
+primary bucket page in order to "squeeze" the bucket down to the minimum
+possible number of pages.
+
+To avoid deadlocks, we must be consistent about the lock order in which we
+lock the buckets for operations that requires locks on two different buckets.
+We use the rule "first lock the old bucket and then new bucket, basically
+lock the lowered number bucket first".
+
+To avoid deadlock in operations that requires locking metapage and other
+buckets, we always take the lock on other bucket first and then on metapage.
 
 
 Pseudocode Algorithms
 ---------------------
 
+Various flags that are used in hash index operations are described as below:
+
+split-in-progress flag indicates that split operation is in progress for a
+bucket.  During split operation, this flag is set on both old and new buckets.
+This flag is cleared once the split operation is finished.
+
+moved-by-split flag on a tuple indicates that tuple is moved from old to new
+bucket.  The concurrent scans can skip such tuples till the split operation is
+finished.  Once the tuple is marked as moved-by-split, it will remain so forever
+but that does no harm.  We have intentionally not cleared it as that can generate
+an additional I/O which is not necessary.
+
+has_garbage flag indicates that the bucket contains tuples that are moved due
+to split.  This will be set only for old bucket.  Now, why we need it besides
+split-in-progress flag is to distinguish the case when the split is over
+(aka split-in-progress flag is cleared.).  This is used both by vacuum as
+well as during re-split operation.  Vacuum, uses it to decide if it needs to
+clear the tuples (that are moved-by-split) from bucket along with dead tuples.
+Re-split of bucket uses it to ensure that it doesn't start a new split from a
+bucket without clearing the previous tuples from old bucket.  The usage by
+re-split helps to keep bloat under control and makes the design somewhat
+simpler as we don't have to any time handle the situation where a bucket can
+contain dead-tuples from multiple splits.
+
 The operations we need to support are: readers scanning the index for
 entries of a particular hash code (which by definition are all in the same
 bucket); insertion of a new tuple into the correct bucket; enlarging the
@@ -193,38 +198,51 @@ The reader algorithm is:
 		release meta page buffer content lock
 		if (correct bucket page is already locked)
 			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock
+		release any existing bucket page buffer content lock (if a concurrent split happened)
+		take the buffer content lock on bucket page in shared mode
 		retake meta page buffer content lock in shared mode
--- then, per read request:
 	release pin on metapage
-	read current page of bucket and take shared buffer content lock
-		step to next page if necessary (no chaining of locks)
+	if the split is in progress for current bucket and this is a new bucket
+		release the buffer content lock on current bucket page
+		pin and acquire the buffer content lock on old bucket in shared mode
+		release the buffer content lock on old bucket, but not pin
+		retake the buffer content lock on new bucket
+		mark the scan such that it skips the tuples that are marked as moved by split
+-- then, per read request:
+	step to next page if necessary (no chaining of locks)
+		if the scan indicates moved by split, then move to old bucket after the scan
+		of current bucket is finished
 	get tuple
 	release buffer content lock and pin on current page
 -- at scan shutdown:
-	release bucket share-lock
-
-We can't hold the metapage lock while acquiring a lock on the target bucket,
-because that might result in an undetected deadlock (lwlocks do not participate
-in deadlock detection).  Instead, we relock the metapage after acquiring the
-bucket page lock and check whether the bucket has been split.  If not, we're
-done.  If so, we release our previously-acquired lock and repeat the process
-using the new bucket number.  Holding the bucket sharelock for
-the remainder of the scan prevents the reader's current-tuple pointer from
-being invalidated by splits or compactions.  Notice that the reader's lock
-does not prevent other buckets from being split or compacted.
+	release any pin we hold on current buffer, old bucket buffer, new bucket buffer
+
+We don't want to hold the meta page lock while acquiring the content lock on
+bucket page, because that might result in poor concurrency.  Instead, we relock
+the metapage after acquiring the bucket page content lock and check whether the
+bucket has been split.  If not, we're done.  If so, we release our
+previously-acquired content lock, but not pin and repeat the process using the
+new bucket number.  Holding the buffer pin on bucket page for the remainder of
+the scan prevents the reader's current-tuple pointer from being invalidated by
+splits or compactions.  Notice that the reader's pin does not prevent other
+buckets from being split or compacted.
 
 To keep concurrency reasonably good, we require readers to cope with
 concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the page sharelock.  Since
-deletion is not possible while a reader holds the bucket sharelock, and
-we assume that heap tuple TIDs are unique, this can be implemented by
+their current scan position after re-acquiring the buffer content lock on
+page.  Since deletion is not possible while a reader holds the pin on bucket,
+and we assume that heap tuple TIDs are unique, this can be implemented by
 searching for the same heap tuple TID previously returned.  Insertion does
 not move index entries across pages, so the previously-returned index entry
 should always be on the same page, at the same or higher offset number,
 as it was before.
 
+To allow scan during bucket split, if at the start of the scan, bucket is
+marked as split-in-progress, it scan all the tuples in that bucket except for
+those that are marked as moved-by-split.  Once it finishes the scan of all the
+tuples in the current bucket, it scans the old bucket from which this bucket
+is formed by split.  This happens only for the new half bucket.
+
 The insertion algorithm is rather similar:
 
 	pin meta page and take buffer content lock in shared mode
@@ -233,18 +251,24 @@ The insertion algorithm is rather similar:
 		release meta page buffer content lock
 		if (correct bucket page is already locked)
 			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock in shared mode
+		release any existing bucket page buffer content lock (if a concurrent split happened)
+		take the buffer content lock on bucket page in exclusive mode
 		retake meta page buffer content lock in shared mode
--- (so far same as reader)
 	release pin on metapage
-	pin current page of bucket and take exclusive buffer content lock
-	if full, release, read/exclusive-lock next page; repeat as needed
+-- (so far same as reader, except for acquisation of buffer content lock in
+	exclusive mode on primary bucket page)
+	if the split-in-progress flag is set for bucket in old half of split
+	and pin count on it is one, then finish the split
+		we already have a buffer content lock on old bucket, conditionally get the content lock on new bucket
+		if get the lock on new bucket
+			finish the split using algorithm mentioned below for split
+			release the buffer content lock and pin on new bucket
+	if current page is full, release lock but not pin, read/exclusive-lock next page; repeat as needed
 	>> see below if no space in any page of bucket
 	insert tuple at appropriate place in page
 	mark current page dirty and release buffer content lock and pin
-	release heavyweight share-lock
-	pin meta page and take buffer content lock in shared mode
+	if the current page is not a bucket page, release the pin on bucket page
+	pin meta page and take buffer content lock in exclusive mode
 	increment tuple count, decide if split needed
 	mark meta page dirty and release buffer content lock and pin
 	done if no split needed, else enter Split algorithm below
@@ -256,11 +280,13 @@ bucket that is being actively scanned, because readers can cope with this
 as explained above.  We only need the short-term buffer locks to ensure
 that readers do not see a partially-updated page.
 
-It is clearly impossible for readers and inserters to deadlock, and in
-fact this algorithm allows them a very high degree of concurrency.
-(The exclusive metapage lock taken to update the tuple count is stronger
-than necessary, since readers do not care about the tuple count, but the
-lock is held for such a short time that this is probably not an issue.)
+To avoid deadlock between readers and inserters, whenever there is a need
+to lock multiple buckets, we always take in the order suggested in Lock
+Definitions above.  This algorithm allows them a very high degree of
+concurrency.  (The exclusive metapage lock taken to update the tuple count
+is stronger than necessary, since readers do not care about the tuple count,
+but the lock is held for such a short time that this is probably not an
+issue.)
 
 When an inserter cannot find space in any existing page of a bucket, it
 must obtain an overflow page and add that page to the bucket's chain.
@@ -271,46 +297,66 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
 The algorithm attempts, but does not necessarily succeed, to split one
 existing bucket in two, thereby lowering the fill ratio:
 
-	pin meta page and take buffer content lock in exclusive mode
-	check split still needed
-	if split not needed anymore, drop buffer content lock and pin and exit
-	decide which bucket to split
-	Attempt to X-lock old bucket number (definitely could fail)
-	Attempt to X-lock new bucket number (shouldn't fail, but...)
-	if above fail, drop locks and pin and exit
+	expand:
+		take buffer content lock in exclusive mode on meta page
+		check split still needed
+		if split not needed anymore, drop buffer content lock and exit
+		decide which bucket to split
+		Attempt to acquire cleanup lock on old bucket number (definitely could fail)
+		if above fail, release lock and pin and exit
+		if the split-in-progress flag is set, then finish the split
+			conditionally get the content lock on new bucket which was involved in split
+			if got the lock on new bucket
+				finish the split using algorithm mentioned below for split
+				release the buffer content lock and pin on old and new buckets
+				try to expand from start
+			else
+				release the buffer conetent lock and pin on old bucket and exit
+		if the garbage flag (indicates that tuples are moved by split) is set on bucket
+			release the buffer content lock on meta page
+			remove the tuples that doesn't belong to this bucket; see bucket cleanup below
+	Attempt to acquire cleanup lock on new bucket number (shouldn't fail, but...)
 	update meta page to reflect new number of buckets
-	mark meta page dirty and release buffer content lock and pin
+	mark meta page dirty and release buffer content lock
 	-- now, accesses to all other buckets can proceed.
 	Perform actual split of bucket, moving tuples as needed
 	>> see below about acquiring needed extra space
-	Release X-locks of old and new buckets
+
+	split guts
+	mark the old and new buckets indicating split-in-progress
+	mark the old bucket indicating has-garbage
+	copy the tuples that belongs to new bucket from old bucket
+	during copy mark such tuples as move-by-split
+	release lock but not pin for primary bucket page of old bucket,
+	read/shared-lock next page; repeat as needed
+	>> see below if no space in bucket page of new bucket
+	ensure to have exclusive-lock on both old and new buckets in that order
+	clear the split-in-progress flag from both the buckets
+	mark buffers dirty and release the locks and pins on both old and new buckets
 
 Note the metapage lock is not held while the actual tuple rearrangement is
 performed, so accesses to other buckets can proceed in parallel; in fact,
 it's possible for multiple bucket splits to proceed in parallel.
 
-Split's attempt to X-lock the old bucket number could fail if another
-process holds S-lock on it.  We do not want to wait if that happens, first
-because we don't want to wait while holding the metapage exclusive-lock,
-and second because it could very easily result in deadlock.  (The other
-process might be out of the hash AM altogether, and could do something
-that blocks on another lock this process holds; so even if the hash
-algorithm itself is deadlock-free, a user-induced deadlock could occur.)
-So, this is a conditional LockAcquire operation, and if it fails we just
-abandon the attempt to split.  This is all right since the index is
-overfull but perfectly functional.  Every subsequent inserter will try to
-split, and eventually one will succeed.  If multiple inserters failed to
-split, the index might still be overfull, but eventually, the index will
+The split operation's attempt to acquire cleanup-lock on the old bucket number
+could fail if another process holds any lock or pin on it.  We do not want to
+wait if that happens, because we don't want to wait while holding the metapage
+exclusive-lock.  So, this is a conditional LWLockAcquire operation, and if
+it fails we just abandon the attempt to split.  This is all right since the
+index is overfull but perfectly functional.  Every subsequent inserter will
+try to split, and eventually one will succeed.  If multiple inserters failed
+to split, the index might still be overfull, but eventually, the index will
 not be overfull and split attempts will stop.  (We could make a successful
 splitter loop to see if the index is still overfull, but it seems better to
 distribute the split overhead across successive insertions.)
 
 A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt.  The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
+disk space or crash) the index is left corrupt.  The probability of that
+could be made quite low if we grab a free page or two before we update the
+meta page, but the only real solution is to treat a split as a WAL-loggable,
 must-complete action.  I'm not planning to teach hash about WAL in this
-go-round.
+go-round.  However, we do try to finish the incomplete splits during insert
+and split.
 
 The fourth operation is garbage collection (bulk deletion):
 
@@ -319,9 +365,13 @@ The fourth operation is garbage collection (bulk deletion):
 	fetch current max bucket number
 	release meta page buffer content lock and pin
 	while next bucket <= max bucket do
-		Acquire X lock on target bucket
-		Scan and remove tuples, compact free space as needed
-		Release X lock
+		Acquire cleanup lock on target bucket
+		Scan and remove tuples
+		For overflow page, first we need to lock the next page and then
+		release the lock on current bucket or overflow page
+		Ensure to have buffer content lock in exclusive mode on bucket page
+		If buffer pincount is one, then compact free space as needed
+		Release lock
 		next bucket ++
 	end loop
 	pin metapage and take buffer content lock in exclusive mode
@@ -330,20 +380,24 @@ The fourth operation is garbage collection (bulk deletion):
 	else update metapage tuple count
 	mark meta page dirty and release buffer content lock and pin
 
-Note that this is designed to allow concurrent splits.  If a split occurs,
-tuples relocated into the new bucket will be visited twice by the scan,
-but that does no harm.  (We must however be careful about the statistics
-reported by the VACUUM operation.  What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count
-if the stored tuple count and number of buckets did *not* change at any
-time during the scan.  This provides a way of correcting the stored tuple
-count if it gets out of sync for some reason.  But if a split or insertion
-does occur concurrently, the scan count is untrustworthy; instead,
-subtract the number of tuples deleted from the stored tuple count and
-use that.)
-
-The exclusive lock request could deadlock in some strange scenarios, but
-we can just error out without any great harm being done.
+Note that this is designed to allow concurrent splits and scans.  If a
+split occurs, tuples relocated into the new bucket will be visited twice
+by the scan, but that does no harm.  As we release the lock on bucket page
+during cleanup scan of a bucket, it will allow concurrent scan to start on
+a bucket and ensures that scan will always be behind cleanup.  It is must to
+keep scans behind cleanup, else vacuum could remove tuples that are required
+to complete the scan as the scan that returns multiple tuples from the same
+bucket page always restart the scan from the previous offset number from which
+it has returned last tuple.  This holds true for backward scans as well
+(backward scans first traverse each bucket starting from first bucket to last
+overflow page in the chain).  We must be careful about the statistics reported
+by the VACUUM operation.  What we can do is count the number of tuples scanned,
+and believe this in preference to the stored tuple count if the stored tuple
+count and number of buckets did *not* change at any time during the scan.  This
+provides a way of correcting the stored tuple count if it gets out of sync for
+some reason.  But if a split or insertion does occur concurrently, the scan
+count is untrustworthy; instead, subtract the number of tuples deleted from the
+stored tuple count and use that.
 
 
 Free Space Management
@@ -417,13 +471,11 @@ free page; there can be no other process holding lock on it.
 
 Bucket splitting uses a similar algorithm if it has to extend the new
 bucket, but it need not worry about concurrent extension since it has
-exclusive lock on the new bucket.
+buffer content lock in exclusive mode on the new bucket.
 
-Freeing an overflow page is done by garbage collection and by bucket
-splitting (the old bucket may contain no-longer-needed overflow pages).
-In both cases, the process holds exclusive lock on the containing bucket,
-so need not worry about other accessors of pages in the bucket.  The
-algorithm is:
+Freeing an overflow page requires the process to hold buffer content lock in
+exclusive mode on the containing bucket, so need not worry about other
+accessors of pages in the bucket.  The algorithm is:
 
 	delink overflow page from bucket chain
 	(this requires read/update/write/release of fore and aft siblings)
@@ -454,14 +506,6 @@ locks.  Since they need no lmgr locks, deadlock is not possible.
 Other Notes
 -----------
 
-All the shenanigans with locking prevent a split occurring while *another*
-process is stopped in a given bucket.  They do not ensure that one of
-our *own* backend's scans is not stopped in the bucket, because lmgr
-doesn't consider a process's own locks to conflict.  So the Split
-algorithm must check for that case separately before deciding it can go
-ahead with the split.  VACUUM does not have this problem since nothing
-else can be happening within the vacuuming backend.
-
-Should we instead try to fix the state of any conflicting local scan?
-Seems mighty ugly --- got to move the held bucket S-lock as well as lots
-of other messiness.  For now, just punt and don't split.
+Clean up locks prevent a split from occurring while *another* process is stopped
+in a given bucket.  It also ensures that one of our *own* backend's scans is not
+stopped in the bucket.
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..4c25269 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -287,10 +287,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		/*
 		 * An insertion into the current index page could have happened while
 		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold share lock on
-		 * the bucket, no deletions or splits could have occurred; therefore
-		 * we can expect that the TID still exists in the current index page,
-		 * at an offset >= where we were.)
+		 * for the TID we previously returned.  (Because we hold pin on the
+		 * bucket, no deletions or splits could have occurred; therefore we
+		 * can expect that the TID still exists in the current index page, at
+		 * an offset >= where we were.)
 		 */
 		OffsetNumber maxoffnum;
 
@@ -424,17 +424,16 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_bucket_valid = false;
-	so->hashso_bucket_blkno = 0;
 	so->hashso_curbuf = InvalidBuffer;
+	so->hashso_bucket_buf = InvalidBuffer;
+	so->hashso_old_bucket_buf = InvalidBuffer;
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
 	ItemPointerSetInvalid(&(so->hashso_heappos));
 
-	scan->opaque = so;
+	so->hashso_skip_moved_tuples = false;
 
-	/* register scan in case we change pages it's using */
-	_hash_regscan(scan);
+	scan->opaque = so;
 
 	return scan;
 }
@@ -449,15 +448,7 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -469,8 +460,9 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		memmove(scan->keyData,
 				scankey,
 				scan->numberOfKeys * sizeof(ScanKeyData));
-		so->hashso_bucket_valid = false;
 	}
+
+	so->hashso_skip_moved_tuples = false;
 }
 
 /*
@@ -482,18 +474,7 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/* don't need scan registered anymore */
-	_hash_dropscan(scan);
-
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	pfree(so);
 	scan->opaque = NULL;
@@ -504,6 +485,9 @@ hashendscan(IndexScanDesc scan)
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
  *
+ * This function also delete the tuples that are moved by split to other
+ * bucket.
+ *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
@@ -548,83 +532,48 @@ loop_top:
 	{
 		BlockNumber bucket_blkno;
 		BlockNumber blkno;
-		bool		bucket_dirty = false;
+		Buffer		bucket_buf;
+		Buffer		buf;
+		HashPageOpaque bucket_opaque;
+		Page		page;
+		bool		bucket_has_garbage = false;
 
 		/* Get address of bucket's start page */
 		bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
 
-		/* Exclusive-lock the bucket so we can shrink it */
-		_hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
-		/* Shouldn't have any active scans locally, either */
-		if (_hash_has_active_scan(rel, cur_bucket))
-			elog(ERROR, "hash index has active scan during VACUUM");
-
-		/* Scan each page in bucket */
 		blkno = bucket_blkno;
-		while (BlockNumberIsValid(blkno))
-		{
-			Buffer		buf;
-			Page		page;
-			HashPageOpaque opaque;
-			OffsetNumber offno;
-			OffsetNumber maxoffno;
-			OffsetNumber deletable[MaxOffsetNumber];
-			int			ndeletable = 0;
-
-			vacuum_delay_point();
 
-			buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-											 info->strategy);
-			page = BufferGetPage(buf);
-			opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-			Assert(opaque->hasho_bucket == cur_bucket);
-
-			/* Scan each tuple in page */
-			maxoffno = PageGetMaxOffsetNumber(page);
-			for (offno = FirstOffsetNumber;
-				 offno <= maxoffno;
-				 offno = OffsetNumberNext(offno))
-			{
-				IndexTuple	itup;
-				ItemPointer htup;
+		/*
+		 * We need to acquire a cleanup lock on the primary bucket page to out
+		 * wait concurrent scans before deleting the dead tuples.
+		 */
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+		LockBufferForCleanup(buf);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 
-				itup = (IndexTuple) PageGetItem(page,
-												PageGetItemId(page, offno));
-				htup = &(itup->t_tid);
-				if (callback(htup, callback_state))
-				{
-					/* mark the item for deletion */
-					deletable[ndeletable++] = offno;
-					tuples_removed += 1;
-				}
-				else
-					num_index_tuples += 1;
-			}
+		page = BufferGetPage(buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
-			/*
-			 * Apply deletions and write page if needed, advance to next page.
-			 */
-			blkno = opaque->hasho_nextblkno;
+		/*
+		 * If the bucket contains tuples that are moved by split, then we
+		 * need to delete such tuples.  We can't delete such tuples if the
+		 * split operation on bucket is not finished as those are needed by
+		 * scans.
+		 */
+		if (H_HAS_GARBAGE(bucket_opaque) &&
+			!H_INCOMPLETE_SPLIT(bucket_opaque))
+			bucket_has_garbage = true;
 
-			if (ndeletable > 0)
-			{
-				PageIndexMultiDelete(page, deletable, ndeletable);
-				_hash_wrtbuf(rel, buf);
-				bucket_dirty = true;
-			}
-			else
-				_hash_relbuf(rel, buf);
-		}
+		bucket_buf = buf;
 
-		/* If we deleted anything, try to compact free space */
-		if (bucket_dirty)
-			_hash_squeezebucket(rel, cur_bucket, bucket_blkno,
-								info->strategy);
+		hashbucketcleanup(rel, bucket_buf, blkno, info->strategy,
+						  local_metapage.hashm_maxbucket,
+						  local_metapage.hashm_highmask,
+						  local_metapage.hashm_lowmask, &tuples_removed,
+						  &num_index_tuples, bucket_has_garbage, true,
+						  callback, callback_state);
 
-		/* Release bucket lock */
-		_hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+		_hash_relbuf(rel, bucket_buf);
 
 		/* Advance to next bucket */
 		cur_bucket++;
@@ -705,6 +654,197 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }
 
+/*
+ * Helper function to perform deletion of index entries from a bucket.
+ *
+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.
+ *
+ * During scan of overflow pages, first we need to lock the next bucket and
+ * then release the lock on current bucket.  This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.
+ *
+ * We need to retain a pin on the primary bucket to ensure that no concurrent
+ * split can start.
+ */
+void
+hashbucketcleanup(Relation rel, Buffer bucket_buf,
+				  BlockNumber bucket_blkno,
+				  BufferAccessStrategy bstrategy,
+				  uint32 maxbucket,
+				  uint32 highmask, uint32 lowmask,
+				  double *tuples_removed,
+				  double *num_index_tuples,
+				  bool bucket_has_garbage,
+				  bool delay,
+				  IndexBulkDeleteCallback callback,
+				  void *callback_state)
+{
+	BlockNumber blkno;
+	Buffer		buf;
+	Bucket		cur_bucket;
+	Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
+	Page		page;
+	bool		bucket_dirty = false;
+
+	blkno = bucket_blkno;
+	buf = bucket_buf;
+	page = BufferGetPage(buf);
+	cur_bucket = ((HashPageOpaque) PageGetSpecialPointer(page))->hasho_bucket;
+
+	if (bucket_has_garbage)
+		new_bucket = _hash_get_newbucket(rel, cur_bucket,
+										 lowmask, maxbucket);
+
+	/* Scan each page in bucket */
+	for (;;)
+	{
+		HashPageOpaque opaque;
+		OffsetNumber offno;
+		OffsetNumber maxoffno;
+		Buffer		next_buf;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+		bool		retain_pin = false;
+		bool		curr_page_dirty = false;
+
+		if (delay)
+			vacuum_delay_point();
+
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		/* Scan each tuple in page */
+		maxoffno = PageGetMaxOffsetNumber(page);
+		for (offno = FirstOffsetNumber;
+			 offno <= maxoffno;
+			 offno = OffsetNumberNext(offno))
+		{
+			IndexTuple	itup;
+			ItemPointer htup;
+			Bucket		bucket;
+
+			itup = (IndexTuple) PageGetItem(page,
+											PageGetItemId(page, offno));
+			htup = &(itup->t_tid);
+			if (callback && callback(htup, callback_state))
+			{
+				/* mark the item for deletion */
+				deletable[ndeletable++] = offno;
+				if (tuples_removed)
+					*tuples_removed += 1;
+			}
+			else if (bucket_has_garbage)
+			{
+				/* delete the tuples that are moved by split. */
+				bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+											  maxbucket,
+											  highmask,
+											  lowmask);
+				/* mark the item for deletion */
+				if (bucket != cur_bucket)
+				{
+					/*
+					 * We expect tuples to either belong to curent bucket or
+					 * new_bucket.  This is ensured because we don't allow
+					 * further splits from bucket that contains garbage. See
+					 * comments in _hash_expandtable.
+					 */
+					Assert(bucket == new_bucket);
+					deletable[ndeletable++] = offno;
+				}
+				else if (num_index_tuples)
+					*num_index_tuples += 1;
+			}
+			else if (num_index_tuples)
+				*num_index_tuples += 1;
+		}
+
+		/* retain the pin on primary bucket page till end of bucket scan */
+		if (blkno == bucket_blkno)
+			retain_pin = true;
+		else
+			retain_pin = false;
+
+		blkno = opaque->hasho_nextblkno;
+
+		/*
+		 * Apply deletions, advance to next page and write page if needed.
+		 */
+		if (ndeletable > 0)
+		{
+			PageIndexMultiDelete(page, deletable, ndeletable);
+			bucket_dirty = true;
+			curr_page_dirty = true;
+		}
+
+		/* bail out if there are no more pages to scan. */
+		if (!BlockNumberIsValid(blkno))
+			break;
+
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
+		/*
+		 * release the lock on previous page after acquiring the lock on next
+		 * page
+		 */
+		if (curr_page_dirty)
+		{
+			if (retain_pin)
+				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+			else
+				_hash_wrtbuf(rel, buf);
+			curr_page_dirty = false;
+		}
+		else if (retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
+
+		buf = next_buf;
+	}
+
+	/*
+	 * lock the bucket page to clear the garbage flag and squeeze the bucket.
+	 * if the current buffer is same as bucket buffer, then we already have
+	 * lock on bucket page.
+	 */
+	if (buf != bucket_buf)
+	{
+		_hash_relbuf(rel, buf);
+		_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+	}
+
+	/*
+	 * Clear the garbage flag from bucket after deleting the tuples that are
+	 * moved by split.  We purposefully clear the flag before squeeze bucket,
+	 * so that after restart, vacuum shouldn't again try to delete the moved
+	 * by split tuples.
+	 */
+	if (bucket_has_garbage)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = BufferGetPage(bucket_buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+	}
+
+	/*
+	 * If we deleted anything, try to compact free space.  For squeezing the
+	 * bucket, we must have a cleanup lock, else it can impact the ordering of
+	 * tuples for a scan that has started before it.
+	 */
+	if (bucket_dirty && IsBufferCleanupOK(bucket_buf))
+		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
+							bstrategy);
+}
 
 void
 hash_redo(XLogReaderState *record)
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..bd39333 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -28,7 +28,8 @@
 void
 _hash_doinsert(Relation rel, IndexTuple itup)
 {
-	Buffer		buf;
+	Buffer		buf = InvalidBuffer;
+	Buffer		bucket_buf;
 	Buffer		metabuf;
 	HashMetaPage metap;
 	BlockNumber blkno;
@@ -40,6 +41,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	bool		do_expand;
 	uint32		hashkey;
 	Bucket		bucket;
+	uint32		maxbucket;
+	uint32		highmask;
+	uint32		lowmask;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -96,9 +100,11 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		{
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+
+		/* Fetch the primary bucket page for the bucket */
+		buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
 
 		/*
 		 * Reacquire metapage lock and check that no bucket split has taken
@@ -109,12 +115,55 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		retry = true;
 	}
 
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+	/*
+	 * Copy bucket mapping info now;  The comment in _hash_expandtable where
+	 * we copy this information and calls _hash_splitbucket explains why this
+	 * is OK.
+	 */
+	maxbucket = metap->hashm_maxbucket;
+	highmask = metap->hashm_highmask;
+	lowmask = metap->hashm_lowmask;
+
+	/* remember the primary bucket buffer to release the pin on it at end. */
+	bucket_buf = buf;
+
 	page = BufferGetPage(buf);
 	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(pageopaque->hasho_bucket == bucket);
 
+	/*
+	 * If there is any pending split, try to finish it before proceeding for
+	 * the insertion.  We try to finish the split for the insertion in old
+	 * bucket, as that will allow us to remove the tuples from old bucket and
+	 * reuse the space.  There is no such apparent benefit from finishing the
+	 * split during insertion in new bucket.
+	 *
+	 * In future, if we want to finish the splits during insertion in new
+	 * bucket, we must ensure the locking order such that old bucket is locked
+	 * before new bucket.
+	 */
+	if (H_OLD_INCOMPLETE_SPLIT(pageopaque) && IsBufferCleanupOK(buf))
+	{
+		BlockNumber nblkno;
+		Buffer		nbuf;
+
+		nblkno = _hash_get_newblk(rel, pageopaque);
+
+		/* Fetch the primary bucket page for the new bucket */
+		nbuf = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+		if (nbuf)
+		{
+			_hash_finish_split(rel, metabuf, buf, nbuf, maxbucket,
+							   highmask, lowmask);
+
+			/*
+			 * release the buffer here as the insertion will happen in old
+			 * bucket.
+			 */
+			_hash_relbuf(rel, nbuf);
+		}
+	}
+
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
@@ -127,14 +176,23 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		{
 			/*
 			 * ovfl page exists; go get it.  if it doesn't have room, we'll
-			 * find out next pass through the loop test above.
+			 * find out next pass through the loop test above.  Retain the
+			 * pin, if it is a primary bucket page.
 			 */
-			_hash_relbuf(rel, buf);
+			if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
+				_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+			else
+				_hash_relbuf(rel, buf);
 			buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 			page = BufferGetPage(buf);
 		}
 		else
 		{
+			bool		retain_pin = false;
+
+			/* page flags must be accessed before releasing lock on a page. */
+			retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;
+
 			/*
 			 * we're at the end of the bucket chain and we haven't found a
 			 * page with enough room.  allocate a new overflow page.
@@ -144,7 +202,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 
 			/* chain to a new overflow page */
-			buf = _hash_addovflpage(rel, metabuf, buf);
+			buf = _hash_addovflpage(rel, metabuf, buf, retain_pin);
 			page = BufferGetPage(buf);
 
 			/* should fit now, given test above */
@@ -158,11 +216,13 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	/* found page with enough space, so add the item here */
 	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
 
-	/* write and release the modified page */
+	/*
+	 * write and release the modified page and ensure to release the pin on
+	 * primary page.
+	 */
 	_hash_wrtbuf(rel, buf);
-
-	/* We can drop the bucket lock now */
-	_hash_droplock(rel, blkno, HASH_SHARE);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
 
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index db3e268..58e15f3 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *
  *	On entry, the caller must hold a pin but no lock on 'buf'.  The pin is
  *	dropped before exiting (we assume the caller is not interested in 'buf'
- *	anymore).  The returned overflow page will be pinned and write-locked;
- *	it is guaranteed to be empty.
+ *	anymore) if not asked to retain.  The pin will be retained only for the
+ *	primary bucket.  The returned overflow page will be pinned and
+ *	write-locked; it is guaranteed to be empty.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
  *
- *	The caller must hold at least share lock on the bucket, to ensure that
- *	no one else tries to compact the bucket meanwhile.  This guarantees that
- *	'buf' won't stop being part of the bucket while it's unlocked.
- *
  * NB: since this could be executed concurrently by multiple processes,
  * one should not assume that the returned overflow page will be the
  * immediate successor of the originally passed 'buf'.  Additional overflow
  * pages might have been added to the bucket chain in between.
  */
 Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 {
 	Buffer		ovflbuf;
 	Page		page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 			break;
 
 		/* we assume we do not need to write the unmodified page */
-		_hash_relbuf(rel, buf);
+		if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
 
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
@@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 
 	/* logically chain overflow page to previous page */
 	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	_hash_wrtbuf(rel, buf);
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, buf);
 
 	return ovflbuf;
 }
@@ -370,11 +373,11 @@ _hash_firstfreebit(uint32 map)
  *	in the bucket, or InvalidBlockNumber if no following page.
  *
  *	NB: caller must not hold lock on metapage, nor on either page that's
- *	adjacent in the bucket chain.  The caller had better hold exclusive lock
- *	on the bucket, too.
+ *	adjacent in the bucket chain except from primary bucket.  The caller had
+ *	better hold cleanup lock on the primary bucket page.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf,
+_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
@@ -413,22 +416,41 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
-	 * deleted.  No concurrency issues since we hold exclusive lock on the
-	 * entire bucket.
+	 * deleted.  No concurrency issues since we hold the cleanup lock on
+	 * primary bucket.  We don't need to aqcuire buffer lock to fix the
+	 * primary bucket, as we already have that lock.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-														 prevblkno,
-														 HASH_WRITE,
+		if (prevblkno == bucket_blkno)
+		{
+			Buffer		prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
+													 prevblkno,
+													 RBM_NORMAL,
+													 bstrategy);
+
+			Page		prevpage = BufferGetPage(prevbuf);
+			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+			Assert(prevopaque->hasho_bucket == bucket);
+			prevopaque->hasho_nextblkno = nextblkno;
+			MarkBufferDirty(prevbuf);
+			ReleaseBuffer(prevbuf);
+		}
+		else
+		{
+			Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
+															 prevblkno,
+															 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		prevpage = BufferGetPage(prevbuf);
-		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+															 bstrategy);
+			Page		prevpage = BufferGetPage(prevbuf);
+			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
 
-		Assert(prevopaque->hasho_bucket == bucket);
-		prevopaque->hasho_nextblkno = nextblkno;
-		_hash_wrtbuf(rel, prevbuf);
+			Assert(prevopaque->hasho_bucket == bucket);
+			prevopaque->hasho_nextblkno = nextblkno;
+			_hash_wrtbuf(rel, prevbuf);
+		}
 	}
 	if (BlockNumberIsValid(nextblkno))
 	{
@@ -570,7 +592,7 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
  *	required that to be true on entry as well, but it's a lot easier for
  *	callers to leave empty overflow pages and let this guy clean it up.
  *
- *	Caller must hold exclusive lock on the target bucket.  This allows
+ *	Caller must hold cleanup lock on the target bucket.  This allows
  *	us to safely lock multiple pages in the bucket.
  *
  *	Since this function is invoked in VACUUM, we provide an access strategy
@@ -580,6 +602,7 @@ void
 _hash_squeezebucket(Relation rel,
 					Bucket bucket,
 					BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy)
 {
 	BlockNumber wblkno;
@@ -591,27 +614,22 @@ _hash_squeezebucket(Relation rel,
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
 	bool		wbuf_dirty;
+	bool		release_buf = false;
 
 	/*
-	 * start squeezing into the base bucket page.
+	 * start squeezing into the primary bucket page.
 	 */
 	wblkno = bucket_blkno;
-	wbuf = _hash_getbuf_with_strategy(rel,
-									  wblkno,
-									  HASH_WRITE,
-									  LH_BUCKET_PAGE,
-									  bstrategy);
+	wbuf = bucket_buf;
 	wpage = BufferGetPage(wbuf);
 	wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 
 	/*
-	 * if there aren't any overflow pages, there's nothing to squeeze.
+	 * if there aren't any overflow pages, there's nothing to squeeze. caller
+	 * is responsible to release the lock on primary bucket page.
 	 */
 	if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
-	{
-		_hash_relbuf(rel, wbuf);
 		return;
-	}
 
 	/*
 	 * Find the last page in the bucket chain by starting at the base bucket
@@ -656,6 +674,10 @@ _hash_squeezebucket(Relation rel,
 			IndexTuple	itup;
 			Size		itemsz;
 
+			/* skip dead tuples */
+			if (ItemIdIsDead(PageGetItemId(rpage, roffnum)))
+				continue;
+
 			itup = (IndexTuple) PageGetItem(rpage,
 											PageGetItemId(rpage, roffnum));
 			itemsz = IndexTupleDSize(*itup);
@@ -669,12 +691,17 @@ _hash_squeezebucket(Relation rel,
 			{
 				Assert(!PageIsEmpty(wpage));
 
+				if (wblkno != bucket_blkno)
+					release_buf = true;
+
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));
 
-				if (wbuf_dirty)
+				if (wbuf_dirty && release_buf)
 					_hash_wrtbuf(rel, wbuf);
-				else
+				else if (wbuf_dirty)
+					MarkBufferDirty(wbuf);
+				else if (release_buf)
 					_hash_relbuf(rel, wbuf);
 
 				/* nothing more to do if we reached the read page */
@@ -700,6 +727,7 @@ _hash_squeezebucket(Relation rel,
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
 				wbuf_dirty = false;
+				release_buf = false;
 			}
 
 			/*
@@ -733,19 +761,25 @@ _hash_squeezebucket(Relation rel,
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
-			/* yes, so release wbuf lock first */
-			if (wbuf_dirty)
+			if (wblkno != bucket_blkno)
+				release_buf = true;
+
+			/* yes, so release wbuf lock first if needed */
+			if (wbuf_dirty && release_buf)
 				_hash_wrtbuf(rel, wbuf);
-			else
+			else if (wbuf_dirty)
+				MarkBufferDirty(wbuf);
+			else if (release_buf)
 				_hash_relbuf(rel, wbuf);
+
 			/* free this overflow page (releases rbuf) */
-			_hash_freeovflpage(rel, rbuf, bstrategy);
+			_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 			/* done */
 			return;
 		}
 
 		/* free this overflow page, then get the previous one */
-		_hash_freeovflpage(rel, rbuf, bstrategy);
+		_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 178463f..36cacc8 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,10 +38,14 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
 					uint32 nblocks);
 static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
+static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
+					   Bucket obucket, Bucket nbucket, Buffer obuf,
+					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
+					   uint32 highmask, uint32 lowmask);
 
 
 /*
@@ -55,46 +59,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 
 
 /*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock.  (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		return ConditionalLockPage(rel, whichlock, access);
-	else
-		return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		UnlockPage(rel, whichlock, access);
-}
-
-/*
  *	_hash_getbuf() -- Get a buffer by block number for read or write.
  *
  *		'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -132,6 +96,35 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
 }
 
 /*
+ * _hash_getbuf_with_condlock_cleanup() -- as above, but get the buffer for write.
+ *
+ *		We try to take the conditional cleanup lock and if we get it then
+ *		return the buffer, else return InvalidBuffer.
+ */
+Buffer
+_hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
+{
+	Buffer		buf;
+
+	if (blkno == P_NEW)
+		elog(ERROR, "hash AM does not use P_NEW");
+
+	buf = ReadBuffer(rel, blkno);
+
+	if (!ConditionalLockBufferForCleanup(buf))
+	{
+		ReleaseBuffer(buf);
+		return InvalidBuffer;
+	}
+
+	/* ref count and lock type are correct */
+
+	_hash_checkpage(rel, buf, flags);
+
+	return buf;
+}
+
+/*
  *	_hash_getinitbuf() -- Get and initialize a buffer by block number.
  *
  *		This must be used only to fetch pages that are known to be before
@@ -266,6 +259,33 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
+ *	_hash_dropscanbuf() -- release buffers used in scan.
+ *
+ * This routine unpins the buffers used during scan on which we
+ * hold no lock.
+ */
+void
+_hash_dropscanbuf(Relation rel, HashScanOpaque so)
+{
+	/* release pin we hold on primary bucket */
+	if (BufferIsValid(so->hashso_bucket_buf) &&
+		so->hashso_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_bucket_buf);
+	so->hashso_bucket_buf = InvalidBuffer;
+
+	/* release pin we hold on old primary bucket */
+	if (BufferIsValid(so->hashso_old_bucket_buf) &&
+		so->hashso_old_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+	so->hashso_old_bucket_buf = InvalidBuffer;
+
+	/* release any pin we still hold */
+	if (BufferIsValid(so->hashso_curbuf))
+		_hash_dropbuf(rel, so->hashso_curbuf);
+	so->hashso_curbuf = InvalidBuffer;
+}
+
+/*
  *	_hash_wrtbuf() -- write a hash page to disk.
  *
  *		This routine releases the lock held on the buffer and our refcount
@@ -489,9 +509,11 @@ _hash_pageinit(Page page, Size size)
 /*
  * Attempt to expand the hash table by creating one new bucket.
  *
- * This will silently do nothing if it cannot get the needed locks.
+ * This will silently do nothing if we don't get cleanup lock on old or
+ * new bucket.
  *
- * The caller should hold no locks on the hash index.
+ * Complete the pending splits and remove the tuples from old bucket,
+ * if there are any left over from previous split.
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.
@@ -506,10 +528,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	BlockNumber start_oblkno;
 	BlockNumber start_nblkno;
 	Buffer		buf_nblkno;
+	Buffer		buf_oblkno;
+	Page		opage;
+	HashPageOpaque oopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
 
+restart_expand:
+
 	/*
 	 * Write-lock the meta page.  It used to be necessary to acquire a
 	 * heavyweight lock to begin a split, but that is no longer required.
@@ -548,11 +575,16 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		goto fail;
 
 	/*
-	 * Determine which bucket is to be split, and attempt to lock the old
-	 * bucket.  If we can't get the lock, give up.
+	 * Determine which bucket is to be split, and attempt to take cleanup lock
+	 * on the old bucket.  If we can't get the lock, give up.
 	 *
-	 * The lock protects us against other backends, but not against our own
-	 * backend.  Must check for active scans separately.
+	 * The cleanup lock protects us not only against other backends, but
+	 * against our own backend as well.
+	 *
+	 * The cleanup lock is mainly to protect the split from concurrent
+	 * inserts. See src/backend/access/hash/README, Lock Definitions for
+	 * further details.  Due to this locking restriction, if there is any
+	 * pending scan, split will give up which is not good, but harmless.
 	 */
 	new_bucket = metap->hashm_maxbucket + 1;
 
@@ -560,14 +592,90 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 
 	start_oblkno = BUCKET_TO_BLKNO(metap, old_bucket);
 
-	if (_hash_has_active_scan(rel, old_bucket))
+	buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
+	if (!buf_oblkno)
 		goto fail;
 
-	if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
-		goto fail;
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 
 	/*
-	 * Likewise lock the new bucket (should never fail).
+	 * We want to finish the split from a bucket as there is no apparent
+	 * benefit by not doing so and it will make the code complicated to finish
+	 * the split that involves multiple buckets considering the case where new
+	 * split also fails.  We don't need to consider the new bucket for
+	 * completing the split here as it is not possible that a re-split of new
+	 * bucket starts when there is still a pending split from old bucket.
+	 */
+	if (H_OLD_INCOMPLETE_SPLIT(oopaque))
+	{
+		BlockNumber nblkno;
+		Buffer		buf_nblkno;
+
+		/*
+		 * Copy bucket mapping info now;  The comment in code below where we
+		 * copy this information and calls _hash_splitbucket explains why this
+		 * is OK.
+		 */
+		maxbucket = metap->hashm_maxbucket;
+		highmask = metap->hashm_highmask;
+		lowmask = metap->hashm_lowmask;
+
+		/* Release the metapage lock, before completing the split. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		nblkno = _hash_get_newblk(rel, oopaque);
+
+		/* Fetch the primary bucket page for the new bucket */
+		buf_nblkno = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+		if (!buf_nblkno)
+		{
+			_hash_relbuf(rel, buf_oblkno);
+			return;
+		}
+
+		_hash_finish_split(rel, metabuf, buf_oblkno, buf_nblkno, maxbucket,
+						   highmask, lowmask);
+
+		/*
+		 * release the buffers and retry for expand.
+		 */
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+
+		goto restart_expand;
+	}
+
+	/*
+	 * Clean the tuples remained from previous split.  This operation requires
+	 * cleanup lock and we already have one on old bucket, so let's do it. We
+	 * also don't want to allow further splits from the bucket till the
+	 * garbage of previous split is cleaned.  This has two advantages, first
+	 * it helps in avoiding the bloat due to garbage and second is, during
+	 * cleanup of bucket, we are always sure that the garbage tuples belong to
+	 * most recently splitted bucket.  On the contrary, if we allow cleanup of
+	 * bucket after meta page is updated to indicate the new split and before
+	 * the actual split, the cleanup operation won't be able to decide whether
+	 * the tuple has been moved to the newly created bucket and ended up
+	 * deleting such tuples.
+	 */
+	if (H_HAS_GARBAGE(oopaque))
+	{
+		/* Release the metapage lock. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		hashbucketcleanup(rel, buf_oblkno, start_oblkno, NULL,
+						  metap->hashm_maxbucket, metap->hashm_highmask,
+						  metap->hashm_lowmask, NULL,
+						  NULL, true, false, NULL, NULL);
+
+		_hash_relbuf(rel, buf_oblkno);
+
+		goto restart_expand;
+	}
+
+	/*
+	 * There shouldn't be any active scan on new bucket.
 	 *
 	 * Note: it is safe to compute the new bucket's blkno here, even though we
 	 * may still need to update the BUCKET_TO_BLKNO mapping.  This is because
@@ -576,12 +684,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	 */
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
-	if (_hash_has_active_scan(rel, new_bucket))
-		elog(ERROR, "scan in progress on supposedly new bucket");
-
-	if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
-		elog(ERROR, "could not get lock on supposedly new bucket");
-
 	/*
 	 * If the split point is increasing (hashm_maxbucket's log base 2
 	 * increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +702,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
 			/* can't split due to BlockNumber overflow */
-			_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-			_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+			_hash_relbuf(rel, buf_oblkno);
 			goto fail;
 		}
 	}
@@ -609,9 +710,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/*
 	 * Physically allocate the new bucket's primary page.  We want to do this
 	 * before changing the metapage's mapping info, in case we can't get the
-	 * disk space.
+	 * disk space.  Ideally, we don't need to check for cleanup lock on new
+	 * bucket as no other backend could find this bucket unless meta page is
+	 * updated.  However, it is good to be consistent with old bucket locking.
 	 */
 	buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
+	if (!IsBufferCleanupOK(buf_nblkno))
+	{
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+		goto fail;
+	}
+
 
 	/*
 	 * Okay to proceed with split.  Update the metapage bucket mapping info.
@@ -665,13 +775,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  start_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno,
 					  maxbucket, highmask, lowmask);
 
-	/* Release bucket locks, allowing others to access them */
-	_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-	_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -738,13 +844,17 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
  * belong in the new bucket, and compress out any free space in the old
  * bucket.
  *
- * The caller must hold exclusive locks on both buckets to ensure that
+ * The caller must hold cleanup locks on both buckets to ensure that
  * no one else is trying to access them (see README).
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.  (The metapage is only
  * touched if it becomes necessary to add or remove overflow pages.)
  *
+ * Split needs to retain pin on primary bucket pages of both old and new
+ * buckets till end of operation.  This is to prevent vacuum to start
+ * when split is in progress.
+ *
  * In addition, the caller must have created the new bucket's base page,
  * which is passed in buffer nbuf, pinned and write-locked.  That lock and
  * pin are released here.  (The API is set up this way because we must do
@@ -756,37 +866,87 @@ _hash_splitbucket(Relation rel,
 				  Buffer metabuf,
 				  Bucket obucket,
 				  Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Buffer		obuf;
 	Page		opage;
 	Page		npage;
 	HashPageOpaque oopaque;
 	HashPageOpaque nopaque;
 
-	/*
-	 * It should be okay to simultaneously write-lock pages from each bucket,
-	 * since no one else can be trying to acquire buffer lock on pages of
-	 * either bucket.
-	 */
-	obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
 	opage = BufferGetPage(obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 
+	/*
+	 * Mark the old bucket to indicate that split is in progress and it has
+	 * deletable tuples. At operation end, we clear split in progress flag and
+	 * vacuum will clear page_has_garbage flag after deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
 	npage = BufferGetPage(nbuf);
 
-	/* initialize the new bucket's primary page */
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nopaque->hasho_prevblkno = InvalidBlockNumber;
 	nopaque->hasho_nextblkno = InvalidBlockNumber;
 	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
 	nopaque->hasho_page_id = HASHO_PAGE_ID;
 
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, nbuf, NULL,
+						   maxbucket, highmask, lowmask);
+
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, obuf);
+	_hash_relbuf(rel, nbuf);
+}
+
+/*
+ * _hash_splitbucket_guts -- Helper function to perform the split operation
+ *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, caller needs to fill htab.  If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
+ * Caller needs to lock and unlock the old and new primary buckets.
+ */
+static void
+_hash_splitbucket_guts(Relation rel,
+					   Buffer metabuf,
+					   Bucket obucket,
+					   Bucket nbucket,
+					   Buffer obuf,
+					   Buffer nbuf,
+					   HTAB *htab,
+					   uint32 maxbucket,
+					   uint32 highmask,
+					   uint32 lowmask)
+{
+	Buffer		bucket_obuf;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
+
+	bucket_obuf = obuf;
+	opage = BufferGetPage(obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
 	/*
 	 * Partition the tuples in the old bucket between the old bucket and the
 	 * new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +958,6 @@ _hash_splitbucket(Relation rel,
 		BlockNumber oblkno;
 		OffsetNumber ooffnum;
 		OffsetNumber omaxoffnum;
-		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
 
 		/* Scan each tuple in old page */
 		omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -810,39 +968,73 @@ _hash_splitbucket(Relation rel,
 			IndexTuple	itup;
 			Size		itemsz;
 			Bucket		bucket;
+			bool		found = false;
+
+			/* skip dead tuples */
+			if (ItemIdIsDead(PageGetItemId(opage, ooffnum)))
+				continue;
 
 			/*
-			 * Fetch the item's hash key (conveniently stored in the item) and
-			 * determine which bucket it now belongs in.
+			 * Before inserting tuple, probe the hash table containing TIDs of
+			 * tuples belonging to new bucket, if we find a match, then skip
+			 * that tuple, else fetch the item's hash key (conveniently stored
+			 * in the item) and determine which bucket it now belongs in.
 			 */
 			itup = (IndexTuple) PageGetItem(opage,
 											PageGetItemId(opage, ooffnum));
+
+			if (htab)
+				(void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
+
+			if (found)
+				continue;
+
 			bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
 										  maxbucket, highmask, lowmask);
 
 			if (bucket == nbucket)
 			{
+				Size		itupsize = 0;
+				IndexTuple	new_itup;
+
+				/*
+				 * make a copy of index tuple as we have to scribble on it.
+				 */
+				new_itup = CopyIndexTuple(itup);
+
+				/*
+				 * mark the index tuple as moved by split, such tuples are
+				 * skipped by scan if there is split in progress for a bucket.
+				 */
+				itupsize = new_itup->t_info & INDEX_SIZE_MASK;
+				new_itup->t_info &= ~INDEX_SIZE_MASK;
+				new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+				new_itup->t_info |= itupsize;
+
 				/*
 				 * insert the tuple into the new bucket.  if it doesn't fit on
 				 * the current page in the new bucket, we must allocate a new
 				 * overflow page and place the tuple on that page instead.
-				 *
-				 * XXX we have a problem here if we fail to get space for a
-				 * new overflow page: we'll error out leaving the bucket split
-				 * only partially complete, meaning the index is corrupt,
-				 * since searches may fail to find entries they should find.
 				 */
-				itemsz = IndexTupleDSize(*itup);
+				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
 				if (PageGetFreeSpace(npage) < itemsz)
 				{
+					bool		retain_pin = false;
+
+					/*
+					 * page flags must be accessed before releasing lock on a
+					 * page.
+					 */
+					retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
+
 					/* write out nbuf and drop lock, but keep pin */
 					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
 					/* chain to a new overflow page */
-					nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+					nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
 					npage = BufferGetPage(nbuf);
-					/* we don't need nopaque within the loop */
+					nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 				}
 
 				/*
@@ -852,12 +1044,10 @@ _hash_splitbucket(Relation rel,
 				 * Possible future improvement: accumulate all the items for
 				 * the new page and qsort them before insertion.
 				 */
-				(void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
+				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
-				/*
-				 * Mark tuple for deletion from old page.
-				 */
-				deletable[ndeletable++] = ooffnum;
+				/* be tidy */
+				pfree(new_itup);
 			}
 			else
 			{
@@ -870,15 +1060,9 @@ _hash_splitbucket(Relation rel,
 
 		oblkno = oopaque->hasho_nextblkno;
 
-		/*
-		 * Done scanning this old page.  If we moved any tuples, delete them
-		 * from the old page.
-		 */
-		if (ndeletable > 0)
-		{
-			PageIndexMultiDelete(opage, deletable, ndeletable);
-			_hash_wrtbuf(rel, obuf);
-		}
+		/* retain the pin on the old primary bucket */
+		if (obuf == bucket_obuf)
+			_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, obuf);
 
@@ -887,18 +1071,153 @@ _hash_splitbucket(Relation rel,
 			break;
 
 		/* Else, advance to next old page */
-		obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
 		opage = BufferGetPage(obuf);
 		oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	}
 
 	/*
 	 * We're at the end of the old bucket chain, so we're done partitioning
-	 * the tuples.  Before quitting, call _hash_squeezebucket to ensure the
-	 * tuples remaining in the old bucket (including the overflow pages) are
-	 * packed as tightly as possible.  The new bucket is already tight.
+	 * the tuples.  Mark the old and new buckets to indicate split is
+	 * finished.
+	 *
+	 * To avoid deadlocks due to locking order of buckets, first lock the old
+	 * bucket and then the new bucket.
+	 */
+	if (nopaque->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, nbuf);
+
+	/*
+	 * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+	 * there is no pending scan that has seen the flag after it is cleared.
 	 */
-	_hash_wrtbuf(rel, nbuf);
+	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+	opage = BufferGetPage(bucket_obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	_hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+	npage = BufferGetPage(bucket_nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+	/* indicate that split is finished */
+	oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
+	nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
+
+	/*
+	 * now write the buffers, here we don't release the locks as caller is
+	 * responsible to release locks.
+	 */
+	MarkBufferDirty(bucket_obuf);
+	MarkBufferDirty(bucket_nbuf);
+}
+
+/*
+ *	_hash_finish_split() -- Finish the previously interrupted split operation
+ *
+ * To complete the split operation, we form the hash table of TIDs in new
+ * bucket which is then used by split operation to skip tuples that are
+ * already moved before the split operation was previously interruptted.
+ *
+ * The caller must hold a pin, but no lock, on the metapage buffer.
+ * The buffer is returned in the same state.  (The metapage is only
+ * touched if it becomes necessary to add or remove overflow pages.)
+ *
+ * 'obuf' and 'nbuf' must be locked by the caller which is also responsible
+ * for unlocking them.
+ */
+void
+_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf,
+				   uint32 maxbucket, uint32 highmask, uint32 lowmask)
+{
+	HASHCTL		hash_ctl;
+	HTAB	   *tidhtab;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque opageopaque;
+	HashPageOpaque npageopaque;
+	Bucket		obucket;
+	Bucket		nbucket;
+	bool		found;
+
+	/* Initialize hash tables used to track TIDs */
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+	hash_ctl.keysize = sizeof(ItemPointerData);
+	hash_ctl.entrysize = sizeof(ItemPointerData);
+	hash_ctl.hcxt = CurrentMemoryContext;
+
+	tidhtab =
+		hash_create("bucket ctids",
+					256,		/* arbitrary initial size */
+					&hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	/*
+	 * Scan the new bucket and build hash table of TIDs
+	 */
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	for (;;)
+	{
+		BlockNumber nblkno;
+		OffsetNumber noffnum;
+		OffsetNumber nmaxoffnum;
+
+		/* Scan each tuple in new page */
+		nmaxoffnum = PageGetMaxOffsetNumber(npage);
+		for (noffnum = FirstOffsetNumber;
+			 noffnum <= nmaxoffnum;
+			 noffnum = OffsetNumberNext(noffnum))
+		{
+			IndexTuple	itup;
+
+			/* Fetch the item's TID and insert it in hash table. */
+			itup = (IndexTuple) PageGetItem(npage,
+											PageGetItemId(npage, noffnum));
+
+			(void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
+
+			Assert(!found);
+		}
+
+		nblkno = npageopaque->hasho_nextblkno;
+
+		/*
+		 * release our write lock without modifying buffer and ensure to
+		 * retain the pin on primary bucket.
+		 */
+		if (nbuf == bucket_nbuf)
+			_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, nbuf);
+
+		/* Exit loop if no more overflow pages in new bucket */
+		if (!BlockNumberIsValid(nblkno))
+			break;
+
+		/* Else, advance to next page */
+		nbuf = _hash_getbuf(rel, nblkno, HASH_READ, LH_OVERFLOW_PAGE);
+		npage = BufferGetPage(nbuf);
+		npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	}
+
+	/* Need a cleanup lock to perform split operation. */
+	LockBufferForCleanup(bucket_nbuf);
+
+	npage = BufferGetPage(bucket_nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nbucket = npageopaque->hasho_bucket;
+
+	opage = BufferGetPage(obuf);
+	opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+	obucket = opageopaque->hasho_bucket;
+
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, bucket_nbuf, tidhtab,
+						   maxbucket, highmask, lowmask);
 
-	_hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+	hash_destroy(tidhtab);
 }
diff --git a/src/backend/access/hash/hashscan.c b/src/backend/access/hash/hashscan.c
deleted file mode 100644
index fe97ef2..0000000
--- a/src/backend/access/hash/hashscan.c
+++ /dev/null
@@ -1,153 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * hashscan.c
- *	  manage scans on hash tables
- *
- * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- *
- * IDENTIFICATION
- *	  src/backend/access/hash/hashscan.c
- *
- *-------------------------------------------------------------------------
- */
-
-#include "postgres.h"
-
-#include "access/hash.h"
-#include "access/relscan.h"
-#include "utils/memutils.h"
-#include "utils/rel.h"
-#include "utils/resowner.h"
-
-
-/*
- * We track all of a backend's active scans on hash indexes using a list
- * of HashScanListData structs, which are allocated in TopMemoryContext.
- * It's okay to use a long-lived context because we rely on the ResourceOwner
- * mechanism to clean up unused entries after transaction or subtransaction
- * abort.  We can't safely keep the entries in the executor's per-query
- * context, because that might be already freed before we get a chance to
- * clean up the list.  (XXX seems like there should be a better way to
- * manage this...)
- */
-typedef struct HashScanListData
-{
-	IndexScanDesc hashsl_scan;
-	ResourceOwner hashsl_owner;
-	struct HashScanListData *hashsl_next;
-} HashScanListData;
-
-typedef HashScanListData *HashScanList;
-
-static HashScanList HashScans = NULL;
-
-
-/*
- * ReleaseResources_hash() --- clean up hash subsystem resources.
- *
- * This is here because it needs to touch this module's static var HashScans.
- */
-void
-ReleaseResources_hash(void)
-{
-	HashScanList l;
-	HashScanList prev;
-	HashScanList next;
-
-	/*
-	 * Release all HashScanList items belonging to the current ResourceOwner.
-	 * Note that we do not release the underlying IndexScanDesc; that's in
-	 * executor memory and will go away on its own (in fact quite possibly has
-	 * gone away already, so we mustn't try to touch it here).
-	 *
-	 * Note: this should be a no-op during normal query shutdown. However, in
-	 * an abort situation ExecutorEnd is not called and so there may be open
-	 * index scans to clean up.
-	 */
-	prev = NULL;
-
-	for (l = HashScans; l != NULL; l = next)
-	{
-		next = l->hashsl_next;
-		if (l->hashsl_owner == CurrentResourceOwner)
-		{
-			if (prev == NULL)
-				HashScans = next;
-			else
-				prev->hashsl_next = next;
-
-			pfree(l);
-			/* prev does not change */
-		}
-		else
-			prev = l;
-	}
-}
-
-/*
- *	_hash_regscan() -- register a new scan.
- */
-void
-_hash_regscan(IndexScanDesc scan)
-{
-	HashScanList new_el;
-
-	new_el = (HashScanList) MemoryContextAlloc(TopMemoryContext,
-											   sizeof(HashScanListData));
-	new_el->hashsl_scan = scan;
-	new_el->hashsl_owner = CurrentResourceOwner;
-	new_el->hashsl_next = HashScans;
-	HashScans = new_el;
-}
-
-/*
- *	_hash_dropscan() -- drop a scan from the scan list
- */
-void
-_hash_dropscan(IndexScanDesc scan)
-{
-	HashScanList chk,
-				last;
-
-	last = NULL;
-	for (chk = HashScans;
-		 chk != NULL && chk->hashsl_scan != scan;
-		 chk = chk->hashsl_next)
-		last = chk;
-
-	if (chk == NULL)
-		elog(ERROR, "hash scan list trashed; cannot find 0x%p", (void *) scan);
-
-	if (last == NULL)
-		HashScans = chk->hashsl_next;
-	else
-		last->hashsl_next = chk->hashsl_next;
-
-	pfree(chk);
-}
-
-/*
- * Is there an active scan in this bucket?
- */
-bool
-_hash_has_active_scan(Relation rel, Bucket bucket)
-{
-	Oid			relid = RelationGetRelid(rel);
-	HashScanList l;
-
-	for (l = HashScans; l != NULL; l = l->hashsl_next)
-	{
-		if (relid == l->hashsl_scan->indexRelation->rd_id)
-		{
-			HashScanOpaque so = (HashScanOpaque) l->hashsl_scan->opaque;
-
-			if (so->hashso_bucket_valid &&
-				so->hashso_bucket == bucket)
-				return true;
-		}
-	}
-
-	return false;
-}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..cd5d3f2 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -72,7 +72,19 @@ _hash_readnext(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_nextblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan to ensure
+	 * that vacuum can't delete the tuples that are moved by split to new
+	 * bucket. Such tuples are required by the scans that are started on
+	 * splitted buckets, before a new buckets split in progress flag
+	 * (LH_BUCKET_NEW_PAGE_SPLIT) is cleared.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -94,7 +106,16 @@ _hash_readprev(Relation rel,
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_prevblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan. See
+	 * comments in _hash_readnext to know the reason of retaining pin.
+	 */
+	if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -104,6 +125,13 @@ _hash_readprev(Relation rel,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
+
+		/*
+		 * We always maintain the pin on bucket page for whole scan operation,
+		 * so releasing the additional pin we have acquired here.
+		 */
+		if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+			_hash_dropbuf(rel, *bufp);
 	}
 }
 
@@ -218,9 +246,11 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 		{
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+
+		/* Fetch the primary bucket page for the bucket */
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
 
 		/*
 		 * Reacquire metapage lock and check that no bucket split has taken
@@ -234,17 +264,58 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	/* done with the metapage */
 	_hash_dropbuf(rel, metabuf);
 
-	/* Update scan opaque state to show we have lock on the bucket */
-	so->hashso_bucket = bucket;
-	so->hashso_bucket_valid = true;
-	so->hashso_bucket_blkno = blkno;
-
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
 	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
+	so->hashso_bucket_buf = buf;
+
+	/*
+	 * If the bucket split is in progress, then we need to skip tuples that
+	 * are moved from old bucket.  To ensure that vacuum doesn't clean any
+	 * tuples from old or new buckets till this scan is in progress, maintain
+	 * a pin on both of the buckets.  Here, we have to be cautious about lock
+	 * ordering, first acquire the lock on old bucket, release the lock on old
+	 * bucket, but not pin, then acquire the lock on new bucket and again
+	 * re-verify whether the bucket split still is in progress. Acquiring lock
+	 * on old bucket first ensures that the vacuum waits for this scan to
+	 * finish.
+	 */
+	if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+	{
+		BlockNumber old_blkno;
+		Buffer		old_buf;
+
+		old_blkno = _hash_get_oldblk(rel, opaque);
+
+		/*
+		 * release the lock on new bucket and re-acquire it after acquiring
+		 * the lock on old bucket.
+		 */
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+		/*
+		 * remember the old bucket buffer so as to use it later for scanning.
+		 */
+		so->hashso_old_bucket_buf = old_buf;
+		_hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+		_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		Assert(opaque->hasho_bucket == bucket);
+
+		if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+			so->hashso_skip_moved_tuples = true;
+		else
+		{
+			_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+			so->hashso_old_bucket_buf = InvalidBuffer;
+		}
+	}
+
 	/* If a backwards scan is requested, move to the end of the chain */
 	if (ScanDirectionIsBackward(dir))
 	{
@@ -273,6 +344,13 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
  *		false.  Else, return true and set the hashso_curpos for the
  *		scan to the right thing.
  *
+ *		Here we also scan the old bucket if the split for current bucket
+ *		was in progress at the start of scan.  The basic idea is that
+ *		skip the tuples that are moved by split while scanning current
+ *		bucket and then scan the old bucket to cover all such tuples. This
+ *		is done to ensure that we don't miss any tuples in the scans that
+ *		started during split.
+ *
  *		'bufP' points to the current buffer, which is pinned and read-locked.
  *		On success exit, we have pin and read-lock on whichever page
  *		contains the right item; on failure, we have released all buffers.
@@ -338,6 +416,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum >= FirstOffsetNumber);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberNext(offnum);	/* move forward */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -353,9 +444,42 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -379,6 +503,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum <= maxoff);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberPrev(offnum);	/* move back */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -394,9 +531,42 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -410,9 +580,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 
 		if (itup == NULL)
 		{
-			/* we ran off the end of the bucket without finding a match */
+			/*
+			 * We ran off the end of the bucket without finding a match.
+			 * Release the pin on bucket buffers.  Normally, such pins are
+			 * released at end of scan, however scrolling cursors can
+			 * reacquire the bucket lock and pin in the same scan multiple
+			 * times.
+			 */
 			*bufP = so->hashso_curbuf = InvalidBuffer;
 			ItemPointerSetInvalid(current);
+			_hash_dropscanbuf(rel, so);
 			return false;
 		}
 
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..b5164d7 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -147,6 +147,23 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_msb-- returns most significant bit position.
+ */
+static uint32
+_hash_msb(uint32 num)
+{
+	uint32		i = 0;
+
+	while (num)
+	{
+		num = num >> 1;
+		++i;
+	}
+
+	return i - 1;
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
@@ -352,3 +369,123 @@ _hash_binsearch_last(Page page, uint32 hash_value)
 
 	return lower;
 }
+
+/*
+ *	_hash_get_oldblk() -- get the block number from which current bucket
+ *			is being splitted.
+ */
+BlockNumber
+_hash_get_oldblk(Relation rel, HashPageOpaque opaque)
+{
+	Bucket		curr_bucket;
+	Bucket		old_bucket;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	/*
+	 * To get the old bucket from the current bucket, we need a mask to modulo
+	 * into lower half of table.  This mask is stored in meta page as
+	 * hashm_lowmask, but here we can't rely on the same, because we need a
+	 * value of lowmask that was prevalent at the time when bucket split was
+	 * started.  Masking the most significant bit of new bucket would give us
+	 * old bucket.
+	 */
+	curr_bucket = opaque->hasho_bucket;
+	mask = (((uint32) 1) << _hash_msb(curr_bucket)) - 1;
+	old_bucket = curr_bucket & mask;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newblk() -- get the block number of bucket for the new bucket
+ *			that will be generated after split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket based on current table
+ * half.  It is mainly required to finish the incomplete splits where we are
+ * sure that not more than one bucket could have split in progress from old
+ * bucket.
+ */
+BlockNumber
+_hash_get_newblk(Relation rel, HashPageOpaque opaque)
+{
+	Bucket		curr_bucket;
+	Bucket		new_bucket;
+	uint32		lowmask;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	curr_bucket = opaque->hasho_bucket;
+
+	/*
+	 * new bucket can be obtained by OR'ing old bucket with most significant
+	 * bit of current table half.  There could be multiple buckets that could
+	 * have splitted from curent bucket.  We need the first such bucket that
+	 * exists based on current table half.
+	 */
+	lowmask = metap->hashm_lowmask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = curr_bucket | mask;
+		if (new_bucket > metap->hashm_maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+		break;
+	}
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newbucket() -- get the new bucket that will be generated after
+ *			split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket.  New bucket can be
+ * obtained by OR'ing old bucket with most significant bit of table half
+ * for lowmask passed in this function.  There could be multiple buckets that
+ * could have splitted from curent bucket.  We need the first such bucket that
+ * exists.  Caller must ensure that no more than one split has happened from
+ * old bucket.
+ */
+Bucket
+_hash_get_newbucket(Relation rel, Bucket curr_bucket,
+					uint32 lowmask, uint32 maxbucket)
+{
+	Bucket		new_bucket;
+	uint32		mask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = curr_bucket | mask;
+		if (new_bucket > maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		break;
+	}
+
+	return new_bucket;
+}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 07075ce..cdc460b 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -668,9 +668,6 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintFileLeakWarning(res);
 			FileClose(res);
 		}
-
-		/* Clean up index scans too */
-		ReleaseResources_hash();
 	}
 
 	/* Let add-on modules get a chance too */
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 725e2f2..c7ad10b 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -24,6 +24,7 @@
 #include "lib/stringinfo.h"
 #include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
+#include "utils/hsearch.h"
 #include "utils/relcache.h"
 
 /*
@@ -32,6 +33,8 @@
  */
 typedef uint32 Bucket;
 
+#define InvalidBucket	((Bucket) 0xFFFFFFFF)
+
 #define BUCKET_TO_BLKNO(metap,B) \
 		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
 
@@ -51,6 +54,9 @@ typedef uint32 Bucket;
 #define LH_BUCKET_PAGE			(1 << 1)
 #define LH_BITMAP_PAGE			(1 << 2)
 #define LH_META_PAGE			(1 << 3)
+#define LH_BUCKET_NEW_PAGE_SPLIT	(1 << 4)
+#define LH_BUCKET_OLD_PAGE_SPLIT	(1 << 5)
+#define LH_BUCKET_PAGE_HAS_GARBAGE	(1 << 6)
 
 typedef struct HashPageOpaqueData
 {
@@ -63,6 +69,12 @@ typedef struct HashPageOpaqueData
 
 typedef HashPageOpaqueData *HashPageOpaque;
 
+#define H_HAS_GARBAGE(opaque)			((opaque)->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+#define H_OLD_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+#define H_INCOMPLETE_SPLIT(opaque)		(((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) || \
+										 ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT))
+
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
  * which otherwise would have a hard time telling pages of different index
@@ -80,19 +92,6 @@ typedef struct HashScanOpaqueData
 	uint32		hashso_sk_hash;
 
 	/*
-	 * By definition, a hash scan should be examining only one bucket. We
-	 * record the bucket number here as soon as it is known.
-	 */
-	Bucket		hashso_bucket;
-	bool		hashso_bucket_valid;
-
-	/*
-	 * If we have a share lock on the bucket, we record it here.  When
-	 * hashso_bucket_blkno is zero, we have no such lock.
-	 */
-	BlockNumber hashso_bucket_blkno;
-
-	/*
 	 * We also want to remember which buffer we're currently examining in the
 	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
 	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -100,11 +99,23 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_curbuf;
 
+	/* remember the buffer associated with primary bucket */
+	Buffer		hashso_bucket_buf;
+
+	/*
+	 * remember the buffer associated with old primary bucket which is
+	 * required during the scan of the bucket for which split is in progress.
+	 */
+	Buffer		hashso_old_bucket_buf;
+
 	/* Current position of the scan, as an index TID */
 	ItemPointerData hashso_curpos;
 
 	/* Current position of the scan, as a heap TID */
 	ItemPointerData hashso_heappos;
+
+	/* Whether scan needs to skip tuples that are moved by split */
+	bool		hashso_skip_moved_tuples;
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -175,6 +186,8 @@ typedef HashMetaPageData *HashMetaPage;
 				  sizeof(ItemIdData) - \
 				  MAXALIGN(sizeof(HashPageOpaqueData)))
 
+#define INDEX_MOVED_BY_SPLIT_MASK	0x2000
+
 #define HASH_MIN_FILLFACTOR			10
 #define HASH_DEFAULT_FILLFACTOR		75
 
@@ -223,9 +236,6 @@ typedef HashMetaPageData *HashMetaPage;
 #define HASH_WRITE		BUFFER_LOCK_EXCLUSIVE
 #define HASH_NOLOCK		(-1)
 
-#define HASH_SHARE		ShareLock
-#define HASH_EXCLUSIVE	ExclusiveLock
-
 /*
  *	Strategy number. There's only one valid strategy for hashing: equality.
  */
@@ -297,21 +307,21 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 
 /* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
 extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BufferAccessStrategy bstrategy);
+				   BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy);
 
 /* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
 extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 			 int access, int flags);
+extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
+								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
@@ -320,6 +330,7 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 						   BufferAccessStrategy bstrategy);
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
+extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
 extern void _hash_wrtbuf(Relation rel, Buffer buf);
 extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
 				   int to_access);
@@ -327,12 +338,9 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
 				ForkNumber forkNum);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
-
-/* hashscan.c */
-extern void _hash_regscan(IndexScanDesc scan);
-extern void _hash_dropscan(IndexScanDesc scan);
-extern bool _hash_has_active_scan(Relation rel, Bucket bucket);
-extern void ReleaseResources_hash(void);
+extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
+				   Buffer nbuf, uint32 maxbucket, uint32 highmask,
+				   uint32 lowmask);
 
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
@@ -362,5 +370,17 @@ extern bool _hash_convert_tuple(Relation index,
 					Datum *index_values, bool *index_isnull);
 extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
 extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+extern BlockNumber _hash_get_oldblk(Relation rel, HashPageOpaque opaque);
+extern BlockNumber _hash_get_newblk(Relation rel, HashPageOpaque opaque);
+extern Bucket _hash_get_newbucket(Relation rel, Bucket curr_bucket,
+					uint32 lowmask, uint32 maxbucket);
+
+/* hash.c */
+extern void hashbucketcleanup(Relation rel, Buffer bucket_buf,
+				  BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
+				  double *tuples_removed, double *num_index_tuples,
+				  bool bucket_has_garbage, bool delay,
+				  IndexBulkDeleteCallback callback, void *callback_state);
 
 #endif   /* HASH_H */
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 8350fa0..788ba9f 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -63,7 +63,7 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
  * t_info manipulation macros
  */
 #define INDEX_SIZE_MASK 0x1FFF
-/* bit 0x2000 is not used at present */
+/* bit 0x2000 is reserved for index-AM specific usage */
 #define INDEX_VAR_MASK	0x4000
 #define INDEX_NULL_MASK 0x8000
 
#124Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#123)
Re: Hash Indexes

On Mon, Oct 24, 2016 at 8:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Thanks for the valuable feedback.

Forgot to mention that in addition to fixing the review comments, I
had made an additional change to skip the dead tuple while copying
tuples from old bucket to new bucket during split. This was
previously not possible because split and scan were blocking
operations (split used to take Exclusive lock on bucket and Scan used
to hold Share lock on bucket till the operation ends), but now it is
possible and during scan some of the tuples can be marked as dead.
Similarly during squeeze operation, skipping dead tuples while moving
tuples across buckets.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#125Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#123)
Re: Hash Indexes

On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Amit, can you please split the buffer manager changes in this patch
into a separate patch?

Sure, attached patch extend_bufmgr_api_for_hash_index_v1.patch does that.

The additional argument to ConditionalLockBuffer() doesn't seem to be
used anywhere in the main patch. Do we actually need it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#126Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#125)
1 attachment(s)
Re: Hash Indexes

On Fri, Oct 28, 2016 at 2:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Amit, can you please split the buffer manager changes in this patch
into a separate patch?

Sure, attached patch extend_bufmgr_api_for_hash_index_v1.patch does that.

The additional argument to ConditionalLockBuffer() doesn't seem to be
used anywhere in the main patch. Do we actually need it?

No, with latest patch of concurrent hash index, we don't need it. I
have forgot to remove it. Please find updated patch attached. The
usage of second parameter for ConditionalLockBuffer() is removed as we
don't want to allow I/O across content locks, so the patch is changed
to fallback to twice locking the metapage.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

extend_bufmgr_api_for_hash_index_v2.patchapplication/octet-stream; name=extend_bufmgr_api_for_hash_index_v2.patchDownload
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index df4c9d7..fa84426 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3745,6 +3745,53 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 	return false;
 }
 
+/*
+ * IsBufferCleanupOK - as above, but don't attempt to take lock
+ *
+ * We won't loop, but just check once to see if the pin count is OK.  If
+ * not, return FALSE.
+ */
+bool
+IsBufferCleanupOK(Buffer buffer)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(BufferIsValid(buffer));
+
+	if (BufferIsLocal(buffer))
+	{
+		/* There should be exactly one pin */
+		if (LocalRefCount[-buffer - 1] != 1)
+			return false;
+		/* Nobody else to wait for */
+		return true;
+	}
+
+	/* There should be exactly one local pin */
+	if (GetPrivateRefCount(buffer) != 1)
+		return false;
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+	/* caller must hold exclusive lock on buffer */
+	Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
+								LW_EXCLUSIVE));
+
+	buf_state = LockBufHdr(bufHdr);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+	{
+		/* pincount is OK. */
+		UnlockBufHdr(bufHdr, buf_state);
+		return true;
+	}
+
+	UnlockBufHdr(bufHdr, buf_state);
+	return false;
+}
+
 
 /*
  *	Functions for buffer I/O handling
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7b6ba96..821bee5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -227,6 +227,7 @@ extern void LockBuffer(Buffer buffer, int mode);
 extern bool ConditionalLockBuffer(Buffer buffer);
 extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
+extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern void AbortBufferIO(void);
#127Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#123)
1 attachment(s)
Re: Hash Indexes

On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[ new patches ]

I looked over parts of this today, mostly the hashinsert.c changes.

+    /*
+     * Copy bucket mapping info now;  The comment in _hash_expandtable where
+     * we copy this information and calls _hash_splitbucket explains why this
+     * is OK.
+     */

So, I went and tried to find the comments to which this comment is
referring and didn't have much luck. At the point this code is
running, we have a pin but no lock on the metapage, so this is only
safe if changing any of these fields requires a cleanup lock on the
metapage. If that's true, it seems like you could just make the
comment say that; if it's false, you've got problems.

This code seems rather pointless anyway, the way it's written. All of
these local variables are used in exactly one place, which is here:

+            _hash_finish_split(rel, metabuf, buf, nbuf, maxbucket,
+                               highmask, lowmask);

But you hold the same locks at the point where you copy those values
into local variables and the point where that code runs. So if the
code is safe as written, you could instead just pass
metap->hashm_maxbucket, metap->hashm_highmask, and
metap->hashm_lowmask to that function instead of having these local
variables. Or, for that matter, you could just let that function read
the data itself: it's got metabuf, after all.

+     * In future, if we want to finish the splits during insertion in new
+     * bucket, we must ensure the locking order such that old bucket is locked
+     * before new bucket.

Not if the locks are conditional anyway.

+ nblkno = _hash_get_newblk(rel, pageopaque);

I think this is not a great name for this function. It's not clear
what "new blocks" refers to, exactly. I suggest
FIND_SPLIT_BUCKET(metap, bucket) or OLD_BUCKET_TO_NEW_BUCKET(metap,
bucket) returning a new bucket number. I think that macro can be
defined as something like this: bucket + (1 <<
(fls(metap->hashm_maxbucket) - 1)). Then do nblkno =
BUCKET_TO_BLKNO(metap, newbucket) to get the block number. That'd all
be considerably simpler than what you have for hash_get_newblk().

Here's some test code I wrote, which seems to work:

#include <stdio.h>
#include <stdlib.h>
#include <strings.h>
#include <assert.h>

int
newbucket(int bucket, int nbuckets)
{
assert(bucket < nbuckets);
return bucket + (1 << (fls(nbuckets) - 1));
}

int
main(int argc, char **argv)
{
int nbuckets = 1;
int restartat = 1;
int splitbucket = 0;

while (splitbucket < 32)
{
printf("old bucket %d splits to new bucket %d\n", splitbucket,
newbucket(splitbucket, nbuckets));
if (++splitbucket >= restartat)
{
splitbucket = 0;
restartat *= 2;
}
++nbuckets;
}

exit(0);
}

Moving on ...

             /*
              * ovfl page exists; go get it.  if it doesn't have room, we'll
-             * find out next pass through the loop test above.
+             * find out next pass through the loop test above.  Retain the
+             * pin, if it is a primary bucket page.
              */
-            _hash_relbuf(rel, buf);
+            if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
+                _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+            else
+                _hash_relbuf(rel, buf);

It seems like it would be cheaper, safer, and clearer to test whether
buf != bucket_buf here, rather than examining the page opaque data.
That's what you do down at the bottom of the function when you ensure
that the pin on the primary bucket page gets released, and it seems
like it should work up here, too.

+            bool        retain_pin = false;
+
+            /* page flags must be accessed before releasing lock on a page. */
+            retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;

Similarly.

I have also attached a patch with some suggested comment changes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

hashinsert-comments.patchapplication/x-download; name=hashinsert-comments.patchDownload
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index bd39333..4291fde 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -92,9 +92,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
 
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * If the previous iteration of this loop locked the primary page of
+		 * what is still the correct target bucket, we are done.  Otherwise,
+		 * drop any old lock before acquiring the new one.
 		 */
 		if (retry)
 		{
@@ -103,7 +103,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			_hash_relbuf(rel, buf);
 		}
 
-		/* Fetch the primary bucket page for the bucket */
+		/* Fetch and lock the primary bucket page for the target bucket */
 		buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
 
 		/*
@@ -132,15 +132,13 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	Assert(pageopaque->hasho_bucket == bucket);
 
 	/*
-	 * If there is any pending split, try to finish it before proceeding for
-	 * the insertion.  We try to finish the split for the insertion in old
-	 * bucket, as that will allow us to remove the tuples from old bucket and
-	 * reuse the space.  There is no such apparent benefit from finishing the
-	 * split during insertion in new bucket.
-	 *
-	 * In future, if we want to finish the splits during insertion in new
-	 * bucket, we must ensure the locking order such that old bucket is locked
-	 * before new bucket.
+	 * If this bucket is in the process of being split, try to finish the
+	 * split before inserting, because that might create room for the
+	 * insertion to proceed without allocating an additional overflow page.
+	 * It's only interesting to finish the split if we're trying to insert
+	 * into the bucket from which we're removing tuples (the "old" bucket),
+	 * not if we're trying to insert into the bucket into which tuples are
+	 * being moved (the "new" bucket).
 	 */
 	if (H_OLD_INCOMPLETE_SPLIT(pageopaque) && IsBufferCleanupOK(buf))
 	{
@@ -176,8 +174,10 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		{
 			/*
 			 * ovfl page exists; go get it.  if it doesn't have room, we'll
-			 * find out next pass through the loop test above.  Retain the
-			 * pin, if it is a primary bucket page.
+			 * find out next pass through the loop test above.  we always
+			 * release both the lock and pin if this is an overflow page, but
+			 * only the lock if this is the primary bucket page, since the pin
+			 * on the primary bucket must be retained throughout the scan.
 			 */
 			if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
 				_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
@@ -217,8 +217,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
 
 	/*
-	 * write and release the modified page and ensure to release the pin on
-	 * primary page.
+	 * write and release the modified page.  if the page we modified was an
+	 * overflow page, we also need to separately drop the pin we retained on
+	 * the primary bucket page.
 	 */
 	_hash_wrtbuf(rel, buf);
 	if (buf != bucket_buf)
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 36cacc8..106ffca 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -96,10 +96,10 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
 }
 
 /*
- * _hash_getbuf_with_condlock_cleanup() -- as above, but get the buffer for write.
+ * _hash_getbuf_with_condlock_cleanup() -- Try to get a buffer for cleanup.
  *
- *		We try to take the conditional cleanup lock and if we get it then
- *		return the buffer, else return InvalidBuffer.
+ *		We read the page and try to acquire a cleanup lock.  If we get it,
+ *		we return the buffer; otherwise, we return InvalidBuffer.
  */
 Buffer
 _hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
#128Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#127)
Re: Hash Indexes

On Wed, Nov 2, 2016 at 6:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[ new patches ]

I looked over parts of this today, mostly the hashinsert.c changes.

+    /*
+     * Copy bucket mapping info now;  The comment in _hash_expandtable where
+     * we copy this information and calls _hash_splitbucket explains why this
+     * is OK.
+     */

So, I went and tried to find the comments to which this comment is
referring and didn't have much luck.

I guess you have just tried to find it in the patch. However, the
comment I am referring above is an existing comment in
_hash_expandtable(). Refer below comment:
/*
* Copy bucket mapping info now; this saves re-accessing the meta page
* inside _hash_splitbucket's inner loop. ...

At the point this code is
running, we have a pin but no lock on the metapage, so this is only
safe if changing any of these fields requires a cleanup lock on the
metapage. If that's true,

No that's not true, we need just Exclusive content lock to update
those fields and these fields should be copied when we have Share
content lock on metapage. In version-8 of patch, it was correct, but
in last version, it seems during code re-arrangement, I have moved it.
I will change it such that these values are copied under matapage
share content lock. I think moving it just before the preceding for
loop should be okay, let me know if you think otherwise.

+ nblkno = _hash_get_newblk(rel, pageopaque);

I think this is not a great name for this function. It's not clear
what "new blocks" refers to, exactly. I suggest
FIND_SPLIT_BUCKET(metap, bucket) or OLD_BUCKET_TO_NEW_BUCKET(metap,
bucket) returning a new bucket number. I think that macro can be
defined as something like this: bucket + (1 <<
(fls(metap->hashm_maxbucket) - 1)).

I think such a macro would not work for the usage of incomplete
splits. The reason is that by the time we try to complete the split
of the current old bucket, the table half (lowmask, highmask,
maxbucket) would have changed and it could give you the bucket in new
table half.

Then do nblkno =
BUCKET_TO_BLKNO(metap, newbucket) to get the block number. That'd all
be considerably simpler than what you have for hash_get_newblk().

I think to use BUCKET_TO_BLKNO we need either a share or exclusive
lock on metapage and as we need a lock on metapage to find new block
from old block, I thought it is better to do inside
_hash_get_newblk().

Moving on ...

/*
* ovfl page exists; go get it.  if it doesn't have room, we'll
-             * find out next pass through the loop test above.
+             * find out next pass through the loop test above.  Retain the
+             * pin, if it is a primary bucket page.
*/
-            _hash_relbuf(rel, buf);
+            if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
+                _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+            else
+                _hash_relbuf(rel, buf);

It seems like it would be cheaper, safer, and clearer to test whether
buf != bucket_buf here, rather than examining the page opaque data.
That's what you do down at the bottom of the function when you ensure
that the pin on the primary bucket page gets released, and it seems
like it should work up here, too.

+            bool        retain_pin = false;
+
+            /* page flags must be accessed before releasing lock on a page. */
+            retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;

Similarly.

Agreed, will change the usage as per your suggestion.

I have also attached a patch with some suggested comment changes.

I will include it in next version of patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#129Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#128)
Re: Hash Indexes

On Thu, Nov 3, 2016 at 6:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

+ nblkno = _hash_get_newblk(rel, pageopaque);

I think this is not a great name for this function. It's not clear
what "new blocks" refers to, exactly. I suggest
FIND_SPLIT_BUCKET(metap, bucket) or OLD_BUCKET_TO_NEW_BUCKET(metap,
bucket) returning a new bucket number. I think that macro can be
defined as something like this: bucket + (1 <<
(fls(metap->hashm_maxbucket) - 1)).

I think such a macro would not work for the usage of incomplete
splits. The reason is that by the time we try to complete the split
of the current old bucket, the table half (lowmask, highmask,
maxbucket) would have changed and it could give you the bucket in new
table half.

Can you provide an example of the scenario you are talking about here?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#130Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#126)
Re: Hash Indexes

On Fri, Oct 28, 2016 at 12:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Oct 28, 2016 at 2:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Amit, can you please split the buffer manager changes in this patch
into a separate patch?

Sure, attached patch extend_bufmgr_api_for_hash_index_v1.patch does that.

The additional argument to ConditionalLockBuffer() doesn't seem to be
used anywhere in the main patch. Do we actually need it?

No, with latest patch of concurrent hash index, we don't need it. I
have forgot to remove it. Please find updated patch attached. The
usage of second parameter for ConditionalLockBuffer() is removed as we
don't want to allow I/O across content locks, so the patch is changed
to fallback to twice locking the metapage.

Committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#131Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#127)
1 attachment(s)
Re: Hash Indexes

On Tue, Nov 1, 2016 at 9:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[ new patches ]

I looked over parts of this today, mostly the hashinsert.c changes.

Some more review...

@@ -656,6 +678,10 @@ _hash_squeezebucket(Relation rel,
IndexTuple itup;
Size itemsz;

+            /* skip dead tuples */
+            if (ItemIdIsDead(PageGetItemId(rpage, roffnum)))
+                continue;

Is this an optimization independent of the rest of the patch, or is
there something in this patch that necessitates it? i.e. Could this
small change be committed independently? If not, then I think it
needs a better comment explaining why it is now mandatory.

- *  Caller must hold exclusive lock on the target bucket.  This allows
+ *  Caller must hold cleanup lock on the target bucket.  This allows
  *  us to safely lock multiple pages in the bucket.

The notion of a lock on a bucket no longer really exists; with this
patch, we'll now properly speak of a lock on a primary bucket page.
Also, I think the bit about safely locking multiple pages is bizarre
-- that's not the issue at all: the problem is that reorganizing a
bucket might confuse concurrent scans into returning wrong answers.

I've included a broader updating of that comment, and some other
comment changes, in the attached incremental patch, which also
refactors your changes to _hash_freeovflpage() a bit to avoid some
code duplication. Please consider this for inclusion in your next
version.

In hashutil.c, I think that _hash_msb() is just a reimplementation of
fls(), which you can rely on being present because we have our own
implementation in src/port. It's quite similar to yours but slightly
shorter. :-) Also, some systems have a builtin fls() function which
actually optimizes down to a single machine instruction, and which is
therefore much faster than either version.

I don't like the fact that _hash_get_newblk() and _hash_get_oldblk()
are working out the bucket number by using the HashOpaque structure
within the bucket page they're examining. First, it seems weird to
pass the whole structure when you only need the bucket number out of
it. More importantly, the caller really ought to know what bucket
they care about without having to discover it.

For example, in _hash_doinsert(), we figure out the bucket into which
we need to insert, and we store that in a variable called "bucket".
Then from there we work out the primary bucket page's block number,
which we store in "blkno". We read the page into "buf" and put a
pointer to that buffer's contents into "page" from which we discover
the HashOpaque, a pointer to which we store into "pageopaque". Then
we pass that to _hash_get_newblk() which will go look into that
structure to find the bucket number ... but couldn't we have just
passed "bucket" instead? Similarly, _hash_expandtable() has the value
available in the variable "old_bucket".

The only caller of _hash_get_oldblk() is _hash_first(), which has the
bucket number available in a variable called "bucket".

So it seems to me that these functions could be simplified to take the
bucket number as an argument directly instead of the HashOpaque.

Generally, this pattern recurs throughout the patch. Every time you
use the data in the page to figure something out which the caller
already knew, you're introducing a risk of bugs: what if the answers
don't match? I think you should try to root out as much of that from
this code as you can.

As you may be able to tell, I'm working my way into this patch
gradually, starting with peripheral parts and working toward the core
of it. Generally, I think it's in pretty good shape, but I still have
quite a bit left to study.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

hashovfl-tweaks.patchapplication/x-download; name=hashovfl-tweaks.patchDownload
commit 5c8b7bb4074ccfccb8c6e6e968e68cb03938028a
Author: Robert Haas <rhaas@postgresql.org>
Date:   Fri Nov 4 10:31:18 2016 -0400

    more me

diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 58e15f3..c00d6f5 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -372,9 +372,10 @@ _hash_firstfreebit(uint32 map)
  *	Returns the block number of the page that followed the given page
  *	in the bucket, or InvalidBlockNumber if no following page.
  *
- *	NB: caller must not hold lock on metapage, nor on either page that's
- *	adjacent in the bucket chain except from primary bucket.  The caller had
- *	better hold cleanup lock on the primary bucket page.
+ *	NB: caller must hold a cleanup lock on the primary bucket page, so that
+ *	concurrent scans can't get confused.  caller must not hold a lock on either
+ *	page adjacent to this one in the bucket chain (except when it's the primary
+ *	bucket page). caller must not hold a lock on the metapage, either.
  */
 BlockNumber
 _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
@@ -416,41 +417,42 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
-	 * deleted.  No concurrency issues since we hold the cleanup lock on
-	 * primary bucket.  We don't need to aqcuire buffer lock to fix the
-	 * primary bucket, as we already have that lock.
+	 * deleted.  No concurrency issues since we hold a cleanup lock on primary
+	 * bucket.  We don't need to acquire a buffer lock to fix the primary
+	 * bucket, as we already have that lock.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
+		Buffer		prevbuf;
+		Page		prevpage;
+		HashPageOpaque prevopaque;
+
 		if (prevblkno == bucket_blkno)
-		{
-			Buffer		prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
-													 prevblkno,
-													 RBM_NORMAL,
-													 bstrategy);
+			prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
+										 prevblkno,
+										 RBM_NORMAL,
+										 bstrategy);
+		else
+			prevbuf = _hash_getbuf_with_strategy(rel,
+												 prevblkno,
+												 HASH_WRITE,
+										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
+												 bstrategy);
+
+		prevpage = BufferGetPage(prevbuf);
+		prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		Assert(prevopaque->hasho_bucket == bucket);
+		prevopaque->hasho_nextblkno = nextblkno;
 
-			Page		prevpage = BufferGetPage(prevbuf);
-			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
 
-			Assert(prevopaque->hasho_bucket == bucket);
-			prevopaque->hasho_nextblkno = nextblkno;
+		if (prevblkno == bucket_blkno)
+		{
 			MarkBufferDirty(prevbuf);
 			ReleaseBuffer(prevbuf);
 		}
 		else
-		{
-			Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-															 prevblkno,
-															 HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-															 bstrategy);
-			Page		prevpage = BufferGetPage(prevbuf);
-			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
-
-			Assert(prevopaque->hasho_bucket == bucket);
-			prevopaque->hasho_nextblkno = nextblkno;
 			_hash_wrtbuf(rel, prevbuf);
-		}
 	}
 	if (BlockNumberIsValid(nextblkno))
 	{
@@ -592,8 +594,10 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
  *	required that to be true on entry as well, but it's a lot easier for
  *	callers to leave empty overflow pages and let this guy clean it up.
  *
- *	Caller must hold cleanup lock on the target bucket.  This allows
- *	us to safely lock multiple pages in the bucket.
+ *	Caller must hold cleanup lock on the primary page of the target bucket
+ *	to exclude any concurrent scans, which could easily be confused into
+ *	returning the same tuple more than once or some tuples not at all by
+ *	the rearrangement we are performing here.
  *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
@@ -626,7 +630,7 @@ _hash_squeezebucket(Relation rel,
 
 	/*
 	 * if there aren't any overflow pages, there's nothing to squeeze. caller
-	 * is responsible to release the lock on primary bucket page.
+	 * is responsible for releasing the lock on primary bucket page.
 	 */
 	if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
 		return;
#132Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#129)
Re: Hash Indexes

On Fri, Nov 4, 2016 at 6:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Nov 3, 2016 at 6:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

+ nblkno = _hash_get_newblk(rel, pageopaque);

I think this is not a great name for this function. It's not clear
what "new blocks" refers to, exactly. I suggest
FIND_SPLIT_BUCKET(metap, bucket) or OLD_BUCKET_TO_NEW_BUCKET(metap,
bucket) returning a new bucket number. I think that macro can be
defined as something like this: bucket + (1 <<
(fls(metap->hashm_maxbucket) - 1)).

I think such a macro would not work for the usage of incomplete
splits. The reason is that by the time we try to complete the split
of the current old bucket, the table half (lowmask, highmask,
maxbucket) would have changed and it could give you the bucket in new
table half.

Can you provide an example of the scenario you are talking about here?

Consider a case as below:

First half of table
0 1 2 3
Second half of table
4 5 6 7

Now when split of bucket 2 (corresponding new bucket will be 6) is in
progress, system crashes and after restart it splits bucket number 3
(corresponding bucket will be 7). Now after that, it will try to form
a new table half with buckets ranging from 8,9,..15. Assume it
creates bucket 8 by splitting from bucket 0 and next if it tries to
split bucket 2, it will find an incomplete split and will attempt to
finish it. At that time if it tries to calculate new bucket from old
bucket (2), it will calculate it as 10 (value of
metap->hashm_maxbucket will be 8 for third table half and if try it
with the above macro, it will calculate it as 10) whereas we need 6.
That is why you will see a check (if (new_bucket >
metap->hashm_maxbucket)) in _hash_get_newblk() which will ensure that
it returns the bucket number from previous half. The basic idea is
that if there is an incomplete split from current bucket, it can't do
a new split from that bucket, so the check in _hash_get_newblk() will
give us correct value.

I can try to explain again if above is not clear enough.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#133Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#131)
Re: Hash Indexes

On Fri, Nov 4, 2016 at 9:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 1, 2016 at 9:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[ new patches ]

I looked over parts of this today, mostly the hashinsert.c changes.

Some more review...

@@ -656,6 +678,10 @@ _hash_squeezebucket(Relation rel,
IndexTuple itup;
Size itemsz;

+            /* skip dead tuples */
+            if (ItemIdIsDead(PageGetItemId(rpage, roffnum)))
+                continue;

Is this an optimization independent of the rest of the patch, or is
there something in this patch that necessitates it?

This specific case is independent of rest of patch, but the same
optimization is used in function _hash_splitbucket_guts() which is
mandatory, because otherwise it will make a copy of that tuple without
copying dead flag.

i.e. Could this
small change be committed independently?

Both the places _hash_squeezebucket() and _hash_splitbucket can use
this optimization irrespective of rest of the patch. I will prepare a
separate patch for these and send along with main patch after some
testing.

If not, then I think it
needs a better comment explaining why it is now mandatory.

- *  Caller must hold exclusive lock on the target bucket.  This allows
+ *  Caller must hold cleanup lock on the target bucket.  This allows
*  us to safely lock multiple pages in the bucket.

The notion of a lock on a bucket no longer really exists; with this
patch, we'll now properly speak of a lock on a primary bucket page.
Also, I think the bit about safely locking multiple pages is bizarre
-- that's not the issue at all: the problem is that reorganizing a
bucket might confuse concurrent scans into returning wrong answers.

I've included a broader updating of that comment, and some other
comment changes, in the attached incremental patch, which also
refactors your changes to _hash_freeovflpage() a bit to avoid some
code duplication. Please consider this for inclusion in your next
version.

Your modifications looks good to me, so will include it in next version.

In hashutil.c, I think that _hash_msb() is just a reimplementation of
fls(), which you can rely on being present because we have our own
implementation in src/port. It's quite similar to yours but slightly
shorter. :-) Also, some systems have a builtin fls() function which
actually optimizes down to a single machine instruction, and which is
therefore much faster than either version.

Agreed, will change as per suggestion.

I don't like the fact that _hash_get_newblk() and _hash_get_oldblk()
are working out the bucket number by using the HashOpaque structure
within the bucket page they're examining. First, it seems weird to
pass the whole structure when you only need the bucket number out of
it. More importantly, the caller really ought to know what bucket
they care about without having to discover it.

For example, in _hash_doinsert(), we figure out the bucket into which
we need to insert, and we store that in a variable called "bucket".
Then from there we work out the primary bucket page's block number,
which we store in "blkno". We read the page into "buf" and put a
pointer to that buffer's contents into "page" from which we discover
the HashOpaque, a pointer to which we store into "pageopaque". Then
we pass that to _hash_get_newblk() which will go look into that
structure to find the bucket number ... but couldn't we have just
passed "bucket" instead?

Yes, it can be done. However, note that pageopaque is not only
retrieved for passing to _hash_get_newblk(), rather it is used in
below code as well, so we can't remove that.

Similarly, _hash_expandtable() has the value
available in the variable "old_bucket".

The only caller of _hash_get_oldblk() is _hash_first(), which has the
bucket number available in a variable called "bucket".

So it seems to me that these functions could be simplified to take the
bucket number as an argument directly instead of the HashOpaque.

Okay, I agree that it is better to use bucket number in both the
API's, so will change it accordingly.

Generally, this pattern recurs throughout the patch. Every time you
use the data in the page to figure something out which the caller
already knew, you're introducing a risk of bugs: what if the answers
don't match? I think you should try to root out as much of that from
this code as you can.

Okay, I will review the patch once with this angle and see if I can improve it.

As you may be able to tell, I'm working my way into this patch
gradually, starting with peripheral parts and working toward the core
of it. Generally, I think it's in pretty good shape, but I still have
quite a bit left to study.

Thanks.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#134Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#128)
2 attachment(s)
Re: Hash Indexes

On Thu, Nov 3, 2016 at 3:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 2, 2016 at 6:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[ new patches ]

I looked over parts of this today, mostly the hashinsert.c changes.

At the point this code is
running, we have a pin but no lock on the metapage, so this is only
safe if changing any of these fields requires a cleanup lock on the
metapage. If that's true,

No that's not true, we need just Exclusive content lock to update
those fields and these fields should be copied when we have Share
content lock on metapage. In version-8 of patch, it was correct, but
in last version, it seems during code re-arrangement, I have moved it.
I will change it such that these values are copied under matapage
share content lock.

Fixed as mentioned.

+ nblkno = _hash_get_newblk(rel, pageopaque);

I think this is not a great name for this function. It's not clear
what "new blocks" refers to, exactly. I suggest
FIND_SPLIT_BUCKET(metap, bucket) or OLD_BUCKET_TO_NEW_BUCKET(metap,
bucket) returning a new bucket number. I think that macro can be
defined as something like this: bucket + (1 <<
(fls(metap->hashm_maxbucket) - 1)).

I think such a macro would not work for the usage of incomplete
splits. The reason is that by the time we try to complete the split
of the current old bucket, the table half (lowmask, highmask,
maxbucket) would have changed and it could give you the bucket in new
table half.

I have changed the function name to _hash_get_oldbucket_newblock() and
passed the Bucket as a second parameter.

Moving on ...

/*
* ovfl page exists; go get it.  if it doesn't have room, we'll
-             * find out next pass through the loop test above.
+             * find out next pass through the loop test above.  Retain the
+             * pin, if it is a primary bucket page.
*/
-            _hash_relbuf(rel, buf);
+            if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
+                _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+            else
+                _hash_relbuf(rel, buf);

It seems like it would be cheaper, safer, and clearer to test whether
buf != bucket_buf here, rather than examining the page opaque data.
That's what you do down at the bottom of the function when you ensure
that the pin on the primary bucket page gets released, and it seems
like it should work up here, too.

+            bool        retain_pin = false;
+
+            /* page flags must be accessed before releasing lock on a page. */
+            retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;

Similarly.

Agreed, will change the usage as per your suggestion.

Changed as discussed. I have changed the similar usage at few other
places in patch.

I have also attached a patch with some suggested comment changes.

I will include it in next version of patch.

Included in new version of patch.

Some more review...

@@ -656,6 +678,10 @@ _hash_squeezebucket(Relation rel,
IndexTuple itup;
Size itemsz;

+            /* skip dead tuples */
+            if (ItemIdIsDead(PageGetItemId(rpage, roffnum)))
+                continue;

Is this an optimization independent of the rest of the patch, or is
there something in this patch that necessitates it?

This specific case is independent of rest of patch, but the same
optimization is used in function _hash_splitbucket_guts() which is
mandatory, because otherwise it will make a copy of that tuple without
copying dead flag.

i.e. Could this
small change be committed independently?

Both the places _hash_squeezebucket() and _hash_splitbucket can use
this optimization irrespective of rest of the patch. I will prepare a
separate patch for these and send along with main patch after some
testing.

Done as a separate patch skip_dead_tups_hash_index-v1.patch.

If not, then I think it
needs a better comment explaining why it is now mandatory.

- *  Caller must hold exclusive lock on the target bucket.  This allows
+ *  Caller must hold cleanup lock on the target bucket.  This allows
*  us to safely lock multiple pages in the bucket.

The notion of a lock on a bucket no longer really exists; with this
patch, we'll now properly speak of a lock on a primary bucket page.
Also, I think the bit about safely locking multiple pages is bizarre
-- that's not the issue at all: the problem is that reorganizing a
bucket might confuse concurrent scans into returning wrong answers.

I've included a broader updating of that comment, and some other
comment changes, in the attached incremental patch, which also
refactors your changes to _hash_freeovflpage() a bit to avoid some
code duplication. Please consider this for inclusion in your next
version.

Your modifications looks good to me, so will include it in next version.

Included in new version of patch.

In hashutil.c, I think that _hash_msb() is just a reimplementation of
fls(), which you can rely on being present because we have our own
implementation in src/port. It's quite similar to yours but slightly
shorter. :-) Also, some systems have a builtin fls() function which
actually optimizes down to a single machine instruction, and which is
therefore much faster than either version.

Agreed, will change as per suggestion.

Changed as per suggestion.

I don't like the fact that _hash_get_newblk() and _hash_get_oldblk()
are working out the bucket number by using the HashOpaque structure
within the bucket page they're examining. First, it seems weird to
pass the whole structure when you only need the bucket number out of
it. More importantly, the caller really ought to know what bucket
they care about without having to discover it.

So it seems to me that these functions could be simplified to take the
bucket number as an argument directly instead of the HashOpaque.

Okay, I agree that it is better to use bucket number in both the
API's, so will change it accordingly.

Changed as per suggestion.

Generally, this pattern recurs throughout the patch. Every time you
use the data in the page to figure something out which the caller
already knew, you're introducing a risk of bugs: what if the answers
don't match? I think you should try to root out as much of that from
this code as you can.

Okay, I will review the patch once with this angle and see if I can improve it.

I have reviewed and found multiple places like hashbucketcleanup(),
_hash_readnext(), _hash_readprev() where such pattern was used.
Changed all such places to ensure that the caller passes the
information if it already has.

Thanks to Ashutosh Sharma who has helped me in ensuring that the
latest patches didn't introduce any concurrency hazards (by testing
with pgbench at high client counts).

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

skip_dead_tups_hash_index-v1.patchapplication/octet-stream; name=skip_dead_tups_hash_index-v1.patchDownload
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index db3e268..df7af3e 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -656,6 +656,10 @@ _hash_squeezebucket(Relation rel,
 			IndexTuple	itup;
 			Size		itemsz;
 
+			/* skip dead tuples */
+			if (ItemIdIsDead(PageGetItemId(rpage, roffnum)))
+				continue;
+
 			itup = (IndexTuple) PageGetItem(rpage,
 											PageGetItemId(rpage, roffnum));
 			itemsz = IndexTupleDSize(*itup);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 178463f..a5e9d17 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -811,6 +811,10 @@ _hash_splitbucket(Relation rel,
 			Size		itemsz;
 			Bucket		bucket;
 
+			/* skip dead tuples */
+			if (ItemIdIsDead(PageGetItemId(opage, ooffnum)))
+				continue;
+
 			/*
 			 * Fetch the item's hash key (conveniently stored in the item) and
 			 * determine which bucket it now belongs in.
concurrent_hash_index_v10.patchapplication/octet-stream; name=concurrent_hash_index_v10.patchDownload
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index 5d3bd94..e2e7e91 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/access/hash
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashscan.o \
-       hashsearch.o hashsort.o hashutil.o hashvalidate.o
+OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
+       hashsort.o hashutil.o hashvalidate.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 0a7da89..7972d9d 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -125,54 +125,59 @@ the initially created buckets.
 
 Lock Definitions
 ----------------
-
-We use both lmgr locks ("heavyweight" locks) and buffer context locks
-(LWLocks) to control access to a hash index.  lmgr locks are needed for
-long-term locking since there is a (small) risk of deadlock, which we must
-be able to detect.  Buffer context locks are used for short-term access
-control to individual pages of the index.
-
-LockPage(rel, page), where page is the page number of a hash bucket page,
-represents the right to split or compact an individual bucket.  A process
-splitting a bucket must exclusive-lock both old and new halves of the
-bucket until it is done.  A process doing VACUUM must exclusive-lock the
-bucket it is currently purging tuples from.  Processes doing scans or
-insertions must share-lock the bucket they are scanning or inserting into.
-(It is okay to allow concurrent scans and insertions.)
-
-The lmgr lock IDs corresponding to overflow pages are currently unused.
-These are available for possible future refinements.  LockPage(rel, 0)
-is also currently undefined (it was previously used to represent the right
-to modify the hash-code-to-bucket mapping, but it is no longer needed for
-that purpose).
-
-Note that these lock definitions are conceptually distinct from any sort
-of lock on the pages whose numbers they share.  A process must also obtain
-read or write buffer lock on the metapage or bucket page before accessing
-said page.
-
-Processes performing hash index scans must hold share lock on the bucket
-they are scanning throughout the scan.  This seems to be essential, since
-there is no reasonable way for a scan to cope with its bucket being split
-underneath it.  This creates a possibility of deadlock external to the
-hash index code, since a process holding one of these locks could block
-waiting for an unrelated lock held by another process.  If that process
-then does something that requires exclusive lock on the bucket, we have
-deadlock.  Therefore the bucket locks must be lmgr locks so that deadlock
-can be detected and recovered from.
-
-Processes must obtain read (share) buffer context lock on any hash index
-page while reading it, and write (exclusive) lock while modifying it.
-To prevent deadlock we enforce these coding rules: no buffer lock may be
-held long term (across index AM calls), nor may any buffer lock be held
-while waiting for an lmgr lock, nor may more than one buffer lock
-be held at a time by any one process.  (The third restriction is probably
-stronger than necessary, but it makes the proof of no deadlock obvious.)
+Concurrency control for hash indexes is provided using buffer content
+locks, buffer pins, and cleanup locks.   Here as elsewhere in PostgreSQL,
+cleanup lock means that we hold an exclusive lock on the buffer and have
+observed at some point after acquiring the lock that we hold the only pin
+on that buffer.  For hash indexes, a cleanup lock on a primary bucket page
+represents the right to perform an arbitrary reorganization of the entire
+bucket.  Therefore, scans retain a pin on the primary bucket page for the
+bucket they are currently scanning.  Splitting a bucket requires a cleanup
+lock on both the old and new primary bucket pages.  VACUUM therefore takes
+a cleanup lock on every bucket page in order to remove tuples.  It can also
+remove tuples copied to a new bucket by any previous split operation, because
+the cleanup lock taken on the primary bucket page guarantees that no scans
+which started prior to the most recent split can still be in progress.  After
+cleaning each page individually, it attempts to take a cleanup lock on the
+primary bucket page in order to "squeeze" the bucket down to the minimum
+possible number of pages.
+
+To avoid deadlocks, we must be consistent about the lock order in which we
+lock the buckets for operations that requires locks on two different buckets.
+We use the rule "first lock the old bucket and then new bucket, basically
+lock the lowered number bucket first".
+
+To avoid deadlock in operations that requires locking metapage and other
+buckets, we always take the lock on other bucket first and then on metapage.
 
 
 Pseudocode Algorithms
 ---------------------
 
+Various flags that are used in hash index operations are described as below:
+
+split-in-progress flag indicates that split operation is in progress for a
+bucket.  During split operation, this flag is set on both old and new buckets.
+This flag is cleared once the split operation is finished.
+
+moved-by-split flag on a tuple indicates that tuple is moved from old to new
+bucket.  The concurrent scans can skip such tuples till the split operation is
+finished.  Once the tuple is marked as moved-by-split, it will remain so forever
+but that does no harm.  We have intentionally not cleared it as that can generate
+an additional I/O which is not necessary.
+
+has_garbage flag indicates that the bucket contains tuples that are moved due
+to split.  This will be set only for old bucket.  Now, why we need it besides
+split-in-progress flag is to distinguish the case when the split is over
+(aka split-in-progress flag is cleared.).  This is used both by vacuum as
+well as during re-split operation.  Vacuum, uses it to decide if it needs to
+clear the tuples (that are moved-by-split) from bucket along with dead tuples.
+Re-split of bucket uses it to ensure that it doesn't start a new split from a
+bucket without clearing the previous tuples from old bucket.  The usage by
+re-split helps to keep bloat under control and makes the design somewhat
+simpler as we don't have to any time handle the situation where a bucket can
+contain dead-tuples from multiple splits.
+
 The operations we need to support are: readers scanning the index for
 entries of a particular hash code (which by definition are all in the same
 bucket); insertion of a new tuple into the correct bucket; enlarging the
@@ -193,38 +198,51 @@ The reader algorithm is:
 		release meta page buffer content lock
 		if (correct bucket page is already locked)
 			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock
+		release any existing bucket page buffer content lock (if a concurrent split happened)
+		take the buffer content lock on bucket page in shared mode
 		retake meta page buffer content lock in shared mode
--- then, per read request:
 	release pin on metapage
-	read current page of bucket and take shared buffer content lock
-		step to next page if necessary (no chaining of locks)
+	if the split is in progress for current bucket and this is a new bucket
+		release the buffer content lock on current bucket page
+		pin and acquire the buffer content lock on old bucket in shared mode
+		release the buffer content lock on old bucket, but not pin
+		retake the buffer content lock on new bucket
+		mark the scan such that it skips the tuples that are marked as moved by split
+-- then, per read request:
+	step to next page if necessary (no chaining of locks)
+		if the scan indicates moved by split, then move to old bucket after the scan
+		of current bucket is finished
 	get tuple
 	release buffer content lock and pin on current page
 -- at scan shutdown:
-	release bucket share-lock
-
-We can't hold the metapage lock while acquiring a lock on the target bucket,
-because that might result in an undetected deadlock (lwlocks do not participate
-in deadlock detection).  Instead, we relock the metapage after acquiring the
-bucket page lock and check whether the bucket has been split.  If not, we're
-done.  If so, we release our previously-acquired lock and repeat the process
-using the new bucket number.  Holding the bucket sharelock for
-the remainder of the scan prevents the reader's current-tuple pointer from
-being invalidated by splits or compactions.  Notice that the reader's lock
-does not prevent other buckets from being split or compacted.
+	release any pin we hold on current buffer, old bucket buffer, new bucket buffer
+
+We don't want to hold the meta page lock while acquiring the content lock on
+bucket page, because that might result in poor concurrency.  Instead, we relock
+the metapage after acquiring the bucket page content lock and check whether the
+bucket has been split.  If not, we're done.  If so, we release our
+previously-acquired content lock, but not pin and repeat the process using the
+new bucket number.  Holding the buffer pin on bucket page for the remainder of
+the scan prevents the reader's current-tuple pointer from being invalidated by
+splits or compactions.  Notice that the reader's pin does not prevent other
+buckets from being split or compacted.
 
 To keep concurrency reasonably good, we require readers to cope with
 concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the page sharelock.  Since
-deletion is not possible while a reader holds the bucket sharelock, and
-we assume that heap tuple TIDs are unique, this can be implemented by
+their current scan position after re-acquiring the buffer content lock on
+page.  Since deletion is not possible while a reader holds the pin on bucket,
+and we assume that heap tuple TIDs are unique, this can be implemented by
 searching for the same heap tuple TID previously returned.  Insertion does
 not move index entries across pages, so the previously-returned index entry
 should always be on the same page, at the same or higher offset number,
 as it was before.
 
+To allow scan during bucket split, if at the start of the scan, bucket is
+marked as split-in-progress, it scan all the tuples in that bucket except for
+those that are marked as moved-by-split.  Once it finishes the scan of all the
+tuples in the current bucket, it scans the old bucket from which this bucket
+is formed by split.  This happens only for the new half bucket.
+
 The insertion algorithm is rather similar:
 
 	pin meta page and take buffer content lock in shared mode
@@ -233,18 +251,24 @@ The insertion algorithm is rather similar:
 		release meta page buffer content lock
 		if (correct bucket page is already locked)
 			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock in shared mode
+		release any existing bucket page buffer content lock (if a concurrent split happened)
+		take the buffer content lock on bucket page in exclusive mode
 		retake meta page buffer content lock in shared mode
--- (so far same as reader)
 	release pin on metapage
-	pin current page of bucket and take exclusive buffer content lock
-	if full, release, read/exclusive-lock next page; repeat as needed
+-- (so far same as reader, except for acquisation of buffer content lock in
+	exclusive mode on primary bucket page)
+	if the split-in-progress flag is set for bucket in old half of split
+	and pin count on it is one, then finish the split
+		we already have a buffer content lock on old bucket, conditionally get the content lock on new bucket
+		if get the lock on new bucket
+			finish the split using algorithm mentioned below for split
+			release the buffer content lock and pin on new bucket
+	if current page is full, release lock but not pin, read/exclusive-lock next page; repeat as needed
 	>> see below if no space in any page of bucket
 	insert tuple at appropriate place in page
 	mark current page dirty and release buffer content lock and pin
-	release heavyweight share-lock
-	pin meta page and take buffer content lock in shared mode
+	if the current page is not a bucket page, release the pin on bucket page
+	pin meta page and take buffer content lock in exclusive mode
 	increment tuple count, decide if split needed
 	mark meta page dirty and release buffer content lock and pin
 	done if no split needed, else enter Split algorithm below
@@ -256,11 +280,13 @@ bucket that is being actively scanned, because readers can cope with this
 as explained above.  We only need the short-term buffer locks to ensure
 that readers do not see a partially-updated page.
 
-It is clearly impossible for readers and inserters to deadlock, and in
-fact this algorithm allows them a very high degree of concurrency.
-(The exclusive metapage lock taken to update the tuple count is stronger
-than necessary, since readers do not care about the tuple count, but the
-lock is held for such a short time that this is probably not an issue.)
+To avoid deadlock between readers and inserters, whenever there is a need
+to lock multiple buckets, we always take in the order suggested in Lock
+Definitions above.  This algorithm allows them a very high degree of
+concurrency.  (The exclusive metapage lock taken to update the tuple count
+is stronger than necessary, since readers do not care about the tuple count,
+but the lock is held for such a short time that this is probably not an
+issue.)
 
 When an inserter cannot find space in any existing page of a bucket, it
 must obtain an overflow page and add that page to the bucket's chain.
@@ -271,46 +297,66 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
 The algorithm attempts, but does not necessarily succeed, to split one
 existing bucket in two, thereby lowering the fill ratio:
 
-	pin meta page and take buffer content lock in exclusive mode
-	check split still needed
-	if split not needed anymore, drop buffer content lock and pin and exit
-	decide which bucket to split
-	Attempt to X-lock old bucket number (definitely could fail)
-	Attempt to X-lock new bucket number (shouldn't fail, but...)
-	if above fail, drop locks and pin and exit
+	expand:
+		take buffer content lock in exclusive mode on meta page
+		check split still needed
+		if split not needed anymore, drop buffer content lock and exit
+		decide which bucket to split
+		Attempt to acquire cleanup lock on old bucket number (definitely could fail)
+		if above fail, release lock and pin and exit
+		if the split-in-progress flag is set, then finish the split
+			conditionally get the content lock on new bucket which was involved in split
+			if got the lock on new bucket
+				finish the split using algorithm mentioned below for split
+				release the buffer content lock and pin on old and new buckets
+				try to expand from start
+			else
+				release the buffer conetent lock and pin on old bucket and exit
+		if the garbage flag (indicates that tuples are moved by split) is set on bucket
+			release the buffer content lock on meta page
+			remove the tuples that doesn't belong to this bucket; see bucket cleanup below
+	Attempt to acquire cleanup lock on new bucket number (shouldn't fail, but...)
 	update meta page to reflect new number of buckets
-	mark meta page dirty and release buffer content lock and pin
+	mark meta page dirty and release buffer content lock
 	-- now, accesses to all other buckets can proceed.
 	Perform actual split of bucket, moving tuples as needed
 	>> see below about acquiring needed extra space
-	Release X-locks of old and new buckets
+
+	split guts
+	mark the old and new buckets indicating split-in-progress
+	mark the old bucket indicating has-garbage
+	copy the tuples that belongs to new bucket from old bucket
+	during copy mark such tuples as move-by-split
+	release lock but not pin for primary bucket page of old bucket,
+	read/shared-lock next page; repeat as needed
+	>> see below if no space in bucket page of new bucket
+	ensure to have exclusive-lock on both old and new buckets in that order
+	clear the split-in-progress flag from both the buckets
+	mark buffers dirty and release the locks and pins on both old and new buckets
 
 Note the metapage lock is not held while the actual tuple rearrangement is
 performed, so accesses to other buckets can proceed in parallel; in fact,
 it's possible for multiple bucket splits to proceed in parallel.
 
-Split's attempt to X-lock the old bucket number could fail if another
-process holds S-lock on it.  We do not want to wait if that happens, first
-because we don't want to wait while holding the metapage exclusive-lock,
-and second because it could very easily result in deadlock.  (The other
-process might be out of the hash AM altogether, and could do something
-that blocks on another lock this process holds; so even if the hash
-algorithm itself is deadlock-free, a user-induced deadlock could occur.)
-So, this is a conditional LockAcquire operation, and if it fails we just
-abandon the attempt to split.  This is all right since the index is
-overfull but perfectly functional.  Every subsequent inserter will try to
-split, and eventually one will succeed.  If multiple inserters failed to
-split, the index might still be overfull, but eventually, the index will
+The split operation's attempt to acquire cleanup-lock on the old bucket number
+could fail if another process holds any lock or pin on it.  We do not want to
+wait if that happens, because we don't want to wait while holding the metapage
+exclusive-lock.  So, this is a conditional LWLockAcquire operation, and if
+it fails we just abandon the attempt to split.  This is all right since the
+index is overfull but perfectly functional.  Every subsequent inserter will
+try to split, and eventually one will succeed.  If multiple inserters failed
+to split, the index might still be overfull, but eventually, the index will
 not be overfull and split attempts will stop.  (We could make a successful
 splitter loop to see if the index is still overfull, but it seems better to
 distribute the split overhead across successive insertions.)
 
 A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt.  The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
+disk space or crash) the index is left corrupt.  The probability of that
+could be made quite low if we grab a free page or two before we update the
+meta page, but the only real solution is to treat a split as a WAL-loggable,
 must-complete action.  I'm not planning to teach hash about WAL in this
-go-round.
+go-round.  However, we do try to finish the incomplete splits during insert
+and split.
 
 The fourth operation is garbage collection (bulk deletion):
 
@@ -319,9 +365,13 @@ The fourth operation is garbage collection (bulk deletion):
 	fetch current max bucket number
 	release meta page buffer content lock and pin
 	while next bucket <= max bucket do
-		Acquire X lock on target bucket
-		Scan and remove tuples, compact free space as needed
-		Release X lock
+		Acquire cleanup lock on target bucket
+		Scan and remove tuples
+		For overflow page, first we need to lock the next page and then
+		release the lock on current bucket or overflow page
+		Ensure to have buffer content lock in exclusive mode on bucket page
+		If buffer pincount is one, then compact free space as needed
+		Release lock
 		next bucket ++
 	end loop
 	pin metapage and take buffer content lock in exclusive mode
@@ -330,20 +380,24 @@ The fourth operation is garbage collection (bulk deletion):
 	else update metapage tuple count
 	mark meta page dirty and release buffer content lock and pin
 
-Note that this is designed to allow concurrent splits.  If a split occurs,
-tuples relocated into the new bucket will be visited twice by the scan,
-but that does no harm.  (We must however be careful about the statistics
-reported by the VACUUM operation.  What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count
-if the stored tuple count and number of buckets did *not* change at any
-time during the scan.  This provides a way of correcting the stored tuple
-count if it gets out of sync for some reason.  But if a split or insertion
-does occur concurrently, the scan count is untrustworthy; instead,
-subtract the number of tuples deleted from the stored tuple count and
-use that.)
-
-The exclusive lock request could deadlock in some strange scenarios, but
-we can just error out without any great harm being done.
+Note that this is designed to allow concurrent splits and scans.  If a
+split occurs, tuples relocated into the new bucket will be visited twice
+by the scan, but that does no harm.  As we release the lock on bucket page
+during cleanup scan of a bucket, it will allow concurrent scan to start on
+a bucket and ensures that scan will always be behind cleanup.  It is must to
+keep scans behind cleanup, else vacuum could remove tuples that are required
+to complete the scan as the scan that returns multiple tuples from the same
+bucket page always restart the scan from the previous offset number from which
+it has returned last tuple.  This holds true for backward scans as well
+(backward scans first traverse each bucket starting from first bucket to last
+overflow page in the chain).  We must be careful about the statistics reported
+by the VACUUM operation.  What we can do is count the number of tuples scanned,
+and believe this in preference to the stored tuple count if the stored tuple
+count and number of buckets did *not* change at any time during the scan.  This
+provides a way of correcting the stored tuple count if it gets out of sync for
+some reason.  But if a split or insertion does occur concurrently, the scan
+count is untrustworthy; instead, subtract the number of tuples deleted from the
+stored tuple count and use that.
 
 
 Free Space Management
@@ -417,13 +471,11 @@ free page; there can be no other process holding lock on it.
 
 Bucket splitting uses a similar algorithm if it has to extend the new
 bucket, but it need not worry about concurrent extension since it has
-exclusive lock on the new bucket.
+buffer content lock in exclusive mode on the new bucket.
 
-Freeing an overflow page is done by garbage collection and by bucket
-splitting (the old bucket may contain no-longer-needed overflow pages).
-In both cases, the process holds exclusive lock on the containing bucket,
-so need not worry about other accessors of pages in the bucket.  The
-algorithm is:
+Freeing an overflow page requires the process to hold buffer content lock in
+exclusive mode on the containing bucket, so need not worry about other
+accessors of pages in the bucket.  The algorithm is:
 
 	delink overflow page from bucket chain
 	(this requires read/update/write/release of fore and aft siblings)
@@ -454,14 +506,6 @@ locks.  Since they need no lmgr locks, deadlock is not possible.
 Other Notes
 -----------
 
-All the shenanigans with locking prevent a split occurring while *another*
-process is stopped in a given bucket.  They do not ensure that one of
-our *own* backend's scans is not stopped in the bucket, because lmgr
-doesn't consider a process's own locks to conflict.  So the Split
-algorithm must check for that case separately before deciding it can go
-ahead with the split.  VACUUM does not have this problem since nothing
-else can be happening within the vacuuming backend.
-
-Should we instead try to fix the state of any conflicting local scan?
-Seems mighty ugly --- got to move the held bucket S-lock as well as lots
-of other messiness.  For now, just punt and don't split.
+Clean up locks prevent a split from occurring while *another* process is stopped
+in a given bucket.  It also ensures that one of our *own* backend's scans is not
+stopped in the bucket.
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..7612e5b 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -287,10 +287,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		/*
 		 * An insertion into the current index page could have happened while
 		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold share lock on
-		 * the bucket, no deletions or splits could have occurred; therefore
-		 * we can expect that the TID still exists in the current index page,
-		 * at an offset >= where we were.)
+		 * for the TID we previously returned.  (Because we hold pin on the
+		 * bucket, no deletions or splits could have occurred; therefore we
+		 * can expect that the TID still exists in the current index page, at
+		 * an offset >= where we were.)
 		 */
 		OffsetNumber maxoffnum;
 
@@ -424,17 +424,16 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_bucket_valid = false;
-	so->hashso_bucket_blkno = 0;
 	so->hashso_curbuf = InvalidBuffer;
+	so->hashso_bucket_buf = InvalidBuffer;
+	so->hashso_old_bucket_buf = InvalidBuffer;
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
 	ItemPointerSetInvalid(&(so->hashso_heappos));
 
-	scan->opaque = so;
+	so->hashso_skip_moved_tuples = false;
 
-	/* register scan in case we change pages it's using */
-	_hash_regscan(scan);
+	scan->opaque = so;
 
 	return scan;
 }
@@ -449,15 +448,7 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -469,8 +460,9 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		memmove(scan->keyData,
 				scankey,
 				scan->numberOfKeys * sizeof(ScanKeyData));
-		so->hashso_bucket_valid = false;
 	}
+
+	so->hashso_skip_moved_tuples = false;
 }
 
 /*
@@ -482,18 +474,7 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/* don't need scan registered anymore */
-	_hash_dropscan(scan);
-
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	pfree(so);
 	scan->opaque = NULL;
@@ -504,6 +485,9 @@ hashendscan(IndexScanDesc scan)
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
  *
+ * This function also delete the tuples that are moved by split to other
+ * bucket.
+ *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
@@ -548,83 +532,47 @@ loop_top:
 	{
 		BlockNumber bucket_blkno;
 		BlockNumber blkno;
-		bool		bucket_dirty = false;
+		Buffer		bucket_buf;
+		Buffer		buf;
+		HashPageOpaque bucket_opaque;
+		Page		page;
+		bool		bucket_has_garbage = false;
 
 		/* Get address of bucket's start page */
 		bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
 
-		/* Exclusive-lock the bucket so we can shrink it */
-		_hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
-		/* Shouldn't have any active scans locally, either */
-		if (_hash_has_active_scan(rel, cur_bucket))
-			elog(ERROR, "hash index has active scan during VACUUM");
-
-		/* Scan each page in bucket */
 		blkno = bucket_blkno;
-		while (BlockNumberIsValid(blkno))
-		{
-			Buffer		buf;
-			Page		page;
-			HashPageOpaque opaque;
-			OffsetNumber offno;
-			OffsetNumber maxoffno;
-			OffsetNumber deletable[MaxOffsetNumber];
-			int			ndeletable = 0;
-
-			vacuum_delay_point();
 
-			buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-											 info->strategy);
-			page = BufferGetPage(buf);
-			opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-			Assert(opaque->hasho_bucket == cur_bucket);
-
-			/* Scan each tuple in page */
-			maxoffno = PageGetMaxOffsetNumber(page);
-			for (offno = FirstOffsetNumber;
-				 offno <= maxoffno;
-				 offno = OffsetNumberNext(offno))
-			{
-				IndexTuple	itup;
-				ItemPointer htup;
+		/*
+		 * We need to acquire a cleanup lock on the primary bucket page to out
+		 * wait concurrent scans before deleting the dead tuples.
+		 */
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+		LockBufferForCleanup(buf);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 
-				itup = (IndexTuple) PageGetItem(page,
-												PageGetItemId(page, offno));
-				htup = &(itup->t_tid);
-				if (callback(htup, callback_state))
-				{
-					/* mark the item for deletion */
-					deletable[ndeletable++] = offno;
-					tuples_removed += 1;
-				}
-				else
-					num_index_tuples += 1;
-			}
+		page = BufferGetPage(buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
-			/*
-			 * Apply deletions and write page if needed, advance to next page.
-			 */
-			blkno = opaque->hasho_nextblkno;
+		/*
+		 * If the bucket contains tuples that are moved by split, then we need
+		 * to delete such tuples.  We can't delete such tuples if the split
+		 * operation on bucket is not finished as those are needed by scans.
+		 */
+		if (H_HAS_GARBAGE(bucket_opaque) &&
+			!H_INCOMPLETE_SPLIT(bucket_opaque))
+			bucket_has_garbage = true;
 
-			if (ndeletable > 0)
-			{
-				PageIndexMultiDelete(page, deletable, ndeletable);
-				_hash_wrtbuf(rel, buf);
-				bucket_dirty = true;
-			}
-			else
-				_hash_relbuf(rel, buf);
-		}
+		bucket_buf = buf;
 
-		/* If we deleted anything, try to compact free space */
-		if (bucket_dirty)
-			_hash_squeezebucket(rel, cur_bucket, bucket_blkno,
-								info->strategy);
+		hashbucketcleanup(rel, cur_bucket, bucket_buf, blkno, info->strategy,
+						  local_metapage.hashm_maxbucket,
+						  local_metapage.hashm_highmask,
+						  local_metapage.hashm_lowmask, &tuples_removed,
+						  &num_index_tuples, bucket_has_garbage, true,
+						  callback, callback_state);
 
-		/* Release bucket lock */
-		_hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+		_hash_relbuf(rel, bucket_buf);
 
 		/* Advance to next bucket */
 		cur_bucket++;
@@ -705,6 +653,190 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }
 
+/*
+ * Helper function to perform deletion of index entries from a bucket.
+ *
+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.
+ *
+ * During scan of overflow pages, first we need to lock the next bucket and
+ * then release the lock on current bucket.  This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.
+ *
+ * We need to retain a pin on the primary bucket to ensure that no concurrent
+ * split can start.
+ */
+void
+hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
+				  BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
+				  double *tuples_removed, double *num_index_tuples,
+				  bool bucket_has_garbage, bool delay,
+				  IndexBulkDeleteCallback callback, void *callback_state)
+{
+	BlockNumber blkno;
+	Buffer		buf;
+	Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
+	bool		bucket_dirty = false;
+
+	blkno = bucket_blkno;
+	buf = bucket_buf;
+
+	if (bucket_has_garbage)
+		new_bucket = _hash_get_oldbucket_newbucket(rel, cur_bucket,
+												   lowmask, maxbucket);
+
+	/* Scan each page in bucket */
+	for (;;)
+	{
+		HashPageOpaque opaque;
+		OffsetNumber offno;
+		OffsetNumber maxoffno;
+		Buffer		next_buf;
+		Page		page;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+		bool		retain_pin = false;
+		bool		curr_page_dirty = false;
+
+		if (delay)
+			vacuum_delay_point();
+
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		/* Scan each tuple in page */
+		maxoffno = PageGetMaxOffsetNumber(page);
+		for (offno = FirstOffsetNumber;
+			 offno <= maxoffno;
+			 offno = OffsetNumberNext(offno))
+		{
+			IndexTuple	itup;
+			ItemPointer htup;
+			Bucket		bucket;
+
+			itup = (IndexTuple) PageGetItem(page,
+											PageGetItemId(page, offno));
+			htup = &(itup->t_tid);
+			if (callback && callback(htup, callback_state))
+			{
+				/* mark the item for deletion */
+				deletable[ndeletable++] = offno;
+				if (tuples_removed)
+					*tuples_removed += 1;
+			}
+			else if (bucket_has_garbage)
+			{
+				/* delete the tuples that are moved by split. */
+				bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+											  maxbucket,
+											  highmask,
+											  lowmask);
+				/* mark the item for deletion */
+				if (bucket != cur_bucket)
+				{
+					/*
+					 * We expect tuples to either belong to curent bucket or
+					 * new_bucket.  This is ensured because we don't allow
+					 * further splits from bucket that contains garbage. See
+					 * comments in _hash_expandtable.
+					 */
+					Assert(bucket == new_bucket);
+					deletable[ndeletable++] = offno;
+				}
+				else if (num_index_tuples)
+					*num_index_tuples += 1;
+			}
+			else if (num_index_tuples)
+				*num_index_tuples += 1;
+		}
+
+		/* retain the pin on primary bucket page till end of bucket scan */
+		if (blkno == bucket_blkno)
+			retain_pin = true;
+		else
+			retain_pin = false;
+
+		blkno = opaque->hasho_nextblkno;
+
+		/*
+		 * Apply deletions, advance to next page and write page if needed.
+		 */
+		if (ndeletable > 0)
+		{
+			PageIndexMultiDelete(page, deletable, ndeletable);
+			bucket_dirty = true;
+			curr_page_dirty = true;
+		}
+
+		/* bail out if there are no more pages to scan. */
+		if (!BlockNumberIsValid(blkno))
+			break;
+
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
+		/*
+		 * release the lock on previous page after acquiring the lock on next
+		 * page
+		 */
+		if (curr_page_dirty)
+		{
+			if (retain_pin)
+				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+			else
+				_hash_wrtbuf(rel, buf);
+			curr_page_dirty = false;
+		}
+		else if (retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
+
+		buf = next_buf;
+	}
+
+	/*
+	 * lock the bucket page to clear the garbage flag and squeeze the bucket.
+	 * if the current buffer is same as bucket buffer, then we already have
+	 * lock on bucket page.
+	 */
+	if (buf != bucket_buf)
+	{
+		_hash_relbuf(rel, buf);
+		_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+	}
+
+	/*
+	 * Clear the garbage flag from bucket after deleting the tuples that are
+	 * moved by split.  We purposefully clear the flag before squeeze bucket,
+	 * so that after restart, vacuum shouldn't again try to delete the moved
+	 * by split tuples.
+	 */
+	if (bucket_has_garbage)
+	{
+		HashPageOpaque bucket_opaque;
+		Page		page;
+
+		page = BufferGetPage(bucket_buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+	}
+
+	/*
+	 * If we deleted anything, try to compact free space.  For squeezing the
+	 * bucket, we must have a cleanup lock, else it can impact the ordering of
+	 * tuples for a scan that has started before it.
+	 */
+	if (bucket_dirty && IsBufferCleanupOK(bucket_buf))
+		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
+							bstrategy);
+}
 
 void
 hash_redo(XLogReaderState *record)
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..c2c2a95 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -28,7 +28,8 @@
 void
 _hash_doinsert(Relation rel, IndexTuple itup)
 {
-	Buffer		buf;
+	Buffer		buf = InvalidBuffer;
+	Buffer		bucket_buf;
 	Buffer		metabuf;
 	HashMetaPage metap;
 	BlockNumber blkno;
@@ -40,6 +41,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	bool		do_expand;
 	uint32		hashkey;
 	Bucket		bucket;
+	uint32		maxbucket;
+	uint32		highmask;
+	uint32		lowmask;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -84,21 +88,32 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 
 		blkno = BUCKET_TO_BLKNO(metap, bucket);
 
+		/*
+		 * Copy bucket mapping info now;  The comment in _hash_expandtable
+		 * where we copy this information and calls _hash_splitbucket explains
+		 * why this is OK.
+		 */
+		maxbucket = metap->hashm_maxbucket;
+		highmask = metap->hashm_highmask;
+		lowmask = metap->hashm_lowmask;
+
 		/* Release metapage lock, but keep pin. */
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
 
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * If the previous iteration of this loop locked the primary page of
+		 * what is still the correct target bucket, we are done.  Otherwise,
+		 * drop any old lock before acquiring the new one.
 		 */
 		if (retry)
 		{
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+
+		/* Fetch and lock the primary bucket page for the target bucket */
+		buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
 
 		/*
 		 * Reacquire metapage lock and check that no bucket split has taken
@@ -109,12 +124,44 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		retry = true;
 	}
 
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+	/* remember the primary bucket buffer to release the pin on it at end. */
+	bucket_buf = buf;
+
 	page = BufferGetPage(buf);
 	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(pageopaque->hasho_bucket == bucket);
 
+	/*
+	 * If this bucket is in the process of being split, try to finish the
+	 * split before inserting, because that might create room for the
+	 * insertion to proceed without allocating an additional overflow page.
+	 * It's only interesting to finish the split if we're trying to insert
+	 * into the bucket from which we're removing tuples (the "old" bucket),
+	 * not if we're trying to insert into the bucket into which tuples are
+	 * being moved (the "new" bucket).
+	 */
+	if (H_OLD_INCOMPLETE_SPLIT(pageopaque) && IsBufferCleanupOK(buf))
+	{
+		BlockNumber nblkno;
+		Buffer		nbuf;
+
+		nblkno = _hash_get_oldbucket_newblock(rel, pageopaque->hasho_bucket);
+
+		/* Fetch the primary bucket page for the new bucket */
+		nbuf = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+		if (nbuf)
+		{
+			_hash_finish_split(rel, metabuf, buf, nbuf, maxbucket,
+							   highmask, lowmask);
+
+			/*
+			 * release the buffer here as the insertion will happen in old
+			 * bucket.
+			 */
+			_hash_relbuf(rel, nbuf);
+		}
+	}
+
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
@@ -127,9 +174,15 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		{
 			/*
 			 * ovfl page exists; go get it.  if it doesn't have room, we'll
-			 * find out next pass through the loop test above.
+			 * find out next pass through the loop test above.  we always
+			 * release both the lock and pin if this is an overflow page, but
+			 * only the lock if this is the primary bucket page, since the pin
+			 * on the primary bucket must be retained throughout the scan.
 			 */
-			_hash_relbuf(rel, buf);
+			if (buf != bucket_buf)
+				_hash_relbuf(rel, buf);
+			else
+				_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 			buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 			page = BufferGetPage(buf);
 		}
@@ -144,7 +197,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 
 			/* chain to a new overflow page */
-			buf = _hash_addovflpage(rel, metabuf, buf);
+			buf = _hash_addovflpage(rel, metabuf, buf, (buf == bucket_buf) ? true : false);
 			page = BufferGetPage(buf);
 
 			/* should fit now, given test above */
@@ -158,11 +211,14 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	/* found page with enough space, so add the item here */
 	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
 
-	/* write and release the modified page */
+	/*
+	 * write and release the modified page.  if the page we modified was an
+	 * overflow page, we also need to separately drop the pin we retained on
+	 * the primary bucket page.
+	 */
 	_hash_wrtbuf(rel, buf);
-
-	/* We can drop the bucket lock now */
-	_hash_droplock(rel, blkno, HASH_SHARE);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
 
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index df7af3e..c00d6f5 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *
  *	On entry, the caller must hold a pin but no lock on 'buf'.  The pin is
  *	dropped before exiting (we assume the caller is not interested in 'buf'
- *	anymore).  The returned overflow page will be pinned and write-locked;
- *	it is guaranteed to be empty.
+ *	anymore) if not asked to retain.  The pin will be retained only for the
+ *	primary bucket.  The returned overflow page will be pinned and
+ *	write-locked; it is guaranteed to be empty.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
  *
- *	The caller must hold at least share lock on the bucket, to ensure that
- *	no one else tries to compact the bucket meanwhile.  This guarantees that
- *	'buf' won't stop being part of the bucket while it's unlocked.
- *
  * NB: since this could be executed concurrently by multiple processes,
  * one should not assume that the returned overflow page will be the
  * immediate successor of the originally passed 'buf'.  Additional overflow
  * pages might have been added to the bucket chain in between.
  */
 Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 {
 	Buffer		ovflbuf;
 	Page		page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 			break;
 
 		/* we assume we do not need to write the unmodified page */
-		_hash_relbuf(rel, buf);
+		if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
 
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
@@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 
 	/* logically chain overflow page to previous page */
 	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	_hash_wrtbuf(rel, buf);
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, buf);
 
 	return ovflbuf;
 }
@@ -369,12 +372,13 @@ _hash_firstfreebit(uint32 map)
  *	Returns the block number of the page that followed the given page
  *	in the bucket, or InvalidBlockNumber if no following page.
  *
- *	NB: caller must not hold lock on metapage, nor on either page that's
- *	adjacent in the bucket chain.  The caller had better hold exclusive lock
- *	on the bucket, too.
+ *	NB: caller must hold a cleanup lock on the primary bucket page, so that
+ *	concurrent scans can't get confused.  caller must not hold a lock on either
+ *	page adjacent to this one in the bucket chain (except when it's the primary
+ *	bucket page). caller must not hold a lock on the metapage, either.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf,
+_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
@@ -413,22 +417,42 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
-	 * deleted.  No concurrency issues since we hold exclusive lock on the
-	 * entire bucket.
+	 * deleted.  No concurrency issues since we hold a cleanup lock on primary
+	 * bucket.  We don't need to acquire a buffer lock to fix the primary
+	 * bucket, as we already have that lock.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-														 prevblkno,
-														 HASH_WRITE,
+		Buffer		prevbuf;
+		Page		prevpage;
+		HashPageOpaque prevopaque;
+
+		if (prevblkno == bucket_blkno)
+			prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
+										 prevblkno,
+										 RBM_NORMAL,
+										 bstrategy);
+		else
+			prevbuf = _hash_getbuf_with_strategy(rel,
+												 prevblkno,
+												 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		prevpage = BufferGetPage(prevbuf);
-		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+												 bstrategy);
+
+		prevpage = BufferGetPage(prevbuf);
+		prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
 
 		Assert(prevopaque->hasho_bucket == bucket);
 		prevopaque->hasho_nextblkno = nextblkno;
-		_hash_wrtbuf(rel, prevbuf);
+
+
+		if (prevblkno == bucket_blkno)
+		{
+			MarkBufferDirty(prevbuf);
+			ReleaseBuffer(prevbuf);
+		}
+		else
+			_hash_wrtbuf(rel, prevbuf);
 	}
 	if (BlockNumberIsValid(nextblkno))
 	{
@@ -570,8 +594,10 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
  *	required that to be true on entry as well, but it's a lot easier for
  *	callers to leave empty overflow pages and let this guy clean it up.
  *
- *	Caller must hold exclusive lock on the target bucket.  This allows
- *	us to safely lock multiple pages in the bucket.
+ *	Caller must hold cleanup lock on the primary page of the target bucket
+ *	to exclude any concurrent scans, which could easily be confused into
+ *	returning the same tuple more than once or some tuples not at all by
+ *	the rearrangement we are performing here.
  *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
@@ -580,6 +606,7 @@ void
 _hash_squeezebucket(Relation rel,
 					Bucket bucket,
 					BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy)
 {
 	BlockNumber wblkno;
@@ -591,27 +618,22 @@ _hash_squeezebucket(Relation rel,
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
 	bool		wbuf_dirty;
+	bool		release_buf = false;
 
 	/*
-	 * start squeezing into the base bucket page.
+	 * start squeezing into the primary bucket page.
 	 */
 	wblkno = bucket_blkno;
-	wbuf = _hash_getbuf_with_strategy(rel,
-									  wblkno,
-									  HASH_WRITE,
-									  LH_BUCKET_PAGE,
-									  bstrategy);
+	wbuf = bucket_buf;
 	wpage = BufferGetPage(wbuf);
 	wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 
 	/*
-	 * if there aren't any overflow pages, there's nothing to squeeze.
+	 * if there aren't any overflow pages, there's nothing to squeeze. caller
+	 * is responsible for releasing the lock on primary bucket page.
 	 */
 	if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
-	{
-		_hash_relbuf(rel, wbuf);
 		return;
-	}
 
 	/*
 	 * Find the last page in the bucket chain by starting at the base bucket
@@ -673,12 +695,17 @@ _hash_squeezebucket(Relation rel,
 			{
 				Assert(!PageIsEmpty(wpage));
 
+				if (wblkno != bucket_blkno)
+					release_buf = true;
+
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));
 
-				if (wbuf_dirty)
+				if (wbuf_dirty && release_buf)
 					_hash_wrtbuf(rel, wbuf);
-				else
+				else if (wbuf_dirty)
+					MarkBufferDirty(wbuf);
+				else if (release_buf)
 					_hash_relbuf(rel, wbuf);
 
 				/* nothing more to do if we reached the read page */
@@ -704,6 +731,7 @@ _hash_squeezebucket(Relation rel,
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
 				wbuf_dirty = false;
+				release_buf = false;
 			}
 
 			/*
@@ -737,19 +765,25 @@ _hash_squeezebucket(Relation rel,
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
-			/* yes, so release wbuf lock first */
-			if (wbuf_dirty)
+			if (wblkno != bucket_blkno)
+				release_buf = true;
+
+			/* yes, so release wbuf lock first if needed */
+			if (wbuf_dirty && release_buf)
 				_hash_wrtbuf(rel, wbuf);
-			else
+			else if (wbuf_dirty)
+				MarkBufferDirty(wbuf);
+			else if (release_buf)
 				_hash_relbuf(rel, wbuf);
+
 			/* free this overflow page (releases rbuf) */
-			_hash_freeovflpage(rel, rbuf, bstrategy);
+			_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 			/* done */
 			return;
 		}
 
 		/* free this overflow page, then get the previous one */
-		_hash_freeovflpage(rel, rbuf, bstrategy);
+		_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a5e9d17..7bc6b26 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,10 +38,14 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
 					uint32 nblocks);
 static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
+static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
+					   Bucket obucket, Bucket nbucket, Buffer obuf,
+					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
+					   uint32 highmask, uint32 lowmask);
 
 
 /*
@@ -55,46 +59,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 
 
 /*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock.  (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		return ConditionalLockPage(rel, whichlock, access);
-	else
-		return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		UnlockPage(rel, whichlock, access);
-}
-
-/*
  *	_hash_getbuf() -- Get a buffer by block number for read or write.
  *
  *		'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -132,6 +96,35 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
 }
 
 /*
+ * _hash_getbuf_with_condlock_cleanup() -- Try to get a buffer for cleanup.
+ *
+ *		We read the page and try to acquire a cleanup lock.  If we get it,
+ *		we return the buffer; otherwise, we return InvalidBuffer.
+ */
+Buffer
+_hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
+{
+	Buffer		buf;
+
+	if (blkno == P_NEW)
+		elog(ERROR, "hash AM does not use P_NEW");
+
+	buf = ReadBuffer(rel, blkno);
+
+	if (!ConditionalLockBufferForCleanup(buf))
+	{
+		ReleaseBuffer(buf);
+		return InvalidBuffer;
+	}
+
+	/* ref count and lock type are correct */
+
+	_hash_checkpage(rel, buf, flags);
+
+	return buf;
+}
+
+/*
  *	_hash_getinitbuf() -- Get and initialize a buffer by block number.
  *
  *		This must be used only to fetch pages that are known to be before
@@ -266,6 +259,33 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
+ *	_hash_dropscanbuf() -- release buffers used in scan.
+ *
+ * This routine unpins the buffers used during scan on which we
+ * hold no lock.
+ */
+void
+_hash_dropscanbuf(Relation rel, HashScanOpaque so)
+{
+	/* release pin we hold on primary bucket page */
+	if (BufferIsValid(so->hashso_bucket_buf) &&
+		so->hashso_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_bucket_buf);
+	so->hashso_bucket_buf = InvalidBuffer;
+
+	/* release pin we hold on old primary bucket page */
+	if (BufferIsValid(so->hashso_old_bucket_buf) &&
+		so->hashso_old_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+	so->hashso_old_bucket_buf = InvalidBuffer;
+
+	/* release any pin we still hold */
+	if (BufferIsValid(so->hashso_curbuf))
+		_hash_dropbuf(rel, so->hashso_curbuf);
+	so->hashso_curbuf = InvalidBuffer;
+}
+
+/*
  *	_hash_wrtbuf() -- write a hash page to disk.
  *
  *		This routine releases the lock held on the buffer and our refcount
@@ -489,9 +509,11 @@ _hash_pageinit(Page page, Size size)
 /*
  * Attempt to expand the hash table by creating one new bucket.
  *
- * This will silently do nothing if it cannot get the needed locks.
+ * This will silently do nothing if we don't get cleanup lock on old or
+ * new bucket.
  *
- * The caller should hold no locks on the hash index.
+ * Complete the pending splits and remove the tuples from old bucket,
+ * if there are any left over from previous split.
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.
@@ -506,10 +528,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	BlockNumber start_oblkno;
 	BlockNumber start_nblkno;
 	Buffer		buf_nblkno;
+	Buffer		buf_oblkno;
+	Page		opage;
+	HashPageOpaque oopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
 
+restart_expand:
+
 	/*
 	 * Write-lock the meta page.  It used to be necessary to acquire a
 	 * heavyweight lock to begin a split, but that is no longer required.
@@ -548,11 +575,16 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		goto fail;
 
 	/*
-	 * Determine which bucket is to be split, and attempt to lock the old
-	 * bucket.  If we can't get the lock, give up.
+	 * Determine which bucket is to be split, and attempt to take cleanup lock
+	 * on the old bucket.  If we can't get the lock, give up.
 	 *
-	 * The lock protects us against other backends, but not against our own
-	 * backend.  Must check for active scans separately.
+	 * The cleanup lock protects us not only against other backends, but
+	 * against our own backend as well.
+	 *
+	 * The cleanup lock is mainly to protect the split from concurrent
+	 * inserts. See src/backend/access/hash/README, Lock Definitions for
+	 * further details.  Due to this locking restriction, if there is any
+	 * pending scan, split will give up which is not good, but harmless.
 	 */
 	new_bucket = metap->hashm_maxbucket + 1;
 
@@ -560,14 +592,90 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 
 	start_oblkno = BUCKET_TO_BLKNO(metap, old_bucket);
 
-	if (_hash_has_active_scan(rel, old_bucket))
+	buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
+	if (!buf_oblkno)
 		goto fail;
 
-	if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
-		goto fail;
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * We want to finish the split from a bucket as there is no apparent
+	 * benefit by not doing so and it will make the code complicated to finish
+	 * the split that involves multiple buckets considering the case where new
+	 * split also fails.  We don't need to consider the new bucket for
+	 * completing the split here as it is not possible that a re-split of new
+	 * bucket starts when there is still a pending split from old bucket.
+	 */
+	if (H_OLD_INCOMPLETE_SPLIT(oopaque))
+	{
+		BlockNumber nblkno;
+		Buffer		buf_nblkno;
+
+		/*
+		 * Copy bucket mapping info now;  The comment in code below where we
+		 * copy this information and calls _hash_splitbucket explains why this
+		 * is OK.
+		 */
+		maxbucket = metap->hashm_maxbucket;
+		highmask = metap->hashm_highmask;
+		lowmask = metap->hashm_lowmask;
+
+		/* Release the metapage lock, before completing the split. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		nblkno = _hash_get_oldbucket_newblock(rel, oopaque->hasho_bucket);
+
+		/* Fetch the primary bucket page for the new bucket */
+		buf_nblkno = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+		if (!buf_nblkno)
+		{
+			_hash_relbuf(rel, buf_oblkno);
+			return;
+		}
+
+		_hash_finish_split(rel, metabuf, buf_oblkno, buf_nblkno, maxbucket,
+						   highmask, lowmask);
+
+		/*
+		 * release the buffers and retry for expand.
+		 */
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+
+		goto restart_expand;
+	}
+
+	/*
+	 * Clean the tuples remained from previous split.  This operation requires
+	 * cleanup lock and we already have one on old bucket, so let's do it. We
+	 * also don't want to allow further splits from the bucket till the
+	 * garbage of previous split is cleaned.  This has two advantages, first
+	 * it helps in avoiding the bloat due to garbage and second is, during
+	 * cleanup of bucket, we are always sure that the garbage tuples belong to
+	 * most recently splitted bucket.  On the contrary, if we allow cleanup of
+	 * bucket after meta page is updated to indicate the new split and before
+	 * the actual split, the cleanup operation won't be able to decide whether
+	 * the tuple has been moved to the newly created bucket and ended up
+	 * deleting such tuples.
+	 */
+	if (H_HAS_GARBAGE(oopaque))
+	{
+		/* Release the metapage lock. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		hashbucketcleanup(rel, old_bucket, buf_oblkno, start_oblkno, NULL,
+						  metap->hashm_maxbucket, metap->hashm_highmask,
+						  metap->hashm_lowmask, NULL,
+						  NULL, true, false, NULL, NULL);
+
+		_hash_relbuf(rel, buf_oblkno);
+
+		goto restart_expand;
+	}
 
 	/*
-	 * Likewise lock the new bucket (should never fail).
+	 * There shouldn't be any active scan on new bucket.
 	 *
 	 * Note: it is safe to compute the new bucket's blkno here, even though we
 	 * may still need to update the BUCKET_TO_BLKNO mapping.  This is because
@@ -576,12 +684,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	 */
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
-	if (_hash_has_active_scan(rel, new_bucket))
-		elog(ERROR, "scan in progress on supposedly new bucket");
-
-	if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
-		elog(ERROR, "could not get lock on supposedly new bucket");
-
 	/*
 	 * If the split point is increasing (hashm_maxbucket's log base 2
 	 * increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +702,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
 			/* can't split due to BlockNumber overflow */
-			_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-			_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+			_hash_relbuf(rel, buf_oblkno);
 			goto fail;
 		}
 	}
@@ -609,9 +710,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/*
 	 * Physically allocate the new bucket's primary page.  We want to do this
 	 * before changing the metapage's mapping info, in case we can't get the
-	 * disk space.
+	 * disk space.  Ideally, we don't need to check for cleanup lock on new
+	 * bucket as no other backend could find this bucket unless meta page is
+	 * updated.  However, it is good to be consistent with old bucket locking.
 	 */
 	buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
+	if (!IsBufferCleanupOK(buf_nblkno))
+	{
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+		goto fail;
+	}
+
 
 	/*
 	 * Okay to proceed with split.  Update the metapage bucket mapping info.
@@ -665,13 +775,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  start_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno,
 					  maxbucket, highmask, lowmask);
 
-	/* Release bucket locks, allowing others to access them */
-	_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-	_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -738,13 +844,17 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
  * belong in the new bucket, and compress out any free space in the old
  * bucket.
  *
- * The caller must hold exclusive locks on both buckets to ensure that
+ * The caller must hold cleanup locks on both buckets to ensure that
  * no one else is trying to access them (see README).
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.  (The metapage is only
  * touched if it becomes necessary to add or remove overflow pages.)
  *
+ * Split needs to retain pin on primary bucket pages of both old and new
+ * buckets till end of operation.  This is to prevent vacuum to start
+ * when split is in progress.
+ *
  * In addition, the caller must have created the new bucket's base page,
  * which is passed in buffer nbuf, pinned and write-locked.  That lock and
  * pin are released here.  (The API is set up this way because we must do
@@ -756,37 +866,87 @@ _hash_splitbucket(Relation rel,
 				  Buffer metabuf,
 				  Bucket obucket,
 				  Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Buffer		obuf;
 	Page		opage;
 	Page		npage;
 	HashPageOpaque oopaque;
 	HashPageOpaque nopaque;
 
-	/*
-	 * It should be okay to simultaneously write-lock pages from each bucket,
-	 * since no one else can be trying to acquire buffer lock on pages of
-	 * either bucket.
-	 */
-	obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
 	opage = BufferGetPage(obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 
+	/*
+	 * Mark the old bucket to indicate that split is in progress and it has
+	 * deletable tuples. At operation end, we clear split in progress flag and
+	 * vacuum will clear page_has_garbage flag after deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
 	npage = BufferGetPage(nbuf);
 
-	/* initialize the new bucket's primary page */
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nopaque->hasho_prevblkno = InvalidBlockNumber;
 	nopaque->hasho_nextblkno = InvalidBlockNumber;
 	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
 	nopaque->hasho_page_id = HASHO_PAGE_ID;
 
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, nbuf, NULL,
+						   maxbucket, highmask, lowmask);
+
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, obuf);
+	_hash_relbuf(rel, nbuf);
+}
+
+/*
+ * _hash_splitbucket_guts -- Helper function to perform the split operation
+ *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, caller needs to fill htab.  If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
+ * Caller needs to lock and unlock the old and new primary buckets.
+ */
+static void
+_hash_splitbucket_guts(Relation rel,
+					   Buffer metabuf,
+					   Bucket obucket,
+					   Bucket nbucket,
+					   Buffer obuf,
+					   Buffer nbuf,
+					   HTAB *htab,
+					   uint32 maxbucket,
+					   uint32 highmask,
+					   uint32 lowmask)
+{
+	Buffer		bucket_obuf;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
+
+	bucket_obuf = obuf;
+	opage = BufferGetPage(obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
 	/*
 	 * Partition the tuples in the old bucket between the old bucket and the
 	 * new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +958,6 @@ _hash_splitbucket(Relation rel,
 		BlockNumber oblkno;
 		OffsetNumber ooffnum;
 		OffsetNumber omaxoffnum;
-		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
 
 		/* Scan each tuple in old page */
 		omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -810,33 +968,56 @@ _hash_splitbucket(Relation rel,
 			IndexTuple	itup;
 			Size		itemsz;
 			Bucket		bucket;
+			bool		found = false;
 
 			/* skip dead tuples */
 			if (ItemIdIsDead(PageGetItemId(opage, ooffnum)))
 				continue;
 
 			/*
-			 * Fetch the item's hash key (conveniently stored in the item) and
-			 * determine which bucket it now belongs in.
+			 * Before inserting a tuple, probe the hash table containing TIDs
+			 * of tuples belonging to new bucket, if we find a match, then
+			 * skip that tuple, else fetch the item's hash key (conveniently
+			 * stored in the item) and determine which bucket it now belongs
+			 * in.
 			 */
 			itup = (IndexTuple) PageGetItem(opage,
 											PageGetItemId(opage, ooffnum));
+
+			if (htab)
+				(void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
+
+			if (found)
+				continue;
+
 			bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
 										  maxbucket, highmask, lowmask);
 
 			if (bucket == nbucket)
 			{
+				Size		itupsize = 0;
+				IndexTuple	new_itup;
+
+				/*
+				 * make a copy of index tuple as we have to scribble on it.
+				 */
+				new_itup = CopyIndexTuple(itup);
+
+				/*
+				 * mark the index tuple as moved by split, such tuples are
+				 * skipped by scan if there is split in progress for a bucket.
+				 */
+				itupsize = new_itup->t_info & INDEX_SIZE_MASK;
+				new_itup->t_info &= ~INDEX_SIZE_MASK;
+				new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+				new_itup->t_info |= itupsize;
+
 				/*
 				 * insert the tuple into the new bucket.  if it doesn't fit on
 				 * the current page in the new bucket, we must allocate a new
 				 * overflow page and place the tuple on that page instead.
-				 *
-				 * XXX we have a problem here if we fail to get space for a
-				 * new overflow page: we'll error out leaving the bucket split
-				 * only partially complete, meaning the index is corrupt,
-				 * since searches may fail to find entries they should find.
 				 */
-				itemsz = IndexTupleDSize(*itup);
+				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
 				if (PageGetFreeSpace(npage) < itemsz)
@@ -844,9 +1025,9 @@ _hash_splitbucket(Relation rel,
 					/* write out nbuf and drop lock, but keep pin */
 					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
 					/* chain to a new overflow page */
-					nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+					nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf) ? true : false);
 					npage = BufferGetPage(nbuf);
-					/* we don't need nopaque within the loop */
+					nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 				}
 
 				/*
@@ -856,12 +1037,10 @@ _hash_splitbucket(Relation rel,
 				 * Possible future improvement: accumulate all the items for
 				 * the new page and qsort them before insertion.
 				 */
-				(void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
+				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
-				/*
-				 * Mark tuple for deletion from old page.
-				 */
-				deletable[ndeletable++] = ooffnum;
+				/* be tidy */
+				pfree(new_itup);
 			}
 			else
 			{
@@ -874,15 +1053,9 @@ _hash_splitbucket(Relation rel,
 
 		oblkno = oopaque->hasho_nextblkno;
 
-		/*
-		 * Done scanning this old page.  If we moved any tuples, delete them
-		 * from the old page.
-		 */
-		if (ndeletable > 0)
-		{
-			PageIndexMultiDelete(opage, deletable, ndeletable);
-			_hash_wrtbuf(rel, obuf);
-		}
+		/* retain the pin on the old primary bucket */
+		if (obuf == bucket_obuf)
+			_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, obuf);
 
@@ -891,18 +1064,153 @@ _hash_splitbucket(Relation rel,
 			break;
 
 		/* Else, advance to next old page */
-		obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
 		opage = BufferGetPage(obuf);
 		oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	}
 
 	/*
 	 * We're at the end of the old bucket chain, so we're done partitioning
-	 * the tuples.  Before quitting, call _hash_squeezebucket to ensure the
-	 * tuples remaining in the old bucket (including the overflow pages) are
-	 * packed as tightly as possible.  The new bucket is already tight.
+	 * the tuples.  Mark the old and new buckets to indicate split is
+	 * finished.
+	 *
+	 * To avoid deadlocks due to locking order of buckets, first lock the old
+	 * bucket and then the new bucket.
 	 */
-	_hash_wrtbuf(rel, nbuf);
+	if (nbuf == bucket_nbuf)
+		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, nbuf);
+
+	/*
+	 * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+	 * there is no pending scan that has seen the flag after it is cleared.
+	 */
+	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+	opage = BufferGetPage(bucket_obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	_hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+	npage = BufferGetPage(bucket_nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+	/* indicate that split is finished */
+	oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
+	nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
+
+	/*
+	 * now write the buffers, here we don't release the locks as caller is
+	 * responsible to release locks.
+	 */
+	MarkBufferDirty(bucket_obuf);
+	MarkBufferDirty(bucket_nbuf);
+}
+
+/*
+ *	_hash_finish_split() -- Finish the previously interrupted split operation
+ *
+ * To complete the split operation, we form the hash table of TIDs in new
+ * bucket which is then used by split operation to skip tuples that are
+ * already moved before the split operation was previously interruptted.
+ *
+ * The caller must hold a pin, but no lock, on the metapage buffer.
+ * The buffer is returned in the same state.  (The metapage is only
+ * touched if it becomes necessary to add or remove overflow pages.)
+ *
+ * 'obuf' and 'nbuf' must be locked by the caller which is also responsible
+ * for unlocking them.
+ */
+void
+_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf,
+				   uint32 maxbucket, uint32 highmask, uint32 lowmask)
+{
+	HASHCTL		hash_ctl;
+	HTAB	   *tidhtab;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque opageopaque;
+	HashPageOpaque npageopaque;
+	Bucket		obucket;
+	Bucket		nbucket;
+	bool		found;
+
+	/* Initialize hash tables used to track TIDs */
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+	hash_ctl.keysize = sizeof(ItemPointerData);
+	hash_ctl.entrysize = sizeof(ItemPointerData);
+	hash_ctl.hcxt = CurrentMemoryContext;
+
+	tidhtab =
+		hash_create("bucket ctids",
+					256,		/* arbitrary initial size */
+					&hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	/*
+	 * Scan the new bucket and build hash table of TIDs
+	 */
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	for (;;)
+	{
+		BlockNumber nblkno;
+		OffsetNumber noffnum;
+		OffsetNumber nmaxoffnum;
+
+		/* Scan each tuple in new page */
+		nmaxoffnum = PageGetMaxOffsetNumber(npage);
+		for (noffnum = FirstOffsetNumber;
+			 noffnum <= nmaxoffnum;
+			 noffnum = OffsetNumberNext(noffnum))
+		{
+			IndexTuple	itup;
+
+			/* Fetch the item's TID and insert it in hash table. */
+			itup = (IndexTuple) PageGetItem(npage,
+											PageGetItemId(npage, noffnum));
+
+			(void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
+
+			Assert(!found);
+		}
+
+		nblkno = npageopaque->hasho_nextblkno;
+
+		/*
+		 * release our write lock without modifying buffer and ensure to
+		 * retain the pin on primary bucket.
+		 */
+		if (nbuf == bucket_nbuf)
+			_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, nbuf);
+
+		/* Exit loop if no more overflow pages in new bucket */
+		if (!BlockNumberIsValid(nblkno))
+			break;
+
+		/* Else, advance to next page */
+		nbuf = _hash_getbuf(rel, nblkno, HASH_READ, LH_OVERFLOW_PAGE);
+		npage = BufferGetPage(nbuf);
+		npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	}
+
+	/* Need a cleanup lock to perform split operation. */
+	LockBufferForCleanup(bucket_nbuf);
+
+	npage = BufferGetPage(bucket_nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nbucket = npageopaque->hasho_bucket;
+
+	opage = BufferGetPage(obuf);
+	opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+	obucket = opageopaque->hasho_bucket;
+
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, bucket_nbuf, tidhtab,
+						   maxbucket, highmask, lowmask);
 
-	_hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+	hash_destroy(tidhtab);
 }
diff --git a/src/backend/access/hash/hashscan.c b/src/backend/access/hash/hashscan.c
deleted file mode 100644
index fe97ef2..0000000
--- a/src/backend/access/hash/hashscan.c
+++ /dev/null
@@ -1,153 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * hashscan.c
- *	  manage scans on hash tables
- *
- * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- *
- * IDENTIFICATION
- *	  src/backend/access/hash/hashscan.c
- *
- *-------------------------------------------------------------------------
- */
-
-#include "postgres.h"
-
-#include "access/hash.h"
-#include "access/relscan.h"
-#include "utils/memutils.h"
-#include "utils/rel.h"
-#include "utils/resowner.h"
-
-
-/*
- * We track all of a backend's active scans on hash indexes using a list
- * of HashScanListData structs, which are allocated in TopMemoryContext.
- * It's okay to use a long-lived context because we rely on the ResourceOwner
- * mechanism to clean up unused entries after transaction or subtransaction
- * abort.  We can't safely keep the entries in the executor's per-query
- * context, because that might be already freed before we get a chance to
- * clean up the list.  (XXX seems like there should be a better way to
- * manage this...)
- */
-typedef struct HashScanListData
-{
-	IndexScanDesc hashsl_scan;
-	ResourceOwner hashsl_owner;
-	struct HashScanListData *hashsl_next;
-} HashScanListData;
-
-typedef HashScanListData *HashScanList;
-
-static HashScanList HashScans = NULL;
-
-
-/*
- * ReleaseResources_hash() --- clean up hash subsystem resources.
- *
- * This is here because it needs to touch this module's static var HashScans.
- */
-void
-ReleaseResources_hash(void)
-{
-	HashScanList l;
-	HashScanList prev;
-	HashScanList next;
-
-	/*
-	 * Release all HashScanList items belonging to the current ResourceOwner.
-	 * Note that we do not release the underlying IndexScanDesc; that's in
-	 * executor memory and will go away on its own (in fact quite possibly has
-	 * gone away already, so we mustn't try to touch it here).
-	 *
-	 * Note: this should be a no-op during normal query shutdown. However, in
-	 * an abort situation ExecutorEnd is not called and so there may be open
-	 * index scans to clean up.
-	 */
-	prev = NULL;
-
-	for (l = HashScans; l != NULL; l = next)
-	{
-		next = l->hashsl_next;
-		if (l->hashsl_owner == CurrentResourceOwner)
-		{
-			if (prev == NULL)
-				HashScans = next;
-			else
-				prev->hashsl_next = next;
-
-			pfree(l);
-			/* prev does not change */
-		}
-		else
-			prev = l;
-	}
-}
-
-/*
- *	_hash_regscan() -- register a new scan.
- */
-void
-_hash_regscan(IndexScanDesc scan)
-{
-	HashScanList new_el;
-
-	new_el = (HashScanList) MemoryContextAlloc(TopMemoryContext,
-											   sizeof(HashScanListData));
-	new_el->hashsl_scan = scan;
-	new_el->hashsl_owner = CurrentResourceOwner;
-	new_el->hashsl_next = HashScans;
-	HashScans = new_el;
-}
-
-/*
- *	_hash_dropscan() -- drop a scan from the scan list
- */
-void
-_hash_dropscan(IndexScanDesc scan)
-{
-	HashScanList chk,
-				last;
-
-	last = NULL;
-	for (chk = HashScans;
-		 chk != NULL && chk->hashsl_scan != scan;
-		 chk = chk->hashsl_next)
-		last = chk;
-
-	if (chk == NULL)
-		elog(ERROR, "hash scan list trashed; cannot find 0x%p", (void *) scan);
-
-	if (last == NULL)
-		HashScans = chk->hashsl_next;
-	else
-		last->hashsl_next = chk->hashsl_next;
-
-	pfree(chk);
-}
-
-/*
- * Is there an active scan in this bucket?
- */
-bool
-_hash_has_active_scan(Relation rel, Bucket bucket)
-{
-	Oid			relid = RelationGetRelid(rel);
-	HashScanList l;
-
-	for (l = HashScans; l != NULL; l = l->hashsl_next)
-	{
-		if (relid == l->hashsl_scan->indexRelation->rd_id)
-		{
-			HashScanOpaque so = (HashScanOpaque) l->hashsl_scan->opaque;
-
-			if (so->hashso_bucket_valid &&
-				so->hashso_bucket == bucket)
-				return true;
-		}
-	}
-
-	return false;
-}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..21954c2 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -67,12 +67,25 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
  */
 static void
 _hash_readnext(Relation rel,
-			   Buffer *bufp, Page *pagep, HashPageOpaque *opaquep)
+			   Buffer *bufp, Page *pagep, HashPageOpaque *opaquep,
+			   bool primary_buc_page)
 {
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_nextblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan to ensure
+	 * that vacuum can't delete the tuples that are moved by split to new
+	 * bucket.  Such tuples are required by the scans that are started on
+	 * buckets where split is in progress, before a new bucket's split in
+	 * progress flag (LH_BUCKET_NEW_PAGE_SPLIT) is cleared.
+	 */
+	if (primary_buc_page)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -89,12 +102,22 @@ _hash_readnext(Relation rel,
  */
 static void
 _hash_readprev(Relation rel,
-			   Buffer *bufp, Page *pagep, HashPageOpaque *opaquep)
+			   Buffer *bufp, Page *pagep, HashPageOpaque *opaquep,
+			   bool primary_buc_page)
 {
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_prevblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan. See
+	 * comments in _hash_readnext to know the reason of retaining pin.
+	 */
+	if (primary_buc_page)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -104,6 +127,13 @@ _hash_readprev(Relation rel,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
+
+		/*
+		 * We always maintain the pin on bucket page for whole scan operation,
+		 * so releasing the additional pin we have acquired here.
+		 */
+		if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+			_hash_dropbuf(rel, *bufp);
 	}
 }
 
@@ -218,9 +248,11 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 		{
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+
+		/* Fetch the primary bucket page for the bucket */
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
 
 		/*
 		 * Reacquire metapage lock and check that no bucket split has taken
@@ -234,22 +266,64 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	/* done with the metapage */
 	_hash_dropbuf(rel, metabuf);
 
-	/* Update scan opaque state to show we have lock on the bucket */
-	so->hashso_bucket = bucket;
-	so->hashso_bucket_valid = true;
-	so->hashso_bucket_blkno = blkno;
-
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
 	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
+	so->hashso_bucket_buf = buf;
+
+	/*
+	 * If the bucket split is in progress, then we need to skip tuples that
+	 * are moved from old bucket.  To ensure that vacuum doesn't clean any
+	 * tuples from old or new buckets till this scan is in progress, maintain
+	 * a pin on both of the buckets.  Here, we have to be cautious about
+	 * locking order, first acquire the lock on old bucket, release the lock
+	 * on old bucket, but not pin, then acquire the lock on new bucket and
+	 * again re-verify whether the bucket split still is in progress.
+	 * Acquiring lock on old bucket first ensures that the vacuum waits for
+	 * this scan to finish.
+	 */
+	if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+	{
+		BlockNumber old_blkno;
+		Buffer		old_buf;
+
+		old_blkno = _hash_get_newbucket_oldblock(rel, opaque->hasho_bucket);
+
+		/*
+		 * release the lock on new bucket and re-acquire it after acquiring
+		 * the lock on old bucket.
+		 */
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+		/*
+		 * remember the old bucket buffer so as to use it later for scanning.
+		 */
+		so->hashso_old_bucket_buf = old_buf;
+		_hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+		_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		Assert(opaque->hasho_bucket == bucket);
+
+		if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+			so->hashso_skip_moved_tuples = true;
+		else
+		{
+			_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+			so->hashso_old_bucket_buf = InvalidBuffer;
+		}
+	}
+
 	/* If a backwards scan is requested, move to the end of the chain */
 	if (ScanDirectionIsBackward(dir))
 	{
 		while (BlockNumberIsValid(opaque->hasho_nextblkno))
-			_hash_readnext(rel, &buf, &page, &opaque);
+			_hash_readnext(rel, &buf, &page, &opaque,
+					   (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false);
 	}
 
 	/* Now find the first tuple satisfying the qualification */
@@ -273,6 +347,13 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
  *		false.  Else, return true and set the hashso_curpos for the
  *		scan to the right thing.
  *
+ *		Here we also scan the old bucket if the split for current bucket
+ *		was in progress at the start of scan.  The basic idea is that
+ *		skip the tuples that are moved by split while scanning current
+ *		bucket and then scan the old bucket to cover all such tuples. This
+ *		is done to ensure that we don't miss any tuples in the scans that
+ *		started during split.
+ *
  *		'bufP' points to the current buffer, which is pinned and read-locked.
  *		On success exit, we have pin and read-lock on whichever page
  *		contains the right item; on failure, we have released all buffers.
@@ -338,6 +419,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum >= FirstOffsetNumber);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberNext(offnum);	/* move forward */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -345,7 +439,8 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					/*
 					 * ran off the end of this page, try the next
 					 */
-					_hash_readnext(rel, &buf, &page, &opaque);
+					_hash_readnext(rel, &buf, &page, &opaque,
+					   (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false);
 					if (BufferIsValid(buf))
 					{
 						maxoff = PageGetMaxOffsetNumber(page);
@@ -353,9 +448,42 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -379,6 +507,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum <= maxoff);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberPrev(offnum);	/* move back */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -386,7 +527,8 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					/*
 					 * ran off the end of this page, try the next
 					 */
-					_hash_readprev(rel, &buf, &page, &opaque);
+					_hash_readprev(rel, &buf, &page, &opaque,
+					   (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false);
 					if (BufferIsValid(buf))
 					{
 						maxoff = PageGetMaxOffsetNumber(page);
@@ -394,9 +536,42 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -410,9 +585,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 
 		if (itup == NULL)
 		{
-			/* we ran off the end of the bucket without finding a match */
+			/*
+			 * We ran off the end of the bucket without finding a match.
+			 * Release the pin on bucket buffers.  Normally, such pins are
+			 * released at end of scan, however scrolling cursors can
+			 * reacquire the bucket lock and pin in the same scan multiple
+			 * times.
+			 */
 			*bufP = so->hashso_curbuf = InvalidBuffer;
 			ItemPointerSetInvalid(current);
+			_hash_dropscanbuf(rel, so);
 			return false;
 		}
 
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..74c50db 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -352,3 +352,118 @@ _hash_binsearch_last(Page page, uint32 hash_value)
 
 	return lower;
 }
+
+/*
+ *	_hash_get_newbucket_oldblock() -- get the block number of a bucket from which
+ *			current (new) bucket is being split.
+ */
+BlockNumber
+_hash_get_newbucket_oldblock(Relation rel, Bucket new_bucket)
+{
+	Bucket		old_bucket;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	/*
+	 * To get the old bucket from the current bucket, we need a mask to modulo
+	 * into lower half of table.  This mask is stored in meta page as
+	 * hashm_lowmask, but here we can't rely on the same, because we need a
+	 * value of lowmask that was prevalent at the time when bucket split was
+	 * started.  Masking the most significant bit of new bucket would give us
+	 * old bucket.
+	 */
+	mask = (((uint32) 1) << fls(new_bucket)) - 1;
+	old_bucket = new_bucket & mask;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_oldbucket_newblock() -- get the block number of a bucket that
+ *			will be generated after split from old bucket.
+ *
+ * This is used to find the new bucket from old bucket based on current table
+ * half.  It is mainly required to finish the incomplete splits where we are
+ * sure that not more than one bucket could have split in progress from old
+ * bucket.
+ */
+BlockNumber
+_hash_get_oldbucket_newblock(Relation rel, Bucket old_bucket)
+{
+	Bucket		new_bucket;
+	uint32		lowmask;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	/*
+	 * new bucket can be obtained by OR'ing old bucket with most significant
+	 * bit of current table half.  There could be multiple buckets that could
+	 * have splitted from curent bucket.  We need the first such bucket that
+	 * exists based on current table half.
+	 */
+	lowmask = metap->hashm_lowmask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = old_bucket | mask;
+		if (new_bucket > metap->hashm_maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+		break;
+	}
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_oldbucket_newbucket() -- get the new bucket that will be
+ *			generated after split from current (old) bucket.
+ *
+ * This is used to find the new bucket from old bucket.  New bucket can be
+ * obtained by OR'ing old bucket with most significant bit of table half
+ * for lowmask passed in this function.  There could be multiple buckets that
+ * could have splitted from curent bucket.  We need the first such bucket that
+ * exists.  Caller must ensure that no more than one split has happened from
+ * old bucket.
+ */
+Bucket
+_hash_get_oldbucket_newbucket(Relation rel, Bucket old_bucket,
+							  uint32 lowmask, uint32 maxbucket)
+{
+	Bucket		new_bucket;
+	uint32		mask;
+
+	for (;;)
+	{
+		mask = lowmask + 1;
+		new_bucket = old_bucket | mask;
+		if (new_bucket > maxbucket)
+		{
+			lowmask = lowmask >> 1;
+			continue;
+		}
+		break;
+	}
+
+	return new_bucket;
+}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 07075ce..cdc460b 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -668,9 +668,6 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintFileLeakWarning(res);
 			FileClose(res);
 		}
-
-		/* Clean up index scans too */
-		ReleaseResources_hash();
 	}
 
 	/* Let add-on modules get a chance too */
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 725e2f2..9a5e983 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -24,6 +24,7 @@
 #include "lib/stringinfo.h"
 #include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
+#include "utils/hsearch.h"
 #include "utils/relcache.h"
 
 /*
@@ -32,6 +33,8 @@
  */
 typedef uint32 Bucket;
 
+#define InvalidBucket	((Bucket) 0xFFFFFFFF)
+
 #define BUCKET_TO_BLKNO(metap,B) \
 		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
 
@@ -51,6 +54,9 @@ typedef uint32 Bucket;
 #define LH_BUCKET_PAGE			(1 << 1)
 #define LH_BITMAP_PAGE			(1 << 2)
 #define LH_META_PAGE			(1 << 3)
+#define LH_BUCKET_NEW_PAGE_SPLIT	(1 << 4)
+#define LH_BUCKET_OLD_PAGE_SPLIT	(1 << 5)
+#define LH_BUCKET_PAGE_HAS_GARBAGE	(1 << 6)
 
 typedef struct HashPageOpaqueData
 {
@@ -63,6 +69,12 @@ typedef struct HashPageOpaqueData
 
 typedef HashPageOpaqueData *HashPageOpaque;
 
+#define H_HAS_GARBAGE(opaque)			((opaque)->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+#define H_OLD_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+#define H_INCOMPLETE_SPLIT(opaque)		(((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) || \
+										 ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT))
+
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
  * which otherwise would have a hard time telling pages of different index
@@ -80,19 +92,6 @@ typedef struct HashScanOpaqueData
 	uint32		hashso_sk_hash;
 
 	/*
-	 * By definition, a hash scan should be examining only one bucket. We
-	 * record the bucket number here as soon as it is known.
-	 */
-	Bucket		hashso_bucket;
-	bool		hashso_bucket_valid;
-
-	/*
-	 * If we have a share lock on the bucket, we record it here.  When
-	 * hashso_bucket_blkno is zero, we have no such lock.
-	 */
-	BlockNumber hashso_bucket_blkno;
-
-	/*
 	 * We also want to remember which buffer we're currently examining in the
 	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
 	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -100,11 +99,23 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_curbuf;
 
+	/* remember the buffer associated with primary bucket */
+	Buffer		hashso_bucket_buf;
+
+	/*
+	 * remember the buffer associated with old primary bucket which is
+	 * required during the scan of the bucket for which split is in progress.
+	 */
+	Buffer		hashso_old_bucket_buf;
+
 	/* Current position of the scan, as an index TID */
 	ItemPointerData hashso_curpos;
 
 	/* Current position of the scan, as a heap TID */
 	ItemPointerData hashso_heappos;
+
+	/* Whether scan needs to skip tuples that are moved by split */
+	bool		hashso_skip_moved_tuples;
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -175,6 +186,8 @@ typedef HashMetaPageData *HashMetaPage;
 				  sizeof(ItemIdData) - \
 				  MAXALIGN(sizeof(HashPageOpaqueData)))
 
+#define INDEX_MOVED_BY_SPLIT_MASK	0x2000
+
 #define HASH_MIN_FILLFACTOR			10
 #define HASH_DEFAULT_FILLFACTOR		75
 
@@ -223,9 +236,6 @@ typedef HashMetaPageData *HashMetaPage;
 #define HASH_WRITE		BUFFER_LOCK_EXCLUSIVE
 #define HASH_NOLOCK		(-1)
 
-#define HASH_SHARE		ShareLock
-#define HASH_EXCLUSIVE	ExclusiveLock
-
 /*
  *	Strategy number. There's only one valid strategy for hashing: equality.
  */
@@ -297,21 +307,21 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 
 /* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
 extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BufferAccessStrategy bstrategy);
+				   BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy);
 
 /* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
 extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 			 int access, int flags);
+extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
+								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
@@ -320,6 +330,7 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 						   BufferAccessStrategy bstrategy);
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
+extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
 extern void _hash_wrtbuf(Relation rel, Buffer buf);
 extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
 				   int to_access);
@@ -327,12 +338,9 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
 				ForkNumber forkNum);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
-
-/* hashscan.c */
-extern void _hash_regscan(IndexScanDesc scan);
-extern void _hash_dropscan(IndexScanDesc scan);
-extern bool _hash_has_active_scan(Relation rel, Bucket bucket);
-extern void ReleaseResources_hash(void);
+extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
+				   Buffer nbuf, uint32 maxbucket, uint32 highmask,
+				   uint32 lowmask);
 
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
@@ -362,5 +370,17 @@ extern bool _hash_convert_tuple(Relation index,
 					Datum *index_values, bool *index_isnull);
 extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
 extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+extern BlockNumber _hash_get_newbucket_oldblock(Relation rel, Bucket new_bucket);
+extern BlockNumber _hash_get_oldbucket_newblock(Relation rel, Bucket old_bucket);
+extern Bucket _hash_get_oldbucket_newbucket(Relation rel, Bucket old_bucket,
+							  uint32 lowmask, uint32 maxbucket);
+
+/* hash.c */
+extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
+ Buffer bucket_buf, BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
+				  double *tuples_removed, double *num_index_tuples,
+				  bool bucket_has_garbage, bool delay,
+				  IndexBulkDeleteCallback callback, void *callback_state);
 
 #endif   /* HASH_H */
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 8350fa0..788ba9f 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -63,7 +63,7 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
  * t_info manipulation macros
  */
 #define INDEX_SIZE_MASK 0x1FFF
-/* bit 0x2000 is not used at present */
+/* bit 0x2000 is reserved for index-AM specific usage */
 #define INDEX_VAR_MASK	0x4000
 #define INDEX_NULL_MASK 0x8000
 
#135Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#134)
Re: Hash Indexes

On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Both the places _hash_squeezebucket() and _hash_splitbucket can use
this optimization irrespective of rest of the patch. I will prepare a
separate patch for these and send along with main patch after some
testing.

Done as a separate patch skip_dead_tups_hash_index-v1.patch.

Thanks. Committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#136Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#134)
1 attachment(s)
Re: Hash Indexes

On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[ new patches ]

Attached is yet another incremental patch with some suggested changes.

+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.

This is confusing, because it makes it sound like we retain the lock
through the entire execution of the function, which isn't always true.
I would say that caller must acquire a cleanup lock on the target
primary bucket page before calling this function, and that on return
that page will again be write-locked. However, the lock might be
temporarily released in the meantime, which visiting overflow pages.
(Attached patch has a suggested rewrite.)

+ * During scan of overflow pages, first we need to lock the next bucket and
+ * then release the lock on current bucket.  This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.

This comment says that it's bad if other scans can pass our cleanup
scan, but it doesn't explain why. I think it's because we don't have
page-at-a-time mode yet, and cleanup might decrease the TIDs for
existing index entries. (Attached patch has a suggested rewrite, but
might need further adjustment if my understanding of the reasons is
incomplete.)

+ if (delay)
+ vacuum_delay_point();

You don't really need "delay". If we're not in a cost-accounted
VACUUM, vacuum_delay_point() just turns into CHECK_FOR_INTERRUPTS(),
which should be safe (and a good idea) regardless. (Fixed in
attached.)

+            if (callback && callback(htup, callback_state))
+            {
+                /* mark the item for deletion */
+                deletable[ndeletable++] = offno;
+                if (tuples_removed)
+                    *tuples_removed += 1;
+            }
+            else if (bucket_has_garbage)
+            {
+                /* delete the tuples that are moved by split. */
+                bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup
),
+                                              maxbucket,
+                                              highmask,
+                                              lowmask);
+                /* mark the item for deletion */
+                if (bucket != cur_bucket)
+                {
+                    /*
+                     * We expect tuples to either belong to curent bucket or
+                     * new_bucket.  This is ensured because we don't allow
+                     * further splits from bucket that contains garbage. See
+                     * comments in _hash_expandtable.
+                     */
+                    Assert(bucket == new_bucket);
+                    deletable[ndeletable++] = offno;
+                }
+                else if (num_index_tuples)
+                    *num_index_tuples += 1;
+            }
+            else if (num_index_tuples)
+                *num_index_tuples += 1;
+        }

OK, a couple things here. First, it seems like we could also delete
any tuples where ItemIdIsDead, and that seems worth doing. In fact, I
think we should check it prior to invoking the callback, because it's
probably quite substantially cheaper than the callback. Second,
repeating deletable[ndeletable++] = offno and *num_index_tuples += 1
doesn't seem very clean to me; I think we should introduce a new bool
to track whether we're keeping the tuple or killing it, and then use
that to drive which of those things we do. (Fixed in attached.)

+        if (H_HAS_GARBAGE(bucket_opaque) &&
+            !H_INCOMPLETE_SPLIT(bucket_opaque))

This is the only place in the entire patch that use
H_INCOMPLETE_SPLIT(), and I'm wondering if that's really correct even
here. Don't you really want H_OLD_INCOMPLETE_SPLIT() here? (And
couldn't we then remove H_INCOMPLETE_SPLIT() itself?) There's no
garbage to be removed from the "new" bucket until the next split, when
it will take on the role of the "old" bucket.

I think it would be a good idea to change this so that
LH_BUCKET_PAGE_HAS_GARBAGE doesn't get set until
LH_BUCKET_OLD_PAGE_SPLIT is cleared. The current way is confusing,
because those tuples are NOT garbage until the split is completed!
Moreover, both of the places that care about
LH_BUCKET_PAGE_HAS_GARBAGE need to make sure that
LH_BUCKET_OLD_PAGE_SPLIT is clear before they do anything about
LH_BUCKET_PAGE_HAS_GARBAGE, so the change I am proposing would
actually simplify the code very slightly.

+#define H_OLD_INCOMPLETE_SPLIT(opaque)  ((opaque)->hasho_flag &
LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque)  ((opaque)->hasho_flag &
LH_BUCKET_NEW_PAGE_SPLIT)

The code isn't consistent about the use of these macros, or at least
not in a good way. When you care about LH_BUCKET_OLD_PAGE_SPLIT, you
test it using the macro; when you care about H_NEW_INCOMPLETE_SPLIT,
you ignore the macro and test it directly. Either get rid of both
macros and always test directly, or keep both macros and use both of
them. Using a macro for one but not the other is strange.

I wonder if we should rename these flags and macros. Maybe
LH_BUCKET_OLD_PAGE_SPLIT -> LH_BEING_SPLIT and
LH_BUCKET_NEW_PAGE_SPLIT -> LH_BEING_POPULATED. I think that might be
clearer. When LH_BEING_POPULATED is set, the bucket is being filled -
that is, populated - from the old bucket. And maybe
LH_BUCKET_PAGE_HAS_GARBAGE -> LH_NEEDS_SPLIT_CLEANUP, too.

+         * Copy bucket mapping info now;  The comment in _hash_expandtable
+         * where we copy this information and calls _hash_splitbucket explains
+         * why this is OK.

After a semicolon, the next word should not be capitalized. There
shouldn't be two spaces after a semicolon, either. Also,
_hash_splitbucket appears to have a verb before it (calls) and a verb
after it (explains) and I have no idea what that means.

+    for (;;)
+    {
+        mask = lowmask + 1;
+        new_bucket = old_bucket | mask;
+        if (new_bucket > metap->hashm_maxbucket)
+        {
+            lowmask = lowmask >> 1;
+            continue;
+        }
+        blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+        break;
+    }

I can't help feeling that it should be possible to do this without
looping. Can we ever loop more than once? How? Can we just use an
if-then instead of a for-loop?

Can't _hash_get_oldbucket_newblock call _hash_get_oldbucket_newbucket
instead of duplicating the logic?

I still don't like the names of these functions very much. If you
said "get X from Y", it would be clear that you put in Y and you get
out X. If you say "X 2 Y", it would be clear that you put in X and
you get out Y. As it is, it's not very clear which is the input and
which is the output.

+ bool primary_buc_page)

I think we could just go with "primary_page" here. (Fixed in attached.)

+    /*
+     * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+     * there is no pending scan that has seen the flag after it is cleared.
+     */
+    _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+    opage = BufferGetPage(bucket_obuf);
+    oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);

I don't understand the comment, because the code *isn't* acquiring a
cleanup lock.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

more-hash-tweaks.patchapplication/x-download; name=more-hash-tweaks.patchDownload
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 7612e5b..d7d21b5 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -287,10 +287,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		/*
 		 * An insertion into the current index page could have happened while
 		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold pin on the
-		 * bucket, no deletions or splits could have occurred; therefore we
-		 * can expect that the TID still exists in the current index page, at
-		 * an offset >= where we were.)
+		 * for the TID we previously returned.  (Because we hold a pin on the
+		 * primary bucket page, no deletions or splits could have occurred;
+		 * therefore we can expect that the TID still exists in the current
+		 * index page, at an offset >= where we were.)
 		 */
 		OffsetNumber maxoffnum;
 
@@ -569,7 +569,7 @@ loop_top:
 						  local_metapage.hashm_maxbucket,
 						  local_metapage.hashm_highmask,
 						  local_metapage.hashm_lowmask, &tuples_removed,
-						  &num_index_tuples, bucket_has_garbage, true,
+						  &num_index_tuples, bucket_has_garbage,
 						  callback, callback_state);
 
 		_hash_relbuf(rel, bucket_buf);
@@ -656,15 +656,21 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 /*
  * Helper function to perform deletion of index entries from a bucket.
  *
- * This expects that the caller has acquired a cleanup lock on the target
- * bucket (primary page of a bucket) and it is reponsibility of caller to
- * release that lock.
+ * This function expects that the caller has acquired a cleanup lock on the
+ * primary bucket page, and will with a write lock again held on the primary
+ * bucket page.  The lock won't necessarily be held continuously, though,
+ * because we'll release it when visiting overflow pages.
  *
- * During scan of overflow pages, first we need to lock the next bucket and
- * then release the lock on current bucket.  This ensures that any concurrent
- * scan started after we start cleaning the bucket will always be behind the
- * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
- * required for sanctity of scan.
+ * It would be very bad if this function cleaned a page while some other
+ * backend was in the midst of scanning it, because hashgettuple assumes
+ * that the next valid TID will be greater than or equal to the current
+ * valid TID.  There can't be any concurrent scans in progress when we first
+ * enter this function because of the cleanup lock we hold on the primary
+ * bucket page, but as soon as we release that lock, there might be.  We
+ * handle that by conspiring to prevent those scans from passing our cleanup
+ * scan.  To do that, we lock the next page in the bucket chain before
+ * releasing the lock on the previous page.  (This type of lock chaining is
+ * not ideal, so we might want to look for a better solution at some point.)
  *
  * We need to retain a pin on the primary bucket to ensure that no concurrent
  * split can start.
@@ -674,7 +680,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 				  BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
 				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
 				  double *tuples_removed, double *num_index_tuples,
-				  bool bucket_has_garbage, bool delay,
+				  bool bucket_has_garbage,
 				  IndexBulkDeleteCallback callback, void *callback_state)
 {
 	BlockNumber blkno;
@@ -702,8 +708,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		bool		retain_pin = false;
 		bool		curr_page_dirty = false;
 
-		if (delay)
-			vacuum_delay_point();
+		vacuum_delay_point();
 
 		page = BufferGetPage(buf);
 		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
@@ -714,17 +719,20 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			 offno <= maxoffno;
 			 offno = OffsetNumberNext(offno))
 		{
-			IndexTuple	itup;
 			ItemPointer htup;
+			ItemId		itemid;
+			IndexTuple	itup;
 			Bucket		bucket;
+			bool		kill_tuple = false;
 
-			itup = (IndexTuple) PageGetItem(page,
-											PageGetItemId(page, offno));
+			itemid = PageGetItemId(page, offno);
+			itup = (IndexTuple) PageGetItem(page, itemid);
 			htup = &(itup->t_tid);
-			if (callback && callback(htup, callback_state))
+			if (ItemIdIsDead(itemid))
+				kill_tuple = true;
+			else if (callback && callback(htup, callback_state))
 			{
-				/* mark the item for deletion */
-				deletable[ndeletable++] = offno;
+				kill_tuple = true;
 				if (tuples_removed)
 					*tuples_removed += 1;
 			}
@@ -745,13 +753,21 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 					 * comments in _hash_expandtable.
 					 */
 					Assert(bucket == new_bucket);
-					deletable[ndeletable++] = offno;
+					kill_tuple = true;
 				}
-				else if (num_index_tuples)
+			}
+
+			if (kill_tuple)
+			{
+				/* mark the item for deletion */
+				deletable[ndeletable++] = offno;
+			}
+			else
+			{
+				/* we're keeping it, so count it */
+				if (num_index_tuples)
 					*num_index_tuples += 1;
 			}
-			else if (num_index_tuples)
-				*num_index_tuples += 1;
 		}
 
 		/* retain the pin on primary bucket page till end of bucket scan */
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 7bc6b26..83eba9f 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -667,7 +667,7 @@ restart_expand:
 		hashbucketcleanup(rel, old_bucket, buf_oblkno, start_oblkno, NULL,
 						  metap->hashm_maxbucket, metap->hashm_highmask,
 						  metap->hashm_lowmask, NULL,
-						  NULL, true, false, NULL, NULL);
+						  NULL, true, NULL, NULL);
 
 		_hash_relbuf(rel, buf_oblkno);
 
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 21954c2..a1e3dbc 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -68,7 +68,7 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
 static void
 _hash_readnext(Relation rel,
 			   Buffer *bufp, Page *pagep, HashPageOpaque *opaquep,
-			   bool primary_buc_page)
+			   bool primary_page)
 {
 	BlockNumber blkno;
 
@@ -81,7 +81,7 @@ _hash_readnext(Relation rel,
 	 * buckets where split is in progress, before a new bucket's split in
 	 * progress flag (LH_BUCKET_NEW_PAGE_SPLIT) is cleared.
 	 */
-	if (primary_buc_page)
+	if (primary_page)
 		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
 	else
 		_hash_relbuf(rel, *bufp);
@@ -103,7 +103,7 @@ _hash_readnext(Relation rel,
 static void
 _hash_readprev(Relation rel,
 			   Buffer *bufp, Page *pagep, HashPageOpaque *opaquep,
-			   bool primary_buc_page)
+			   bool primary_page)
 {
 	BlockNumber blkno;
 
@@ -113,7 +113,7 @@ _hash_readprev(Relation rel,
 	 * Retain the pin on primary bucket page till the end of scan. See
 	 * comments in _hash_readnext to know the reason of retaining pin.
 	 */
-	if (primary_buc_page)
+	if (primary_page)
 		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
 	else
 		_hash_relbuf(rel, *bufp);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9a5e983..26d539b 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -377,10 +377,11 @@ extern Bucket _hash_get_oldbucket_newbucket(Relation rel, Bucket old_bucket,
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
- Buffer bucket_buf, BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+				  Buffer bucket_buf, BlockNumber bucket_blkno,
+				  BufferAccessStrategy bstrategy,
 				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
 				  double *tuples_removed, double *num_index_tuples,
-				  bool bucket_has_garbage, bool delay,
+				  bool bucket_has_garbage,
 				  IndexBulkDeleteCallback callback, void *callback_state);
 
 #endif   /* HASH_H */
#137Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#136)
Re: Hash Indexes

On Wed, Nov 9, 2016 at 1:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[ new patches ]

Attached is yet another incremental patch with some suggested changes.

+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.

This is confusing, because it makes it sound like we retain the lock
through the entire execution of the function, which isn't always true.
I would say that caller must acquire a cleanup lock on the target
primary bucket page before calling this function, and that on return
that page will again be write-locked. However, the lock might be
temporarily released in the meantime, which visiting overflow pages.
(Attached patch has a suggested rewrite.)

+ * This function expects that the caller has acquired a cleanup lock on the
+ * primary bucket page, and will with a write lock again held on the primary
+ * bucket page.  The lock won't necessarily be held continuously, though,
+ * because we'll release it when visiting overflow pages.

Looks like typo in above comment. /will with a write lock/will
return with a write lock

+ * During scan of overflow pages, first we need to lock the next bucket and
+ * then release the lock on current bucket.  This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.

This comment says that it's bad if other scans can pass our cleanup
scan, but it doesn't explain why. I think it's because we don't have
page-at-a-time mode yet,

Right.

and cleanup might decrease the TIDs for
existing index entries.

I think the reason is that cleanup might move tuples around during
which it might move previously returned TID to a position earlier than
its current position. This is a problem because it restarts the scan
from previously returned offset and try to find previously returned
tuples TID. This has been mentioned in README as below:

+ It is must to
+keep scans behind cleanup, else vacuum could remove tuples that are required
+to complete the scan as the scan that returns multiple tuples from the same
+bucket page always restart the scan from the previous offset number from which
+it has returned last tuple.

We might want to slightly improve the README so that the reason is
more clear and then mention in comments to refer README, but I am open
either way, let me know which way you prefer?

+ if (delay)
+ vacuum_delay_point();

You don't really need "delay". If we're not in a cost-accounted
VACUUM, vacuum_delay_point() just turns into CHECK_FOR_INTERRUPTS(),
which should be safe (and a good idea) regardless. (Fixed in
attached.)

Okay, that makes sense.

+            if (callback && callback(htup, callback_state))
+            {
+                /* mark the item for deletion */
+                deletable[ndeletable++] = offno;
+                if (tuples_removed)
+                    *tuples_removed += 1;
+            }
+            else if (bucket_has_garbage)
+            {
+                /* delete the tuples that are moved by split. */
+                bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup
),
+                                              maxbucket,
+                                              highmask,
+                                              lowmask);
+                /* mark the item for deletion */
+                if (bucket != cur_bucket)
+                {
+                    /*
+                     * We expect tuples to either belong to curent bucket or
+                     * new_bucket.  This is ensured because we don't allow
+                     * further splits from bucket that contains garbage. See
+                     * comments in _hash_expandtable.
+                     */
+                    Assert(bucket == new_bucket);
+                    deletable[ndeletable++] = offno;
+                }
+                else if (num_index_tuples)
+                    *num_index_tuples += 1;
+            }
+            else if (num_index_tuples)
+                *num_index_tuples += 1;
+        }

OK, a couple things here. First, it seems like we could also delete
any tuples where ItemIdIsDead, and that seems worth doing.

I think we can't do that because here we want to strictly rely on
callback function for vacuum similar to btree. The reason is explained
as below comment in function btvacuumpage().

/*
* During Hot Standby we currently assume that
* XLOG_BTREE_VACUUM records do not produce conflicts. That is
* only true as long as the callback function depends only
* upon whether the index tuple refers to heap tuples removed
* in the initial heap scan. ...
..

In fact, I
think we should check it prior to invoking the callback, because it's
probably quite substantially cheaper than the callback. Second,
repeating deletable[ndeletable++] = offno and *num_index_tuples += 1
doesn't seem very clean to me; I think we should introduce a new bool
to track whether we're keeping the tuple or killing it, and then use
that to drive which of those things we do. (Fixed in attached.)

This looks okay to me. So if you agree with my reasoning for not
including first part, then I can take that out and keep this part in
next patch.

+        if (H_HAS_GARBAGE(bucket_opaque) &&
+            !H_INCOMPLETE_SPLIT(bucket_opaque))

This is the only place in the entire patch that use
H_INCOMPLETE_SPLIT(), and I'm wondering if that's really correct even
here. Don't you really want H_OLD_INCOMPLETE_SPLIT() here? (And
couldn't we then remove H_INCOMPLETE_SPLIT() itself?)

You are right. Will remove it in next version.

I think it would be a good idea to change this so that
LH_BUCKET_PAGE_HAS_GARBAGE doesn't get set until
LH_BUCKET_OLD_PAGE_SPLIT is cleared. The current way is confusing,
because those tuples are NOT garbage until the split is completed!
Moreover, both of the places that care about
LH_BUCKET_PAGE_HAS_GARBAGE need to make sure that
LH_BUCKET_OLD_PAGE_SPLIT is clear before they do anything about
LH_BUCKET_PAGE_HAS_GARBAGE, so the change I am proposing would
actually simplify the code very slightly.

Not an issue. We can do that way.

+#define H_OLD_INCOMPLETE_SPLIT(opaque)  ((opaque)->hasho_flag &
LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque)  ((opaque)->hasho_flag &
LH_BUCKET_NEW_PAGE_SPLIT)

The code isn't consistent about the use of these macros, or at least
not in a good way. When you care about LH_BUCKET_OLD_PAGE_SPLIT, you
test it using the macro; when you care about H_NEW_INCOMPLETE_SPLIT,
you ignore the macro and test it directly. Either get rid of both
macros and always test directly, or keep both macros and use both of
them. Using a macro for one but not the other is strange.

I will like to use a macro at both places.

I wonder if we should rename these flags and macros. Maybe
LH_BUCKET_OLD_PAGE_SPLIT -> LH_BEING_SPLIT and
LH_BUCKET_NEW_PAGE_SPLIT -> LH_BEING_POPULATED.

I think keeping BUCKET (LH_BUCKET_*) in define indicates in some way
about the type of page being split. Hash index has multiple type of
pages. That seems to be taken care in existing defines as below.
#define LH_OVERFLOW_PAGE (1 << 0)
#define LH_BUCKET_PAGE (1 << 1)
#define LH_BITMAP_PAGE (1 << 2)
#define LH_META_PAGE (1 << 3)

I think that might be
clearer. When LH_BEING_POPULATED is set, the bucket is being filled -
that is, populated - from the old bucket.

How about LH_BUCKET_BEING_POPULATED or may LH_BP_BEING_SPLIT where BP
indicates Bucket page?

I think keeping Split work in these defines might make more sense like
LH_BP_SPLIT_OLD/LH_BP_SPLIT_FORM_NEW.

And maybe
LH_BUCKET_PAGE_HAS_GARBAGE -> LH_NEEDS_SPLIT_CLEANUP, too.

How about LH_BUCKET_NEEDS_SPLIT_CLEANUP or LH_BP_NEEDS_SPLIT_CLEANUP?
I am slightly inclined to keep Bucket word, but let me know if you
think it will make the length longer.

+         * Copy bucket mapping info now;  The comment in _hash_expandtable
+         * where we copy this information and calls _hash_splitbucket explains
+         * why this is OK.

After a semicolon, the next word should not be capitalized. There
shouldn't be two spaces after a semicolon, either.

Will fix.

Also,
_hash_splitbucket appears to have a verb before it (calls) and a verb
after it (explains) and I have no idea what that means.

I think conjuction is required there. Let me try to rewrite as below:
refer the comment in _hash_expandtable where we copy this information
before calling _hash_splitbucket to see why this is ok.

If you have better words to explain it, then let me know.

+    for (;;)
+    {
+        mask = lowmask + 1;
+        new_bucket = old_bucket | mask;
+        if (new_bucket > metap->hashm_maxbucket)
+        {
+            lowmask = lowmask >> 1;
+            continue;
+        }
+        blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+        break;
+    }

I can't help feeling that it should be possible to do this without
looping. Can we ever loop more than once?

No.

How? Can we just use an
if-then instead of a for-loop?

I could see below two possibilities:
First way -

retry:
mask = lowmask + 1;
new_bucket = old_bucket | mask;
if (new_bucket > maxbucket)
{
lowmask = lowmask >> 1;
goto retry;
}

Second way -
new_bucket = CALC_NEW_BUCKET(old_bucket,lowmask);
if (new_bucket > maxbucket)
{
lowmask = lowmask >> 1;
new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask);
}

#define CALC_NEW_BUCKET(old_bucket, lowmask) \
new_bucket = old_bucket | (lowmask + 1)

Do you have something else in mind?

Can't _hash_get_oldbucket_newblock call _hash_get_oldbucket_newbucket
instead of duplicating the logic?

Will change in next version of patch.

I still don't like the names of these functions very much. If you
said "get X from Y", it would be clear that you put in Y and you get
out X. If you say "X 2 Y", it would be clear that you put in X and
you get out Y. As it is, it's not very clear which is the input and
which is the output.

Whatever exists earlier is input and the later one is output. For
example in existing function _hash_get_indextuple_hashkey(). However,
feel free to suggest better names here. How about
_hash_get_oldbucket2newblock() or _hash_get_newblock_from_oldbucket()
or simply _hash_get_newblock()?

+    /*
+     * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+     * there is no pending scan that has seen the flag after it is cleared.
+     */
+    _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+    opage = BufferGetPage(bucket_obuf);
+    oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);

I don't understand the comment, because the code *isn't* acquiring a
cleanup lock.

Oops, this is ramnant from one of the design approach to clear these
flags which was later discarded due to issues. I will change this to
indicate Exclusive lock.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#138Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#137)
Re: Hash Indexes

On Wed, Nov 9, 2016 at 9:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

+ * This function expects that the caller has acquired a cleanup lock on the
+ * primary bucket page, and will with a write lock again held on the primary
+ * bucket page.  The lock won't necessarily be held continuously, though,
+ * because we'll release it when visiting overflow pages.

Looks like typo in above comment. /will with a write lock/will
return with a write lock

Oh, yes. Thanks.

+ * During scan of overflow pages, first we need to lock the next bucket and
+ * then release the lock on current bucket.  This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.

This comment says that it's bad if other scans can pass our cleanup
scan, but it doesn't explain why. I think it's because we don't have
page-at-a-time mode yet,

Right.

and cleanup might decrease the TIDs for
existing index entries.

I think the reason is that cleanup might move tuples around during
which it might move previously returned TID to a position earlier than
its current position. This is a problem because it restarts the scan
from previously returned offset and try to find previously returned
tuples TID. This has been mentioned in README as below:

+ It is must to
+keep scans behind cleanup, else vacuum could remove tuples that are required
+to complete the scan as the scan that returns multiple tuples from the same
+bucket page always restart the scan from the previous offset number from which
+it has returned last tuple.

We might want to slightly improve the README so that the reason is
more clear and then mention in comments to refer README, but I am open
either way, let me know which way you prefer?

I think we can give a brief explanation right in the code comment. I
referred to "decreasing the TIDs"; you are referring to "moving tuples
around". But I think that moving the tuples to different locations is
not the problem. I think the problem is that a tuple might be
assigned a lower spot in the item pointer array - i.e. the TID
decreases.

OK, a couple things here. First, it seems like we could also delete
any tuples where ItemIdIsDead, and that seems worth doing.

I think we can't do that because here we want to strictly rely on
callback function for vacuum similar to btree. The reason is explained
as below comment in function btvacuumpage().

OK, I see. It would probably be good to comment this, then, so that
someone later doesn't get confused as I did.

This looks okay to me. So if you agree with my reasoning for not
including first part, then I can take that out and keep this part in
next patch.

Cool.

I think that might be
clearer. When LH_BEING_POPULATED is set, the bucket is being filled -
that is, populated - from the old bucket.

How about LH_BUCKET_BEING_POPULATED or may LH_BP_BEING_SPLIT where BP
indicates Bucket page?

LH_BUCKET_BEING_POPULATED seems good to me.

And maybe
LH_BUCKET_PAGE_HAS_GARBAGE -> LH_NEEDS_SPLIT_CLEANUP, too.

How about LH_BUCKET_NEEDS_SPLIT_CLEANUP or LH_BP_NEEDS_SPLIT_CLEANUP?
I am slightly inclined to keep Bucket word, but let me know if you
think it will make the length longer.

LH_BUCKET_NEEDS_SPLIT_CLEANUP seems good to me.

How? Can we just use an
if-then instead of a for-loop?

I could see below two possibilities:
First way -

retry:
mask = lowmask + 1;
new_bucket = old_bucket | mask;
if (new_bucket > maxbucket)
{
lowmask = lowmask >> 1;
goto retry;
}

Second way -
new_bucket = CALC_NEW_BUCKET(old_bucket,lowmask);
if (new_bucket > maxbucket)
{
lowmask = lowmask >> 1;
new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask);
}

#define CALC_NEW_BUCKET(old_bucket, lowmask) \
new_bucket = old_bucket | (lowmask + 1)

Do you have something else in mind?

Second one would be my preference.

I still don't like the names of these functions very much. If you
said "get X from Y", it would be clear that you put in Y and you get
out X. If you say "X 2 Y", it would be clear that you put in X and
you get out Y. As it is, it's not very clear which is the input and
which is the output.

Whatever exists earlier is input and the later one is output. For
example in existing function _hash_get_indextuple_hashkey(). However,
feel free to suggest better names here. How about
_hash_get_oldbucket2newblock() or _hash_get_newblock_from_oldbucket()
or simply _hash_get_newblock()?

The problem with _hash_get_newblock() is that it sounds like you are
getting a new block in the relation, not the new bucket (or
corresponding block) for some old bucket. The name isn't specific
enough to know what "new" means.

In general, I think "new" and "old" are not very good terminology
here. It's not entirely intuitive what they mean, and as soon as it
becomes unclear that you are speaking of something happening *in the
context of a bucket split* then it becomes much less clear. I don't
really have any ideas here that are altogether good; either of your
other two suggestions (not _hash_get_newblock()) seem OK.

+    /*
+     * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+     * there is no pending scan that has seen the flag after it is cleared.
+     */
+    _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+    opage = BufferGetPage(bucket_obuf);
+    oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);

I don't understand the comment, because the code *isn't* acquiring a
cleanup lock.

Oops, this is ramnant from one of the design approach to clear these
flags which was later discarded due to issues. I will change this to
indicate Exclusive lock.

Of course, an exclusive lock doesn't guarantee anything like that...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#139Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#138)
Re: Hash Indexes

On Wed, Nov 9, 2016 at 9:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 9, 2016 at 9:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think we can give a brief explanation right in the code comment. I
referred to "decreasing the TIDs"; you are referring to "moving tuples
around". But I think that moving the tuples to different locations is
not the problem. I think the problem is that a tuple might be
assigned a lower spot in the item pointer array

I think we both understand the problem and it is just matter of using
different words. I will go with your suggestion and will try to
slightly adjust the README as well so that both places use same
terminology.

+    /*
+     * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+     * there is no pending scan that has seen the flag after it is cleared.
+     */
+    _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+    opage = BufferGetPage(bucket_obuf);
+    oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);

I don't understand the comment, because the code *isn't* acquiring a
cleanup lock.

Oops, this is ramnant from one of the design approach to clear these
flags which was later discarded due to issues. I will change this to
indicate Exclusive lock.

Of course, an exclusive lock doesn't guarantee anything like that...

Right, but we don't need that guarantee (there is no pending scan that
has seen the flag after it is cleared) to clear the flags. It was
written in one of the previous patches where I was exploring the idea
of using cleanup lock to clear the flags and then don't use the same
during vacuum. However, there were some problems in that design and I
have changed the code, but forgot to update the comment.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#140Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#139)
Re: Hash Indexes

On Wed, Nov 9, 2016 at 11:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 9, 2016 at 9:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 9, 2016 at 9:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I think we can give a brief explanation right in the code comment. I
referred to "decreasing the TIDs"; you are referring to "moving tuples
around". But I think that moving the tuples to different locations is
not the problem. I think the problem is that a tuple might be
assigned a lower spot in the item pointer array

I think we both understand the problem and it is just matter of using
different words. I will go with your suggestion and will try to
slightly adjust the README as well so that both places use same
terminology.

Yes, I think we're on the same page.

Right, but we don't need that guarantee (there is no pending scan that
has seen the flag after it is cleared) to clear the flags. It was
written in one of the previous patches where I was exploring the idea
of using cleanup lock to clear the flags and then don't use the same
during vacuum. However, there were some problems in that design and I
have changed the code, but forgot to update the comment.

OK, got it, thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#141Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#140)
Re: Hash Indexes

On Wed, Nov 9, 2016 at 12:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 9, 2016 at 11:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 9, 2016 at 9:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 9, 2016 at 9:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I think we can give a brief explanation right in the code comment. I
referred to "decreasing the TIDs"; you are referring to "moving tuples
around". But I think that moving the tuples to different locations is
not the problem. I think the problem is that a tuple might be
assigned a lower spot in the item pointer array

I think we both understand the problem and it is just matter of using
different words. I will go with your suggestion and will try to
slightly adjust the README as well so that both places use same
terminology.

Yes, I think we're on the same page.

Some more review:

The API contract of _hash_finish_split seems a bit unfortunate. The
caller is supposed to have obtained a cleanup lock on both the old and
new buffers, but the first thing it does is walk the entire new bucket
chain, completely ignoring the old one. That means holding a cleanup
lock on the old buffer across an unbounded number of I/O operations --
which also means that you can't interrupt the query by pressing ^C,
because an LWLock (on the old buffer) is held. Moreover, the
requirement to hold a lock on the new buffer isn't convenient for
either caller; they both have to go do it, so why not move it into the
function? Perhaps the function should be changed so that it
guarantees that a pin is held on the primary page of the existing
bucket, but no locks are held.

Where _hash_finish_split does LockBufferForCleanup(bucket_nbuf),
should it instead be trying to get the lock conditionally and
returning immediately if it doesn't get the lock? Seems like a good
idea...

* We're at the end of the old bucket chain, so we're done partitioning
* the tuples. Mark the old and new buckets to indicate split is
* finished.
*
* To avoid deadlocks due to locking order of buckets, first lock the old
* bucket and then the new bucket.

These comments have drifted too far from the code to which they refer.
The first part is basically making the same point as the
slightly-later comment /* indicate that split is finished */.

The use of _hash_relbuf, _hash_wrtbuf, and _hash_chgbufaccess is
coming to seem like a horrible idea to me. That's not your fault - it
was like this before - but maybe in a followup patch we should
consider ripping all of that out and just calling MarkBufferDirty(),
ReleaseBuffer(), LockBuffer(), UnlockBuffer(), and/or
UnlockReleaseBuffer() as appropriate. As far as I can see, the
current style is just obfuscating the code.

itupsize = new_itup->t_info & INDEX_SIZE_MASK;
new_itup->t_info &= ~INDEX_SIZE_MASK;
new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
new_itup->t_info |= itupsize;

If I'm not mistaken, you could omit the first, second, and fourth
lines here and keep only the third one, and it would do exactly the
same thing. The first line saves the bits in INDEX_SIZE_MASK. The
second line clears the bits in INDEX_SIZE_MASK. The fourth line
re-sets the bits that were originally saved.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#142Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#141)
Re: Hash Indexes

On Thu, Nov 10, 2016 at 2:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Some more review:

The API contract of _hash_finish_split seems a bit unfortunate. The
caller is supposed to have obtained a cleanup lock on both the old and
new buffers, but the first thing it does is walk the entire new bucket
chain, completely ignoring the old one. That means holding a cleanup
lock on the old buffer across an unbounded number of I/O operations --
which also means that you can't interrupt the query by pressing ^C,
because an LWLock (on the old buffer) is held.

I see the problem you are talking about, but it was done to ensure
locking order, old bucket first and then new bucket, else there could
be a deadlock risk. However, I think we can avoid holding the cleanup
lock on old bucket till we scan the new bucket to form a hash table of
TIDs.

Moreover, the
requirement to hold a lock on the new buffer isn't convenient for
either caller; they both have to go do it, so why not move it into the
function?

Yeah, we can move the locking of new bucket entirely into new function.

Perhaps the function should be changed so that it
guarantees that a pin is held on the primary page of the existing
bucket, but no locks are held.

Okay, so we can change the locking order as follows:
a. ensure a cleanup lock on old bucket and check if the bucket (old)
has pending split.
b. if there is a pending split, release the lock on old bucket, but not pin.

below steps will be performed by _hash_finish_split():

c. acquire the read content lock on new bucket and form the hash table
of TIDs and in the process of forming hash table, we need to traverse
whole bucket chain. While traversing bucket chain, release the lock
on previous bucket (both lock and pin if not a primary bucket page).
d. After the hash table is formed, acquire cleanup lock on old and new
buckets conditionaly; if we are not able to get cleanup lock on
either, then just return from there.
e. Perform split operation.
f. release the lock and pin on new bucket
g. release the lock on old bucket. We don't want to release the pin
on old bucket as the caller has ensure it before passing it to
_hash_finish_split(), so releasing pin should be resposibility of
caller.

Now, both the callers need to ensure that they restart the operation
from begining as after we release lock on old bucket, somebody might
have split the bucket.

Does the above change in locking strategy sounds okay?

Where _hash_finish_split does LockBufferForCleanup(bucket_nbuf),
should it instead be trying to get the lock conditionally and
returning immediately if it doesn't get the lock? Seems like a good
idea...

Yeah, we can take a cleanup lock conditionally, but it would waste the
effort of forming hash table, if we don't get cleanup lock
immediately. Considering incomplete splits to be a rare operation,
may be this the wasted effort is okay, but I am not sure. Don't you
think we should avoid that effort?

* We're at the end of the old bucket chain, so we're done partitioning
* the tuples. Mark the old and new buckets to indicate split is
* finished.
*
* To avoid deadlocks due to locking order of buckets, first lock the old
* bucket and then the new bucket.

These comments have drifted too far from the code to which they refer.
The first part is basically making the same point as the
slightly-later comment /* indicate that split is finished */.

I think we can remove the second comment /* indicate that split is
finished */. Apart from that, I think the above comment you have
quoted seems to be inline with current code. At that point, we have
finished partitioning the tuples, so I don't understand what makes you
think that it is drifted from the code? Is it because of second part
of comment (To avoid deadlocks ...)? If so, I think we can move it to
few lines down where we actually performs the locking on old and new
bucket?

The use of _hash_relbuf, _hash_wrtbuf, and _hash_chgbufaccess is
coming to seem like a horrible idea to me. That's not your fault - it
was like this before - but maybe in a followup patch we should
consider ripping all of that out and just calling MarkBufferDirty(),
ReleaseBuffer(), LockBuffer(), UnlockBuffer(), and/or
UnlockReleaseBuffer() as appropriate. As far as I can see, the
current style is just obfuscating the code.

Okay, we can do some study and try to change it in the way you are
suggesting. It seems partially this has been derived from btree code
where we have function _bt_relbuf(). I am sure that we don't need
_hash_wrtbuf after WAL patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#143Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#136)
1 attachment(s)
Re: Hash Indexes

On Wed, Nov 9, 2016 at 1:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[ new patches ]

Attached is yet another incremental patch with some suggested changes.

+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.

This is confusing, because it makes it sound like we retain the lock
through the entire execution of the function, which isn't always true.
I would say that caller must acquire a cleanup lock on the target
primary bucket page before calling this function, and that on return
that page will again be write-locked. However, the lock might be
temporarily released in the meantime, which visiting overflow pages.
(Attached patch has a suggested rewrite.)

+ * During scan of overflow pages, first we need to lock the next bucket and
+ * then release the lock on current bucket.  This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.

This comment says that it's bad if other scans can pass our cleanup
scan, but it doesn't explain why. I think it's because we don't have
page-at-a-time mode yet, and cleanup might decrease the TIDs for
existing index entries. (Attached patch has a suggested rewrite, but
might need further adjustment if my understanding of the reasons is
incomplete.)

Okay, I have included your changes with minor typo fix and updated
README to use similar language.

+ if (delay)
+ vacuum_delay_point();

You don't really need "delay". If we're not in a cost-accounted
VACUUM, vacuum_delay_point() just turns into CHECK_FOR_INTERRUPTS(),
which should be safe (and a good idea) regardless. (Fixed in
attached.)

New patch contains this fix.

+            if (callback && callback(htup, callback_state))
+            {
+                /* mark the item for deletion */
+                deletable[ndeletable++] = offno;
+                if (tuples_removed)
+                    *tuples_removed += 1;
+            }
+            else if (bucket_has_garbage)
+            {
+                /* delete the tuples that are moved by split. */
+                bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup
),
+                                              maxbucket,
+                                              highmask,
+                                              lowmask);
+                /* mark the item for deletion */
+                if (bucket != cur_bucket)
+                {
+                    /*
+                     * We expect tuples to either belong to curent bucket or
+                     * new_bucket.  This is ensured because we don't allow
+                     * further splits from bucket that contains garbage. See
+                     * comments in _hash_expandtable.
+                     */
+                    Assert(bucket == new_bucket);
+                    deletable[ndeletable++] = offno;
+                }
+                else if (num_index_tuples)
+                    *num_index_tuples += 1;
+            }
+            else if (num_index_tuples)
+                *num_index_tuples += 1;
+        }

OK, a couple things here. First, it seems like we could also delete
any tuples where ItemIdIsDead, and that seems worth doing. In fact, I
think we should check it prior to invoking the callback, because it's
probably quite substantially cheaper than the callback. Second,
repeating deletable[ndeletable++] = offno and *num_index_tuples += 1
doesn't seem very clean to me; I think we should introduce a new bool
to track whether we're keeping the tuple or killing it, and then use
that to drive which of those things we do. (Fixed in attached.)

As discussed up thread, I have included your changes apart from the
change related to ItemIsDead.

+        if (H_HAS_GARBAGE(bucket_opaque) &&
+            !H_INCOMPLETE_SPLIT(bucket_opaque))

This is the only place in the entire patch that use
H_INCOMPLETE_SPLIT(), and I'm wondering if that's really correct even
here. Don't you really want H_OLD_INCOMPLETE_SPLIT() here? (And
couldn't we then remove H_INCOMPLETE_SPLIT() itself?) There's no
garbage to be removed from the "new" bucket until the next split, when
it will take on the role of the "old" bucket.

Fixed.

I think it would be a good idea to change this so that
LH_BUCKET_PAGE_HAS_GARBAGE doesn't get set until
LH_BUCKET_OLD_PAGE_SPLIT is cleared. The current way is confusing,
because those tuples are NOT garbage until the split is completed!
Moreover, both of the places that care about
LH_BUCKET_PAGE_HAS_GARBAGE need to make sure that
LH_BUCKET_OLD_PAGE_SPLIT is clear before they do anything about
LH_BUCKET_PAGE_HAS_GARBAGE, so the change I am proposing would
actually simplify the code very slightly.

Yeah, I have changed as per above suggestion. However, I think with
this change we can only check for garbage flag during vacuum. For
now, I am checking both incomplete split and garbage flag in the
vacuum just to be extra sure, but if you also feel that we can remove
the incomplete split check, then I will do so.

+#define H_OLD_INCOMPLETE_SPLIT(opaque)  ((opaque)->hasho_flag &
LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque)  ((opaque)->hasho_flag &
LH_BUCKET_NEW_PAGE_SPLIT)

The code isn't consistent about the use of these macros, or at least
not in a good way. When you care about LH_BUCKET_OLD_PAGE_SPLIT, you
test it using the macro; when you care about H_NEW_INCOMPLETE_SPLIT,
you ignore the macro and test it directly. Either get rid of both
macros and always test directly, or keep both macros and use both of
them. Using a macro for one but not the other is strange.

Used macro for both.

I wonder if we should rename these flags and macros. Maybe
LH_BUCKET_OLD_PAGE_SPLIT -> LH_BEING_SPLIT and
LH_BUCKET_NEW_PAGE_SPLIT -> LH_BEING_POPULATED. I think that might be
clearer. When LH_BEING_POPULATED is set, the bucket is being filled -
that is, populated - from the old bucket. And maybe
LH_BUCKET_PAGE_HAS_GARBAGE -> LH_NEEDS_SPLIT_CLEANUP, too.

Changed the names as per discussion up thread.

+         * Copy bucket mapping info now;  The comment in _hash_expandtable
+         * where we copy this information and calls _hash_splitbucket explains
+         * why this is OK.

After a semicolon, the next word should not be capitalized. There
shouldn't be two spaces after a semicolon, either. Also,
_hash_splitbucket appears to have a verb before it (calls) and a verb
after it (explains) and I have no idea what that means.

Fixed.

+    for (;;)
+    {
+        mask = lowmask + 1;
+        new_bucket = old_bucket | mask;
+        if (new_bucket > metap->hashm_maxbucket)
+        {
+            lowmask = lowmask >> 1;
+            continue;
+        }
+        blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+        break;
+    }

I can't help feeling that it should be possible to do this without
looping. Can we ever loop more than once? How? Can we just use an
if-then instead of a for-loop?

Can't _hash_get_oldbucket_newblock call _hash_get_oldbucket_newbucket
instead of duplicating the logic?

Changed as per discussion up thread.

I still don't like the names of these functions very much. If you
said "get X from Y", it would be clear that you put in Y and you get
out X. If you say "X 2 Y", it would be clear that you put in X and
you get out Y. As it is, it's not very clear which is the input and
which is the output.

Changed as per discussion up thread.

+ bool primary_buc_page)

I think we could just go with "primary_page" here. (Fixed in attached.)

Included the change in attached version of the patch.

+    /*
+     * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+     * there is no pending scan that has seen the flag after it is cleared.
+     */
+    _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+    opage = BufferGetPage(bucket_obuf);
+    oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);

I don't understand the comment, because the code *isn't* acquiring a
cleanup lock.

Removed this comment.

Some more review:

The API contract of _hash_finish_split seems a bit unfortunate. The
caller is supposed to have obtained a cleanup lock on both the old and
new buffers, but the first thing it does is walk the entire new bucket
chain, completely ignoring the old one. That means holding a cleanup
lock on the old buffer across an unbounded number of I/O operations --
which also means that you can't interrupt the query by pressing ^C,
because an LWLock (on the old buffer) is held.

Fixed in attached patch as per algorithm proposed few lines down in this mail.

I see the problem you are talking about, but it was done to ensure
locking order, old bucket first and then new bucket, else there could
be a deadlock risk. However, I think we can avoid holding the cleanup
lock on old bucket till we scan the new bucket to form a hash table of
TIDs.

Moreover, the
requirement to hold a lock on the new buffer isn't convenient for
either caller; they both have to go do it, so why not move it into the
function?

Yeah, we can move the locking of new bucket entirely into new function.

Done.

Perhaps the function should be changed so that it
guarantees that a pin is held on the primary page of the existing
bucket, but no locks are held.

Okay, so we can change the locking order as follows:
a. ensure a cleanup lock on old bucket and check if the bucket (old)
has pending split.
b. if there is a pending split, release the lock on old bucket, but not pin.

below steps will be performed by _hash_finish_split():

c. acquire the read content lock on new bucket and form the hash table
of TIDs and in the process of forming hash table, we need to traverse
whole bucket chain. While traversing bucket chain, release the lock
on previous bucket (both lock and pin if not a primary bucket page).
d. After the hash table is formed, acquire cleanup lock on old and new
buckets conditionaly; if we are not able to get cleanup lock on
either, then just return from there.
e. Perform split operation..
f. release the lock and pin on new bucket
g. release the lock on old bucket. We don't want to release the pin
on old bucket as the caller has ensure it before passing it to
_hash_finish_split(), so releasing pin should be resposibility of
caller.

Now, both the callers need to ensure that they restart the operation
from begining as after we release lock on old bucket, somebody might
have split the bucket.

Does the above change in locking strategy sounds okay?

I have changed the locking strategy as per above description by me and
accordingly changed the prototype of _hash_finish_split.

Where _hash_finish_split does LockBufferForCleanup(bucket_nbuf),
should it instead be trying to get the lock conditionally and
returning immediately if it doesn't get the lock? Seems like a good
idea...

Yeah, we can take a cleanup lock conditionally, but it would waste the
effort of forming hash table, if we don't get cleanup lock
immediately. Considering incomplete splits to be a rare operation,
may be this the wasted effort is okay, but I am not sure. Don't you
think we should avoid that effort?

Changed it to conditional lock.

* We're at the end of the old bucket chain, so we're done partitioning
* the tuples. Mark the old and new buckets to indicate split is
* finished.
*
* To avoid deadlocks due to locking order of buckets, first lock the old
* bucket and then the new bucket.

These comments have drifted too far from the code to which they refer.
The first part is basically making the same point as the
slightly-later comment /* indicate that split is finished */.

I think we can remove the second comment /* indicate that split is
finished */.

Removed this comment.

itupsize = new_itup->t_info & INDEX_SIZE_MASK;
new_itup->t_info &= ~INDEX_SIZE_MASK;
new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
new_itup->t_info |= itupsize;

If I'm not mistaken, you could omit the first, second, and fourth
lines here and keep only the third one, and it would do exactly the
same thing. The first line saves the bits in INDEX_SIZE_MASK. The
second line clears the bits in INDEX_SIZE_MASK. The fourth line
re-sets the bits that were originally saved.

You are right and I have changed the code as per your suggestion.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

concurrent_hash_index_v11.patchapplication/octet-stream; name=concurrent_hash_index_v11.patchDownload
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index 5d3bd94..e2e7e91 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/access/hash
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashscan.o \
-       hashsearch.o hashsort.o hashutil.o hashvalidate.o
+OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
+       hashsort.o hashutil.o hashvalidate.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 0a7da89..386730b 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -125,54 +125,59 @@ the initially created buckets.
 
 Lock Definitions
 ----------------
-
-We use both lmgr locks ("heavyweight" locks) and buffer context locks
-(LWLocks) to control access to a hash index.  lmgr locks are needed for
-long-term locking since there is a (small) risk of deadlock, which we must
-be able to detect.  Buffer context locks are used for short-term access
-control to individual pages of the index.
-
-LockPage(rel, page), where page is the page number of a hash bucket page,
-represents the right to split or compact an individual bucket.  A process
-splitting a bucket must exclusive-lock both old and new halves of the
-bucket until it is done.  A process doing VACUUM must exclusive-lock the
-bucket it is currently purging tuples from.  Processes doing scans or
-insertions must share-lock the bucket they are scanning or inserting into.
-(It is okay to allow concurrent scans and insertions.)
-
-The lmgr lock IDs corresponding to overflow pages are currently unused.
-These are available for possible future refinements.  LockPage(rel, 0)
-is also currently undefined (it was previously used to represent the right
-to modify the hash-code-to-bucket mapping, but it is no longer needed for
-that purpose).
-
-Note that these lock definitions are conceptually distinct from any sort
-of lock on the pages whose numbers they share.  A process must also obtain
-read or write buffer lock on the metapage or bucket page before accessing
-said page.
-
-Processes performing hash index scans must hold share lock on the bucket
-they are scanning throughout the scan.  This seems to be essential, since
-there is no reasonable way for a scan to cope with its bucket being split
-underneath it.  This creates a possibility of deadlock external to the
-hash index code, since a process holding one of these locks could block
-waiting for an unrelated lock held by another process.  If that process
-then does something that requires exclusive lock on the bucket, we have
-deadlock.  Therefore the bucket locks must be lmgr locks so that deadlock
-can be detected and recovered from.
-
-Processes must obtain read (share) buffer context lock on any hash index
-page while reading it, and write (exclusive) lock while modifying it.
-To prevent deadlock we enforce these coding rules: no buffer lock may be
-held long term (across index AM calls), nor may any buffer lock be held
-while waiting for an lmgr lock, nor may more than one buffer lock
-be held at a time by any one process.  (The third restriction is probably
-stronger than necessary, but it makes the proof of no deadlock obvious.)
+Concurrency control for hash indexes is provided using buffer content
+locks, buffer pins, and cleanup locks.   Here as elsewhere in PostgreSQL,
+cleanup lock means that we hold an exclusive lock on the buffer and have
+observed at some point after acquiring the lock that we hold the only pin
+on that buffer.  For hash indexes, a cleanup lock on a primary bucket page
+represents the right to perform an arbitrary reorganization of the entire
+bucket.  Therefore, scans retain a pin on the primary bucket page for the
+bucket they are currently scanning.  Splitting a bucket requires a cleanup
+lock on both the old and new primary bucket pages.  VACUUM therefore takes
+a cleanup lock on every bucket page in order to remove tuples.  It can also
+remove tuples copied to a new bucket by any previous split operation, because
+the cleanup lock taken on the primary bucket page guarantees that no scans
+which started prior to the most recent split can still be in progress.  After
+cleaning each page individually, it attempts to take a cleanup lock on the
+primary bucket page in order to "squeeze" the bucket down to the minimum
+possible number of pages.
+
+To avoid deadlocks, we must be consistent about the lock order in which we
+lock the buckets for operations that requires locks on two different buckets.
+We use the rule "first lock the old bucket and then new bucket, basically
+lock the lowered number bucket first".
+
+To avoid deadlock in operations that requires locking metapage and other
+buckets, we always take the lock on other bucket first and then on metapage.
 
 
 Pseudocode Algorithms
 ---------------------
 
+Various flags that are used in hash index operations are described as below:
+
+split-in-progress flag indicates that split operation is in progress for a
+bucket.  During split operation, this flag is set on both old and new buckets.
+This flag is cleared once the split operation is finished.
+
+moved-by-split flag on a tuple indicates that tuple is moved from old to new
+bucket.  The concurrent scans can skip such tuples till the split operation is
+finished.  Once the tuple is marked as moved-by-split, it will remain so forever
+but that does no harm.  We have intentionally not cleared it as that can generate
+an additional I/O which is not necessary.
+
+split-cleanup flag indicates that the bucket contains tuples that are moved due
+to split.  This will be set only for old bucket.  Now, why we need it besides
+split-in-progress flag is to distinguish the case when the split is over
+(aka split-in-progress flag is cleared.).  This is used both by vacuum as
+well as during re-split operation.  Vacuum, uses it to decide if it needs to
+clear the tuples (that are moved-by-split) from bucket along with dead tuples.
+Re-split of bucket uses it to ensure that it doesn't start a new split from a
+bucket without clearing the previous tuples from old bucket.  The usage by
+re-split helps to keep bloat under control and makes the design somewhat
+simpler as we don't have to any time handle the situation where a bucket can
+contain dead-tuples from multiple splits.
+
 The operations we need to support are: readers scanning the index for
 entries of a particular hash code (which by definition are all in the same
 bucket); insertion of a new tuple into the correct bucket; enlarging the
@@ -193,38 +198,51 @@ The reader algorithm is:
 		release meta page buffer content lock
 		if (correct bucket page is already locked)
 			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock
+		release any existing bucket page buffer content lock (if a concurrent split happened)
+		take the buffer content lock on bucket page in shared mode
 		retake meta page buffer content lock in shared mode
--- then, per read request:
 	release pin on metapage
-	read current page of bucket and take shared buffer content lock
-		step to next page if necessary (no chaining of locks)
+	if the split is in progress for current bucket and this is a new bucket
+		release the buffer content lock on current bucket page
+		pin and acquire the buffer content lock on old bucket in shared mode
+		release the buffer content lock on old bucket, but not pin
+		retake the buffer content lock on new bucket
+		mark the scan such that it skips the tuples that are marked as moved by split
+-- then, per read request:
+	step to next page if necessary (no chaining of locks)
+		if the scan indicates moved by split, then move to old bucket after the scan
+		of current bucket is finished
 	get tuple
 	release buffer content lock and pin on current page
 -- at scan shutdown:
-	release bucket share-lock
-
-We can't hold the metapage lock while acquiring a lock on the target bucket,
-because that might result in an undetected deadlock (lwlocks do not participate
-in deadlock detection).  Instead, we relock the metapage after acquiring the
-bucket page lock and check whether the bucket has been split.  If not, we're
-done.  If so, we release our previously-acquired lock and repeat the process
-using the new bucket number.  Holding the bucket sharelock for
-the remainder of the scan prevents the reader's current-tuple pointer from
-being invalidated by splits or compactions.  Notice that the reader's lock
-does not prevent other buckets from being split or compacted.
+	release any pin we hold on current buffer, old bucket buffer, new bucket buffer
+
+We don't want to hold the meta page lock while acquiring the content lock on
+bucket page, because that might result in poor concurrency.  Instead, we relock
+the metapage after acquiring the bucket page content lock and check whether the
+bucket has been split.  If not, we're done.  If so, we release our
+previously-acquired content lock, but not pin and repeat the process using the
+new bucket number.  Holding the buffer pin on bucket page for the remainder of
+the scan prevents the reader's current-tuple pointer from being invalidated by
+splits or compactions.  Notice that the reader's pin does not prevent other
+buckets from being split or compacted.
 
 To keep concurrency reasonably good, we require readers to cope with
 concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the page sharelock.  Since
-deletion is not possible while a reader holds the bucket sharelock, and
-we assume that heap tuple TIDs are unique, this can be implemented by
+their current scan position after re-acquiring the buffer content lock on
+page.  Since deletion is not possible while a reader holds the pin on bucket,
+and we assume that heap tuple TIDs are unique, this can be implemented by
 searching for the same heap tuple TID previously returned.  Insertion does
 not move index entries across pages, so the previously-returned index entry
 should always be on the same page, at the same or higher offset number,
 as it was before.
 
+To allow scan during bucket split, if at the start of the scan, bucket is
+marked as split-in-progress, it scan all the tuples in that bucket except for
+those that are marked as moved-by-split.  Once it finishes the scan of all the
+tuples in the current bucket, it scans the old bucket from which this bucket
+is formed by split.  This happens only for the new half bucket.
+
 The insertion algorithm is rather similar:
 
 	pin meta page and take buffer content lock in shared mode
@@ -233,18 +251,27 @@ The insertion algorithm is rather similar:
 		release meta page buffer content lock
 		if (correct bucket page is already locked)
 			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock in shared mode
+		release any existing bucket page buffer content lock (if a concurrent split happened)
+		take the buffer content lock on bucket page in exclusive mode
 		retake meta page buffer content lock in shared mode
--- (so far same as reader)
 	release pin on metapage
-	pin current page of bucket and take exclusive buffer content lock
-	if full, release, read/exclusive-lock next page; repeat as needed
+-- (so far same as reader, except for acquisation of buffer content lock in
+	exclusive mode on primary bucket page)
+	if the split-in-progress flag is set for bucket in old half of split
+	and pin count on it is one, then finish the split
+		release the buffer content lock on current bucket
+		get the new bucket (bucket which was in process of split from current bucket) using current bucket
+		scan the new bucket and form the hash table of TIDs
+		conditionally get the cleanup lock on old and new buckets
+		if we get the lock on both the buckets
+			finish the split using algorithm mentioned below for split
+		release the pin on old bucket and restart the insert from beginning.
+	if current page is full, release lock but not pin, read/exclusive-lock next page; repeat as needed
 	>> see below if no space in any page of bucket
 	insert tuple at appropriate place in page
 	mark current page dirty and release buffer content lock and pin
-	release heavyweight share-lock
-	pin meta page and take buffer content lock in shared mode
+	if the current page is not a bucket page, release the pin on bucket page
+	pin meta page and take buffer content lock in exclusive mode
 	increment tuple count, decide if split needed
 	mark meta page dirty and release buffer content lock and pin
 	done if no split needed, else enter Split algorithm below
@@ -256,11 +283,13 @@ bucket that is being actively scanned, because readers can cope with this
 as explained above.  We only need the short-term buffer locks to ensure
 that readers do not see a partially-updated page.
 
-It is clearly impossible for readers and inserters to deadlock, and in
-fact this algorithm allows them a very high degree of concurrency.
-(The exclusive metapage lock taken to update the tuple count is stronger
-than necessary, since readers do not care about the tuple count, but the
-lock is held for such a short time that this is probably not an issue.)
+To avoid deadlock between readers and inserters, whenever there is a need
+to lock multiple buckets, we always take in the order suggested in Lock
+Definitions above.  This algorithm allows them a very high degree of
+concurrency.  (The exclusive metapage lock taken to update the tuple count
+is stronger than necessary, since readers do not care about the tuple count,
+but the lock is held for such a short time that this is probably not an
+issue.)
 
 When an inserter cannot find space in any existing page of a bucket, it
 must obtain an overflow page and add that page to the bucket's chain.
@@ -271,46 +300,68 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
 The algorithm attempts, but does not necessarily succeed, to split one
 existing bucket in two, thereby lowering the fill ratio:
 
-	pin meta page and take buffer content lock in exclusive mode
-	check split still needed
-	if split not needed anymore, drop buffer content lock and pin and exit
-	decide which bucket to split
-	Attempt to X-lock old bucket number (definitely could fail)
-	Attempt to X-lock new bucket number (shouldn't fail, but...)
-	if above fail, drop locks and pin and exit
+	expand:
+		take buffer content lock in exclusive mode on meta page
+		check split still needed
+		if split not needed anymore, drop buffer content lock and exit
+		decide which bucket to split
+		Attempt to acquire cleanup lock on old bucket number (definitely could fail)
+		if above fail, release lock and pin and exit
+		if the split-in-progress flag is set, then finish the split
+			conditionally get the content lock on new bucket which was involved in split
+			if got the lock on new bucket
+				finish the split using algorithm mentioned below for split
+				release the buffer content lock and pin on old and new buckets
+				try to expand from start
+			else
+				release the buffer conetent lock and pin on old bucket and exit
+		if the split-cleanup flag (indicates that tuples are moved by split) is set on bucket
+			release the buffer content lock on meta page
+			remove the tuples that doesn't belong to this bucket; see bucket cleanup below
+	Attempt to acquire cleanup lock on new bucket number (shouldn't fail, but...)
 	update meta page to reflect new number of buckets
-	mark meta page dirty and release buffer content lock and pin
+	mark meta page dirty and release buffer content lock
 	-- now, accesses to all other buckets can proceed.
 	Perform actual split of bucket, moving tuples as needed
 	>> see below about acquiring needed extra space
-	Release X-locks of old and new buckets
+
+	split guts
+	mark the old and new buckets indicating split-in-progress
+	if we are finishing the incomplete split
+		probe the temporary hash table to check if the value already exists in new bucket
+	copy the tuples that belongs to new bucket from old bucket
+	during copy mark such tuples as move-by-split
+	release lock but not pin for primary bucket page of old bucket,
+	read/shared-lock next page; repeat as needed
+	>> see below if no space in bucket page of new bucket
+	ensure to have exclusive-lock on both old and new buckets in that order
+	clear the split-in-progress flag from both the buckets
+	mark the old bucket indicating split-cleanup
+	mark buffers dirty and release the locks and pins on both old and new buckets
 
 Note the metapage lock is not held while the actual tuple rearrangement is
 performed, so accesses to other buckets can proceed in parallel; in fact,
 it's possible for multiple bucket splits to proceed in parallel.
 
-Split's attempt to X-lock the old bucket number could fail if another
-process holds S-lock on it.  We do not want to wait if that happens, first
-because we don't want to wait while holding the metapage exclusive-lock,
-and second because it could very easily result in deadlock.  (The other
-process might be out of the hash AM altogether, and could do something
-that blocks on another lock this process holds; so even if the hash
-algorithm itself is deadlock-free, a user-induced deadlock could occur.)
-So, this is a conditional LockAcquire operation, and if it fails we just
-abandon the attempt to split.  This is all right since the index is
-overfull but perfectly functional.  Every subsequent inserter will try to
-split, and eventually one will succeed.  If multiple inserters failed to
-split, the index might still be overfull, but eventually, the index will
+The split operation's attempt to acquire cleanup-lock on the old bucket number
+could fail if another process holds any lock or pin on it.  We do not want to
+wait if that happens, because we don't want to wait while holding the metapage
+exclusive-lock.  So, this is a conditional LWLockAcquire operation, and if
+it fails we just abandon the attempt to split.  This is all right since the
+index is overfull but perfectly functional.  Every subsequent inserter will
+try to split, and eventually one will succeed.  If multiple inserters failed
+to split, the index might still be overfull, but eventually, the index will
 not be overfull and split attempts will stop.  (We could make a successful
 splitter loop to see if the index is still overfull, but it seems better to
 distribute the split overhead across successive insertions.)
 
 A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt.  The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
+disk space or crash) the index is left corrupt.  The probability of that
+could be made quite low if we grab a free page or two before we update the
+meta page, but the only real solution is to treat a split as a WAL-loggable,
 must-complete action.  I'm not planning to teach hash about WAL in this
-go-round.
+go-round.  However, we do try to finish the incomplete splits during insert
+and split.
 
 The fourth operation is garbage collection (bulk deletion):
 
@@ -319,9 +370,13 @@ The fourth operation is garbage collection (bulk deletion):
 	fetch current max bucket number
 	release meta page buffer content lock and pin
 	while next bucket <= max bucket do
-		Acquire X lock on target bucket
-		Scan and remove tuples, compact free space as needed
-		Release X lock
+		Acquire cleanup lock on target bucket
+		Scan and remove tuples
+		For overflow page, first we need to lock the next page and then
+		release the lock on current bucket or overflow page
+		Ensure to have buffer content lock in exclusive mode on bucket page
+		If buffer pincount is one, then compact free space as needed
+		Release lock
 		next bucket ++
 	end loop
 	pin metapage and take buffer content lock in exclusive mode
@@ -330,20 +385,24 @@ The fourth operation is garbage collection (bulk deletion):
 	else update metapage tuple count
 	mark meta page dirty and release buffer content lock and pin
 
-Note that this is designed to allow concurrent splits.  If a split occurs,
-tuples relocated into the new bucket will be visited twice by the scan,
-but that does no harm.  (We must however be careful about the statistics
-reported by the VACUUM operation.  What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count
-if the stored tuple count and number of buckets did *not* change at any
-time during the scan.  This provides a way of correcting the stored tuple
-count if it gets out of sync for some reason.  But if a split or insertion
-does occur concurrently, the scan count is untrustworthy; instead,
-subtract the number of tuples deleted from the stored tuple count and
-use that.)
-
-The exclusive lock request could deadlock in some strange scenarios, but
-we can just error out without any great harm being done.
+Note that this is designed to allow concurrent splits and scans.  If a
+split occurs, tuples relocated into the new bucket will be visited twice
+by the scan, but that does no harm.  As we release the lock on bucket page
+during cleanup scan of a bucket, it will allow concurrent scan to start on
+a bucket and ensures that scan will always be behind cleanup.  It is must to
+keep scans behind cleanup, else vacuum could decrease the TIDs that are required
+to complete the scan.  Now, as the scan that returns multiple tuples from the
+same bucket page always expect next valid TID to be greater than or equal to the
+current TID, it might miss the tuples.  This holds true for backward scans as
+well (backward scans first traverse each bucket starting from first bucket to
+last overflow page in the chain).  We must be careful about the statistics
+reported by the VACUUM operation.  What we can do is count the number of tuples
+scanned, and believe this in preference to the stored tuple count if the stored
+tuple count and number of buckets did *not* change at any time during the scan.
+This provides a way of correcting the stored tuple count if it gets out of sync
+for some reason.  But if a split or insertion does occur concurrently, the scan
+count is untrustworthy; instead, subtract the number of tuples deleted from the
+stored tuple count and use that.
 
 
 Free Space Management
@@ -417,13 +476,11 @@ free page; there can be no other process holding lock on it.
 
 Bucket splitting uses a similar algorithm if it has to extend the new
 bucket, but it need not worry about concurrent extension since it has
-exclusive lock on the new bucket.
+buffer content lock in exclusive mode on the new bucket.
 
-Freeing an overflow page is done by garbage collection and by bucket
-splitting (the old bucket may contain no-longer-needed overflow pages).
-In both cases, the process holds exclusive lock on the containing bucket,
-so need not worry about other accessors of pages in the bucket.  The
-algorithm is:
+Freeing an overflow page requires the process to hold buffer content lock in
+exclusive mode on the containing bucket, so need not worry about other
+accessors of pages in the bucket.  The algorithm is:
 
 	delink overflow page from bucket chain
 	(this requires read/update/write/release of fore and aft siblings)
@@ -454,14 +511,6 @@ locks.  Since they need no lmgr locks, deadlock is not possible.
 Other Notes
 -----------
 
-All the shenanigans with locking prevent a split occurring while *another*
-process is stopped in a given bucket.  They do not ensure that one of
-our *own* backend's scans is not stopped in the bucket, because lmgr
-doesn't consider a process's own locks to conflict.  So the Split
-algorithm must check for that case separately before deciding it can go
-ahead with the split.  VACUUM does not have this problem since nothing
-else can be happening within the vacuuming backend.
-
-Should we instead try to fix the state of any conflicting local scan?
-Seems mighty ugly --- got to move the held bucket S-lock as well as lots
-of other messiness.  For now, just punt and don't split.
+Clean up locks prevent a split from occurring while *another* process is stopped
+in a given bucket.  It also ensures that one of our *own* backend's scans is not
+stopped in the bucket.
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..15b65f9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -287,10 +287,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		/*
 		 * An insertion into the current index page could have happened while
 		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold share lock on
-		 * the bucket, no deletions or splits could have occurred; therefore
-		 * we can expect that the TID still exists in the current index page,
-		 * at an offset >= where we were.)
+		 * for the TID we previously returned.  (Because we hold a pin on the
+		 * primary bucket page, no deletions or splits could have occurred;
+		 * therefore we can expect that the TID still exists in the current
+		 * index page, at an offset >= where we were.)
 		 */
 		OffsetNumber maxoffnum;
 
@@ -424,17 +424,16 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_bucket_valid = false;
-	so->hashso_bucket_blkno = 0;
 	so->hashso_curbuf = InvalidBuffer;
+	so->hashso_bucket_buf = InvalidBuffer;
+	so->hashso_old_bucket_buf = InvalidBuffer;
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
 	ItemPointerSetInvalid(&(so->hashso_heappos));
 
-	scan->opaque = so;
+	so->hashso_skip_moved_tuples = false;
 
-	/* register scan in case we change pages it's using */
-	_hash_regscan(scan);
+	scan->opaque = so;
 
 	return scan;
 }
@@ -449,15 +448,7 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -469,8 +460,9 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		memmove(scan->keyData,
 				scankey,
 				scan->numberOfKeys * sizeof(ScanKeyData));
-		so->hashso_bucket_valid = false;
 	}
+
+	so->hashso_skip_moved_tuples = false;
 }
 
 /*
@@ -482,18 +474,7 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/* don't need scan registered anymore */
-	_hash_dropscan(scan);
-
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	pfree(so);
 	scan->opaque = NULL;
@@ -504,6 +485,9 @@ hashendscan(IndexScanDesc scan)
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
  *
+ * This function also delete the tuples that are moved by split to other
+ * bucket.
+ *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
@@ -548,83 +532,47 @@ loop_top:
 	{
 		BlockNumber bucket_blkno;
 		BlockNumber blkno;
-		bool		bucket_dirty = false;
+		Buffer		bucket_buf;
+		Buffer		buf;
+		HashPageOpaque bucket_opaque;
+		Page		page;
+		bool		split_cleanup = false;
 
 		/* Get address of bucket's start page */
 		bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
 
-		/* Exclusive-lock the bucket so we can shrink it */
-		_hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
-		/* Shouldn't have any active scans locally, either */
-		if (_hash_has_active_scan(rel, cur_bucket))
-			elog(ERROR, "hash index has active scan during VACUUM");
-
-		/* Scan each page in bucket */
 		blkno = bucket_blkno;
-		while (BlockNumberIsValid(blkno))
-		{
-			Buffer		buf;
-			Page		page;
-			HashPageOpaque opaque;
-			OffsetNumber offno;
-			OffsetNumber maxoffno;
-			OffsetNumber deletable[MaxOffsetNumber];
-			int			ndeletable = 0;
-
-			vacuum_delay_point();
-
-			buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-											 info->strategy);
-			page = BufferGetPage(buf);
-			opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-			Assert(opaque->hasho_bucket == cur_bucket);
-
-			/* Scan each tuple in page */
-			maxoffno = PageGetMaxOffsetNumber(page);
-			for (offno = FirstOffsetNumber;
-				 offno <= maxoffno;
-				 offno = OffsetNumberNext(offno))
-			{
-				IndexTuple	itup;
-				ItemPointer htup;
 
-				itup = (IndexTuple) PageGetItem(page,
-												PageGetItemId(page, offno));
-				htup = &(itup->t_tid);
-				if (callback(htup, callback_state))
-				{
-					/* mark the item for deletion */
-					deletable[ndeletable++] = offno;
-					tuples_removed += 1;
-				}
-				else
-					num_index_tuples += 1;
-			}
+		/*
+		 * We need to acquire a cleanup lock on the primary bucket page to out
+		 * wait concurrent scans before deleting the dead tuples.
+		 */
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+		LockBufferForCleanup(buf);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 
-			/*
-			 * Apply deletions and write page if needed, advance to next page.
-			 */
-			blkno = opaque->hasho_nextblkno;
+		page = BufferGetPage(buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
-			if (ndeletable > 0)
-			{
-				PageIndexMultiDelete(page, deletable, ndeletable);
-				_hash_wrtbuf(rel, buf);
-				bucket_dirty = true;
-			}
-			else
-				_hash_relbuf(rel, buf);
-		}
+		/*
+		 * If the bucket contains tuples that are moved by split, then we need
+		 * to delete such tuples.  We can't delete such tuples if the split
+		 * operation on bucket is not finished as those are needed by scans.
+		 */
+		if (!H_BUCKET_BEING_SPLIT(bucket_opaque) &&
+			H_NEEDS_SPLIT_CLEANUP(bucket_opaque))
+			split_cleanup = true;
+
+		bucket_buf = buf;
 
-		/* If we deleted anything, try to compact free space */
-		if (bucket_dirty)
-			_hash_squeezebucket(rel, cur_bucket, bucket_blkno,
-								info->strategy);
+		hashbucketcleanup(rel, cur_bucket, bucket_buf, blkno, info->strategy,
+						  local_metapage.hashm_maxbucket,
+						  local_metapage.hashm_highmask,
+						  local_metapage.hashm_lowmask, &tuples_removed,
+						  &num_index_tuples, split_cleanup,
+						  callback, callback_state);
 
-		/* Release bucket lock */
-		_hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+		_hash_relbuf(rel, bucket_buf);
 
 		/* Advance to next bucket */
 		cur_bucket++;
@@ -705,6 +653,208 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }
 
+/*
+ * Helper function to perform deletion of index entries from a bucket.
+ *
+ * This function expects that the caller has acquired a cleanup lock on the
+ * primary bucket page, and will retrun with a write lock again held on the
+ * primary bucket page.  The lock won't necessarily be held continuously,
+ * though, because we'll release it when visiting overflow pages.
+ *
+ * It would be very bad if this function cleaned a page while some other
+ * backend was in the midst of scanning it, because hashgettuple assumes
+ * that the next valid TID will be greater than or equal to the current
+ * valid TID.  There can't be any concurrent scans in progress when we first
+ * enter this function because of the cleanup lock we hold on the primary
+ * bucket page, but as soon as we release that lock, there might be.  We
+ * handle that by conspiring to prevent those scans from passing our cleanup
+ * scan.  To do that, we lock the next page in the bucket chain before
+ * releasing the lock on the previous page.  (This type of lock chaining is
+ * not ideal, so we might want to look for a better solution at some point.)
+ *
+ * We need to retain a pin on the primary bucket to ensure that no concurrent
+ * split can start.
+ */
+void
+hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
+				  BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
+				  double *tuples_removed, double *num_index_tuples,
+				  bool split_cleanup,
+				  IndexBulkDeleteCallback callback, void *callback_state)
+{
+	BlockNumber blkno;
+	Buffer		buf;
+	Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
+	bool		bucket_dirty = false;
+
+	blkno = bucket_blkno;
+	buf = bucket_buf;
+
+	if (split_cleanup)
+		new_bucket = _hash_get_newbucket_from_oldbucket(rel, cur_bucket,
+														lowmask, maxbucket);
+
+	/* Scan each page in bucket */
+	for (;;)
+	{
+		HashPageOpaque opaque;
+		OffsetNumber offno;
+		OffsetNumber maxoffno;
+		Buffer		next_buf;
+		Page		page;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+		bool		retain_pin = false;
+		bool		curr_page_dirty = false;
+
+		vacuum_delay_point();
+
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		/* Scan each tuple in page */
+		maxoffno = PageGetMaxOffsetNumber(page);
+		for (offno = FirstOffsetNumber;
+			 offno <= maxoffno;
+			 offno = OffsetNumberNext(offno))
+		{
+			ItemPointer htup;
+			IndexTuple	itup;
+			Bucket		bucket;
+			bool		kill_tuple = false;
+
+			itup = (IndexTuple) PageGetItem(page,
+											PageGetItemId(page, offno));
+			htup = &(itup->t_tid);
+
+			/*
+			 * To remove the dead tuples, we strictly want to rely on results
+			 * of callback function.  refer btvacuumpage for detailed reason.
+			 */
+			if (callback && callback(htup, callback_state))
+			{
+				kill_tuple = true;
+				if (tuples_removed)
+					*tuples_removed += 1;
+			}
+			else if (split_cleanup)
+			{
+				/* delete the tuples that are moved by split. */
+				bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+											  maxbucket,
+											  highmask,
+											  lowmask);
+				/* mark the item for deletion */
+				if (bucket != cur_bucket)
+				{
+					/*
+					 * We expect tuples to either belong to curent bucket or
+					 * new_bucket.  This is ensured because we don't allow
+					 * further splits from bucket that contains garbage. See
+					 * comments in _hash_expandtable.
+					 */
+					Assert(bucket == new_bucket);
+					kill_tuple = true;
+				}
+			}
+
+			if (kill_tuple)
+			{
+				/* mark the item for deletion */
+				deletable[ndeletable++] = offno;
+			}
+			else
+			{
+				/* we're keeping it, so count it */
+				if (num_index_tuples)
+					*num_index_tuples += 1;
+			}
+		}
+
+		/* retain the pin on primary bucket page till end of bucket scan */
+		if (blkno == bucket_blkno)
+			retain_pin = true;
+		else
+			retain_pin = false;
+
+		blkno = opaque->hasho_nextblkno;
+
+		/*
+		 * Apply deletions, advance to next page and write page if needed.
+		 */
+		if (ndeletable > 0)
+		{
+			PageIndexMultiDelete(page, deletable, ndeletable);
+			bucket_dirty = true;
+			curr_page_dirty = true;
+		}
+
+		/* bail out if there are no more pages to scan. */
+		if (!BlockNumberIsValid(blkno))
+			break;
+
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
+		/*
+		 * release the lock on previous page after acquiring the lock on next
+		 * page
+		 */
+		if (curr_page_dirty)
+		{
+			if (retain_pin)
+				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+			else
+				_hash_wrtbuf(rel, buf);
+			curr_page_dirty = false;
+		}
+		else if (retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
+
+		buf = next_buf;
+	}
+
+	/*
+	 * lock the bucket page to clear the garbage flag and squeeze the bucket.
+	 * if the current buffer is same as bucket buffer, then we already have
+	 * lock on bucket page.
+	 */
+	if (buf != bucket_buf)
+	{
+		_hash_relbuf(rel, buf);
+		_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+	}
+
+	/*
+	 * Clear the garbage flag from bucket after deleting the tuples that are
+	 * moved by split.  We purposefully clear the flag before squeeze bucket,
+	 * so that after restart, vacuum shouldn't again try to delete the moved
+	 * by split tuples.
+	 */
+	if (split_cleanup)
+	{
+		HashPageOpaque bucket_opaque;
+		Page		page;
+
+		page = BufferGetPage(bucket_buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
+	}
+
+	/*
+	 * If we deleted anything, try to compact free space.  For squeezing the
+	 * bucket, we must have a cleanup lock, else it can impact the ordering of
+	 * tuples for a scan that has started before it.
+	 */
+	if (bucket_dirty && IsBufferCleanupOK(bucket_buf))
+		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
+							bstrategy);
+}
 
 void
 hash_redo(XLogReaderState *record)
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..ddf4681 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -28,18 +28,22 @@
 void
 _hash_doinsert(Relation rel, IndexTuple itup)
 {
-	Buffer		buf;
+	Buffer		buf = InvalidBuffer;
+	Buffer		bucket_buf;
 	Buffer		metabuf;
 	HashMetaPage metap;
 	BlockNumber blkno;
-	BlockNumber oldblkno = InvalidBlockNumber;
-	bool		retry = false;
+	BlockNumber oldblkno;
+	bool		retry;
 	Page		page;
 	HashPageOpaque pageopaque;
 	Size		itemsz;
 	bool		do_expand;
 	uint32		hashkey;
 	Bucket		bucket;
+	uint32		maxbucket;
+	uint32		highmask;
+	uint32		lowmask;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -51,6 +55,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+restart_insert:
 	/* Read the metapage */
 	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
 	metap = HashPageGetMeta(BufferGetPage(metabuf));
@@ -69,6 +74,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 						itemsz, HashMaxItemSize((Page) metap)),
 			errhint("Values larger than a buffer page cannot be indexed.")));
 
+	oldblkno = InvalidBlockNumber;
+	retry = false;
+
 	/*
 	 * Loop until we get a lock on the correct target bucket.
 	 */
@@ -84,21 +92,32 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 
 		blkno = BUCKET_TO_BLKNO(metap, bucket);
 
+		/*
+		 * Copy bucket mapping info now; refer the comment in
+		 * _hash_expandtable where we copy this information before calling
+		 * _hash_splitbucket to see why this is okay.
+		 */
+		maxbucket = metap->hashm_maxbucket;
+		highmask = metap->hashm_highmask;
+		lowmask = metap->hashm_lowmask;
+
 		/* Release metapage lock, but keep pin. */
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
 
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * If the previous iteration of this loop locked the primary page of
+		 * what is still the correct target bucket, we are done.  Otherwise,
+		 * drop any old lock before acquiring the new one.
 		 */
 		if (retry)
 		{
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+
+		/* Fetch and lock the primary bucket page for the target bucket */
+		buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
 
 		/*
 		 * Reacquire metapage lock and check that no bucket split has taken
@@ -109,12 +128,37 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		retry = true;
 	}
 
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+	/* remember the primary bucket buffer to release the pin on it at end. */
+	bucket_buf = buf;
+
 	page = BufferGetPage(buf);
 	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(pageopaque->hasho_bucket == bucket);
 
+	/*
+	 * If this bucket is in the process of being split, try to finish the
+	 * split before inserting, because that might create room for the
+	 * insertion to proceed without allocating an additional overflow page.
+	 * It's only interesting to finish the split if we're trying to insert
+	 * into the bucket from which we're removing tuples (the "old" bucket),
+	 * not if we're trying to insert into the bucket into which tuples are
+	 * being moved (the "new" bucket).
+	 */
+	if (H_BUCKET_BEING_SPLIT(pageopaque) && IsBufferCleanupOK(buf))
+	{
+
+		/* release the lock on bucket buffer, before completing the split. */
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+		_hash_finish_split(rel, metabuf, buf, pageopaque->hasho_bucket,
+						   maxbucket, highmask, lowmask);
+
+		/* release the pin on old and meta buffer.  retry for insert. */
+		_hash_dropbuf(rel, buf);
+		_hash_dropbuf(rel, metabuf);
+		goto restart_insert;
+	}
+
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
@@ -127,9 +171,15 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		{
 			/*
 			 * ovfl page exists; go get it.  if it doesn't have room, we'll
-			 * find out next pass through the loop test above.
+			 * find out next pass through the loop test above.  we always
+			 * release both the lock and pin if this is an overflow page, but
+			 * only the lock if this is the primary bucket page, since the pin
+			 * on the primary bucket must be retained throughout the scan.
 			 */
-			_hash_relbuf(rel, buf);
+			if (buf != bucket_buf)
+				_hash_relbuf(rel, buf);
+			else
+				_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 			buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 			page = BufferGetPage(buf);
 		}
@@ -144,7 +194,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 
 			/* chain to a new overflow page */
-			buf = _hash_addovflpage(rel, metabuf, buf);
+			buf = _hash_addovflpage(rel, metabuf, buf, (buf == bucket_buf) ? true : false);
 			page = BufferGetPage(buf);
 
 			/* should fit now, given test above */
@@ -158,11 +208,14 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	/* found page with enough space, so add the item here */
 	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
 
-	/* write and release the modified page */
+	/*
+	 * write and release the modified page.  if the page we modified was an
+	 * overflow page, we also need to separately drop the pin we retained on
+	 * the primary bucket page.
+	 */
 	_hash_wrtbuf(rel, buf);
-
-	/* We can drop the bucket lock now */
-	_hash_droplock(rel, blkno, HASH_SHARE);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
 
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index df7af3e..c00d6f5 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *
  *	On entry, the caller must hold a pin but no lock on 'buf'.  The pin is
  *	dropped before exiting (we assume the caller is not interested in 'buf'
- *	anymore).  The returned overflow page will be pinned and write-locked;
- *	it is guaranteed to be empty.
+ *	anymore) if not asked to retain.  The pin will be retained only for the
+ *	primary bucket.  The returned overflow page will be pinned and
+ *	write-locked; it is guaranteed to be empty.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
  *
- *	The caller must hold at least share lock on the bucket, to ensure that
- *	no one else tries to compact the bucket meanwhile.  This guarantees that
- *	'buf' won't stop being part of the bucket while it's unlocked.
- *
  * NB: since this could be executed concurrently by multiple processes,
  * one should not assume that the returned overflow page will be the
  * immediate successor of the originally passed 'buf'.  Additional overflow
  * pages might have been added to the bucket chain in between.
  */
 Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 {
 	Buffer		ovflbuf;
 	Page		page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 			break;
 
 		/* we assume we do not need to write the unmodified page */
-		_hash_relbuf(rel, buf);
+		if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
 
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
@@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 
 	/* logically chain overflow page to previous page */
 	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	_hash_wrtbuf(rel, buf);
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, buf);
 
 	return ovflbuf;
 }
@@ -369,12 +372,13 @@ _hash_firstfreebit(uint32 map)
  *	Returns the block number of the page that followed the given page
  *	in the bucket, or InvalidBlockNumber if no following page.
  *
- *	NB: caller must not hold lock on metapage, nor on either page that's
- *	adjacent in the bucket chain.  The caller had better hold exclusive lock
- *	on the bucket, too.
+ *	NB: caller must hold a cleanup lock on the primary bucket page, so that
+ *	concurrent scans can't get confused.  caller must not hold a lock on either
+ *	page adjacent to this one in the bucket chain (except when it's the primary
+ *	bucket page). caller must not hold a lock on the metapage, either.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf,
+_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
@@ -413,22 +417,42 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
-	 * deleted.  No concurrency issues since we hold exclusive lock on the
-	 * entire bucket.
+	 * deleted.  No concurrency issues since we hold a cleanup lock on primary
+	 * bucket.  We don't need to acquire a buffer lock to fix the primary
+	 * bucket, as we already have that lock.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-														 prevblkno,
-														 HASH_WRITE,
+		Buffer		prevbuf;
+		Page		prevpage;
+		HashPageOpaque prevopaque;
+
+		if (prevblkno == bucket_blkno)
+			prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
+										 prevblkno,
+										 RBM_NORMAL,
+										 bstrategy);
+		else
+			prevbuf = _hash_getbuf_with_strategy(rel,
+												 prevblkno,
+												 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		prevpage = BufferGetPage(prevbuf);
-		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+												 bstrategy);
+
+		prevpage = BufferGetPage(prevbuf);
+		prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
 
 		Assert(prevopaque->hasho_bucket == bucket);
 		prevopaque->hasho_nextblkno = nextblkno;
-		_hash_wrtbuf(rel, prevbuf);
+
+
+		if (prevblkno == bucket_blkno)
+		{
+			MarkBufferDirty(prevbuf);
+			ReleaseBuffer(prevbuf);
+		}
+		else
+			_hash_wrtbuf(rel, prevbuf);
 	}
 	if (BlockNumberIsValid(nextblkno))
 	{
@@ -570,8 +594,10 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
  *	required that to be true on entry as well, but it's a lot easier for
  *	callers to leave empty overflow pages and let this guy clean it up.
  *
- *	Caller must hold exclusive lock on the target bucket.  This allows
- *	us to safely lock multiple pages in the bucket.
+ *	Caller must hold cleanup lock on the primary page of the target bucket
+ *	to exclude any concurrent scans, which could easily be confused into
+ *	returning the same tuple more than once or some tuples not at all by
+ *	the rearrangement we are performing here.
  *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
@@ -580,6 +606,7 @@ void
 _hash_squeezebucket(Relation rel,
 					Bucket bucket,
 					BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy)
 {
 	BlockNumber wblkno;
@@ -591,27 +618,22 @@ _hash_squeezebucket(Relation rel,
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
 	bool		wbuf_dirty;
+	bool		release_buf = false;
 
 	/*
-	 * start squeezing into the base bucket page.
+	 * start squeezing into the primary bucket page.
 	 */
 	wblkno = bucket_blkno;
-	wbuf = _hash_getbuf_with_strategy(rel,
-									  wblkno,
-									  HASH_WRITE,
-									  LH_BUCKET_PAGE,
-									  bstrategy);
+	wbuf = bucket_buf;
 	wpage = BufferGetPage(wbuf);
 	wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 
 	/*
-	 * if there aren't any overflow pages, there's nothing to squeeze.
+	 * if there aren't any overflow pages, there's nothing to squeeze. caller
+	 * is responsible for releasing the lock on primary bucket page.
 	 */
 	if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
-	{
-		_hash_relbuf(rel, wbuf);
 		return;
-	}
 
 	/*
 	 * Find the last page in the bucket chain by starting at the base bucket
@@ -673,12 +695,17 @@ _hash_squeezebucket(Relation rel,
 			{
 				Assert(!PageIsEmpty(wpage));
 
+				if (wblkno != bucket_blkno)
+					release_buf = true;
+
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));
 
-				if (wbuf_dirty)
+				if (wbuf_dirty && release_buf)
 					_hash_wrtbuf(rel, wbuf);
-				else
+				else if (wbuf_dirty)
+					MarkBufferDirty(wbuf);
+				else if (release_buf)
 					_hash_relbuf(rel, wbuf);
 
 				/* nothing more to do if we reached the read page */
@@ -704,6 +731,7 @@ _hash_squeezebucket(Relation rel,
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
 				wbuf_dirty = false;
+				release_buf = false;
 			}
 
 			/*
@@ -737,19 +765,25 @@ _hash_squeezebucket(Relation rel,
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
-			/* yes, so release wbuf lock first */
-			if (wbuf_dirty)
+			if (wblkno != bucket_blkno)
+				release_buf = true;
+
+			/* yes, so release wbuf lock first if needed */
+			if (wbuf_dirty && release_buf)
 				_hash_wrtbuf(rel, wbuf);
-			else
+			else if (wbuf_dirty)
+				MarkBufferDirty(wbuf);
+			else if (release_buf)
 				_hash_relbuf(rel, wbuf);
+
 			/* free this overflow page (releases rbuf) */
-			_hash_freeovflpage(rel, rbuf, bstrategy);
+			_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 			/* done */
 			return;
 		}
 
 		/* free this overflow page, then get the previous one */
-		_hash_freeovflpage(rel, rbuf, bstrategy);
+		_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a5e9d17..c858bde 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,10 +38,14 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
 					uint32 nblocks);
 static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
+static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
+					   Bucket obucket, Bucket nbucket, Buffer obuf,
+					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
+					   uint32 highmask, uint32 lowmask);
 
 
 /*
@@ -55,46 +59,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 
 
 /*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock.  (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		return ConditionalLockPage(rel, whichlock, access);
-	else
-		return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		UnlockPage(rel, whichlock, access);
-}
-
-/*
  *	_hash_getbuf() -- Get a buffer by block number for read or write.
  *
  *		'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -132,6 +96,35 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
 }
 
 /*
+ * _hash_getbuf_with_condlock_cleanup() -- Try to get a buffer for cleanup.
+ *
+ *		We read the page and try to acquire a cleanup lock.  If we get it,
+ *		we return the buffer; otherwise, we return InvalidBuffer.
+ */
+Buffer
+_hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
+{
+	Buffer		buf;
+
+	if (blkno == P_NEW)
+		elog(ERROR, "hash AM does not use P_NEW");
+
+	buf = ReadBuffer(rel, blkno);
+
+	if (!ConditionalLockBufferForCleanup(buf))
+	{
+		ReleaseBuffer(buf);
+		return InvalidBuffer;
+	}
+
+	/* ref count and lock type are correct */
+
+	_hash_checkpage(rel, buf, flags);
+
+	return buf;
+}
+
+/*
  *	_hash_getinitbuf() -- Get and initialize a buffer by block number.
  *
  *		This must be used only to fetch pages that are known to be before
@@ -266,6 +259,33 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
+ *	_hash_dropscanbuf() -- release buffers used in scan.
+ *
+ * This routine unpins the buffers used during scan on which we
+ * hold no lock.
+ */
+void
+_hash_dropscanbuf(Relation rel, HashScanOpaque so)
+{
+	/* release pin we hold on primary bucket page */
+	if (BufferIsValid(so->hashso_bucket_buf) &&
+		so->hashso_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_bucket_buf);
+	so->hashso_bucket_buf = InvalidBuffer;
+
+	/* release pin we hold on old primary bucket page */
+	if (BufferIsValid(so->hashso_old_bucket_buf) &&
+		so->hashso_old_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+	so->hashso_old_bucket_buf = InvalidBuffer;
+
+	/* release any pin we still hold */
+	if (BufferIsValid(so->hashso_curbuf))
+		_hash_dropbuf(rel, so->hashso_curbuf);
+	so->hashso_curbuf = InvalidBuffer;
+}
+
+/*
  *	_hash_wrtbuf() -- write a hash page to disk.
  *
  *		This routine releases the lock held on the buffer and our refcount
@@ -489,9 +509,11 @@ _hash_pageinit(Page page, Size size)
 /*
  * Attempt to expand the hash table by creating one new bucket.
  *
- * This will silently do nothing if it cannot get the needed locks.
+ * This will silently do nothing if we don't get cleanup lock on old or
+ * new bucket.
  *
- * The caller should hold no locks on the hash index.
+ * Complete the pending splits and remove the tuples from old bucket,
+ * if there are any left over from previous split.
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.
@@ -506,10 +528,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	BlockNumber start_oblkno;
 	BlockNumber start_nblkno;
 	Buffer		buf_nblkno;
+	Buffer		buf_oblkno;
+	Page		opage;
+	HashPageOpaque oopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
 
+restart_expand:
+
 	/*
 	 * Write-lock the meta page.  It used to be necessary to acquire a
 	 * heavyweight lock to begin a split, but that is no longer required.
@@ -548,11 +575,16 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		goto fail;
 
 	/*
-	 * Determine which bucket is to be split, and attempt to lock the old
-	 * bucket.  If we can't get the lock, give up.
+	 * Determine which bucket is to be split, and attempt to take cleanup lock
+	 * on the old bucket.  If we can't get the lock, give up.
+	 *
+	 * The cleanup lock protects us not only against other backends, but
+	 * against our own backend as well.
 	 *
-	 * The lock protects us against other backends, but not against our own
-	 * backend.  Must check for active scans separately.
+	 * The cleanup lock is mainly to protect the split from concurrent
+	 * inserts. See src/backend/access/hash/README, Lock Definitions for
+	 * further details.  Due to this locking restriction, if there is any
+	 * pending scan, split will give up which is not good, but harmless.
 	 */
 	new_bucket = metap->hashm_maxbucket + 1;
 
@@ -560,14 +592,78 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 
 	start_oblkno = BUCKET_TO_BLKNO(metap, old_bucket);
 
-	if (_hash_has_active_scan(rel, old_bucket))
+	buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
+	if (!buf_oblkno)
 		goto fail;
 
-	if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
-		goto fail;
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * We want to finish the split from a bucket as there is no apparent
+	 * benefit by not doing so and it will make the code complicated to finish
+	 * the split that involves multiple buckets considering the case where new
+	 * split also fails.  We don't need to consider the new bucket for
+	 * completing the split here as it is not possible that a re-split of new
+	 * bucket starts when there is still a pending split from old bucket.
+	 */
+	if (H_BUCKET_BEING_SPLIT(oopaque))
+	{
+		/*
+		 * Copy bucket mapping info now; refer the comment in code below where
+		 * we copy this information before calling _hash_splitbucket to see
+		 * why this is okay.
+		 */
+		maxbucket = metap->hashm_maxbucket;
+		highmask = metap->hashm_highmask;
+		lowmask = metap->hashm_lowmask;
+
+		/*
+		 * Release the lock on metapage and old_bucket, before completing the
+		 * split.
+		 */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		_hash_chgbufaccess(rel, buf_oblkno, HASH_READ, HASH_NOLOCK);
+
+		_hash_finish_split(rel, metabuf, buf_oblkno, old_bucket, maxbucket,
+						   highmask, lowmask);
+
+		/* release the pin on old buffer and retry for expand. */
+		_hash_dropbuf(rel, buf_oblkno);
+
+		goto restart_expand;
+	}
 
 	/*
-	 * Likewise lock the new bucket (should never fail).
+	 * Clean the tuples remained from previous split.  This operation requires
+	 * cleanup lock and we already have one on old bucket, so let's do it. We
+	 * also don't want to allow further splits from the bucket till the
+	 * garbage of previous split is cleaned.  This has two advantages, first
+	 * it helps in avoiding the bloat due to garbage and second is, during
+	 * cleanup of bucket, we are always sure that the garbage tuples belong to
+	 * most recently splitted bucket.  On the contrary, if we allow cleanup of
+	 * bucket after meta page is updated to indicate the new split and before
+	 * the actual split, the cleanup operation won't be able to decide whether
+	 * the tuple has been moved to the newly created bucket and ended up
+	 * deleting such tuples.
+	 */
+	if (H_NEEDS_SPLIT_CLEANUP(oopaque))
+	{
+		/* Release the metapage lock. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		hashbucketcleanup(rel, old_bucket, buf_oblkno, start_oblkno, NULL,
+						  metap->hashm_maxbucket, metap->hashm_highmask,
+						  metap->hashm_lowmask, NULL,
+						  NULL, true, NULL, NULL);
+
+		_hash_relbuf(rel, buf_oblkno);
+
+		goto restart_expand;
+	}
+
+	/*
+	 * There shouldn't be any active scan on new bucket.
 	 *
 	 * Note: it is safe to compute the new bucket's blkno here, even though we
 	 * may still need to update the BUCKET_TO_BLKNO mapping.  This is because
@@ -576,12 +672,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	 */
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
-	if (_hash_has_active_scan(rel, new_bucket))
-		elog(ERROR, "scan in progress on supposedly new bucket");
-
-	if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
-		elog(ERROR, "could not get lock on supposedly new bucket");
-
 	/*
 	 * If the split point is increasing (hashm_maxbucket's log base 2
 	 * increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +690,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
 			/* can't split due to BlockNumber overflow */
-			_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-			_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+			_hash_relbuf(rel, buf_oblkno);
 			goto fail;
 		}
 	}
@@ -609,9 +698,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/*
 	 * Physically allocate the new bucket's primary page.  We want to do this
 	 * before changing the metapage's mapping info, in case we can't get the
-	 * disk space.
+	 * disk space.  Ideally, we don't need to check for cleanup lock on new
+	 * bucket as no other backend could find this bucket unless meta page is
+	 * updated.  However, it is good to be consistent with old bucket locking.
 	 */
 	buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
+	if (!IsBufferCleanupOK(buf_nblkno))
+	{
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+		goto fail;
+	}
+
 
 	/*
 	 * Okay to proceed with split.  Update the metapage bucket mapping info.
@@ -665,13 +763,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  start_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno,
 					  maxbucket, highmask, lowmask);
 
-	/* Release bucket locks, allowing others to access them */
-	_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-	_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -738,13 +832,17 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
  * belong in the new bucket, and compress out any free space in the old
  * bucket.
  *
- * The caller must hold exclusive locks on both buckets to ensure that
+ * The caller must hold cleanup locks on both buckets to ensure that
  * no one else is trying to access them (see README).
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.  (The metapage is only
  * touched if it becomes necessary to add or remove overflow pages.)
  *
+ * Split needs to retain pin on primary bucket pages of both old and new
+ * buckets till end of operation.  This is to prevent vacuum to start
+ * when split is in progress.
+ *
  * In addition, the caller must have created the new bucket's base page,
  * which is passed in buffer nbuf, pinned and write-locked.  That lock and
  * pin are released here.  (The API is set up this way because we must do
@@ -756,37 +854,86 @@ _hash_splitbucket(Relation rel,
 				  Buffer metabuf,
 				  Bucket obucket,
 				  Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Buffer		obuf;
 	Page		opage;
 	Page		npage;
 	HashPageOpaque oopaque;
 	HashPageOpaque nopaque;
 
-	/*
-	 * It should be okay to simultaneously write-lock pages from each bucket,
-	 * since no one else can be trying to acquire buffer lock on pages of
-	 * either bucket.
-	 */
-	obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
 	opage = BufferGetPage(obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 
+	/*
+	 * Mark the old bucket to indicate that split is in progress.  At
+	 * operation end, we clear split-in-progress flag.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
+
 	npage = BufferGetPage(nbuf);
 
-	/* initialize the new bucket's primary page */
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nopaque->hasho_prevblkno = InvalidBlockNumber;
 	nopaque->hasho_nextblkno = InvalidBlockNumber;
 	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
 	nopaque->hasho_page_id = HASHO_PAGE_ID;
 
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, nbuf, NULL,
+						   maxbucket, highmask, lowmask);
+
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, obuf);
+	_hash_relbuf(rel, nbuf);
+}
+
+/*
+ * _hash_splitbucket_guts -- Helper function to perform the split operation
+ *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, caller needs to fill htab.  If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
+ * Caller needs to lock and unlock the old and new primary buckets.
+ */
+static void
+_hash_splitbucket_guts(Relation rel,
+					   Buffer metabuf,
+					   Bucket obucket,
+					   Bucket nbucket,
+					   Buffer obuf,
+					   Buffer nbuf,
+					   HTAB *htab,
+					   uint32 maxbucket,
+					   uint32 highmask,
+					   uint32 lowmask)
+{
+	Buffer		bucket_obuf;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
+
+	bucket_obuf = obuf;
+	opage = BufferGetPage(obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
 	/*
 	 * Partition the tuples in the old bucket between the old bucket and the
 	 * new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +945,6 @@ _hash_splitbucket(Relation rel,
 		BlockNumber oblkno;
 		OffsetNumber ooffnum;
 		OffsetNumber omaxoffnum;
-		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
 
 		/* Scan each tuple in old page */
 		omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -810,33 +955,52 @@ _hash_splitbucket(Relation rel,
 			IndexTuple	itup;
 			Size		itemsz;
 			Bucket		bucket;
+			bool		found = false;
 
 			/* skip dead tuples */
 			if (ItemIdIsDead(PageGetItemId(opage, ooffnum)))
 				continue;
 
 			/*
-			 * Fetch the item's hash key (conveniently stored in the item) and
-			 * determine which bucket it now belongs in.
+			 * Before inserting a tuple, probe the hash table containing TIDs
+			 * of tuples belonging to new bucket, if we find a match, then
+			 * skip that tuple, else fetch the item's hash key (conveniently
+			 * stored in the item) and determine which bucket it now belongs
+			 * in.
 			 */
 			itup = (IndexTuple) PageGetItem(opage,
 											PageGetItemId(opage, ooffnum));
+
+			if (htab)
+				(void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
+
+			if (found)
+				continue;
+
 			bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
 										  maxbucket, highmask, lowmask);
 
 			if (bucket == nbucket)
 			{
+				IndexTuple	new_itup;
+
+				/*
+				 * make a copy of index tuple as we have to scribble on it.
+				 */
+				new_itup = CopyIndexTuple(itup);
+
+				/*
+				 * mark the index tuple as moved by split, such tuples are
+				 * skipped by scan if there is split in progress for a bucket.
+				 */
+				new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+
 				/*
 				 * insert the tuple into the new bucket.  if it doesn't fit on
 				 * the current page in the new bucket, we must allocate a new
 				 * overflow page and place the tuple on that page instead.
-				 *
-				 * XXX we have a problem here if we fail to get space for a
-				 * new overflow page: we'll error out leaving the bucket split
-				 * only partially complete, meaning the index is corrupt,
-				 * since searches may fail to find entries they should find.
 				 */
-				itemsz = IndexTupleDSize(*itup);
+				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
 				if (PageGetFreeSpace(npage) < itemsz)
@@ -844,9 +1008,9 @@ _hash_splitbucket(Relation rel,
 					/* write out nbuf and drop lock, but keep pin */
 					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
 					/* chain to a new overflow page */
-					nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+					nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf) ? true : false);
 					npage = BufferGetPage(nbuf);
-					/* we don't need nopaque within the loop */
+					nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 				}
 
 				/*
@@ -856,12 +1020,10 @@ _hash_splitbucket(Relation rel,
 				 * Possible future improvement: accumulate all the items for
 				 * the new page and qsort them before insertion.
 				 */
-				(void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
+				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
-				/*
-				 * Mark tuple for deletion from old page.
-				 */
-				deletable[ndeletable++] = ooffnum;
+				/* be tidy */
+				pfree(new_itup);
 			}
 			else
 			{
@@ -874,15 +1036,9 @@ _hash_splitbucket(Relation rel,
 
 		oblkno = oopaque->hasho_nextblkno;
 
-		/*
-		 * Done scanning this old page.  If we moved any tuples, delete them
-		 * from the old page.
-		 */
-		if (ndeletable > 0)
-		{
-			PageIndexMultiDelete(opage, deletable, ndeletable);
-			_hash_wrtbuf(rel, obuf);
-		}
+		/* retain the pin on the old primary bucket */
+		if (obuf == bucket_obuf)
+			_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, obuf);
 
@@ -891,18 +1047,169 @@ _hash_splitbucket(Relation rel,
 			break;
 
 		/* Else, advance to next old page */
-		obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
 		opage = BufferGetPage(obuf);
 		oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	}
 
 	/*
 	 * We're at the end of the old bucket chain, so we're done partitioning
-	 * the tuples.  Before quitting, call _hash_squeezebucket to ensure the
-	 * tuples remaining in the old bucket (including the overflow pages) are
-	 * packed as tightly as possible.  The new bucket is already tight.
+	 * the tuples.  Mark the old and new buckets to indicate split is
+	 * finished.
+	 *
+	 * To avoid deadlocks due to locking order of buckets, first lock the old
+	 * bucket and then the new bucket.
 	 */
-	_hash_wrtbuf(rel, nbuf);
+	if (nbuf == bucket_nbuf)
+		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, nbuf);
+
+	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+	opage = BufferGetPage(bucket_obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	_hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+	npage = BufferGetPage(bucket_nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
+	nopaque->hasho_flag &= ~LH_BUCKET_BEING_POPULATED;
+
+	/*
+	 * After the split is finished, mark the old bucket to indicate that it
+	 * contains deletable tuples.  Vacuum will clear split-cleanup flag after
+	 * deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_NEEDS_SPLIT_CLEANUP;
+
+	/*
+	 * now write the buffers, here we don't release the locks as caller is
+	 * responsible to release locks.
+	 */
+	MarkBufferDirty(bucket_obuf);
+	MarkBufferDirty(bucket_nbuf);
+}
+
+/*
+ *	_hash_finish_split() -- Finish the previously interrupted split operation
+ *
+ * To complete the split operation, we form the hash table of TIDs in new
+ * bucket which is then used by split operation to skip tuples that are
+ * already moved before the split operation was previously interruptted.
+ *
+ * The caller must hold a pin, but no lock, on the metapage and old bucket's
+ * primay page buffer.  The buffers are returned in the same state.  (The
+ * metapage is only touched if it becomes necessary to add or remove overflow
+ * pages.)
+ */
+void
+_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
+				   uint32 maxbucket, uint32 highmask, uint32 lowmask)
+{
+	HASHCTL		hash_ctl;
+	HTAB	   *tidhtab;
+	Buffer		bucket_nbuf = InvalidBuffer;
+	Buffer		nbuf;
+	Page		npage;
+	BlockNumber nblkno;
+	BlockNumber bucket_nblkno;
+	HashPageOpaque npageopaque;
+	Bucket		nbucket;
+	bool		found;
+
+	/* Initialize hash tables used to track TIDs */
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+	hash_ctl.keysize = sizeof(ItemPointerData);
+	hash_ctl.entrysize = sizeof(ItemPointerData);
+	hash_ctl.hcxt = CurrentMemoryContext;
+
+	tidhtab =
+		hash_create("bucket ctids",
+					256,		/* arbitrary initial size */
+					&hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	bucket_nblkno = nblkno = _hash_get_newblock_from_oldbucket(rel, obucket);
+
+	/*
+	 * Scan the new bucket and build hash table of TIDs
+	 */
+	for (;;)
+	{
+		OffsetNumber noffnum;
+		OffsetNumber nmaxoffnum;
+
+		nbuf = _hash_getbuf(rel, nblkno, HASH_READ,
+							LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+
+		/* remember the primary bucket buffer to acquire cleanup lock on it. */
+		if (nblkno == bucket_nblkno)
+			bucket_nbuf = nbuf;
+
+		npage = BufferGetPage(nbuf);
+		npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+		/* Scan each tuple in new page */
+		nmaxoffnum = PageGetMaxOffsetNumber(npage);
+		for (noffnum = FirstOffsetNumber;
+			 noffnum <= nmaxoffnum;
+			 noffnum = OffsetNumberNext(noffnum))
+		{
+			IndexTuple	itup;
+
+			/* Fetch the item's TID and insert it in hash table. */
+			itup = (IndexTuple) PageGetItem(npage,
+											PageGetItemId(npage, noffnum));
+
+			(void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
+
+			Assert(!found);
+		}
+
+		nblkno = npageopaque->hasho_nextblkno;
+
+		/*
+		 * release our write lock without modifying buffer and ensure to
+		 * retain the pin on primary bucket.
+		 */
+		if (nbuf == bucket_nbuf)
+			_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, nbuf);
+
+		/* Exit loop if no more overflow pages in new bucket */
+		if (!BlockNumberIsValid(nblkno))
+			break;
+	}
+
+	/*
+	 * Conditionally get the cleanup lock on old and new buckets to perform
+	 * the split operation.  If we don't get the cleanup locks, silently
+	 * giveup and next insertion on old bucket will try again to complete the
+	 * split.
+	 */
+	if (!ConditionalLockBufferForCleanup(obuf))
+	{
+		hash_destroy(tidhtab);
+		return;
+	}
+	if (!ConditionalLockBufferForCleanup(bucket_nbuf))
+	{
+		_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
+		hash_destroy(tidhtab);
+		return;
+	}
+
+	npage = BufferGetPage(bucket_nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nbucket = npageopaque->hasho_bucket;
+
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, bucket_nbuf, tidhtab,
+						   maxbucket, highmask, lowmask);
 
-	_hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+	_hash_relbuf(rel, bucket_nbuf);
+	_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
+	hash_destroy(tidhtab);
 }
diff --git a/src/backend/access/hash/hashscan.c b/src/backend/access/hash/hashscan.c
deleted file mode 100644
index fe97ef2..0000000
--- a/src/backend/access/hash/hashscan.c
+++ /dev/null
@@ -1,153 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * hashscan.c
- *	  manage scans on hash tables
- *
- * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- *
- * IDENTIFICATION
- *	  src/backend/access/hash/hashscan.c
- *
- *-------------------------------------------------------------------------
- */
-
-#include "postgres.h"
-
-#include "access/hash.h"
-#include "access/relscan.h"
-#include "utils/memutils.h"
-#include "utils/rel.h"
-#include "utils/resowner.h"
-
-
-/*
- * We track all of a backend's active scans on hash indexes using a list
- * of HashScanListData structs, which are allocated in TopMemoryContext.
- * It's okay to use a long-lived context because we rely on the ResourceOwner
- * mechanism to clean up unused entries after transaction or subtransaction
- * abort.  We can't safely keep the entries in the executor's per-query
- * context, because that might be already freed before we get a chance to
- * clean up the list.  (XXX seems like there should be a better way to
- * manage this...)
- */
-typedef struct HashScanListData
-{
-	IndexScanDesc hashsl_scan;
-	ResourceOwner hashsl_owner;
-	struct HashScanListData *hashsl_next;
-} HashScanListData;
-
-typedef HashScanListData *HashScanList;
-
-static HashScanList HashScans = NULL;
-
-
-/*
- * ReleaseResources_hash() --- clean up hash subsystem resources.
- *
- * This is here because it needs to touch this module's static var HashScans.
- */
-void
-ReleaseResources_hash(void)
-{
-	HashScanList l;
-	HashScanList prev;
-	HashScanList next;
-
-	/*
-	 * Release all HashScanList items belonging to the current ResourceOwner.
-	 * Note that we do not release the underlying IndexScanDesc; that's in
-	 * executor memory and will go away on its own (in fact quite possibly has
-	 * gone away already, so we mustn't try to touch it here).
-	 *
-	 * Note: this should be a no-op during normal query shutdown. However, in
-	 * an abort situation ExecutorEnd is not called and so there may be open
-	 * index scans to clean up.
-	 */
-	prev = NULL;
-
-	for (l = HashScans; l != NULL; l = next)
-	{
-		next = l->hashsl_next;
-		if (l->hashsl_owner == CurrentResourceOwner)
-		{
-			if (prev == NULL)
-				HashScans = next;
-			else
-				prev->hashsl_next = next;
-
-			pfree(l);
-			/* prev does not change */
-		}
-		else
-			prev = l;
-	}
-}
-
-/*
- *	_hash_regscan() -- register a new scan.
- */
-void
-_hash_regscan(IndexScanDesc scan)
-{
-	HashScanList new_el;
-
-	new_el = (HashScanList) MemoryContextAlloc(TopMemoryContext,
-											   sizeof(HashScanListData));
-	new_el->hashsl_scan = scan;
-	new_el->hashsl_owner = CurrentResourceOwner;
-	new_el->hashsl_next = HashScans;
-	HashScans = new_el;
-}
-
-/*
- *	_hash_dropscan() -- drop a scan from the scan list
- */
-void
-_hash_dropscan(IndexScanDesc scan)
-{
-	HashScanList chk,
-				last;
-
-	last = NULL;
-	for (chk = HashScans;
-		 chk != NULL && chk->hashsl_scan != scan;
-		 chk = chk->hashsl_next)
-		last = chk;
-
-	if (chk == NULL)
-		elog(ERROR, "hash scan list trashed; cannot find 0x%p", (void *) scan);
-
-	if (last == NULL)
-		HashScans = chk->hashsl_next;
-	else
-		last->hashsl_next = chk->hashsl_next;
-
-	pfree(chk);
-}
-
-/*
- * Is there an active scan in this bucket?
- */
-bool
-_hash_has_active_scan(Relation rel, Bucket bucket)
-{
-	Oid			relid = RelationGetRelid(rel);
-	HashScanList l;
-
-	for (l = HashScans; l != NULL; l = l->hashsl_next)
-	{
-		if (relid == l->hashsl_scan->indexRelation->rd_id)
-		{
-			HashScanOpaque so = (HashScanOpaque) l->hashsl_scan->opaque;
-
-			if (so->hashso_bucket_valid &&
-				so->hashso_bucket == bucket)
-				return true;
-		}
-	}
-
-	return false;
-}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..101c5c5 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -67,12 +67,25 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
  */
 static void
 _hash_readnext(Relation rel,
-			   Buffer *bufp, Page *pagep, HashPageOpaque *opaquep)
+			   Buffer *bufp, Page *pagep, HashPageOpaque *opaquep,
+			   bool primary_page)
 {
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_nextblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan to ensure
+	 * that vacuum can't delete the tuples that are moved by split to new
+	 * bucket.  Such tuples are required by the scans that are started on
+	 * buckets where split is in progress, before a new bucket's split in
+	 * progress flag (LH_BUCKET_BEING_POPULATED) is cleared.
+	 */
+	if (primary_page)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -89,12 +102,22 @@ _hash_readnext(Relation rel,
  */
 static void
 _hash_readprev(Relation rel,
-			   Buffer *bufp, Page *pagep, HashPageOpaque *opaquep)
+			   Buffer *bufp, Page *pagep, HashPageOpaque *opaquep,
+			   bool primary_page)
 {
 	BlockNumber blkno;
 
 	blkno = (*opaquep)->hasho_prevblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan. See
+	 * comments in _hash_readnext to know the reason of retaining pin.
+	 */
+	if (primary_page)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -104,6 +127,13 @@ _hash_readprev(Relation rel,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
+
+		/*
+		 * We always maintain the pin on bucket page for whole scan operation,
+		 * so releasing the additional pin we have acquired here.
+		 */
+		if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+			_hash_dropbuf(rel, *bufp);
 	}
 }
 
@@ -218,9 +248,11 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 		{
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+
+		/* Fetch the primary bucket page for the bucket */
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
 
 		/*
 		 * Reacquire metapage lock and check that no bucket split has taken
@@ -234,22 +266,64 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	/* done with the metapage */
 	_hash_dropbuf(rel, metabuf);
 
-	/* Update scan opaque state to show we have lock on the bucket */
-	so->hashso_bucket = bucket;
-	so->hashso_bucket_valid = true;
-	so->hashso_bucket_blkno = blkno;
-
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
 	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
+	so->hashso_bucket_buf = buf;
+
+	/*
+	 * If the bucket split is in progress, then we need to skip tuples that
+	 * are moved from old bucket.  To ensure that vacuum doesn't clean any
+	 * tuples from old or new buckets till this scan is in progress, maintain
+	 * a pin on both of the buckets.  Here, we have to be cautious about
+	 * locking order, first acquire the lock on old bucket, release the lock
+	 * on old bucket, but not pin, then acquire the lock on new bucket and
+	 * again re-verify whether the bucket split still is in progress.
+	 * Acquiring lock on old bucket first ensures that the vacuum waits for
+	 * this scan to finish.
+	 */
+	if (H_BUCKET_BEING_POPULATED(opaque))
+	{
+		BlockNumber old_blkno;
+		Buffer		old_buf;
+
+		old_blkno = _hash_get_oldblock_from_newbucket(rel, opaque->hasho_bucket);
+
+		/*
+		 * release the lock on new bucket and re-acquire it after acquiring
+		 * the lock on old bucket.
+		 */
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+		/*
+		 * remember the old bucket buffer so as to use it later for scanning.
+		 */
+		so->hashso_old_bucket_buf = old_buf;
+		_hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+		_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		Assert(opaque->hasho_bucket == bucket);
+
+		if (H_BUCKET_BEING_POPULATED(opaque))
+			so->hashso_skip_moved_tuples = true;
+		else
+		{
+			_hash_dropbuf(rel, so->hashso_old_bucket_buf);
+			so->hashso_old_bucket_buf = InvalidBuffer;
+		}
+	}
+
 	/* If a backwards scan is requested, move to the end of the chain */
 	if (ScanDirectionIsBackward(dir))
 	{
 		while (BlockNumberIsValid(opaque->hasho_nextblkno))
-			_hash_readnext(rel, &buf, &page, &opaque);
+			_hash_readnext(rel, &buf, &page, &opaque,
+					   (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false);
 	}
 
 	/* Now find the first tuple satisfying the qualification */
@@ -273,6 +347,13 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
  *		false.  Else, return true and set the hashso_curpos for the
  *		scan to the right thing.
  *
+ *		Here we also scan the old bucket if the split for current bucket
+ *		was in progress at the start of scan.  The basic idea is that
+ *		skip the tuples that are moved by split while scanning current
+ *		bucket and then scan the old bucket to cover all such tuples. This
+ *		is done to ensure that we don't miss any tuples in the scans that
+ *		started during split.
+ *
  *		'bufP' points to the current buffer, which is pinned and read-locked.
  *		On success exit, we have pin and read-lock on whichever page
  *		contains the right item; on failure, we have released all buffers.
@@ -338,6 +419,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum >= FirstOffsetNumber);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberNext(offnum);	/* move forward */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -345,7 +439,8 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					/*
 					 * ran off the end of this page, try the next
 					 */
-					_hash_readnext(rel, &buf, &page, &opaque);
+					_hash_readnext(rel, &buf, &page, &opaque,
+					   (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false);
 					if (BufferIsValid(buf))
 					{
 						maxoff = PageGetMaxOffsetNumber(page);
@@ -353,9 +448,42 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -379,6 +507,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum <= maxoff);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_skip_moved_tuples &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberPrev(offnum);	/* move back */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -386,7 +527,8 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					/*
 					 * ran off the end of this page, try the next
 					 */
-					_hash_readprev(rel, &buf, &page, &opaque);
+					_hash_readprev(rel, &buf, &page, &opaque,
+					   (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false);
 					if (BufferIsValid(buf))
 					{
 						maxoff = PageGetMaxOffsetNumber(page);
@@ -394,9 +536,42 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
-						itup = NULL;
-						break;	/* exit for-loop */
+						/*
+						 * end of bucket, scan old bucket if there was a split
+						 * in progress at the start of scan.
+						 */
+						if (so->hashso_skip_moved_tuples)
+						{
+							buf = so->hashso_old_bucket_buf;
+
+							/*
+							 * old buket buffer must be valid as we acquire
+							 * the pin on it before the start of scan and
+							 * retain it till end of scan.
+							 */
+							Assert(BufferIsValid(buf));
+
+							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+							page = BufferGetPage(buf);
+							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+							maxoff = PageGetMaxOffsetNumber(page);
+							offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+							/*
+							 * setting hashso_skip_moved_tuples to false
+							 * ensures that we don't check for tuples that are
+							 * moved by split in old bucket and it also
+							 * ensures that we won't retry to scan the old
+							 * bucket once the scan for same is finished.
+							 */
+							so->hashso_skip_moved_tuples = false;
+						}
+						else
+						{
+							itup = NULL;
+							break;		/* exit for-loop */
+						}
 					}
 				}
 				break;
@@ -410,9 +585,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 
 		if (itup == NULL)
 		{
-			/* we ran off the end of the bucket without finding a match */
+			/*
+			 * We ran off the end of the bucket without finding a match.
+			 * Release the pin on bucket buffers.  Normally, such pins are
+			 * released at end of scan, however scrolling cursors can
+			 * reacquire the bucket lock and pin in the same scan multiple
+			 * times.
+			 */
 			*bufP = so->hashso_curbuf = InvalidBuffer;
 			ItemPointerSetInvalid(current);
+			_hash_dropscanbuf(rel, so);
 			return false;
 		}
 
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..cf464e9 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -20,6 +20,8 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
+#define CALC_NEW_BUCKET(old_bucket, lowmask) \
+			old_bucket | (lowmask + 1)
 
 /*
  * _hash_checkqual -- does the index tuple satisfy the scan conditions?
@@ -352,3 +354,95 @@ _hash_binsearch_last(Page page, uint32 hash_value)
 
 	return lower;
 }
+
+/*
+ *	_hash_get_oldblock_from_newbucket() -- get the block number of a bucket
+ *			from which current (new) bucket is being split.
+ */
+BlockNumber
+_hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bucket)
+{
+	Bucket		old_bucket;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	/*
+	 * To get the old bucket from the current bucket, we need a mask to modulo
+	 * into lower half of table.  This mask is stored in meta page as
+	 * hashm_lowmask, but here we can't rely on the same, because we need a
+	 * value of lowmask that was prevalent at the time when bucket split was
+	 * started.  Masking the most significant bit of new bucket would give us
+	 * old bucket.
+	 */
+	mask = (((uint32) 1) << fls(new_bucket)) - 1;
+	old_bucket = new_bucket & mask;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newblock_from_oldbucket() -- get the block number of a bucket
+ *			that will be generated after split from old bucket.
+ *
+ * This is used to find the new bucket from old bucket based on current table
+ * half.  It is mainly required to finish the incomplete splits where we are
+ * sure that not more than one bucket could have split in progress from old
+ * bucket.
+ */
+BlockNumber
+_hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket)
+{
+	Bucket		new_bucket;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	new_bucket = _hash_get_newbucket_from_oldbucket(rel, old_bucket,
+													metap->hashm_lowmask,
+													metap->hashm_maxbucket);
+	blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newbucket_from_oldbucket() -- get the new bucket that will be
+ *			generated after split from current (old) bucket.
+ *
+ * This is used to find the new bucket from old bucket.  New bucket can be
+ * obtained by OR'ing old bucket with most significant bit of current table
+ * half (lowmask passed in this function can be used to identify msb of
+ * current table half).  There could be multiple buckets that could have
+ * splitted from curent bucket.  We need the first such bucket that exists.
+ * Caller must ensure that no more than one split has happened from old
+ * bucket.
+ */
+Bucket
+_hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
+								   uint32 lowmask, uint32 maxbucket)
+{
+	Bucket		new_bucket;
+
+	new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask);
+	if (new_bucket > maxbucket)
+	{
+		lowmask = lowmask >> 1;
+		new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask);
+	}
+
+	return new_bucket;
+}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 07075ce..cdc460b 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -668,9 +668,6 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintFileLeakWarning(res);
 			FileClose(res);
 		}
-
-		/* Clean up index scans too */
-		ReleaseResources_hash();
 	}
 
 	/* Let add-on modules get a chance too */
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 725e2f2..52bbedc 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -24,6 +24,7 @@
 #include "lib/stringinfo.h"
 #include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
+#include "utils/hsearch.h"
 #include "utils/relcache.h"
 
 /*
@@ -32,6 +33,8 @@
  */
 typedef uint32 Bucket;
 
+#define InvalidBucket	((Bucket) 0xFFFFFFFF)
+
 #define BUCKET_TO_BLKNO(metap,B) \
 		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
 
@@ -51,6 +54,9 @@ typedef uint32 Bucket;
 #define LH_BUCKET_PAGE			(1 << 1)
 #define LH_BITMAP_PAGE			(1 << 2)
 #define LH_META_PAGE			(1 << 3)
+#define LH_BUCKET_BEING_POPULATED	(1 << 4)
+#define LH_BUCKET_BEING_SPLIT	(1 << 5)
+#define LH_BUCKET_NEEDS_SPLIT_CLEANUP	(1 << 6)
 
 typedef struct HashPageOpaqueData
 {
@@ -63,6 +69,10 @@ typedef struct HashPageOpaqueData
 
 typedef HashPageOpaqueData *HashPageOpaque;
 
+#define H_NEEDS_SPLIT_CLEANUP(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEEDS_SPLIT_CLEANUP)
+#define H_BUCKET_BEING_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_SPLIT)
+#define H_BUCKET_BEING_POPULATED(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_POPULATED)
+
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
  * which otherwise would have a hard time telling pages of different index
@@ -80,19 +90,6 @@ typedef struct HashScanOpaqueData
 	uint32		hashso_sk_hash;
 
 	/*
-	 * By definition, a hash scan should be examining only one bucket. We
-	 * record the bucket number here as soon as it is known.
-	 */
-	Bucket		hashso_bucket;
-	bool		hashso_bucket_valid;
-
-	/*
-	 * If we have a share lock on the bucket, we record it here.  When
-	 * hashso_bucket_blkno is zero, we have no such lock.
-	 */
-	BlockNumber hashso_bucket_blkno;
-
-	/*
 	 * We also want to remember which buffer we're currently examining in the
 	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
 	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -100,11 +97,23 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_curbuf;
 
+	/* remember the buffer associated with primary bucket */
+	Buffer		hashso_bucket_buf;
+
+	/*
+	 * remember the buffer associated with old primary bucket which is
+	 * required during the scan of the bucket for which split is in progress.
+	 */
+	Buffer		hashso_old_bucket_buf;
+
 	/* Current position of the scan, as an index TID */
 	ItemPointerData hashso_curpos;
 
 	/* Current position of the scan, as a heap TID */
 	ItemPointerData hashso_heappos;
+
+	/* Whether scan needs to skip tuples that are moved by split */
+	bool		hashso_skip_moved_tuples;
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -175,6 +184,8 @@ typedef HashMetaPageData *HashMetaPage;
 				  sizeof(ItemIdData) - \
 				  MAXALIGN(sizeof(HashPageOpaqueData)))
 
+#define INDEX_MOVED_BY_SPLIT_MASK	0x2000
+
 #define HASH_MIN_FILLFACTOR			10
 #define HASH_DEFAULT_FILLFACTOR		75
 
@@ -223,9 +234,6 @@ typedef HashMetaPageData *HashMetaPage;
 #define HASH_WRITE		BUFFER_LOCK_EXCLUSIVE
 #define HASH_NOLOCK		(-1)
 
-#define HASH_SHARE		ShareLock
-#define HASH_EXCLUSIVE	ExclusiveLock
-
 /*
  *	Strategy number. There's only one valid strategy for hashing: equality.
  */
@@ -297,21 +305,21 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 
 /* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
 extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BufferAccessStrategy bstrategy);
+				   BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy);
 
 /* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
 extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 			 int access, int flags);
+extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
+								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
@@ -320,6 +328,7 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 						   BufferAccessStrategy bstrategy);
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
+extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
 extern void _hash_wrtbuf(Relation rel, Buffer buf);
 extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
 				   int to_access);
@@ -327,12 +336,9 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
 				ForkNumber forkNum);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
-
-/* hashscan.c */
-extern void _hash_regscan(IndexScanDesc scan);
-extern void _hash_dropscan(IndexScanDesc scan);
-extern bool _hash_has_active_scan(Relation rel, Bucket bucket);
-extern void ReleaseResources_hash(void);
+extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
+				   Bucket obucket, uint32 maxbucket, uint32 highmask,
+				   uint32 lowmask);
 
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
@@ -362,5 +368,18 @@ extern bool _hash_convert_tuple(Relation index,
 					Datum *index_values, bool *index_isnull);
 extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
 extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bucket);
+extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
+extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
+								   uint32 lowmask, uint32 maxbucket);
+
+/* hash.c */
+extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
+				  Buffer bucket_buf, BlockNumber bucket_blkno,
+				  BufferAccessStrategy bstrategy,
+				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
+				  double *tuples_removed, double *num_index_tuples,
+				  bool bucket_has_garbage,
+				  IndexBulkDeleteCallback callback, void *callback_state);
 
 #endif   /* HASH_H */
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 8350fa0..788ba9f 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -63,7 +63,7 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
  * t_info manipulation macros
  */
 #define INDEX_SIZE_MASK 0x1FFF
-/* bit 0x2000 is not used at present */
+/* bit 0x2000 is reserved for index-AM specific usage */
 #define INDEX_VAR_MASK	0x4000
 #define INDEX_NULL_MASK 0x8000
 
#144Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#143)
Re: Hash Indexes

On Sat, Nov 12, 2016 at 12:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

You are right and I have changed the code as per your suggestion.

So...

+        /*
+         * We always maintain the pin on bucket page for whole scan operation,
+         * so releasing the additional pin we have acquired here.
+         */
+        if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+            _hash_dropbuf(rel, *bufp);

This relies on the page contents to know whether we took a pin; that
seems like a bad plan. We need to know intrinsically whether we took
a pin.

+     * If the bucket split is in progress, then we need to skip tuples that
+     * are moved from old bucket.  To ensure that vacuum doesn't clean any
+     * tuples from old or new buckets till this scan is in progress, maintain
+     * a pin on both of the buckets.  Here, we have to be cautious about

It wouldn't be a problem if VACUUM removed tuples from the new bucket,
because they'd have to be dead anyway. It also wouldn't be a problem
if it removed tuples from the old bucket that were actually dead. The
real issue isn't vacuum anyway, but the process of cleaning up after a
split. We need to hold the pin so that tuples being moved from the
old bucket to the new bucket by the split don't get removed from the
old bucket until our scan is done.

+ old_blkno = _hash_get_oldblock_from_newbucket(rel,
opaque->hasho_bucket);

Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"? I
feel like I'm repeating this ad nauseum, but I really think it's bad
to rely on the special space instead of our own local variables!

-            /* we ran off the end of the bucket without finding a match */
+            /*
+             * We ran off the end of the bucket without finding a match.
+             * Release the pin on bucket buffers.  Normally, such pins are
+             * released at end of scan, however scrolling cursors can
+             * reacquire the bucket lock and pin in the same scan multiple
+             * times.
+             */
             *bufP = so->hashso_curbuf = InvalidBuffer;
             ItemPointerSetInvalid(current);
+            _hash_dropscanbuf(rel, so);

I think this comment is saying that we'll release the pin on the
primary bucket page for now, and then reacquire it later if the user
reverses the scan direction. But that doesn't sound very safe,
because the bucket could be split in the meantime and the order in
which tuples are returned could change. I think we want that to
remain stable within a single query execution.

+            _hash_readnext(rel, &buf, &page, &opaque,
+                       (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false);

Same comment: don't rely on the special space to figure this out.
Keep track. Also != 0 would be better than ? true : false.

+                            /*
+                             * setting hashso_skip_moved_tuples to false
+                             * ensures that we don't check for tuples that are
+                             * moved by split in old bucket and it also
+                             * ensures that we won't retry to scan the old
+                             * bucket once the scan for same is finished.
+                             */
+                            so->hashso_skip_moved_tuples = false;

I think you've got a big problem here. Suppose the user starts the
scan in the new bucket and runs it forward until they end up in the
old bucket. Then they turn around and run the scan backward. When
they reach the beginning of the old bucket, they're going to stop, not
move back to the new bucket, AFAICS. Oops.

_hash_first() has a related problem: a backward scan starts at the end
of the new bucket and moves backward, but it should start at the end
of the old bucket, and then when it reaches the beginning, flip to the
new bucket and move backward through that one. Otherwise, a backward
scan and a forward scan don't return tuples in opposite order, which
they should.

I think what you need to do to fix both of these problems is a more
thorough job gluing the two buckets together. I'd suggest that the
responsibility for switching between the two buckets should probably
be given to _hash_readprev() and _hash_readnext(), because every place
that needs to advance to the next or previous page that cares about
this. Right now you are trying to handle it mostly in the functions
that call those functions, but that is prone to errors of omission.

Also, I think that so->hashso_skip_moved_tuples is badly designed.
There are two separate facts you need to know: (1) whether you are
scanning a bucket that was still being populated at the start of your
scan and (2) if yes, whether you are scanning the bucket being
populated or whether you are instead scanning the corresponding "old"
bucket. You're trying to keep track of that using one Boolean, but
one Boolean only has two states and there are three possible states
here.

+    if (H_BUCKET_BEING_SPLIT(pageopaque) && IsBufferCleanupOK(buf))
+    {
+
+        /* release the lock on bucket buffer, before completing the split. */

Extra blank line.

+moved-by-split flag on a tuple indicates that tuple is moved from old to new
+bucket.  The concurrent scans can skip such tuples till the split operation is
+finished.  Once the tuple is marked as moved-by-split, it will remain
so forever
+but that does no harm.  We have intentionally not cleared it as that
can generate
+an additional I/O which is not necessary.

The first sentence needs to start with "the" but the second sentence shouldn't.

It would be good to adjust this part a bit to more clearly explain
that the split-in-progress and split-cleanup flags are bucket-level
flags, while moved-by-split is a per-tuple flag. It's possible to
figure this out from what you've written, but I think it could be more
clear. Another thing that is strange is that the code uses THREE
flags, bucket-being-split, bucket-being-populated, and
needs-split-cleanup, but the README conflates the first two and uses a
different name.

+previously-acquired content lock, but not pin and repeat the process using the

s/but not pin/but not the pin,/

 A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt.  The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
+disk space or crash) the index is left corrupt.  The probability of that
+could be made quite low if we grab a free page or two before we update the
+meta page, but the only real solution is to treat a split as a WAL-loggable,
 must-complete action.  I'm not planning to teach hash about WAL in this
-go-round.
+go-round.  However, we do try to finish the incomplete splits during insert
+and split.

I think this paragraph needs a much heavier rewrite explaining the new
incomplete split handling. It's basically wrong now. Perhaps replace
it with something like this:

--
If a split fails partway through (e.g. due to insufficient disk space
or an interrupt), the index will not be corrupted. Instead, we'll
retry the split every time a tuple is inserted into the old bucket
prior to inserting the new tuple; eventually, we should succeed. The
fact that a split is left unfinished doesn't prevent subsequent
buckets from being split, but we won't try to split the bucket again
until the prior split is finished. In other words, a bucket can be in
the middle of being split for some time, but ti can't be in the middle
of two splits at the same time.

Although we can survive a failure to split a bucket, a crash is likely
to corrupt the index, since hash indexes are not yet WAL-logged.
--

+        Acquire cleanup lock on target bucket
+        Scan and remove tuples
+        For overflow page, first we need to lock the next page and then
+        release the lock on current bucket or overflow page
+        Ensure to have buffer content lock in exclusive mode on bucket page
+        If buffer pincount is one, then compact free space as needed
+        Release lock

I don't think this summary is particularly correct. You would never
guess from this that we lock each bucket page in turn and then go back
and try to relock the primary bucket page at the end. It's more like:

acquire cleanup lock on primary bucket page
loop:
scan and remove tuples
if this is the last bucket page, break out of loop
pin and x-lock next page
release prior lock and pin (except keep pin on primary bucket page)
if the page we have locked is not the primary bucket page:
release lock and take exclusive lock on primary bucket page
if there are no other pins on the primary bucket page:
squeeze the bucket to remove free space

Come to think of it, I'm a little worried about the locking in
_hash_squeezebucket(). It seems like we drop the lock on each "write"
bucket page before taking the lock on the next one. So a concurrent
scan could get ahead of the cleanup process. That would be bad,
wouldn't it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#145Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#144)
Re: Hash Indexes

On Thu, Nov 17, 2016 at 3:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Nov 12, 2016 at 12:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

You are right and I have changed the code as per your suggestion.

So...

+        /*
+         * We always maintain the pin on bucket page for whole scan operation,
+         * so releasing the additional pin we have acquired here.
+         */
+        if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+            _hash_dropbuf(rel, *bufp);

This relies on the page contents to know whether we took a pin; that
seems like a bad plan. We need to know intrinsically whether we took
a pin.

Okay, I think we can do that as we have bucket buffer information
(hashso_bucket_buf) in HashScanOpaqueData. We might need to pass this
information in _hash_readprev.

+     * If the bucket split is in progress, then we need to skip tuples that
+     * are moved from old bucket.  To ensure that vacuum doesn't clean any
+     * tuples from old or new buckets till this scan is in progress, maintain
+     * a pin on both of the buckets.  Here, we have to be cautious about

It wouldn't be a problem if VACUUM removed tuples from the new bucket,
because they'd have to be dead anyway. It also wouldn't be a problem
if it removed tuples from the old bucket that were actually dead. The
real issue isn't vacuum anyway, but the process of cleaning up after a
split. We need to hold the pin so that tuples being moved from the
old bucket to the new bucket by the split don't get removed from the
old bucket until our scan is done.

Are you expecting a comment change here?

+ old_blkno = _hash_get_oldblock_from_newbucket(rel,
opaque->hasho_bucket);

Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"? I
feel like I'm repeating this ad nauseum, but I really think it's bad
to rely on the special space instead of our own local variables!

Sure, we can pass bucket as well. However, if you see few lines below
(while (BlockNumberIsValid(opaque->hasho_nextblkno))), we are already
relying on special space to pass variables. In general, we are using
special space to pass variables to functions in many other places in
the code. What exactly are you bothered about in accessing special
space, if it is safe to do?

-            /* we ran off the end of the bucket without finding a match */
+            /*
+             * We ran off the end of the bucket without finding a match.
+             * Release the pin on bucket buffers.  Normally, such pins are
+             * released at end of scan, however scrolling cursors can
+             * reacquire the bucket lock and pin in the same scan multiple
+             * times.
+             */
*bufP = so->hashso_curbuf = InvalidBuffer;
ItemPointerSetInvalid(current);
+            _hash_dropscanbuf(rel, so);

I think this comment is saying that we'll release the pin on the
primary bucket page for now, and then reacquire it later if the user
reverses the scan direction. But that doesn't sound very safe,
because the bucket could be split in the meantime and the order in
which tuples are returned could change. I think we want that to
remain stable within a single query execution.

Isn't that possible even without the patch? Basically, after reaching
end of forward scan and for doing backward *all* scan, we need to
perform portal rewind which will in turn call hashrescan where we will
drop the lock on bucket and then again when we try to move cursor
forward we acquire lock in _hash_first(), so in between when we don't
have the lock, the split could happen and next scan results could
differ.

Also, in the documentation, it is mentioned that "The SQL standard
says that it is implementation-dependent whether cursors are sensitive
to concurrent updates of the underlying data by default. In
PostgreSQL, cursors are insensitive by default, and can be made
sensitive by specifying FOR UPDATE." which I think indicates that
results can't be guaranteed for forward and backward scans.

So, even if we try to come up with some solution for stable results in
some scenarios, I am not sure that can be guaranteed for all
scenarios.

+                            /*
+                             * setting hashso_skip_moved_tuples to false
+                             * ensures that we don't check for tuples that are
+                             * moved by split in old bucket and it also
+                             * ensures that we won't retry to scan the old
+                             * bucket once the scan for same is finished.
+                             */
+                            so->hashso_skip_moved_tuples = false;

I think you've got a big problem here. Suppose the user starts the
scan in the new bucket and runs it forward until they end up in the
old bucket. Then they turn around and run the scan backward. When
they reach the beginning of the old bucket, they're going to stop, not
move back to the new bucket, AFAICS. Oops.

After the scan has finished old bucket and turned back, it will
actually restart the scan (_hash_first) and will start from the end of
the new bucket. That is also a problem and it should actually start
from the end of the old bucket which is actually what you have
mentioned as next problem. So, I think if we fix the next problem, we
are okay.

_hash_first() has a related problem: a backward scan starts at the end
of the new bucket and moves backward, but it should start at the end
of the old bucket, and then when it reaches the beginning, flip to the
new bucket and move backward through that one. Otherwise, a backward
scan and a forward scan don't return tuples in opposite order, which
they should.

I think what you need to do to fix both of these problems is a more
thorough job gluing the two buckets together. I'd suggest that the
responsibility for switching between the two buckets should probably
be given to _hash_readprev() and _hash_readnext(), because every place
that needs to advance to the next or previous page that cares about
this. Right now you are trying to handle it mostly in the functions
that call those functions, but that is prone to errors of omission.

It seems like a better way, so will change accordingly.

Also, I think that so->hashso_skip_moved_tuples is badly designed.
There are two separate facts you need to know: (1) whether you are
scanning a bucket that was still being populated at the start of your
scan and (2) if yes, whether you are scanning the bucket being
populated or whether you are instead scanning the corresponding "old"
bucket. You're trying to keep track of that using one Boolean, but
one Boolean only has two states and there are three possible states
here.

So do you prefer to have two booleans to track those facts or have an
uint8 with a tri-state value or something else?

acquire cleanup lock on primary bucket page
loop:
scan and remove tuples
if this is the last bucket page, break out of loop
pin and x-lock next page
release prior lock and pin (except keep pin on primary bucket page)
if the page we have locked is not the primary bucket page:
release lock and take exclusive lock on primary bucket page
if there are no other pins on the primary bucket page:
squeeze the bucket to remove free space

Come to think of it, I'm a little worried about the locking in
_hash_squeezebucket(). It seems like we drop the lock on each "write"
bucket page before taking the lock on the next one. So a concurrent
scan could get ahead of the cleanup process. That would be bad,
wouldn't it?

Yeah, that would be bad if it happens, but no concurrent scan can
happen during squeeze phase because we take an exclusive lock on a
bucket page and maintain it throughout the operation.

Thanks for such a detailed review.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#146Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#145)
Re: Hash Indexes

On Thu, Nov 17, 2016 at 12:05 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Are you expecting a comment change here?

+ old_blkno = _hash_get_oldblock_from_newbucket(rel,
opaque->hasho_bucket);

Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"? I
feel like I'm repeating this ad nauseum, but I really think it's bad
to rely on the special space instead of our own local variables!

Sure, we can pass bucket as well. However, if you see few lines below
(while (BlockNumberIsValid(opaque->hasho_nextblkno))), we are already
relying on special space to pass variables. In general, we are using
special space to pass variables to functions in many other places in
the code. What exactly are you bothered about in accessing special
space, if it is safe to do?

I don't want to rely on the special space to know which buffers we
have locked or pinned. We obviously need the special space to find
the next and previous buffers in the block chain -- there's no other
way to know that. However, we should be more careful about locks and
pins. If the special space is corrupted in some way, we still
shouldn't get confused about which buffers we have locked or pinned.

I think this comment is saying that we'll release the pin on the
primary bucket page for now, and then reacquire it later if the user
reverses the scan direction. But that doesn't sound very safe,
because the bucket could be split in the meantime and the order in
which tuples are returned could change. I think we want that to
remain stable within a single query execution.

Isn't that possible even without the patch? Basically, after reaching
end of forward scan and for doing backward *all* scan, we need to
perform portal rewind which will in turn call hashrescan where we will
drop the lock on bucket and then again when we try to move cursor
forward we acquire lock in _hash_first(), so in between when we don't
have the lock, the split could happen and next scan results could
differ.

Well, the existing code doesn't drop the heavyweight lock at that
location, but your patch does drop the pin that serves the same
function, so I feel like there must be some difference.

Also, I think that so->hashso_skip_moved_tuples is badly designed.
There are two separate facts you need to know: (1) whether you are
scanning a bucket that was still being populated at the start of your
scan and (2) if yes, whether you are scanning the bucket being
populated or whether you are instead scanning the corresponding "old"
bucket. You're trying to keep track of that using one Boolean, but
one Boolean only has two states and there are three possible states
here.

So do you prefer to have two booleans to track those facts or have an
uint8 with a tri-state value or something else?

I don't currently have a preference.

Come to think of it, I'm a little worried about the locking in
_hash_squeezebucket(). It seems like we drop the lock on each "write"
bucket page before taking the lock on the next one. So a concurrent
scan could get ahead of the cleanup process. That would be bad,
wouldn't it?

Yeah, that would be bad if it happens, but no concurrent scan can
happen during squeeze phase because we take an exclusive lock on a
bucket page and maintain it throughout the operation.

Well, that's completely unacceptable. A major reason the current code
uses heavyweight locks is because you can't hold lightweight locks
across arbitrary amounts of work -- because, just to take one example,
a process holding or waiting for an LWLock isn't interruptible. The
point of this redesign was to get rid of that, so that LWLocks are
only held for short periods. I dislike the lock-chaining approach
(take the next lock before releasing the previous one) quite a bit and
really would like to find a way to get rid of that, but the idea of
holding a buffer lock across a complete traversal of an unbounded
number of overflow buckets is far worse. We've got to come up with a
design that doesn't require that, or else completely redesign the
bucket-squeezing stuff.

(Would it make any sense to change the order of the hash index patches
we've got outstanding? For instance, if we did the page-at-a-time
stuff first, it would make life simpler for this patch in several
ways, possibly including this issue.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#147Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#146)
Re: Hash Indexes

On Thu, Nov 17, 2016 at 10:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Nov 17, 2016 at 12:05 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think this comment is saying that we'll release the pin on the
primary bucket page for now, and then reacquire it later if the user
reverses the scan direction. But that doesn't sound very safe,
because the bucket could be split in the meantime and the order in
which tuples are returned could change. I think we want that to
remain stable within a single query execution.

Isn't that possible even without the patch? Basically, after reaching
end of forward scan and for doing backward *all* scan, we need to
perform portal rewind which will in turn call hashrescan where we will
drop the lock on bucket and then again when we try to move cursor
forward we acquire lock in _hash_first(), so in between when we don't
have the lock, the split could happen and next scan results could
differ.

Well, the existing code doesn't drop the heavyweight lock at that
location, but your patch does drop the pin that serves the same
function, so I feel like there must be some difference.

Yes, but I am not sure if existing code is right. Consider below scenario,

Session-1

Begin;
Declare c cursor for select * from t4 where c1=1;
Fetch forward all from c; --here shared heavy-weight lock count becomes 1
Fetch prior from c; --here shared heavy-weight lock count becomes 2
close c; -- here, lock release will reduce the lock count and shared
heavy-weight lock count becomes 1

Now, if we try to insert from another session, such that it leads to
bucket-split of the bucket for which session-1 had used a cursor, it
will wait for session-1. The insert can only proceed after session-1
performs the commit. I think after the cursor is closed in session-1,
the insert from another session should succeed, don't you think so?

Come to think of it, I'm a little worried about the locking in
_hash_squeezebucket(). It seems like we drop the lock on each "write"
bucket page before taking the lock on the next one. So a concurrent
scan could get ahead of the cleanup process. That would be bad,
wouldn't it?

Yeah, that would be bad if it happens, but no concurrent scan can
happen during squeeze phase because we take an exclusive lock on a
bucket page and maintain it throughout the operation.

Well, that's completely unacceptable. A major reason the current code
uses heavyweight locks is because you can't hold lightweight locks
across arbitrary amounts of work -- because, just to take one example,
a process holding or waiting for an LWLock isn't interruptible. The
point of this redesign was to get rid of that, so that LWLocks are
only held for short periods. I dislike the lock-chaining approach
(take the next lock before releasing the previous one) quite a bit and
really would like to find a way to get rid of that, but the idea of
holding a buffer lock across a complete traversal of an unbounded
number of overflow buckets is far worse. We've got to come up with a
design that doesn't require that, or else completely redesign the
bucket-squeezing stuff.

I think we can use the idea of lock-chaining (take the next lock
before releasing the previous one) for squeeze-phase to solve this
issue. Basically for squeeze operation, what we need to take care is
that there shouldn't be any scan before we start the squeeze and then
afterward if the scan starts, it should be always behind write-end of
a squeeze. If we follow this, then there shouldn't be any problem
even for backward scans because to start backward scans, it needs to
start with the first bucket and reach last bucket page by locking each
bucket page in read mode.

(Would it make any sense to change the order of the hash index patches
we've got outstanding? For instance, if we did the page-at-a-time
stuff first, it would make life simpler for this patch in several
ways, possibly including this issue.)

I agree that page-at-a-time can help hash indexes, but I don't think
it can help with this particular issue of squeeze operation. While
cleaning dead-tuples, it would be okay even if scan went ahead of
cleanup (considering we have page-at-a-time mode), but for squeeze, we
can't afford that because it can move some tuples to a prior bucket
page and scan would miss those tuples. Also, page-at-a-time will help
cleaning tuples only for MVCC scans (it might not help for unlogged
tables scan or non-MVCC scans). Another point is that we don't have a
patch for page-at-a-time scan ready at this stage.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#148Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#147)
Re: Hash Indexes

On Fri, Nov 18, 2016 at 12:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 17, 2016 at 10:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Nov 17, 2016 at 12:05 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think this comment is saying that we'll release the pin on the
primary bucket page for now, and then reacquire it later if the user
reverses the scan direction. But that doesn't sound very safe,
because the bucket could be split in the meantime and the order in
which tuples are returned could change. I think we want that to
remain stable within a single query execution.

Isn't that possible even without the patch? Basically, after reaching
end of forward scan and for doing backward *all* scan, we need to
perform portal rewind which will in turn call hashrescan where we will
drop the lock on bucket and then again when we try to move cursor
forward we acquire lock in _hash_first(), so in between when we don't
have the lock, the split could happen and next scan results could
differ.

Well, the existing code doesn't drop the heavyweight lock at that
location, but your patch does drop the pin that serves the same
function, so I feel like there must be some difference.

Yes, but I am not sure if existing code is right. Consider below scenario,

Session-1

Begin;
Declare c cursor for select * from t4 where c1=1;
Fetch forward all from c; --here shared heavy-weight lock count becomes 1
Fetch prior from c; --here shared heavy-weight lock count becomes 2
close c; -- here, lock release will reduce the lock count and shared
heavy-weight lock count becomes 1

Now, if we try to insert from another session, such that it leads to
bucket-split of the bucket for which session-1 had used a cursor, it
will wait for session-1.

It will not wait, but just skip the split because we are using try
lock, however, the point remains that select should not hold bucket
level locks even after the cursor is closed.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#149Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#144)
1 attachment(s)
Re: Hash Indexes

On Thu, Nov 17, 2016 at 3:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Nov 12, 2016 at 12:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

You are right and I have changed the code as per your suggestion.

So...

+        /*
+         * We always maintain the pin on bucket page for whole scan operation,
+         * so releasing the additional pin we have acquired here.
+         */
+        if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+            _hash_dropbuf(rel, *bufp);

This relies on the page contents to know whether we took a pin; that
seems like a bad plan. We need to know intrinsically whether we took
a pin.

Okay, changed to not rely on page contents.

+     * If the bucket split is in progress, then we need to skip tuples that
+     * are moved from old bucket.  To ensure that vacuum doesn't clean any
+     * tuples from old or new buckets till this scan is in progress, maintain
+     * a pin on both of the buckets.  Here, we have to be cautious about

It wouldn't be a problem if VACUUM removed tuples from the new bucket,
because they'd have to be dead anyway. It also wouldn't be a problem
if it removed tuples from the old bucket that were actually dead. The
real issue isn't vacuum anyway, but the process of cleaning up after a
split. We need to hold the pin so that tuples being moved from the
old bucket to the new bucket by the split don't get removed from the
old bucket until our scan is done.

Updated comments to explain clearly.

+ old_blkno = _hash_get_oldblock_from_newbucket(rel,
opaque->hasho_bucket);

Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"? I
feel like I'm repeating this ad nauseum, but I really think it's bad
to rely on the special space instead of our own local variables!

Okay, changed as per suggestion.

-            /* we ran off the end of the bucket without finding a match */
+            /*
+             * We ran off the end of the bucket without finding a match.
+             * Release the pin on bucket buffers.  Normally, such pins are
+             * released at end of scan, however scrolling cursors can
+             * reacquire the bucket lock and pin in the same scan multiple
+             * times.
+             */
*bufP = so->hashso_curbuf = InvalidBuffer;
ItemPointerSetInvalid(current);
+            _hash_dropscanbuf(rel, so);

I think this comment is saying that we'll release the pin on the
primary bucket page for now, and then reacquire it later if the user
reverses the scan direction. But that doesn't sound very safe,
because the bucket could be split in the meantime and the order in
which tuples are returned could change. I think we want that to
remain stable within a single query execution.

As explained [1]/messages/by-id/CAA4eK1JJDWFY0_Ezs4ZxXgnrGtTn48vFuXniOLmL7FOWX-tKNw@mail.gmail.com, this shouldn't be a problem.

+            _hash_readnext(rel, &buf, &page, &opaque,
+                       (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false);

Same comment: don't rely on the special space to figure this out.
Keep track. Also != 0 would be better than ? true : false.

After gluing scan of old and new buckets in _hash_read* API's, this is
no more required.

+                            /*
+                             * setting hashso_skip_moved_tuples to false
+                             * ensures that we don't check for tuples that are
+                             * moved by split in old bucket and it also
+                             * ensures that we won't retry to scan the old
+                             * bucket once the scan for same is finished.
+                             */
+                            so->hashso_skip_moved_tuples = false;

I think you've got a big problem here. Suppose the user starts the
scan in the new bucket and runs it forward until they end up in the
old bucket. Then they turn around and run the scan backward. When
they reach the beginning of the old bucket, they're going to stop, not
move back to the new bucket, AFAICS. Oops.

_hash_first() has a related problem: a backward scan starts at the end
of the new bucket and moves backward, but it should start at the end
of the old bucket, and then when it reaches the beginning, flip to the
new bucket and move backward through that one. Otherwise, a backward
scan and a forward scan don't return tuples in opposite order, which
they should.

I think what you need to do to fix both of these problems is a more
thorough job gluing the two buckets together. I'd suggest that the
responsibility for switching between the two buckets should probably
be given to _hash_readprev() and _hash_readnext(), because every place
that needs to advance to the next or previous page that cares about
this. Right now you are trying to handle it mostly in the functions
that call those functions, but that is prone to errors of omission.

Changed as per this idea to change the API's and fix the problem.

Also, I think that so->hashso_skip_moved_tuples is badly designed.
There are two separate facts you need to know: (1) whether you are
scanning a bucket that was still being populated at the start of your
scan and (2) if yes, whether you are scanning the bucket being
populated or whether you are instead scanning the corresponding "old"
bucket. You're trying to keep track of that using one Boolean, but
one Boolean only has two states and there are three possible states
here.

Updated patch is using two boolean variables to track the bucket state.

+    if (H_BUCKET_BEING_SPLIT(pageopaque) && IsBufferCleanupOK(buf))
+    {
+
+        /* release the lock on bucket buffer, before completing the split. */

Extra blank line.

Removed.

+moved-by-split flag on a tuple indicates that tuple is moved from old to new
+bucket.  The concurrent scans can skip such tuples till the split operation is
+finished.  Once the tuple is marked as moved-by-split, it will remain
so forever
+but that does no harm.  We have intentionally not cleared it as that
can generate
+an additional I/O which is not necessary.

The first sentence needs to start with "the" but the second sentence shouldn't.

Changed.

It would be good to adjust this part a bit to more clearly explain
that the split-in-progress and split-cleanup flags are bucket-level
flags, while moved-by-split is a per-tuple flag. It's possible to
figure this out from what you've written, but I think it could be more
clear. Another thing that is strange is that the code uses THREE
flags, bucket-being-split, bucket-being-populated, and
needs-split-cleanup, but the README conflates the first two and uses a
different name.

Updated patch to use bucket-being-split and bucket-being-populated to
explain the split operation in README. I have also changed the readme
to clearly indicate which the bucket and tuple level flags.

+previously-acquired content lock, but not pin and repeat the process using the

s/but not pin/but not the pin,/

Changed.

A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt.  The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
+disk space or crash) the index is left corrupt.  The probability of that
+could be made quite low if we grab a free page or two before we update the
+meta page, but the only real solution is to treat a split as a WAL-loggable,
must-complete action.  I'm not planning to teach hash about WAL in this
-go-round.
+go-round.  However, we do try to finish the incomplete splits during insert
+and split.

I think this paragraph needs a much heavier rewrite explaining the new
incomplete split handling. It's basically wrong now. Perhaps replace
it with something like this:

--
If a split fails partway through (e.g. due to insufficient disk space
or an interrupt), the index will not be corrupted. Instead, we'll
retry the split every time a tuple is inserted into the old bucket
prior to inserting the new tuple; eventually, we should succeed. The
fact that a split is left unfinished doesn't prevent subsequent
buckets from being split, but we won't try to split the bucket again
until the prior split is finished. In other words, a bucket can be in
the middle of being split for some time, but ti can't be in the middle
of two splits at the same time.

Although we can survive a failure to split a bucket, a crash is likely
to corrupt the index, since hash indexes are not yet WAL-logged.
--

s/ti/it
Fixed the typo and used the suggested text in README.

+        Acquire cleanup lock on target bucket
+        Scan and remove tuples
+        For overflow page, first we need to lock the next page and then
+        release the lock on current bucket or overflow page
+        Ensure to have buffer content lock in exclusive mode on bucket page
+        If buffer pincount is one, then compact free space as needed
+        Release lock

I don't think this summary is particularly correct. You would never
guess from this that we lock each bucket page in turn and then go back
and try to relock the primary bucket page at the end. It's more like:

acquire cleanup lock on primary bucket page
loop:
scan and remove tuples
if this is the last bucket page, break out of loop
pin and x-lock next page
release prior lock and pin (except keep pin on primary bucket page)
if the page we have locked is not the primary bucket page:
release lock and take exclusive lock on primary bucket page
if there are no other pins on the primary bucket page:
squeeze the bucket to remove free space

Yeah, it is clear, so I have used it in README.

Come to think of it, I'm a little worried about the locking in
_hash_squeezebucket(). It seems like we drop the lock on each "write"
bucket page before taking the lock on the next one. So a concurrent
scan could get ahead of the cleanup process. That would be bad,
wouldn't it?

As discussed [2]/messages/by-id/CAA4eK1J+0OYWKswWYNEjrBk3LfGpGJ9iSV8bYPQ3M=-qpkMtwQ %40mail.gmail.com, I have changed the code to use lock-chaining during
squeeze phase.

Apart from above, I have fixed a bug in calculation of lowmask in
_hash_get_oldblock_from_newbucket().

[1]: /messages/by-id/CAA4eK1JJDWFY0_Ezs4ZxXgnrGtTn48vFuXniOLmL7FOWX-tKNw@mail.gmail.com
[2]: /messages/by-id/CAA4eK1J+0OYWKswWYNEjrBk3LfGpGJ9iSV8bYPQ3M=-qpkMtwQ %40mail.gmail.com
%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

concurrent_hash_index_v12.patchapplication/octet-stream; name=concurrent_hash_index_v12.patchDownload
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index 5d3bd94..e2e7e91 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/access/hash
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashscan.o \
-       hashsearch.o hashsort.o hashutil.o hashvalidate.o
+OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
+       hashsort.o hashutil.o hashvalidate.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 0a7da89..4259de9 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -125,54 +125,61 @@ the initially created buckets.
 
 Lock Definitions
 ----------------
-
-We use both lmgr locks ("heavyweight" locks) and buffer context locks
-(LWLocks) to control access to a hash index.  lmgr locks are needed for
-long-term locking since there is a (small) risk of deadlock, which we must
-be able to detect.  Buffer context locks are used for short-term access
-control to individual pages of the index.
-
-LockPage(rel, page), where page is the page number of a hash bucket page,
-represents the right to split or compact an individual bucket.  A process
-splitting a bucket must exclusive-lock both old and new halves of the
-bucket until it is done.  A process doing VACUUM must exclusive-lock the
-bucket it is currently purging tuples from.  Processes doing scans or
-insertions must share-lock the bucket they are scanning or inserting into.
-(It is okay to allow concurrent scans and insertions.)
-
-The lmgr lock IDs corresponding to overflow pages are currently unused.
-These are available for possible future refinements.  LockPage(rel, 0)
-is also currently undefined (it was previously used to represent the right
-to modify the hash-code-to-bucket mapping, but it is no longer needed for
-that purpose).
-
-Note that these lock definitions are conceptually distinct from any sort
-of lock on the pages whose numbers they share.  A process must also obtain
-read or write buffer lock on the metapage or bucket page before accessing
-said page.
-
-Processes performing hash index scans must hold share lock on the bucket
-they are scanning throughout the scan.  This seems to be essential, since
-there is no reasonable way for a scan to cope with its bucket being split
-underneath it.  This creates a possibility of deadlock external to the
-hash index code, since a process holding one of these locks could block
-waiting for an unrelated lock held by another process.  If that process
-then does something that requires exclusive lock on the bucket, we have
-deadlock.  Therefore the bucket locks must be lmgr locks so that deadlock
-can be detected and recovered from.
-
-Processes must obtain read (share) buffer context lock on any hash index
-page while reading it, and write (exclusive) lock while modifying it.
-To prevent deadlock we enforce these coding rules: no buffer lock may be
-held long term (across index AM calls), nor may any buffer lock be held
-while waiting for an lmgr lock, nor may more than one buffer lock
-be held at a time by any one process.  (The third restriction is probably
-stronger than necessary, but it makes the proof of no deadlock obvious.)
+Concurrency control for hash indexes is provided using buffer content
+locks, buffer pins, and cleanup locks.   Here as elsewhere in PostgreSQL,
+cleanup lock means that we hold an exclusive lock on the buffer and have
+observed at some point after acquiring the lock that we hold the only pin
+on that buffer.  For hash indexes, a cleanup lock on a primary bucket page
+represents the right to perform an arbitrary reorganization of the entire
+bucket.  Therefore, scans retain a pin on the primary bucket page for the
+bucket they are currently scanning.  Splitting a bucket requires a cleanup
+lock on both the old and new primary bucket pages.  VACUUM therefore takes
+a cleanup lock on every bucket page in order to remove tuples.  It can also
+remove tuples copied to a new bucket by any previous split operation, because
+the cleanup lock taken on the primary bucket page guarantees that no scans
+which started prior to the most recent split can still be in progress.  After
+cleaning each page individually, it attempts to take a cleanup lock on the
+primary bucket page in order to "squeeze" the bucket down to the minimum
+possible number of pages.
+
+To avoid deadlocks, we must be consistent about the lock order in which we
+lock the buckets for operations that requires locks on two different buckets.
+We use the rule "first lock the old bucket and then new bucket, basically
+lock the lowered number bucket first".
+
+To avoid deadlock in operations that requires locking metapage and other
+buckets, we always take the lock on other bucket first and then on metapage.
 
 
 Pseudocode Algorithms
 ---------------------
 
+Various flags that are used in hash index operations are described as below:
+
+The bucket-being-split and bucket-being-populated flags indicate that split
+the operation is in progress for a bucket.  During split operation,
+a bucket-being-split flag is set on the old bucket and bucket-being-populated flag
+is set on new bucket.  These flags are cleared once the split operation is
+finished.  Both the flags are bucket level flags.
+
+The moved-by-split flag on a tuple indicates that tuple is moved from old to new
+bucket.  Concurrent scans can skip such tuples till the split operation is
+finished.  Once the tuple is marked as moved-by-split, it will remain so forever
+but that does no harm.  We have intentionally not cleared it as that can generate
+an additional I/O which is not necessary.  This flag is a tuple level flag.
+
+The split-cleanup flag indicates that the bucket contains tuples that are moved due
+to split.  This will be set only for the old bucket.  Now, why we need it besides
+a bucket-being-split flag is to distinguish the case when the split is over
+(aka bucket-being-split flag is cleared.).  This is used both by vacuum as
+well as during re-split operation.  Vacuum uses it to decide if it needs to
+clear the tuples (that are moved-by-split) from bucket along with dead tuples.
+Re-split of bucket uses it to ensure that it doesn't start a new split from a
+bucket without clearing the previous tuples from the old bucket.  The usage by
+re-split helps to keep bloat under control and makes the design somewhat
+simpler as we don't have to any time handle the situation where a bucket can
+contain dead tuples from multiple splits.  This flag is a bucket level flag.
+
 The operations we need to support are: readers scanning the index for
 entries of a particular hash code (which by definition are all in the same
 bucket); insertion of a new tuple into the correct bucket; enlarging the
@@ -193,38 +200,51 @@ The reader algorithm is:
 		release meta page buffer content lock
 		if (correct bucket page is already locked)
 			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock
+		release any existing bucket page buffer content lock (if a concurrent split happened)
+		take the buffer content lock on bucket page in shared mode
 		retake meta page buffer content lock in shared mode
--- then, per read request:
 	release pin on metapage
-	read current page of bucket and take shared buffer content lock
-		step to next page if necessary (no chaining of locks)
+	if the split is in progress for current bucket and this is a new bucket
+		release the buffer content lock on current bucket page
+		pin and acquire the buffer content lock on old bucket in shared mode
+		release the buffer content lock on old bucket, but not pin
+		retake the buffer content lock on new bucket
+		mark the scan such that it skips the tuples that are marked as moved by split
+-- then, per read request:
+	step to next page if necessary (no chaining of locks)
+		if the scan indicates moved by split, then move to old bucket after the scan
+		of current bucket is finished
 	get tuple
 	release buffer content lock and pin on current page
 -- at scan shutdown:
-	release bucket share-lock
-
-We can't hold the metapage lock while acquiring a lock on the target bucket,
-because that might result in an undetected deadlock (lwlocks do not participate
-in deadlock detection).  Instead, we relock the metapage after acquiring the
-bucket page lock and check whether the bucket has been split.  If not, we're
-done.  If so, we release our previously-acquired lock and repeat the process
-using the new bucket number.  Holding the bucket sharelock for
-the remainder of the scan prevents the reader's current-tuple pointer from
-being invalidated by splits or compactions.  Notice that the reader's lock
-does not prevent other buckets from being split or compacted.
+	release any pin we hold on current buffer, old bucket buffer, new bucket buffer
+
+We don't want to hold the meta page lock while acquiring the content lock on
+bucket page, because that might result in poor concurrency.  Instead, we relock
+the metapage after acquiring the bucket page content lock and check whether the
+bucket has been split.  If not, we're done.  If so, we release our
+previously-acquired content lock, but not the pin and repeat the process using
+the new bucket number.  Holding the buffer pin on bucket page for the remainder
+of the scan prevents the reader's current-tuple pointer from being invalidated
+by splits or compactions.  Notice that the reader's pin does not prevent other
+buckets from being split or compacted.
 
 To keep concurrency reasonably good, we require readers to cope with
 concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the page sharelock.  Since
-deletion is not possible while a reader holds the bucket sharelock, and
-we assume that heap tuple TIDs are unique, this can be implemented by
+their current scan position after re-acquiring the buffer content lock on
+page.  Since deletion is not possible while a reader holds the pin on bucket,
+and we assume that heap tuple TIDs are unique, this can be implemented by
 searching for the same heap tuple TID previously returned.  Insertion does
 not move index entries across pages, so the previously-returned index entry
 should always be on the same page, at the same or higher offset number,
 as it was before.
 
+To allow scan during bucket split, if at the start of the scan, bucket is
+marked as bucket-being-populated, it scan all the tuples in that bucket except
+for those that are marked as moved-by-split.  Once it finishes the scan of all
+the tuples in the current bucket, it scans the old bucket from which this bucket
+is formed by split.  This happens only for the new half bucket.
+
 The insertion algorithm is rather similar:
 
 	pin meta page and take buffer content lock in shared mode
@@ -233,18 +253,27 @@ The insertion algorithm is rather similar:
 		release meta page buffer content lock
 		if (correct bucket page is already locked)
 			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock in shared mode
+		release any existing bucket page buffer content lock (if a concurrent split happened)
+		take the buffer content lock on bucket page in exclusive mode
 		retake meta page buffer content lock in shared mode
--- (so far same as reader)
 	release pin on metapage
-	pin current page of bucket and take exclusive buffer content lock
-	if full, release, read/exclusive-lock next page; repeat as needed
+-- (so far same as reader, except for acquisation of buffer content lock in
+	exclusive mode on primary bucket page)
+	if the bucket-being-split flag is set for a bucket and pin count on it is
+	one, then finish the split
+		release the buffer content lock on current bucket
+		get the new bucket (bucket which was in process of split from current bucket) using current bucket
+		scan the new bucket and form the hash table of TIDs
+		conditionally get the cleanup lock on old and new buckets
+		if we get the lock on both the buckets
+			finish the split using algorithm mentioned below for split
+		release the pin on old bucket and restart the insert from beginning.
+	if current page is full, release lock but not pin, read/exclusive-lock next page; repeat as needed
 	>> see below if no space in any page of bucket
 	insert tuple at appropriate place in page
 	mark current page dirty and release buffer content lock and pin
-	release heavyweight share-lock
-	pin meta page and take buffer content lock in shared mode
+	if the current page is not a bucket page, release the pin on bucket page
+	pin meta page and take buffer content lock in exclusive mode
 	increment tuple count, decide if split needed
 	mark meta page dirty and release buffer content lock and pin
 	done if no split needed, else enter Split algorithm below
@@ -256,11 +285,13 @@ bucket that is being actively scanned, because readers can cope with this
 as explained above.  We only need the short-term buffer locks to ensure
 that readers do not see a partially-updated page.
 
-It is clearly impossible for readers and inserters to deadlock, and in
-fact this algorithm allows them a very high degree of concurrency.
-(The exclusive metapage lock taken to update the tuple count is stronger
-than necessary, since readers do not care about the tuple count, but the
-lock is held for such a short time that this is probably not an issue.)
+To avoid deadlock between readers and inserters, whenever there is a need
+to lock multiple buckets, we always take in the order suggested in Lock
+Definitions above.  This algorithm allows them a very high degree of
+concurrency.  (The exclusive metapage lock taken to update the tuple count
+is stronger than necessary, since readers do not care about the tuple count,
+but the lock is held for such a short time that this is probably not an
+issue.)
 
 When an inserter cannot find space in any existing page of a bucket, it
 must obtain an overflow page and add that page to the bucket's chain.
@@ -271,46 +302,72 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
 The algorithm attempts, but does not necessarily succeed, to split one
 existing bucket in two, thereby lowering the fill ratio:
 
-	pin meta page and take buffer content lock in exclusive mode
-	check split still needed
-	if split not needed anymore, drop buffer content lock and pin and exit
-	decide which bucket to split
-	Attempt to X-lock old bucket number (definitely could fail)
-	Attempt to X-lock new bucket number (shouldn't fail, but...)
-	if above fail, drop locks and pin and exit
+	expand:
+		take buffer content lock in exclusive mode on meta page
+		check split still needed
+		if split not needed anymore, drop buffer content lock and exit
+		decide which bucket to split
+		Attempt to acquire cleanup lock on old bucket number (definitely could fail)
+		if above fail, release lock and pin and exit
+		if the bucket-being-split flag is set, then finish the split
+			conditionally get the content lock on new bucket which was involved in split
+			if got the lock on new bucket
+				finish the split using algorithm mentioned below for split
+				release the buffer content lock and pin on old and new buckets
+				try to expand from start
+			else
+				release the buffer conetent lock and pin on old bucket and exit
+		if the split-cleanup flag (indicates that tuples are moved by split) is set on bucket
+			release the buffer content lock on meta page
+			remove the tuples that doesn't belong to this bucket; see bucket cleanup below
+	Attempt to acquire cleanup lock on new bucket number (shouldn't fail, but...)
 	update meta page to reflect new number of buckets
-	mark meta page dirty and release buffer content lock and pin
+	mark meta page dirty and release buffer content lock
 	-- now, accesses to all other buckets can proceed.
 	Perform actual split of bucket, moving tuples as needed
 	>> see below about acquiring needed extra space
-	Release X-locks of old and new buckets
+
+	split guts
+	mark the old and new buckets indicating split is in progress
+	if we are finishing the incomplete split
+		probe the temporary hash table to check if the value already exists in new bucket
+	copy the tuples that belongs to new bucket from old bucket
+	during copy mark such tuples as move-by-split
+	release lock but not pin for primary bucket page of old bucket,
+	read/shared-lock next page; repeat as needed
+	>> see below if no space in bucket page of new bucket
+	ensure to have exclusive-lock on both old and new buckets in that order
+	clear the bucket-being-split and bucket-being-populated flag from both the buckets respectively
+	mark the old bucket indicating split-cleanup
+	mark buffers dirty and release the locks and pins on both old and new buckets
 
 Note the metapage lock is not held while the actual tuple rearrangement is
 performed, so accesses to other buckets can proceed in parallel; in fact,
 it's possible for multiple bucket splits to proceed in parallel.
 
-Split's attempt to X-lock the old bucket number could fail if another
-process holds S-lock on it.  We do not want to wait if that happens, first
-because we don't want to wait while holding the metapage exclusive-lock,
-and second because it could very easily result in deadlock.  (The other
-process might be out of the hash AM altogether, and could do something
-that blocks on another lock this process holds; so even if the hash
-algorithm itself is deadlock-free, a user-induced deadlock could occur.)
-So, this is a conditional LockAcquire operation, and if it fails we just
-abandon the attempt to split.  This is all right since the index is
-overfull but perfectly functional.  Every subsequent inserter will try to
-split, and eventually one will succeed.  If multiple inserters failed to
-split, the index might still be overfull, but eventually, the index will
+The split operation's attempt to acquire cleanup-lock on the old bucket number
+could fail if another process holds any lock or pin on it.  We do not want to
+wait if that happens, because we don't want to wait while holding the metapage
+exclusive-lock.  So, this is a conditional LWLockAcquire operation, and if
+it fails we just abandon the attempt to split.  This is all right since the
+index is overfull but perfectly functional.  Every subsequent inserter will
+try to split, and eventually one will succeed.  If multiple inserters failed
+to split, the index might still be overfull, but eventually, the index will
 not be overfull and split attempts will stop.  (We could make a successful
 splitter loop to see if the index is still overfull, but it seems better to
 distribute the split overhead across successive insertions.)
 
-A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt.  The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
-must-complete action.  I'm not planning to teach hash about WAL in this
-go-round.
+If a split fails partway through (e.g. due to insufficient disk space or an
+interrupt), the index will not be corrupted.  Instead, we'll retry the split
+every time a tuple is inserted into the old bucket prior to inserting the new
+tuple; eventually, we should succeed.  The fact that a split is left
+unfinished doesn't prevent subsequent buckets from being split, but we won't
+try to split the bucket again until the prior split is finished.  In other
+words, a bucket can be in the middle of being split for some time, but it can't
+be in the middle of two splits at the same time.
+
+Although we can survive a failure to split a bucket, a crash is likely to
+corrupt the index, since hash indexes are not yet WAL-logged.
 
 The fourth operation is garbage collection (bulk deletion):
 
@@ -319,9 +376,17 @@ The fourth operation is garbage collection (bulk deletion):
 	fetch current max bucket number
 	release meta page buffer content lock and pin
 	while next bucket <= max bucket do
-		Acquire X lock on target bucket
-		Scan and remove tuples, compact free space as needed
-		Release X lock
+		acquire cleanup lock on primary bucket page
+		loop:
+			scan and remove tuples
+			if this is the last bucket page, break out of loop
+			pin and x-lock next page
+			release prior lock and pin (except keep pin on primary bucket page)
+		if the page we have locked is not the primary bucket page:
+			release lock and take exclusive lock on primary bucket page
+		if there are no other pins on the primary bucket page:
+			squeeze the bucket to remove free space
+		release the pin on primary bucket page
 		next bucket ++
 	end loop
 	pin metapage and take buffer content lock in exclusive mode
@@ -330,20 +395,24 @@ The fourth operation is garbage collection (bulk deletion):
 	else update metapage tuple count
 	mark meta page dirty and release buffer content lock and pin
 
-Note that this is designed to allow concurrent splits.  If a split occurs,
-tuples relocated into the new bucket will be visited twice by the scan,
-but that does no harm.  (We must however be careful about the statistics
-reported by the VACUUM operation.  What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count
-if the stored tuple count and number of buckets did *not* change at any
-time during the scan.  This provides a way of correcting the stored tuple
-count if it gets out of sync for some reason.  But if a split or insertion
-does occur concurrently, the scan count is untrustworthy; instead,
-subtract the number of tuples deleted from the stored tuple count and
-use that.)
-
-The exclusive lock request could deadlock in some strange scenarios, but
-we can just error out without any great harm being done.
+Note that this is designed to allow concurrent splits and scans.  If a
+split occurs, tuples relocated into the new bucket will be visited twice
+by the scan, but that does no harm.  As we release the lock on bucket page
+during cleanup scan of a bucket, it will allow concurrent scan to start on
+a bucket and ensures that scan will always be behind cleanup.  It is must to
+keep scans behind cleanup, else vacuum could decrease the TIDs that are required
+to complete the scan.  Now, as the scan that returns multiple tuples from the
+same bucket page always expect next valid TID to be greater than or equal to the
+current TID, it might miss the tuples.  This holds true for backward scans as
+well (backward scans first traverse each bucket starting from first bucket to
+last overflow page in the chain).  We must be careful about the statistics
+reported by the VACUUM operation.  What we can do is count the number of tuples
+scanned, and believe this in preference to the stored tuple count if the stored
+tuple count and number of buckets did *not* change at any time during the scan.
+This provides a way of correcting the stored tuple count if it gets out of sync
+for some reason.  But if a split or insertion does occur concurrently, the scan
+count is untrustworthy; instead, subtract the number of tuples deleted from the
+stored tuple count and use that.
 
 
 Free Space Management
@@ -417,13 +486,11 @@ free page; there can be no other process holding lock on it.
 
 Bucket splitting uses a similar algorithm if it has to extend the new
 bucket, but it need not worry about concurrent extension since it has
-exclusive lock on the new bucket.
+buffer content lock in exclusive mode on the new bucket.
 
-Freeing an overflow page is done by garbage collection and by bucket
-splitting (the old bucket may contain no-longer-needed overflow pages).
-In both cases, the process holds exclusive lock on the containing bucket,
-so need not worry about other accessors of pages in the bucket.  The
-algorithm is:
+Freeing an overflow page requires the process to hold buffer content lock in
+exclusive mode on the containing bucket, so need not worry about other
+accessors of pages in the bucket.  The algorithm is:
 
 	delink overflow page from bucket chain
 	(this requires read/update/write/release of fore and aft siblings)
@@ -454,14 +521,6 @@ locks.  Since they need no lmgr locks, deadlock is not possible.
 Other Notes
 -----------
 
-All the shenanigans with locking prevent a split occurring while *another*
-process is stopped in a given bucket.  They do not ensure that one of
-our *own* backend's scans is not stopped in the bucket, because lmgr
-doesn't consider a process's own locks to conflict.  So the Split
-algorithm must check for that case separately before deciding it can go
-ahead with the split.  VACUUM does not have this problem since nothing
-else can be happening within the vacuuming backend.
-
-Should we instead try to fix the state of any conflicting local scan?
-Seems mighty ugly --- got to move the held bucket S-lock as well as lots
-of other messiness.  For now, just punt and don't split.
+Clean up locks prevent a split from occurring while *another* process is stopped
+in a given bucket.  It also ensures that one of our *own* backend's scans is not
+stopped in the bucket.
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..1e807cf 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -287,10 +287,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		/*
 		 * An insertion into the current index page could have happened while
 		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold share lock on
-		 * the bucket, no deletions or splits could have occurred; therefore
-		 * we can expect that the TID still exists in the current index page,
-		 * at an offset >= where we were.)
+		 * for the TID we previously returned.  (Because we hold a pin on the
+		 * primary bucket page, no deletions or splits could have occurred;
+		 * therefore we can expect that the TID still exists in the current
+		 * index page, at an offset >= where we were.)
 		 */
 		OffsetNumber maxoffnum;
 
@@ -424,17 +424,17 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_bucket_valid = false;
-	so->hashso_bucket_blkno = 0;
 	so->hashso_curbuf = InvalidBuffer;
+	so->hashso_bucket_buf = InvalidBuffer;
+	so->hashso_split_bucket_buf = InvalidBuffer;
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
 	ItemPointerSetInvalid(&(so->hashso_heappos));
 
-	scan->opaque = so;
+	so->hashso_buc_populated = false;
+	so->hashso_buc_split = false;
 
-	/* register scan in case we change pages it's using */
-	_hash_regscan(scan);
+	scan->opaque = so;
 
 	return scan;
 }
@@ -449,15 +449,7 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -469,8 +461,10 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		memmove(scan->keyData,
 				scankey,
 				scan->numberOfKeys * sizeof(ScanKeyData));
-		so->hashso_bucket_valid = false;
 	}
+
+	so->hashso_buc_populated = false;
+	so->hashso_buc_split = false;
 }
 
 /*
@@ -482,18 +476,7 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/* don't need scan registered anymore */
-	_hash_dropscan(scan);
-
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);
 
 	pfree(so);
 	scan->opaque = NULL;
@@ -504,6 +487,9 @@ hashendscan(IndexScanDesc scan)
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
  *
+ * This function also deletes the tuples that are moved by split to other
+ * bucket.
+ *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
@@ -548,83 +534,47 @@ loop_top:
 	{
 		BlockNumber bucket_blkno;
 		BlockNumber blkno;
-		bool		bucket_dirty = false;
+		Buffer		bucket_buf;
+		Buffer		buf;
+		HashPageOpaque bucket_opaque;
+		Page		page;
+		bool		split_cleanup = false;
 
 		/* Get address of bucket's start page */
 		bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
 
-		/* Exclusive-lock the bucket so we can shrink it */
-		_hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
-		/* Shouldn't have any active scans locally, either */
-		if (_hash_has_active_scan(rel, cur_bucket))
-			elog(ERROR, "hash index has active scan during VACUUM");
-
-		/* Scan each page in bucket */
 		blkno = bucket_blkno;
-		while (BlockNumberIsValid(blkno))
-		{
-			Buffer		buf;
-			Page		page;
-			HashPageOpaque opaque;
-			OffsetNumber offno;
-			OffsetNumber maxoffno;
-			OffsetNumber deletable[MaxOffsetNumber];
-			int			ndeletable = 0;
-
-			vacuum_delay_point();
-
-			buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-											 info->strategy);
-			page = BufferGetPage(buf);
-			opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-			Assert(opaque->hasho_bucket == cur_bucket);
-
-			/* Scan each tuple in page */
-			maxoffno = PageGetMaxOffsetNumber(page);
-			for (offno = FirstOffsetNumber;
-				 offno <= maxoffno;
-				 offno = OffsetNumberNext(offno))
-			{
-				IndexTuple	itup;
-				ItemPointer htup;
 
-				itup = (IndexTuple) PageGetItem(page,
-												PageGetItemId(page, offno));
-				htup = &(itup->t_tid);
-				if (callback(htup, callback_state))
-				{
-					/* mark the item for deletion */
-					deletable[ndeletable++] = offno;
-					tuples_removed += 1;
-				}
-				else
-					num_index_tuples += 1;
-			}
+		/*
+		 * We need to acquire a cleanup lock on the primary bucket page to out
+		 * wait concurrent scans before deleting the dead tuples.
+		 */
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+		LockBufferForCleanup(buf);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
 
-			/*
-			 * Apply deletions and write page if needed, advance to next page.
-			 */
-			blkno = opaque->hasho_nextblkno;
+		page = BufferGetPage(buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
-			if (ndeletable > 0)
-			{
-				PageIndexMultiDelete(page, deletable, ndeletable);
-				_hash_wrtbuf(rel, buf);
-				bucket_dirty = true;
-			}
-			else
-				_hash_relbuf(rel, buf);
-		}
+		/*
+		 * If the bucket contains tuples that are moved by split, then we need
+		 * to delete such tuples.  We can't delete such tuples if the split
+		 * operation on bucket is not finished as those are needed by scans.
+		 */
+		if (!H_BUCKET_BEING_SPLIT(bucket_opaque) &&
+			H_NEEDS_SPLIT_CLEANUP(bucket_opaque))
+			split_cleanup = true;
+
+		bucket_buf = buf;
 
-		/* If we deleted anything, try to compact free space */
-		if (bucket_dirty)
-			_hash_squeezebucket(rel, cur_bucket, bucket_blkno,
-								info->strategy);
+		hashbucketcleanup(rel, cur_bucket, bucket_buf, blkno, info->strategy,
+						  local_metapage.hashm_maxbucket,
+						  local_metapage.hashm_highmask,
+						  local_metapage.hashm_lowmask, &tuples_removed,
+						  &num_index_tuples, split_cleanup,
+						  callback, callback_state);
 
-		/* Release bucket lock */
-		_hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+		_hash_dropbuf(rel, bucket_buf);
 
 		/* Advance to next bucket */
 		cur_bucket++;
@@ -705,6 +655,210 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }
 
+/*
+ * Helper function to perform deletion of index entries from a bucket.
+ *
+ * This function expects that the caller has acquired a cleanup lock on the
+ * primary bucket page, and will retrun with a write lock again held on the
+ * primary bucket page.  The lock won't necessarily be held continuously,
+ * though, because we'll release it when visiting overflow pages.
+ *
+ * It would be very bad if this function cleaned a page while some other
+ * backend was in the midst of scanning it, because hashgettuple assumes
+ * that the next valid TID will be greater than or equal to the current
+ * valid TID.  There can't be any concurrent scans in progress when we first
+ * enter this function because of the cleanup lock we hold on the primary
+ * bucket page, but as soon as we release that lock, there might be.  We
+ * handle that by conspiring to prevent those scans from passing our cleanup
+ * scan.  To do that, we lock the next page in the bucket chain before
+ * releasing the lock on the previous page.  (This type of lock chaining is
+ * not ideal, so we might want to look for a better solution at some point.)
+ *
+ * We need to retain a pin on the primary bucket to ensure that no concurrent
+ * split can start.
+ */
+void
+hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
+				  BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
+				  double *tuples_removed, double *num_index_tuples,
+				  bool split_cleanup,
+				  IndexBulkDeleteCallback callback, void *callback_state)
+{
+	BlockNumber blkno;
+	Buffer		buf;
+	Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
+	bool		bucket_dirty = false;
+
+	blkno = bucket_blkno;
+	buf = bucket_buf;
+
+	if (split_cleanup)
+		new_bucket = _hash_get_newbucket_from_oldbucket(rel, cur_bucket,
+														lowmask, maxbucket);
+
+	/* Scan each page in bucket */
+	for (;;)
+	{
+		HashPageOpaque opaque;
+		OffsetNumber offno;
+		OffsetNumber maxoffno;
+		Buffer		next_buf;
+		Page		page;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+		bool		retain_pin = false;
+		bool		curr_page_dirty = false;
+
+		vacuum_delay_point();
+
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		/* Scan each tuple in page */
+		maxoffno = PageGetMaxOffsetNumber(page);
+		for (offno = FirstOffsetNumber;
+			 offno <= maxoffno;
+			 offno = OffsetNumberNext(offno))
+		{
+			ItemPointer htup;
+			IndexTuple	itup;
+			Bucket		bucket;
+			bool		kill_tuple = false;
+
+			itup = (IndexTuple) PageGetItem(page,
+											PageGetItemId(page, offno));
+			htup = &(itup->t_tid);
+
+			/*
+			 * To remove the dead tuples, we strictly want to rely on results
+			 * of callback function.  refer btvacuumpage for detailed reason.
+			 */
+			if (callback && callback(htup, callback_state))
+			{
+				kill_tuple = true;
+				if (tuples_removed)
+					*tuples_removed += 1;
+			}
+			else if (split_cleanup)
+			{
+				/* delete the tuples that are moved by split. */
+				bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+											  maxbucket,
+											  highmask,
+											  lowmask);
+				/* mark the item for deletion */
+				if (bucket != cur_bucket)
+				{
+					/*
+					 * We expect tuples to either belong to curent bucket or
+					 * new_bucket.  This is ensured because we don't allow
+					 * further splits from bucket that contains garbage. See
+					 * comments in _hash_expandtable.
+					 */
+					Assert(bucket == new_bucket);
+					kill_tuple = true;
+				}
+			}
+
+			if (kill_tuple)
+			{
+				/* mark the item for deletion */
+				deletable[ndeletable++] = offno;
+			}
+			else
+			{
+				/* we're keeping it, so count it */
+				if (num_index_tuples)
+					*num_index_tuples += 1;
+			}
+		}
+
+		/* retain the pin on primary bucket page till end of bucket scan */
+		if (blkno == bucket_blkno)
+			retain_pin = true;
+		else
+			retain_pin = false;
+
+		blkno = opaque->hasho_nextblkno;
+
+		/*
+		 * Apply deletions, advance to next page and write page if needed.
+		 */
+		if (ndeletable > 0)
+		{
+			PageIndexMultiDelete(page, deletable, ndeletable);
+			bucket_dirty = true;
+			curr_page_dirty = true;
+		}
+
+		/* bail out if there are no more pages to scan. */
+		if (!BlockNumberIsValid(blkno))
+			break;
+
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
+		/*
+		 * release the lock on previous page after acquiring the lock on next
+		 * page
+		 */
+		if (curr_page_dirty)
+		{
+			if (retain_pin)
+				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+			else
+				_hash_wrtbuf(rel, buf);
+			curr_page_dirty = false;
+		}
+		else if (retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
+
+		buf = next_buf;
+	}
+
+	/*
+	 * lock the bucket page to clear the garbage flag and squeeze the bucket.
+	 * if the current buffer is same as bucket buffer, then we already have
+	 * lock on bucket page.
+	 */
+	if (buf != bucket_buf)
+	{
+		_hash_relbuf(rel, buf);
+		_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+	}
+
+	/*
+	 * Clear the garbage flag from bucket after deleting the tuples that are
+	 * moved by split.  We purposefully clear the flag before squeeze bucket,
+	 * so that after restart, vacuum shouldn't again try to delete the moved
+	 * by split tuples.
+	 */
+	if (split_cleanup)
+	{
+		HashPageOpaque bucket_opaque;
+		Page		page;
+
+		page = BufferGetPage(bucket_buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
+	}
+
+	/*
+	 * If we have deleted anything, try to compact free space.  For squeezing
+	 * the bucket, we must have a cleanup lock, else it can impact the
+	 * ordering of tuples for a scan that has started before it.
+	 */
+	if (bucket_dirty && IsBufferCleanupOK(bucket_buf))
+		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
+							bstrategy);
+	else
+		_hash_chgbufaccess(rel, bucket_buf, HASH_WRITE, HASH_NOLOCK);
+}
 
 void
 hash_redo(XLogReaderState *record)
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..572146a 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -28,18 +28,22 @@
 void
 _hash_doinsert(Relation rel, IndexTuple itup)
 {
-	Buffer		buf;
+	Buffer		buf = InvalidBuffer;
+	Buffer		bucket_buf;
 	Buffer		metabuf;
 	HashMetaPage metap;
 	BlockNumber blkno;
-	BlockNumber oldblkno = InvalidBlockNumber;
-	bool		retry = false;
+	BlockNumber oldblkno;
+	bool		retry;
 	Page		page;
 	HashPageOpaque pageopaque;
 	Size		itemsz;
 	bool		do_expand;
 	uint32		hashkey;
 	Bucket		bucket;
+	uint32		maxbucket;
+	uint32		highmask;
+	uint32		lowmask;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -51,6 +55,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+restart_insert:
 	/* Read the metapage */
 	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
 	metap = HashPageGetMeta(BufferGetPage(metabuf));
@@ -69,6 +74,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 						itemsz, HashMaxItemSize((Page) metap)),
 			errhint("Values larger than a buffer page cannot be indexed.")));
 
+	oldblkno = InvalidBlockNumber;
+	retry = false;
+
 	/*
 	 * Loop until we get a lock on the correct target bucket.
 	 */
@@ -84,21 +92,32 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 
 		blkno = BUCKET_TO_BLKNO(metap, bucket);
 
+		/*
+		 * Copy bucket mapping info now; refer the comment in
+		 * _hash_expandtable where we copy this information before calling
+		 * _hash_splitbucket to see why this is okay.
+		 */
+		maxbucket = metap->hashm_maxbucket;
+		highmask = metap->hashm_highmask;
+		lowmask = metap->hashm_lowmask;
+
 		/* Release metapage lock, but keep pin. */
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
 
 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * If the previous iteration of this loop locked the primary page of
+		 * what is still the correct target bucket, we are done.  Otherwise,
+		 * drop any old lock before acquiring the new one.
 		 */
 		if (retry)
 		{
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+
+		/* Fetch and lock the primary bucket page for the target bucket */
+		buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
 
 		/*
 		 * Reacquire metapage lock and check that no bucket split has taken
@@ -109,12 +128,36 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		retry = true;
 	}
 
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+	/* remember the primary bucket buffer to release the pin on it at end. */
+	bucket_buf = buf;
+
 	page = BufferGetPage(buf);
 	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(pageopaque->hasho_bucket == bucket);
 
+	/*
+	 * If this bucket is in the process of being split, try to finish the
+	 * split before inserting, because that might create room for the
+	 * insertion to proceed without allocating an additional overflow page.
+	 * It's only interesting to finish the split if we're trying to insert
+	 * into the bucket from which we're removing tuples (the "old" bucket),
+	 * not if we're trying to insert into the bucket into which tuples are
+	 * being moved (the "new" bucket).
+	 */
+	if (H_BUCKET_BEING_SPLIT(pageopaque) && IsBufferCleanupOK(buf))
+	{
+		/* release the lock on bucket buffer, before completing the split. */
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+		_hash_finish_split(rel, metabuf, buf, pageopaque->hasho_bucket,
+						   maxbucket, highmask, lowmask);
+
+		/* release the pin on old and meta buffer.  retry for insert. */
+		_hash_dropbuf(rel, buf);
+		_hash_dropbuf(rel, metabuf);
+		goto restart_insert;
+	}
+
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
@@ -127,9 +170,15 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		{
 			/*
 			 * ovfl page exists; go get it.  if it doesn't have room, we'll
-			 * find out next pass through the loop test above.
+			 * find out next pass through the loop test above.  we always
+			 * release both the lock and pin if this is an overflow page, but
+			 * only the lock if this is the primary bucket page, since the pin
+			 * on the primary bucket must be retained throughout the scan.
 			 */
-			_hash_relbuf(rel, buf);
+			if (buf != bucket_buf)
+				_hash_relbuf(rel, buf);
+			else
+				_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 			buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 			page = BufferGetPage(buf);
 		}
@@ -144,7 +193,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 
 			/* chain to a new overflow page */
-			buf = _hash_addovflpage(rel, metabuf, buf);
+			buf = _hash_addovflpage(rel, metabuf, buf, (buf == bucket_buf) ? true : false);
 			page = BufferGetPage(buf);
 
 			/* should fit now, given test above */
@@ -158,11 +207,14 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	/* found page with enough space, so add the item here */
 	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
 
-	/* write and release the modified page */
+	/*
+	 * write and release the modified page.  if the page we modified was an
+	 * overflow page, we also need to separately drop the pin we retained on
+	 * the primary bucket page.
+	 */
 	_hash_wrtbuf(rel, buf);
-
-	/* We can drop the bucket lock now */
-	_hash_droplock(rel, blkno, HASH_SHARE);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
 
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index df7af3e..e2d208e 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *
  *	On entry, the caller must hold a pin but no lock on 'buf'.  The pin is
  *	dropped before exiting (we assume the caller is not interested in 'buf'
- *	anymore).  The returned overflow page will be pinned and write-locked;
- *	it is guaranteed to be empty.
+ *	anymore) if not asked to retain.  The pin will be retained only for the
+ *	primary bucket.  The returned overflow page will be pinned and
+ *	write-locked; it is guaranteed to be empty.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
  *
- *	The caller must hold at least share lock on the bucket, to ensure that
- *	no one else tries to compact the bucket meanwhile.  This guarantees that
- *	'buf' won't stop being part of the bucket while it's unlocked.
- *
  * NB: since this could be executed concurrently by multiple processes,
  * one should not assume that the returned overflow page will be the
  * immediate successor of the originally passed 'buf'.  Additional overflow
  * pages might have been added to the bucket chain in between.
  */
 Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 {
 	Buffer		ovflbuf;
 	Page		page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 			break;
 
 		/* we assume we do not need to write the unmodified page */
-		_hash_relbuf(rel, buf);
+		if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
 
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
@@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 
 	/* logically chain overflow page to previous page */
 	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	_hash_wrtbuf(rel, buf);
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, buf);
 
 	return ovflbuf;
 }
@@ -369,21 +372,25 @@ _hash_firstfreebit(uint32 map)
  *	Returns the block number of the page that followed the given page
  *	in the bucket, or InvalidBlockNumber if no following page.
  *
- *	NB: caller must not hold lock on metapage, nor on either page that's
- *	adjacent in the bucket chain.  The caller had better hold exclusive lock
- *	on the bucket, too.
+ *	NB: caller must not hold lock on metapage, nor on page, that's next to
+ *	ovflbuf in the bucket chain.  We don't acquire the lock on page that's
+ *	prior to ovflbuf in chain if it is same as wbuf because the caller already
+ *	has a lock on same.  This function releases the lock on wbuf and caller
+ *	is responsible for releasing the pin on same.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BufferAccessStrategy bstrategy)
+_hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
+				   bool wbuf_dirty, BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
 	Buffer		metabuf;
 	Buffer		mapbuf;
+	Buffer		prevbuf = InvalidBuffer;
 	BlockNumber ovflblkno;
 	BlockNumber prevblkno;
 	BlockNumber blkno;
 	BlockNumber nextblkno;
+	BlockNumber writeblkno;
 	HashPageOpaque ovflopaque;
 	Page		ovflpage;
 	Page		mappage;
@@ -400,6 +407,7 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
 	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
 	nextblkno = ovflopaque->hasho_nextblkno;
 	prevblkno = ovflopaque->hasho_prevblkno;
+	writeblkno = BufferGetBlockNumber(wbuf);
 	bucket = ovflopaque->hasho_bucket;
 
 	/*
@@ -413,23 +421,39 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
-	 * deleted.  No concurrency issues since we hold exclusive lock on the
-	 * entire bucket.
+	 * deleted.  Concurrency issues are avoided by using lock chaining as
+	 * described atop hashbucketcleanup.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-														 prevblkno,
-														 HASH_WRITE,
+		Page		prevpage;
+		HashPageOpaque prevopaque;
+
+		if (prevblkno == writeblkno)
+			prevbuf = wbuf;
+		else
+			prevbuf = _hash_getbuf_with_strategy(rel,
+												 prevblkno,
+												 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		prevpage = BufferGetPage(prevbuf);
-		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+												 bstrategy);
+
+		prevpage = BufferGetPage(prevbuf);
+		prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
 
 		Assert(prevopaque->hasho_bucket == bucket);
 		prevopaque->hasho_nextblkno = nextblkno;
-		_hash_wrtbuf(rel, prevbuf);
+
+		if (prevblkno != writeblkno)
+			_hash_wrtbuf(rel, prevbuf);
 	}
+
+	/* write and unlock the write buffer */
+	if (wbuf_dirty)
+		_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
+
 	if (BlockNumberIsValid(nextblkno))
 	{
 		Buffer		nextbuf = _hash_getbuf_with_strategy(rel,
@@ -570,8 +594,15 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
  *	required that to be true on entry as well, but it's a lot easier for
  *	callers to leave empty overflow pages and let this guy clean it up.
  *
- *	Caller must hold exclusive lock on the target bucket.  This allows
- *	us to safely lock multiple pages in the bucket.
+ *	Caller must acquire cleanup lock on the primary page of the target
+ *	bucket to exclude any scans that are in progress, which could easily
+ *	be confused into returning the same tuple more than once or some tuples
+ *	not at all by the rearrangement we are performing here.  To prevent
+ *	any concurrent scan to cross the squeeze scan we use lock chaining
+ *	similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ *
+ *	We need to retain a pin on the primary bucket to ensure that no concurrent
+ *	split can start.
  *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
@@ -580,6 +611,7 @@ void
 _hash_squeezebucket(Relation rel,
 					Bucket bucket,
 					BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy)
 {
 	BlockNumber wblkno;
@@ -593,23 +625,20 @@ _hash_squeezebucket(Relation rel,
 	bool		wbuf_dirty;
 
 	/*
-	 * start squeezing into the base bucket page.
+	 * start squeezing into the primary bucket page.
 	 */
 	wblkno = bucket_blkno;
-	wbuf = _hash_getbuf_with_strategy(rel,
-									  wblkno,
-									  HASH_WRITE,
-									  LH_BUCKET_PAGE,
-									  bstrategy);
+	wbuf = bucket_buf;
 	wpage = BufferGetPage(wbuf);
 	wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 
 	/*
-	 * if there aren't any overflow pages, there's nothing to squeeze.
+	 * if there aren't any overflow pages, there's nothing to squeeze. caller
+	 * is responsible for releasing the pin on primary bucket page.
 	 */
 	if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
 	{
-		_hash_relbuf(rel, wbuf);
+		_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
 		return;
 	}
 
@@ -646,6 +675,7 @@ _hash_squeezebucket(Relation rel,
 		OffsetNumber maxroffnum;
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable = 0;
+		bool		retain_pin = false;
 
 		/* Scan each tuple in "read" page */
 		maxroffnum = PageGetMaxOffsetNumber(rpage);
@@ -671,13 +701,37 @@ _hash_squeezebucket(Relation rel,
 			 */
 			while (PageGetFreeSpace(wpage) < itemsz)
 			{
+				Buffer		next_wbuf = InvalidBuffer;
+
 				Assert(!PageIsEmpty(wpage));
 
+				if (wblkno == bucket_blkno)
+					retain_pin = true;
+
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));
 
+				/* don't need to move to next page if we reached the read page */
+				if (wblkno != rblkno)
+					next_wbuf = _hash_getbuf_with_strategy(rel,
+														   wblkno,
+														   HASH_WRITE,
+														   LH_OVERFLOW_PAGE,
+														   bstrategy);
+
+				/*
+				 * release the lock on previous page after acquiring the lock
+				 * on next page
+				 */
 				if (wbuf_dirty)
-					_hash_wrtbuf(rel, wbuf);
+				{
+					if (retain_pin)
+						_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
+					else
+						_hash_wrtbuf(rel, wbuf);
+				}
+				else if (retain_pin)
+					_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
 				else
 					_hash_relbuf(rel, wbuf);
 
@@ -695,15 +749,12 @@ _hash_squeezebucket(Relation rel,
 					return;
 				}
 
-				wbuf = _hash_getbuf_with_strategy(rel,
-												  wblkno,
-												  HASH_WRITE,
-												  LH_OVERFLOW_PAGE,
-												  bstrategy);
+				wbuf = next_wbuf;
 				wpage = BufferGetPage(wbuf);
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
 				wbuf_dirty = false;
+				retain_pin = false;
 			}
 
 			/*
@@ -728,28 +779,29 @@ _hash_squeezebucket(Relation rel,
 		 * Tricky point here: if our read and write pages are adjacent in the
 		 * bucket chain, our write lock on wbuf will conflict with
 		 * _hash_freeovflpage's attempt to update the sibling links of the
-		 * removed page.  However, in that case we are done anyway, so we can
-		 * simply drop the write lock before calling _hash_freeovflpage.
+		 * removed page.  In that case, we don't need to lock it again and we
+		 * always release the lock on wbuf in _hash_freeovflpage and then
+		 * retake it again here.  This will not only simplify the code, but is
+		 * required to atomically log the changes which will be helpful when
+		 * we write WAL for hash indexes.
 		 */
 		rblkno = ropaque->hasho_prevblkno;
 		Assert(BlockNumberIsValid(rblkno));
 
+		/* free this overflow page (releases rbuf) */
+		_hash_freeovflpage(rel, rbuf, wbuf, wbuf_dirty, bstrategy);
+
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
-			/* yes, so release wbuf lock first */
-			if (wbuf_dirty)
-				_hash_wrtbuf(rel, wbuf);
-			else
-				_hash_relbuf(rel, wbuf);
-			/* free this overflow page (releases rbuf) */
-			_hash_freeovflpage(rel, rbuf, bstrategy);
-			/* done */
+			/* retain the pin on primary bucket page till end of bucket scan */
+			if (wblkno != bucket_blkno)
+				_hash_dropbuf(rel, wbuf);
 			return;
 		}
 
-		/* free this overflow page, then get the previous one */
-		_hash_freeovflpage(rel, rbuf, bstrategy);
+		/* lock the overflow page being written, then get the previous one */
+		_hash_chgbufaccess(rel, wbuf, HASH_NOLOCK, HASH_WRITE);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a5e9d17..ee7cbba 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,10 +38,14 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
 					uint32 nblocks);
 static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
+static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
+					   Bucket obucket, Bucket nbucket, Buffer obuf,
+					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
+					   uint32 highmask, uint32 lowmask);
 
 
 /*
@@ -55,46 +59,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 
 
 /*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock.  (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		return ConditionalLockPage(rel, whichlock, access);
-	else
-		return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		UnlockPage(rel, whichlock, access);
-}
-
-/*
  *	_hash_getbuf() -- Get a buffer by block number for read or write.
  *
  *		'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -132,6 +96,35 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
 }
 
 /*
+ * _hash_getbuf_with_condlock_cleanup() -- Try to get a buffer for cleanup.
+ *
+ *		We read the page and try to acquire a cleanup lock.  If we get it,
+ *		we return the buffer; otherwise, we return InvalidBuffer.
+ */
+Buffer
+_hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
+{
+	Buffer		buf;
+
+	if (blkno == P_NEW)
+		elog(ERROR, "hash AM does not use P_NEW");
+
+	buf = ReadBuffer(rel, blkno);
+
+	if (!ConditionalLockBufferForCleanup(buf))
+	{
+		ReleaseBuffer(buf);
+		return InvalidBuffer;
+	}
+
+	/* ref count and lock type are correct */
+
+	_hash_checkpage(rel, buf, flags);
+
+	return buf;
+}
+
+/*
  *	_hash_getinitbuf() -- Get and initialize a buffer by block number.
  *
  *		This must be used only to fetch pages that are known to be before
@@ -266,6 +259,37 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
+ *	_hash_dropscanbuf() -- release buffers used in scan.
+ *
+ * This routine unpins the buffers used during scan on which we
+ * hold no lock.
+ */
+void
+_hash_dropscanbuf(Relation rel, HashScanOpaque so)
+{
+	/* release pin we hold on primary bucket page */
+	if (BufferIsValid(so->hashso_bucket_buf) &&
+		so->hashso_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_bucket_buf);
+	so->hashso_bucket_buf = InvalidBuffer;
+
+	/* release pin we hold on primary bucket page  of bucket being split */
+	if (BufferIsValid(so->hashso_split_bucket_buf) &&
+		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
+	so->hashso_split_bucket_buf = InvalidBuffer;
+
+	/* release any pin we still hold */
+	if (BufferIsValid(so->hashso_curbuf))
+		_hash_dropbuf(rel, so->hashso_curbuf);
+	so->hashso_curbuf = InvalidBuffer;
+
+	/* reset split scan */
+	so->hashso_buc_populated = false;
+	so->hashso_buc_split = false;
+}
+
+/*
  *	_hash_wrtbuf() -- write a hash page to disk.
  *
  *		This routine releases the lock held on the buffer and our refcount
@@ -489,9 +513,11 @@ _hash_pageinit(Page page, Size size)
 /*
  * Attempt to expand the hash table by creating one new bucket.
  *
- * This will silently do nothing if it cannot get the needed locks.
+ * This will silently do nothing if we don't get cleanup lock on old or
+ * new bucket.
  *
- * The caller should hold no locks on the hash index.
+ * Complete the pending splits and remove the tuples from old bucket,
+ * if there are any left over from the previous split.
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.
@@ -506,10 +532,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	BlockNumber start_oblkno;
 	BlockNumber start_nblkno;
 	Buffer		buf_nblkno;
+	Buffer		buf_oblkno;
+	Page		opage;
+	HashPageOpaque oopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
 
+restart_expand:
+
 	/*
 	 * Write-lock the meta page.  It used to be necessary to acquire a
 	 * heavyweight lock to begin a split, but that is no longer required.
@@ -548,11 +579,16 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		goto fail;
 
 	/*
-	 * Determine which bucket is to be split, and attempt to lock the old
-	 * bucket.  If we can't get the lock, give up.
+	 * Determine which bucket is to be split, and attempt to take cleanup lock
+	 * on the old bucket.  If we can't get the lock, give up.
+	 *
+	 * The cleanup lock protects us not only against other backends, but
+	 * against our own backend as well.
 	 *
-	 * The lock protects us against other backends, but not against our own
-	 * backend.  Must check for active scans separately.
+	 * The cleanup lock is mainly to protect the split from concurrent
+	 * inserts. See src/backend/access/hash/README, Lock Definitions for
+	 * further details.  Due to this locking restriction, if there is any
+	 * pending scan, the split will give up which is not good, but harmless.
 	 */
 	new_bucket = metap->hashm_maxbucket + 1;
 
@@ -560,14 +596,78 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 
 	start_oblkno = BUCKET_TO_BLKNO(metap, old_bucket);
 
-	if (_hash_has_active_scan(rel, old_bucket))
+	buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
+	if (!buf_oblkno)
 		goto fail;
 
-	if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
-		goto fail;
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * We want to finish the split from a bucket as there is no apparent
+	 * benefit by not doing so and it will make the code complicated to finish
+	 * the split that involves multiple buckets considering the case where new
+	 * split also fails.  We don't need to consider the new bucket for
+	 * completing the split here as it is not possible that a re-split of new
+	 * bucket starts when there is still a pending split from old bucket.
+	 */
+	if (H_BUCKET_BEING_SPLIT(oopaque))
+	{
+		/*
+		 * Copy bucket mapping info now; refer the comment in code below where
+		 * we copy this information before calling _hash_splitbucket to see
+		 * why this is okay.
+		 */
+		maxbucket = metap->hashm_maxbucket;
+		highmask = metap->hashm_highmask;
+		lowmask = metap->hashm_lowmask;
+
+		/*
+		 * Release the lock on metapage and old_bucket, before completing the
+		 * split.
+		 */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		_hash_chgbufaccess(rel, buf_oblkno, HASH_READ, HASH_NOLOCK);
+
+		_hash_finish_split(rel, metabuf, buf_oblkno, old_bucket, maxbucket,
+						   highmask, lowmask);
+
+		/* release the pin on old buffer and retry for expand. */
+		_hash_dropbuf(rel, buf_oblkno);
+
+		goto restart_expand;
+	}
 
 	/*
-	 * Likewise lock the new bucket (should never fail).
+	 * Clean the tuples remained from the previous split.  This operation
+	 * requires cleanup lock and we already have one on the old bucket, so
+	 * let's do it. We also don't want to allow further splits from the bucket
+	 * till the garbage of previous split is cleaned.  This has two
+	 * advantages, first it helps in avoiding the bloat due to garbage and
+	 * second is, during cleanup of bucket, we are always sure that the
+	 * garbage tuples belong to most recently splitted bucket.  On the
+	 * contrary, if we allow cleanup of bucket after meta page is updated to
+	 * indicate the new split and before the actual split, the cleanup
+	 * operation won't be able to decide whether the tuple has been moved to
+	 * the newly created bucket and ended up deleting such tuples.
+	 */
+	if (H_NEEDS_SPLIT_CLEANUP(oopaque))
+	{
+		/* Release the metapage lock. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		hashbucketcleanup(rel, old_bucket, buf_oblkno, start_oblkno, NULL,
+						  metap->hashm_maxbucket, metap->hashm_highmask,
+						  metap->hashm_lowmask, NULL,
+						  NULL, true, NULL, NULL);
+
+		_hash_dropbuf(rel, buf_oblkno);
+
+		goto restart_expand;
+	}
+
+	/*
+	 * There shouldn't be any active scan on new bucket.
 	 *
 	 * Note: it is safe to compute the new bucket's blkno here, even though we
 	 * may still need to update the BUCKET_TO_BLKNO mapping.  This is because
@@ -576,12 +676,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	 */
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
-	if (_hash_has_active_scan(rel, new_bucket))
-		elog(ERROR, "scan in progress on supposedly new bucket");
-
-	if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
-		elog(ERROR, "could not get lock on supposedly new bucket");
-
 	/*
 	 * If the split point is increasing (hashm_maxbucket's log base 2
 	 * increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +694,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
 			/* can't split due to BlockNumber overflow */
-			_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-			_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+			_hash_relbuf(rel, buf_oblkno);
 			goto fail;
 		}
 	}
@@ -609,9 +702,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/*
 	 * Physically allocate the new bucket's primary page.  We want to do this
 	 * before changing the metapage's mapping info, in case we can't get the
-	 * disk space.
+	 * disk space.  Ideally, we don't need to check for cleanup lock on new
+	 * bucket as no other backend could find this bucket unless meta page is
+	 * updated.  However, it is good to be consistent with old bucket locking.
 	 */
 	buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
+	if (!IsBufferCleanupOK(buf_nblkno))
+	{
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+		goto fail;
+	}
+
 
 	/*
 	 * Okay to proceed with split.  Update the metapage bucket mapping info.
@@ -665,13 +767,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  start_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno,
 					  maxbucket, highmask, lowmask);
 
-	/* Release bucket locks, allowing others to access them */
-	_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-	_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -738,13 +836,17 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
  * belong in the new bucket, and compress out any free space in the old
  * bucket.
  *
- * The caller must hold exclusive locks on both buckets to ensure that
+ * The caller must hold cleanup locks on both buckets to ensure that
  * no one else is trying to access them (see README).
  *
  * The caller must hold a pin, but no lock, on the metapage buffer.
  * The buffer is returned in the same state.  (The metapage is only
  * touched if it becomes necessary to add or remove overflow pages.)
  *
+ * Split needs to retain pin on primary bucket pages of both old and new
+ * buckets till end of operation.  This is to prevent vacuum to start
+ * when split is in progress.
+ *
  * In addition, the caller must have created the new bucket's base page,
  * which is passed in buffer nbuf, pinned and write-locked.  That lock and
  * pin are released here.  (The API is set up this way because we must do
@@ -756,37 +858,86 @@ _hash_splitbucket(Relation rel,
 				  Buffer metabuf,
 				  Bucket obucket,
 				  Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Buffer		obuf;
 	Page		opage;
 	Page		npage;
 	HashPageOpaque oopaque;
 	HashPageOpaque nopaque;
 
-	/*
-	 * It should be okay to simultaneously write-lock pages from each bucket,
-	 * since no one else can be trying to acquire buffer lock on pages of
-	 * either bucket.
-	 */
-	obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
 	opage = BufferGetPage(obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 
+	/*
+	 * Mark the old bucket to indicate that split is in progress.  At
+	 * operation end, we clear split-in-progress flag.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
+
 	npage = BufferGetPage(nbuf);
 
-	/* initialize the new bucket's primary page */
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nopaque->hasho_prevblkno = InvalidBlockNumber;
 	nopaque->hasho_nextblkno = InvalidBlockNumber;
 	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
 	nopaque->hasho_page_id = HASHO_PAGE_ID;
 
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, nbuf, NULL,
+						   maxbucket, highmask, lowmask);
+
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, obuf);
+	_hash_relbuf(rel, nbuf);
+}
+
+/*
+ * _hash_splitbucket_guts -- Helper function to perform the split operation
+ *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, caller needs to fill htab.  If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
+ * Caller needs to lock and unlock the old and new primary buckets.
+ */
+static void
+_hash_splitbucket_guts(Relation rel,
+					   Buffer metabuf,
+					   Bucket obucket,
+					   Bucket nbucket,
+					   Buffer obuf,
+					   Buffer nbuf,
+					   HTAB *htab,
+					   uint32 maxbucket,
+					   uint32 highmask,
+					   uint32 lowmask)
+{
+	Buffer		bucket_obuf;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
+
+	bucket_obuf = obuf;
+	opage = BufferGetPage(obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
 	/*
 	 * Partition the tuples in the old bucket between the old bucket and the
 	 * new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +949,6 @@ _hash_splitbucket(Relation rel,
 		BlockNumber oblkno;
 		OffsetNumber ooffnum;
 		OffsetNumber omaxoffnum;
-		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
 
 		/* Scan each tuple in old page */
 		omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -810,33 +959,52 @@ _hash_splitbucket(Relation rel,
 			IndexTuple	itup;
 			Size		itemsz;
 			Bucket		bucket;
+			bool		found = false;
 
 			/* skip dead tuples */
 			if (ItemIdIsDead(PageGetItemId(opage, ooffnum)))
 				continue;
 
 			/*
-			 * Fetch the item's hash key (conveniently stored in the item) and
-			 * determine which bucket it now belongs in.
+			 * Before inserting a tuple, probe the hash table containing TIDs
+			 * of tuples belonging to new bucket, if we find a match, then
+			 * skip that tuple, else fetch the item's hash key (conveniently
+			 * stored in the item) and determine which bucket it now belongs
+			 * in.
 			 */
 			itup = (IndexTuple) PageGetItem(opage,
 											PageGetItemId(opage, ooffnum));
+
+			if (htab)
+				(void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
+
+			if (found)
+				continue;
+
 			bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
 										  maxbucket, highmask, lowmask);
 
 			if (bucket == nbucket)
 			{
+				IndexTuple	new_itup;
+
+				/*
+				 * make a copy of index tuple as we have to scribble on it.
+				 */
+				new_itup = CopyIndexTuple(itup);
+
+				/*
+				 * mark the index tuple as moved by split, such tuples are
+				 * skipped by scan if there is split in progress for a bucket.
+				 */
+				new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+
 				/*
 				 * insert the tuple into the new bucket.  if it doesn't fit on
 				 * the current page in the new bucket, we must allocate a new
 				 * overflow page and place the tuple on that page instead.
-				 *
-				 * XXX we have a problem here if we fail to get space for a
-				 * new overflow page: we'll error out leaving the bucket split
-				 * only partially complete, meaning the index is corrupt,
-				 * since searches may fail to find entries they should find.
 				 */
-				itemsz = IndexTupleDSize(*itup);
+				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
 				if (PageGetFreeSpace(npage) < itemsz)
@@ -844,9 +1012,9 @@ _hash_splitbucket(Relation rel,
 					/* write out nbuf and drop lock, but keep pin */
 					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
 					/* chain to a new overflow page */
-					nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+					nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf) ? true : false);
 					npage = BufferGetPage(nbuf);
-					/* we don't need nopaque within the loop */
+					nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 				}
 
 				/*
@@ -856,12 +1024,10 @@ _hash_splitbucket(Relation rel,
 				 * Possible future improvement: accumulate all the items for
 				 * the new page and qsort them before insertion.
 				 */
-				(void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
+				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
-				/*
-				 * Mark tuple for deletion from old page.
-				 */
-				deletable[ndeletable++] = ooffnum;
+				/* be tidy */
+				pfree(new_itup);
 			}
 			else
 			{
@@ -874,15 +1040,9 @@ _hash_splitbucket(Relation rel,
 
 		oblkno = oopaque->hasho_nextblkno;
 
-		/*
-		 * Done scanning this old page.  If we moved any tuples, delete them
-		 * from the old page.
-		 */
-		if (ndeletable > 0)
-		{
-			PageIndexMultiDelete(opage, deletable, ndeletable);
-			_hash_wrtbuf(rel, obuf);
-		}
+		/* retain the pin on the old primary bucket */
+		if (obuf == bucket_obuf)
+			_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, obuf);
 
@@ -891,18 +1051,169 @@ _hash_splitbucket(Relation rel,
 			break;
 
 		/* Else, advance to next old page */
-		obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
 		opage = BufferGetPage(obuf);
 		oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	}
 
 	/*
 	 * We're at the end of the old bucket chain, so we're done partitioning
-	 * the tuples.  Before quitting, call _hash_squeezebucket to ensure the
-	 * tuples remaining in the old bucket (including the overflow pages) are
-	 * packed as tightly as possible.  The new bucket is already tight.
+	 * the tuples.  Mark the old and new buckets to indicate split is
+	 * finished.
+	 *
+	 * To avoid deadlocks due to locking order of buckets, first lock the old
+	 * bucket and then the new bucket.
 	 */
-	_hash_wrtbuf(rel, nbuf);
+	if (nbuf == bucket_nbuf)
+		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, nbuf);
+
+	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+	opage = BufferGetPage(bucket_obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	_hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+	npage = BufferGetPage(bucket_nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
+	nopaque->hasho_flag &= ~LH_BUCKET_BEING_POPULATED;
+
+	/*
+	 * After the split is finished, mark the old bucket to indicate that it
+	 * contains deletable tuples.  Vacuum will clear split-cleanup flag after
+	 * deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_NEEDS_SPLIT_CLEANUP;
+
+	/*
+	 * now write the buffers, here we don't release the locks as caller is
+	 * responsible to release locks.
+	 */
+	MarkBufferDirty(bucket_obuf);
+	MarkBufferDirty(bucket_nbuf);
+}
+
+/*
+ *	_hash_finish_split() -- Finish the previously interrupted split operation
+ *
+ * To complete the split operation, we form the hash table of TIDs in new
+ * bucket which is then used by split operation to skip tuples that are
+ * already moved before the split operation was previously interruptted.
+ *
+ * The caller must hold a pin, but no lock, on the metapage and old bucket's
+ * primay page buffer.  The buffers are returned in the same state.  (The
+ * metapage is only touched if it becomes necessary to add or remove overflow
+ * pages.)
+ */
+void
+_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
+				   uint32 maxbucket, uint32 highmask, uint32 lowmask)
+{
+	HASHCTL		hash_ctl;
+	HTAB	   *tidhtab;
+	Buffer		bucket_nbuf = InvalidBuffer;
+	Buffer		nbuf;
+	Page		npage;
+	BlockNumber nblkno;
+	BlockNumber bucket_nblkno;
+	HashPageOpaque npageopaque;
+	Bucket		nbucket;
+	bool		found;
+
+	/* Initialize hash tables used to track TIDs */
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+	hash_ctl.keysize = sizeof(ItemPointerData);
+	hash_ctl.entrysize = sizeof(ItemPointerData);
+	hash_ctl.hcxt = CurrentMemoryContext;
+
+	tidhtab =
+		hash_create("bucket ctids",
+					256,		/* arbitrary initial size */
+					&hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	bucket_nblkno = nblkno = _hash_get_newblock_from_oldbucket(rel, obucket);
+
+	/*
+	 * Scan the new bucket and build hash table of TIDs
+	 */
+	for (;;)
+	{
+		OffsetNumber noffnum;
+		OffsetNumber nmaxoffnum;
+
+		nbuf = _hash_getbuf(rel, nblkno, HASH_READ,
+							LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+
+		/* remember the primary bucket buffer to acquire cleanup lock on it. */
+		if (nblkno == bucket_nblkno)
+			bucket_nbuf = nbuf;
+
+		npage = BufferGetPage(nbuf);
+		npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+		/* Scan each tuple in new page */
+		nmaxoffnum = PageGetMaxOffsetNumber(npage);
+		for (noffnum = FirstOffsetNumber;
+			 noffnum <= nmaxoffnum;
+			 noffnum = OffsetNumberNext(noffnum))
+		{
+			IndexTuple	itup;
+
+			/* Fetch the item's TID and insert it in hash table. */
+			itup = (IndexTuple) PageGetItem(npage,
+											PageGetItemId(npage, noffnum));
+
+			(void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
+
+			Assert(!found);
+		}
+
+		nblkno = npageopaque->hasho_nextblkno;
+
+		/*
+		 * release our write lock without modifying buffer and ensure to
+		 * retain the pin on primary bucket.
+		 */
+		if (nbuf == bucket_nbuf)
+			_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, nbuf);
+
+		/* Exit loop if no more overflow pages in new bucket */
+		if (!BlockNumberIsValid(nblkno))
+			break;
+	}
+
+	/*
+	 * Conditionally get the cleanup lock on old and new buckets to perform
+	 * the split operation.  If we don't get the cleanup locks, silently
+	 * giveup and next insertion on old bucket will try again to complete the
+	 * split.
+	 */
+	if (!ConditionalLockBufferForCleanup(obuf))
+	{
+		hash_destroy(tidhtab);
+		return;
+	}
+	if (!ConditionalLockBufferForCleanup(bucket_nbuf))
+	{
+		_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
+		hash_destroy(tidhtab);
+		return;
+	}
+
+	npage = BufferGetPage(bucket_nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nbucket = npageopaque->hasho_bucket;
+
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, bucket_nbuf, tidhtab,
+						   maxbucket, highmask, lowmask);
 
-	_hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+	_hash_relbuf(rel, bucket_nbuf);
+	_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
+	hash_destroy(tidhtab);
 }
diff --git a/src/backend/access/hash/hashscan.c b/src/backend/access/hash/hashscan.c
deleted file mode 100644
index fe97ef2..0000000
--- a/src/backend/access/hash/hashscan.c
+++ /dev/null
@@ -1,153 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * hashscan.c
- *	  manage scans on hash tables
- *
- * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- *
- * IDENTIFICATION
- *	  src/backend/access/hash/hashscan.c
- *
- *-------------------------------------------------------------------------
- */
-
-#include "postgres.h"
-
-#include "access/hash.h"
-#include "access/relscan.h"
-#include "utils/memutils.h"
-#include "utils/rel.h"
-#include "utils/resowner.h"
-
-
-/*
- * We track all of a backend's active scans on hash indexes using a list
- * of HashScanListData structs, which are allocated in TopMemoryContext.
- * It's okay to use a long-lived context because we rely on the ResourceOwner
- * mechanism to clean up unused entries after transaction or subtransaction
- * abort.  We can't safely keep the entries in the executor's per-query
- * context, because that might be already freed before we get a chance to
- * clean up the list.  (XXX seems like there should be a better way to
- * manage this...)
- */
-typedef struct HashScanListData
-{
-	IndexScanDesc hashsl_scan;
-	ResourceOwner hashsl_owner;
-	struct HashScanListData *hashsl_next;
-} HashScanListData;
-
-typedef HashScanListData *HashScanList;
-
-static HashScanList HashScans = NULL;
-
-
-/*
- * ReleaseResources_hash() --- clean up hash subsystem resources.
- *
- * This is here because it needs to touch this module's static var HashScans.
- */
-void
-ReleaseResources_hash(void)
-{
-	HashScanList l;
-	HashScanList prev;
-	HashScanList next;
-
-	/*
-	 * Release all HashScanList items belonging to the current ResourceOwner.
-	 * Note that we do not release the underlying IndexScanDesc; that's in
-	 * executor memory and will go away on its own (in fact quite possibly has
-	 * gone away already, so we mustn't try to touch it here).
-	 *
-	 * Note: this should be a no-op during normal query shutdown. However, in
-	 * an abort situation ExecutorEnd is not called and so there may be open
-	 * index scans to clean up.
-	 */
-	prev = NULL;
-
-	for (l = HashScans; l != NULL; l = next)
-	{
-		next = l->hashsl_next;
-		if (l->hashsl_owner == CurrentResourceOwner)
-		{
-			if (prev == NULL)
-				HashScans = next;
-			else
-				prev->hashsl_next = next;
-
-			pfree(l);
-			/* prev does not change */
-		}
-		else
-			prev = l;
-	}
-}
-
-/*
- *	_hash_regscan() -- register a new scan.
- */
-void
-_hash_regscan(IndexScanDesc scan)
-{
-	HashScanList new_el;
-
-	new_el = (HashScanList) MemoryContextAlloc(TopMemoryContext,
-											   sizeof(HashScanListData));
-	new_el->hashsl_scan = scan;
-	new_el->hashsl_owner = CurrentResourceOwner;
-	new_el->hashsl_next = HashScans;
-	HashScans = new_el;
-}
-
-/*
- *	_hash_dropscan() -- drop a scan from the scan list
- */
-void
-_hash_dropscan(IndexScanDesc scan)
-{
-	HashScanList chk,
-				last;
-
-	last = NULL;
-	for (chk = HashScans;
-		 chk != NULL && chk->hashsl_scan != scan;
-		 chk = chk->hashsl_next)
-		last = chk;
-
-	if (chk == NULL)
-		elog(ERROR, "hash scan list trashed; cannot find 0x%p", (void *) scan);
-
-	if (last == NULL)
-		HashScans = chk->hashsl_next;
-	else
-		last->hashsl_next = chk->hashsl_next;
-
-	pfree(chk);
-}
-
-/*
- * Is there an active scan in this bucket?
- */
-bool
-_hash_has_active_scan(Relation rel, Bucket bucket)
-{
-	Oid			relid = RelationGetRelid(rel);
-	HashScanList l;
-
-	for (l = HashScans; l != NULL; l = l->hashsl_next)
-	{
-		if (relid == l->hashsl_scan->indexRelation->rd_id)
-		{
-			HashScanOpaque so = (HashScanOpaque) l->hashsl_scan->opaque;
-
-			if (so->hashso_bucket_valid &&
-				so->hashso_bucket == bucket)
-				return true;
-		}
-	}
-
-	return false;
-}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..8d43b38 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -63,38 +63,94 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- * Advance to next page in a bucket, if any.
+ * Advance to next page in a bucket, if any.  If we are scanning the bucket
+ * being populated during split operation then this function advances to the
+ * bucket being split after the last bucket page of bucket being populated.
  */
 static void
-_hash_readnext(Relation rel,
+_hash_readnext(IndexScanDesc scan,
 			   Buffer *bufp, Page *pagep, HashPageOpaque *opaquep)
 {
 	BlockNumber blkno;
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	bool		block_found = false;
 
 	blkno = (*opaquep)->hasho_nextblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan.  Refer the
+	 * comments in _hash_first to know the reason of retaining pin.
+	 */
+	if (*bufp == so->hashso_bucket_buf || *bufp == so->hashso_split_bucket_buf)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
 	if (BlockNumberIsValid(blkno))
 	{
 		*bufp = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+		block_found = true;
+	}
+	else if (so->hashso_buc_populated && !so->hashso_buc_split)
+	{
+		/*
+		 * end of bucket, scan bucket being split if there was a split in
+		 * progress at the start of scan.
+		 */
+		*bufp = so->hashso_split_bucket_buf;
+
+		/*
+		 * buffer for bucket being split must be valid as we acquire the pin
+		 * on it before the start of scan and retain it till end of scan.
+		 */
+		Assert(BufferIsValid(*bufp));
+
+		_hash_chgbufaccess(rel, *bufp, HASH_NOLOCK, HASH_READ);
+
+		/*
+		 * setting hashso_buc_split to true indicates that we are scanning
+		 * bucket being split.
+		 */
+		so->hashso_buc_split = true;
+
+		block_found = true;
+	}
+
+	if (block_found)
+	{
 		*pagep = BufferGetPage(*bufp);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 	}
 }
 
 /*
- * Advance to previous page in a bucket, if any.
+ * Advance to previous page in a bucket, if any.  If the current scan has
+ * started during split operation then this function advances to bucket
+ * being populated after the first bucket page of bucket being split.
  */
 static void
-_hash_readprev(Relation rel,
+_hash_readprev(IndexScanDesc scan,
 			   Buffer *bufp, Page *pagep, HashPageOpaque *opaquep)
 {
 	BlockNumber blkno;
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 
 	blkno = (*opaquep)->hasho_prevblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan.  Refer the
+	 * comments in _hash_first to know the reason of retaining pin.
+	 */
+	if (*bufp == so->hashso_bucket_buf || *bufp == so->hashso_split_bucket_buf)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@@ -104,6 +160,41 @@ _hash_readprev(Relation rel,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
+
+		/*
+		 * We always maintain the pin on bucket page for whole scan operation,
+		 * so releasing the additional pin we have acquired here.
+		 */
+		if (*bufp == so->hashso_bucket_buf || *bufp == so->hashso_split_bucket_buf)
+			_hash_dropbuf(rel, *bufp);
+	}
+	else if (so->hashso_buc_populated && so->hashso_buc_split)
+	{
+		/*
+		 * end of bucket, scan bucket being populated if there was a split in
+		 * progress at the start of scan.
+		 */
+		*bufp = so->hashso_bucket_buf;
+
+		/*
+		 * buffer for bucket being populated must be valid as we acquire the
+		 * pin on it before the start of scan and retain it till end of scan.
+		 */
+		Assert(BufferIsValid(*bufp));
+
+		_hash_chgbufaccess(rel, *bufp, HASH_NOLOCK, HASH_READ);
+		*pagep = BufferGetPage(*bufp);
+		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
+
+		/* move to the end of bucket chain */
+		while (BlockNumberIsValid((*opaquep)->hasho_nextblkno))
+			_hash_readnext(scan, bufp, pagep, opaquep);
+
+		/*
+		 * setting hashso_buc_split to false indicates that we are scanning
+		 * bucket being populated.
+		 */
+		so->hashso_buc_split = false;
 	}
 }
 
@@ -218,9 +309,11 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 		{
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+
+		/* Fetch the primary bucket page for the bucket */
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
 
 		/*
 		 * Reacquire metapage lock and check that no bucket split has taken
@@ -234,22 +327,73 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	/* done with the metapage */
 	_hash_dropbuf(rel, metabuf);
 
-	/* Update scan opaque state to show we have lock on the bucket */
-	so->hashso_bucket = bucket;
-	so->hashso_bucket_valid = true;
-	so->hashso_bucket_blkno = blkno;
-
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
 	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
+	so->hashso_bucket_buf = buf;
+
+	/*
+	 * If the bucket split is in progress, then while scanning the bucket
+	 * being populated, we need to skip tuples that are moved from bucket
+	 * being split.  We need to maintain the pin on bucket being split to
+	 * ensure that split-cleanup work done by vacuum doesn't remove tuples
+	 * from it till this scan is done.  We need to main to maintain the pin on
+	 * bucket being populated to ensure that vacuum doesn't squeeze that
+	 * bucket till this scan is complete, otherwise the ordering of tuples
+	 * can't be maintained during forward and backward scans.  Here, we have
+	 * to be cautious about locking order, first acquire the lock on bucket
+	 * being split, release the lock on it, but not pin, then acquire the lock
+	 * on bucket being populated and again re-verify whether the bucket split
+	 * still is in progress.  First acquiring lock on bucket being split
+	 * ensures that the vacuum waits for this scan to finish.
+	 */
+	if (H_BUCKET_BEING_POPULATED(opaque))
+	{
+		BlockNumber old_blkno;
+		Buffer		old_buf;
+
+		old_blkno = _hash_get_oldblock_from_newbucket(rel, bucket);
+
+		/*
+		 * release the lock on new bucket and re-acquire it after acquiring
+		 * the lock on old bucket.
+		 */
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+		/*
+		 * remember the split bucket buffer so as to use it later for
+		 * scanning.
+		 */
+		so->hashso_split_bucket_buf = old_buf;
+		_hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+		_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		Assert(opaque->hasho_bucket == bucket);
+
+		if (H_BUCKET_BEING_POPULATED(opaque))
+			so->hashso_buc_populated = true;
+		else
+		{
+			_hash_dropbuf(rel, so->hashso_split_bucket_buf);
+			so->hashso_split_bucket_buf = InvalidBuffer;
+		}
+	}
+
 	/* If a backwards scan is requested, move to the end of the chain */
 	if (ScanDirectionIsBackward(dir))
 	{
-		while (BlockNumberIsValid(opaque->hasho_nextblkno))
-			_hash_readnext(rel, &buf, &page, &opaque);
+		/*
+		 * Backward scans that start during split needs to start from end of
+		 * bucket being split.
+		 */
+		while (BlockNumberIsValid(opaque->hasho_nextblkno) ||
+			   (so->hashso_buc_populated && !so->hashso_buc_split))
+			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
 	/* Now find the first tuple satisfying the qualification */
@@ -273,6 +417,12 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
  *		false.  Else, return true and set the hashso_curpos for the
  *		scan to the right thing.
  *
+ *		Here we need to ensure that if the scan has started during split, then
+ *		skip the tuples that are moved by split while scanning bucket being
+ *		populated and then scan the bucket being split to cover all such
+ *		tuples.  This is done to ensure that we don't miss tuples in the scans
+ *		that are started during split.
+ *
  *		'bufP' points to the current buffer, which is pinned and read-locked.
  *		On success exit, we have pin and read-lock on whichever page
  *		contains the right item; on failure, we have released all buffers.
@@ -338,6 +488,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum >= FirstOffsetNumber);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_buc_populated && !so->hashso_buc_split &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberNext(offnum);	/* move forward */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -345,7 +508,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					/*
 					 * ran off the end of this page, try the next
 					 */
-					_hash_readnext(rel, &buf, &page, &opaque);
+					_hash_readnext(scan, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
 						maxoff = PageGetMaxOffsetNumber(page);
@@ -353,7 +516,6 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
 						itup = NULL;
 						break;	/* exit for-loop */
 					}
@@ -379,6 +541,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum <= maxoff);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_buc_populated && !so->hashso_buc_split &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberPrev(offnum);	/* move back */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@@ -386,7 +561,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					/*
 					 * ran off the end of this page, try the next
 					 */
-					_hash_readprev(rel, &buf, &page, &opaque);
+					_hash_readprev(scan, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
 						maxoff = PageGetMaxOffsetNumber(page);
@@ -394,7 +569,6 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
 						itup = NULL;
 						break;	/* exit for-loop */
 					}
@@ -410,9 +584,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 
 		if (itup == NULL)
 		{
-			/* we ran off the end of the bucket without finding a match */
+			/*
+			 * We ran off the end of the bucket without finding a match.
+			 * Release the pin on bucket buffers.  Normally, such pins are
+			 * released at end of scan, however scrolling cursors can
+			 * reacquire the bucket lock and pin in the same scan multiple
+			 * times.
+			 */
 			*bufP = so->hashso_curbuf = InvalidBuffer;
 			ItemPointerSetInvalid(current);
+			_hash_dropscanbuf(rel, so);
 			return false;
 		}
 
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..3819de9 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -20,6 +20,8 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
+#define CALC_NEW_BUCKET(old_bucket, lowmask) \
+			old_bucket | (lowmask + 1)
 
 /*
  * _hash_checkqual -- does the index tuple satisfy the scan conditions?
@@ -352,3 +354,95 @@ _hash_binsearch_last(Page page, uint32 hash_value)
 
 	return lower;
 }
+
+/*
+ *	_hash_get_oldblock_from_newbucket() -- get the block number of a bucket
+ *			from which current (new) bucket is being split.
+ */
+BlockNumber
+_hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bucket)
+{
+	Bucket		old_bucket;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	/*
+	 * To get the old bucket from the current bucket, we need a mask to modulo
+	 * into lower half of table.  This mask is stored in meta page as
+	 * hashm_lowmask, but here we can't rely on the same, because we need a
+	 * value of lowmask that was prevalent at the time when bucket split was
+	 * started.  Masking the most significant bit of new bucket would give us
+	 * old bucket.
+	 */
+	mask = (((uint32) 1) << (fls(new_bucket) - 1)) - 1;
+	old_bucket = new_bucket & mask;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newblock_from_oldbucket() -- get the block number of a bucket
+ *			that will be generated after split from old bucket.
+ *
+ * This is used to find the new bucket from old bucket based on current table
+ * half.  It is mainly required to finish the incomplete splits where we are
+ * sure that not more than one bucket could have split in progress from old
+ * bucket.
+ */
+BlockNumber
+_hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket)
+{
+	Bucket		new_bucket;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	new_bucket = _hash_get_newbucket_from_oldbucket(rel, old_bucket,
+													metap->hashm_lowmask,
+													metap->hashm_maxbucket);
+	blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newbucket_from_oldbucket() -- get the new bucket that will be
+ *			generated after split from current (old) bucket.
+ *
+ * This is used to find the new bucket from old bucket.  New bucket can be
+ * obtained by OR'ing old bucket with most significant bit of current table
+ * half (lowmask passed in this function can be used to identify msb of
+ * current table half).  There could be multiple buckets that could have
+ * splitted from curent bucket.  We need the first such bucket that exists.
+ * Caller must ensure that no more than one split has happened from old
+ * bucket.
+ */
+Bucket
+_hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
+								   uint32 lowmask, uint32 maxbucket)
+{
+	Bucket		new_bucket;
+
+	new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask);
+	if (new_bucket > maxbucket)
+	{
+		lowmask = lowmask >> 1;
+		new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask);
+	}
+
+	return new_bucket;
+}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 07075ce..cdc460b 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -668,9 +668,6 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintFileLeakWarning(res);
 			FileClose(res);
 		}
-
-		/* Clean up index scans too */
-		ReleaseResources_hash();
 	}
 
 	/* Let add-on modules get a chance too */
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 725e2f2..6dfc41f 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -24,6 +24,7 @@
 #include "lib/stringinfo.h"
 #include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
+#include "utils/hsearch.h"
 #include "utils/relcache.h"
 
 /*
@@ -32,6 +33,8 @@
  */
 typedef uint32 Bucket;
 
+#define InvalidBucket	((Bucket) 0xFFFFFFFF)
+
 #define BUCKET_TO_BLKNO(metap,B) \
 		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
 
@@ -51,6 +54,9 @@ typedef uint32 Bucket;
 #define LH_BUCKET_PAGE			(1 << 1)
 #define LH_BITMAP_PAGE			(1 << 2)
 #define LH_META_PAGE			(1 << 3)
+#define LH_BUCKET_BEING_POPULATED	(1 << 4)
+#define LH_BUCKET_BEING_SPLIT	(1 << 5)
+#define LH_BUCKET_NEEDS_SPLIT_CLEANUP	(1 << 6)
 
 typedef struct HashPageOpaqueData
 {
@@ -63,6 +69,10 @@ typedef struct HashPageOpaqueData
 
 typedef HashPageOpaqueData *HashPageOpaque;
 
+#define H_NEEDS_SPLIT_CLEANUP(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEEDS_SPLIT_CLEANUP)
+#define H_BUCKET_BEING_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_SPLIT)
+#define H_BUCKET_BEING_POPULATED(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_POPULATED)
+
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
  * which otherwise would have a hard time telling pages of different index
@@ -80,19 +90,6 @@ typedef struct HashScanOpaqueData
 	uint32		hashso_sk_hash;
 
 	/*
-	 * By definition, a hash scan should be examining only one bucket. We
-	 * record the bucket number here as soon as it is known.
-	 */
-	Bucket		hashso_bucket;
-	bool		hashso_bucket_valid;
-
-	/*
-	 * If we have a share lock on the bucket, we record it here.  When
-	 * hashso_bucket_blkno is zero, we have no such lock.
-	 */
-	BlockNumber hashso_bucket_blkno;
-
-	/*
 	 * We also want to remember which buffer we're currently examining in the
 	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
 	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -100,11 +97,30 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_curbuf;
 
+	/* remember the buffer associated with primary bucket */
+	Buffer		hashso_bucket_buf;
+
+	/*
+	 * remember the buffer associated with primary bucket page of bucket being
+	 * split.  it is required during the scan of the bucket which is being
+	 * populated during split operation.
+	 */
+	Buffer		hashso_split_bucket_buf;
+
 	/* Current position of the scan, as an index TID */
 	ItemPointerData hashso_curpos;
 
 	/* Current position of the scan, as a heap TID */
 	ItemPointerData hashso_heappos;
+
+	/* Whether scan starts on bucket being populated due to split */
+	bool		hashso_buc_populated;
+
+	/*
+	 * Whether scanning bucket being split?  The value of this parameter is
+	 * referred only when hashso_buc_populated is true.
+	 */
+	bool		hashso_buc_split;
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -175,6 +191,8 @@ typedef HashMetaPageData *HashMetaPage;
 				  sizeof(ItemIdData) - \
 				  MAXALIGN(sizeof(HashPageOpaqueData)))
 
+#define INDEX_MOVED_BY_SPLIT_MASK	0x2000
+
 #define HASH_MIN_FILLFACTOR			10
 #define HASH_DEFAULT_FILLFACTOR		75
 
@@ -223,9 +241,6 @@ typedef HashMetaPageData *HashMetaPage;
 #define HASH_WRITE		BUFFER_LOCK_EXCLUSIVE
 #define HASH_NOLOCK		(-1)
 
-#define HASH_SHARE		ShareLock
-#define HASH_EXCLUSIVE	ExclusiveLock
-
 /*
  *	Strategy number. There's only one valid strategy for hashing: equality.
  */
@@ -297,21 +312,21 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 
 /* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
-extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BufferAccessStrategy bstrategy);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
+extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
+				   bool wbuf_dirty, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy);
 
 /* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
 extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 			 int access, int flags);
+extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
+								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
@@ -320,6 +335,7 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 						   BufferAccessStrategy bstrategy);
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
+extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
 extern void _hash_wrtbuf(Relation rel, Buffer buf);
 extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
 				   int to_access);
@@ -327,12 +343,9 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
 				ForkNumber forkNum);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
-
-/* hashscan.c */
-extern void _hash_regscan(IndexScanDesc scan);
-extern void _hash_dropscan(IndexScanDesc scan);
-extern bool _hash_has_active_scan(Relation rel, Bucket bucket);
-extern void ReleaseResources_hash(void);
+extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
+				   Bucket obucket, uint32 maxbucket, uint32 highmask,
+				   uint32 lowmask);
 
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
@@ -362,5 +375,18 @@ extern bool _hash_convert_tuple(Relation index,
 					Datum *index_values, bool *index_isnull);
 extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
 extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bucket);
+extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
+extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
+								   uint32 lowmask, uint32 maxbucket);
+
+/* hash.c */
+extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
+				  Buffer bucket_buf, BlockNumber bucket_blkno,
+				  BufferAccessStrategy bstrategy,
+				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
+				  double *tuples_removed, double *num_index_tuples,
+				  bool bucket_has_garbage,
+				  IndexBulkDeleteCallback callback, void *callback_state);
 
 #endif   /* HASH_H */
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 8350fa0..788ba9f 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -63,7 +63,7 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
  * t_info manipulation macros
  */
 #define INDEX_SIZE_MASK 0x1FFF
-/* bit 0x2000 is not used at present */
+/* bit 0x2000 is reserved for index-AM specific usage */
 #define INDEX_VAR_MASK	0x4000
 #define INDEX_NULL_MASK 0x8000
 
#150Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Amit Kapila (#149)
2 attachment(s)
Re: Hash Indexes

Hi All,

I have executed few test-cases to validate the v12 patch for
concurrent hash index shared upthread and have found no issues. Below
are some of the test-cases i ran,

1) pgbench test on a read-write workload with following configuration
(This was basically to validate the locking strategy not for
performance testing)

postgresql non-default configuration:
----------------------------------------------------
min_wal_size=15GB
max_wal_size=20GB
checkpoint_timeout=900
maintenance_work_mem=1GB
checkpoint_completion_target=0.9
max_connections=200
shared buffer=8GB

pgbench settings:
-------------------------
Scale Factor=300
run time= 30 mins
pgbench -c $thread -j $thread -T $time_for_reading -M prepared postgres

2) As v12 patch mainly has locking changes related to bucket squeezing
in hash index, I have ran a small test-case to build hash index with
good number of overflow pages and then ran deletion operation to see
if the bucket squeezing has happened. The test script
"test_squeezeb_hindex.sh" used for this testing is attached with this
mail and the results are shown below:

=====Number of bucket and overflow pages before delete=====
274671 Tuples only is on.
148390
131263 bucket
17126 overflow
1 bitmap

=====Number of bucket and overflow pages after delete=====
274671 Tuples only is on.
141240
131263 bucket
9976 overflow
1 bitmap

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

Show quoted text

On Wed, Nov 23, 2016 at 7:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 17, 2016 at 3:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Nov 12, 2016 at 12:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

You are right and I have changed the code as per your suggestion.

So...

+        /*
+         * We always maintain the pin on bucket page for whole scan operation,
+         * so releasing the additional pin we have acquired here.
+         */
+        if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+            _hash_dropbuf(rel, *bufp);

This relies on the page contents to know whether we took a pin; that
seems like a bad plan. We need to know intrinsically whether we took
a pin.

Okay, changed to not rely on page contents.

+     * If the bucket split is in progress, then we need to skip tuples that
+     * are moved from old bucket.  To ensure that vacuum doesn't clean any
+     * tuples from old or new buckets till this scan is in progress, maintain
+     * a pin on both of the buckets.  Here, we have to be cautious about

It wouldn't be a problem if VACUUM removed tuples from the new bucket,
because they'd have to be dead anyway. It also wouldn't be a problem
if it removed tuples from the old bucket that were actually dead. The
real issue isn't vacuum anyway, but the process of cleaning up after a
split. We need to hold the pin so that tuples being moved from the
old bucket to the new bucket by the split don't get removed from the
old bucket until our scan is done.

Updated comments to explain clearly.

+ old_blkno = _hash_get_oldblock_from_newbucket(rel,
opaque->hasho_bucket);

Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"? I
feel like I'm repeating this ad nauseum, but I really think it's bad
to rely on the special space instead of our own local variables!

Okay, changed as per suggestion.

-            /* we ran off the end of the bucket without finding a match */
+            /*
+             * We ran off the end of the bucket without finding a match.
+             * Release the pin on bucket buffers.  Normally, such pins are
+             * released at end of scan, however scrolling cursors can
+             * reacquire the bucket lock and pin in the same scan multiple
+             * times.
+             */
*bufP = so->hashso_curbuf = InvalidBuffer;
ItemPointerSetInvalid(current);
+            _hash_dropscanbuf(rel, so);

I think this comment is saying that we'll release the pin on the
primary bucket page for now, and then reacquire it later if the user
reverses the scan direction. But that doesn't sound very safe,
because the bucket could be split in the meantime and the order in
which tuples are returned could change. I think we want that to
remain stable within a single query execution.

As explained [1], this shouldn't be a problem.

+            _hash_readnext(rel, &buf, &page, &opaque,
+                       (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false);

Same comment: don't rely on the special space to figure this out.
Keep track. Also != 0 would be better than ? true : false.

After gluing scan of old and new buckets in _hash_read* API's, this is
no more required.

+                            /*
+                             * setting hashso_skip_moved_tuples to false
+                             * ensures that we don't check for tuples that are
+                             * moved by split in old bucket and it also
+                             * ensures that we won't retry to scan the old
+                             * bucket once the scan for same is finished.
+                             */
+                            so->hashso_skip_moved_tuples = false;

I think you've got a big problem here. Suppose the user starts the
scan in the new bucket and runs it forward until they end up in the
old bucket. Then they turn around and run the scan backward. When
they reach the beginning of the old bucket, they're going to stop, not
move back to the new bucket, AFAICS. Oops.

_hash_first() has a related problem: a backward scan starts at the end
of the new bucket and moves backward, but it should start at the end
of the old bucket, and then when it reaches the beginning, flip to the
new bucket and move backward through that one. Otherwise, a backward
scan and a forward scan don't return tuples in opposite order, which
they should.

I think what you need to do to fix both of these problems is a more
thorough job gluing the two buckets together. I'd suggest that the
responsibility for switching between the two buckets should probably
be given to _hash_readprev() and _hash_readnext(), because every place
that needs to advance to the next or previous page that cares about
this. Right now you are trying to handle it mostly in the functions
that call those functions, but that is prone to errors of omission.

Changed as per this idea to change the API's and fix the problem.

Also, I think that so->hashso_skip_moved_tuples is badly designed.
There are two separate facts you need to know: (1) whether you are
scanning a bucket that was still being populated at the start of your
scan and (2) if yes, whether you are scanning the bucket being
populated or whether you are instead scanning the corresponding "old"
bucket. You're trying to keep track of that using one Boolean, but
one Boolean only has two states and there are three possible states
here.

Updated patch is using two boolean variables to track the bucket state.

+    if (H_BUCKET_BEING_SPLIT(pageopaque) && IsBufferCleanupOK(buf))
+    {
+
+        /* release the lock on bucket buffer, before completing the split. */

Extra blank line.

Removed.

+moved-by-split flag on a tuple indicates that tuple is moved from old to new
+bucket.  The concurrent scans can skip such tuples till the split operation is
+finished.  Once the tuple is marked as moved-by-split, it will remain
so forever
+but that does no harm.  We have intentionally not cleared it as that
can generate
+an additional I/O which is not necessary.

The first sentence needs to start with "the" but the second sentence shouldn't.

Changed.

It would be good to adjust this part a bit to more clearly explain
that the split-in-progress and split-cleanup flags are bucket-level
flags, while moved-by-split is a per-tuple flag. It's possible to
figure this out from what you've written, but I think it could be more
clear. Another thing that is strange is that the code uses THREE
flags, bucket-being-split, bucket-being-populated, and
needs-split-cleanup, but the README conflates the first two and uses a
different name.

Updated patch to use bucket-being-split and bucket-being-populated to
explain the split operation in README. I have also changed the readme
to clearly indicate which the bucket and tuple level flags.

+previously-acquired content lock, but not pin and repeat the process using the

s/but not pin/but not the pin,/

Changed.

A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt.  The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
+disk space or crash) the index is left corrupt.  The probability of that
+could be made quite low if we grab a free page or two before we update the
+meta page, but the only real solution is to treat a split as a WAL-loggable,
must-complete action.  I'm not planning to teach hash about WAL in this
-go-round.
+go-round.  However, we do try to finish the incomplete splits during insert
+and split.

I think this paragraph needs a much heavier rewrite explaining the new
incomplete split handling. It's basically wrong now. Perhaps replace
it with something like this:

--
If a split fails partway through (e.g. due to insufficient disk space
or an interrupt), the index will not be corrupted. Instead, we'll
retry the split every time a tuple is inserted into the old bucket
prior to inserting the new tuple; eventually, we should succeed. The
fact that a split is left unfinished doesn't prevent subsequent
buckets from being split, but we won't try to split the bucket again
until the prior split is finished. In other words, a bucket can be in
the middle of being split for some time, but ti can't be in the middle
of two splits at the same time.

Although we can survive a failure to split a bucket, a crash is likely
to corrupt the index, since hash indexes are not yet WAL-logged.
--

s/ti/it
Fixed the typo and used the suggested text in README.

+        Acquire cleanup lock on target bucket
+        Scan and remove tuples
+        For overflow page, first we need to lock the next page and then
+        release the lock on current bucket or overflow page
+        Ensure to have buffer content lock in exclusive mode on bucket page
+        If buffer pincount is one, then compact free space as needed
+        Release lock

I don't think this summary is particularly correct. You would never
guess from this that we lock each bucket page in turn and then go back
and try to relock the primary bucket page at the end. It's more like:

acquire cleanup lock on primary bucket page
loop:
scan and remove tuples
if this is the last bucket page, break out of loop
pin and x-lock next page
release prior lock and pin (except keep pin on primary bucket page)
if the page we have locked is not the primary bucket page:
release lock and take exclusive lock on primary bucket page
if there are no other pins on the primary bucket page:
squeeze the bucket to remove free space

Yeah, it is clear, so I have used it in README.

Come to think of it, I'm a little worried about the locking in
_hash_squeezebucket(). It seems like we drop the lock on each "write"
bucket page before taking the lock on the next one. So a concurrent
scan could get ahead of the cleanup process. That would be bad,
wouldn't it?

As discussed [2], I have changed the code to use lock-chaining during
squeeze phase.

Apart from above, I have fixed a bug in calculation of lowmask in
_hash_get_oldblock_from_newbucket().

[1] - /messages/by-id/CAA4eK1JJDWFY0_Ezs4ZxXgnrGtTn48vFuXniOLmL7FOWX-tKNw@mail.gmail.com
[2] - /messages/by-id/CAA4eK1J+0OYWKswWYNEjrBk3LfGpGJ9iSV8bYPQ3M=-qpkMtwQ
%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachments:

test_squeeze_hindex.shapplication/x-sh; name=test_squeeze_hindex.shDownload
test_squeeze_hindex.sqlapplication/sql; name=test_squeeze_hindex.sqlDownload
#151Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#149)
Re: Hash Indexes

On Wed, Nov 23, 2016 at 8:50 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[ new patch ]

Committed with some further cosmetic changes. I guess I won't be very
surprised if this turns out to have a few bugs yet, but I think it's
in fairly good shape at this point.

I think it would be worth testing this code with very long overflow
chains by hacking the fill factor up to 1000 or something of that
sort, so that we get lots and lots of overflow pages before we start
splitting. I think that might find some bugs that aren't obvious
right now because most buckets get split before they even have a
single overflow bucket.

Also, the deadlock hazards that we talked about upthread should
probably be documented in the README somewhere, along with why we're
OK with accepting those hazards.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#152Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#151)
Re: Hash Indexes

On Thu, Dec 1, 2016 at 2:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 23, 2016 at 8:50 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[ new patch ]

Committed with some further cosmetic changes.

Thank you very much.

I think it would be worth testing this code with very long overflow
chains by hacking the fill factor up to 1000

1000 is not a valid value for fill factor. Do you intend to say 100?

or something of that

sort, so that we get lots and lots of overflow pages before we start
splitting. I think that might find some bugs that aren't obvious
right now because most buckets get split before they even have a
single overflow bucket.

Also, the deadlock hazards that we talked about upthread should
probably be documented in the README somewhere, along with why we're
OK with accepting those hazards.

That makes sense. I will send a patch along that lines.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#153Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#152)
Re: Hash Indexes

On Thu, Dec 1, 2016 at 12:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 1, 2016 at 2:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 23, 2016 at 8:50 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[ new patch ]

Committed with some further cosmetic changes.

Thank you very much.

I think it would be worth testing this code with very long overflow
chains by hacking the fill factor up to 1000

1000 is not a valid value for fill factor. Do you intend to say 100?

No. IIUC, 100 would mean split when the average bucket contains 1
page worth of tuples. I want to split when the average bucket
contains 10 pages worth of tuples.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#154Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#153)
Re: Hash Indexes

On Thu, Dec 1, 2016 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 1, 2016 at 12:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 1, 2016 at 2:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 23, 2016 at 8:50 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[ new patch ]

Committed with some further cosmetic changes.

Thank you very much.

I think it would be worth testing this code with very long overflow
chains by hacking the fill factor up to 1000

1000 is not a valid value for fill factor. Do you intend to say 100?

No. IIUC, 100 would mean split when the average bucket contains 1
page worth of tuples.

I also think so.

I want to split when the average bucket
contains 10 pages worth of tuples.

oh, I think what you mean to say is hack the code to bump fill factor
and then test it. I was confused that how can user can do that from
SQL command.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#155Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#154)
Re: Hash Indexes

On Fri, Dec 2, 2016 at 1:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I want to split when the average bucket
contains 10 pages worth of tuples.

oh, I think what you mean to say is hack the code to bump fill factor
and then test it. I was confused that how can user can do that from
SQL command.

Yes, that's why I said "hacking the fill factor up to 1000" when I
originally mentioned it.

Actually, for hash indexes, there's no reason why we couldn't allow
fillfactor settings greater than 100, and it might be useful.
Possibly it should be the default. Not 1000, certainly, but I'm not
sure that the current value of 75 is at all optimal. The optimal
value might be 100 or 125 or 150 or something like that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#156Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#155)
Re: Hash Indexes

On Sat, Dec 3, 2016 at 12:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Dec 2, 2016 at 1:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I want to split when the average bucket
contains 10 pages worth of tuples.

oh, I think what you mean to say is hack the code to bump fill factor
and then test it. I was confused that how can user can do that from
SQL command.

Yes, that's why I said "hacking the fill factor up to 1000" when I
originally mentioned it.

Actually, for hash indexes, there's no reason why we couldn't allow
fillfactor settings greater than 100, and it might be useful.

Yeah, I agree with that, but as of now, it might be tricky to support
the different range of fill factor for one of the indexes. Another
idea could be to have an additional storage parameter like
split_bucket_length or something like that for hash indexes which
indicate that split will occur after the average bucket contains
"split_bucket_length * page" worth of tuples. We do have additional
storage parameters for other types of indexes, so having one for the
hash index should not be a problem.

I think this is important because split immediately increases the hash
index space by approximately 2 times. We might want to change that
algorithm someday, but the above idea will prevent that in many cases.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#157Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#156)
Re: Hash Indexes

On Fri, Dec 2, 2016 at 10:54 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Dec 3, 2016 at 12:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Dec 2, 2016 at 1:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I want to split when the average bucket
contains 10 pages worth of tuples.

oh, I think what you mean to say is hack the code to bump fill factor
and then test it. I was confused that how can user can do that from
SQL command.

Yes, that's why I said "hacking the fill factor up to 1000" when I
originally mentioned it.

Actually, for hash indexes, there's no reason why we couldn't allow
fillfactor settings greater than 100, and it might be useful.

Yeah, I agree with that, but as of now, it might be tricky to support
the different range of fill factor for one of the indexes. Another
idea could be to have an additional storage parameter like
split_bucket_length or something like that for hash indexes which
indicate that split will occur after the average bucket contains
"split_bucket_length * page" worth of tuples. We do have additional
storage parameters for other types of indexes, so having one for the
hash index should not be a problem.

Agreed.

I think this is important because split immediately increases the hash
index space by approximately 2 times. We might want to change that
algorithm someday, but the above idea will prevent that in many cases.

Also agreed.

But the first thing is that you should probably do some testing in
that area via a quick hack to see if anything breaks in an obvious
way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#158Jeff Janes
jeff.janes@gmail.com
In reply to: Amit Kapila (#154)
2 attachment(s)
Re: Hash Indexes

On Thu, Dec 1, 2016 at 10:54 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Dec 1, 2016 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 1, 2016 at 12:43 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Thu, Dec 1, 2016 at 2:18 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

On Wed, Nov 23, 2016 at 8:50 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

[ new patch ]

Committed with some further cosmetic changes.

Thank you very much.

I think it would be worth testing this code with very long overflow
chains by hacking the fill factor up to 1000

1000 is not a valid value for fill factor. Do you intend to say 100?

No. IIUC, 100 would mean split when the average bucket contains 1
page worth of tuples.

I also think so.

I want to split when the average bucket
contains 10 pages worth of tuples.

oh, I think what you mean to say is hack the code to bump fill factor
and then test it. I was confused that how can user can do that from
SQL command.

I just occasionally insert a bunch of equal tuples, which have to be in
overflow pages no matter how much splitting happens.

I am getting vacuum errors against HEAD, after about 20 minutes or so (8
cores).

49233 XX002 2016-12-05 23:06:44.087 PST:ERROR: index "foo_index_idx"
contains unexpected zero page at block 64941
49233 XX002 2016-12-05 23:06:44.087 PST:HINT: Please REINDEX it.
49233 XX002 2016-12-05 23:06:44.087 PST:CONTEXT: automatic vacuum of
table "jjanes.public.foo"

Testing harness is attached. It includes a lot of code to test crash
recovery, but all of that stuff is turned off in this instance. No patches
need to be applied to the server to get this one to run.

With the latest HASH WAL patch applied, I get different but apparently
related errors

41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:ERROR: index
"foo_index_idx" contains corrupted page at block 27602
41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:HINT: Please REINDEX it.
41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:STATEMENT: update foo set
count=count+1 where index=$1

Cheers,

Jeff

Attachments:

count.plapplication/octet-stream; name=count.plDownload
do_nocrash.shapplication/x-sh; name=do_nocrash.shDownload
#159Amit Kapila
amit.kapila16@gmail.com
In reply to: Jeff Janes (#158)
Re: Hash Indexes

On Tue, Dec 6, 2016 at 1:29 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

I just occasionally insert a bunch of equal tuples, which have to be in
overflow pages no matter how much splitting happens.

I am getting vacuum errors against HEAD, after about 20 minutes or so (8
cores).

49233 XX002 2016-12-05 23:06:44.087 PST:ERROR: index "foo_index_idx"
contains unexpected zero page at block 64941
49233 XX002 2016-12-05 23:06:44.087 PST:HINT: Please REINDEX it.
49233 XX002 2016-12-05 23:06:44.087 PST:CONTEXT: automatic vacuum of table
"jjanes.public.foo"

Thanks for the report. This can happen due to vacuum trying to access
a new page. Vacuum can do so if (a) the calculation for maxbuckets
(in metapage) is wrong or (b) it is trying to free the overflow page
twice. Offhand, I don't see that can happen in code. I will
investigate further to see if there is any another reason why vacuum
can access the new page. BTW, have you done the test after commit
2f4193c3, that doesn't appear to be the cause of this problem, but
still, it is better to test after that fix. I am trying to reproduce
the issue, but in the meantime, if by anychance, you have a callstack,
then please share the same.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#160Jeff Janes
jeff.janes@gmail.com
In reply to: Amit Kapila (#159)
1 attachment(s)
Re: Hash Indexes

On Tue, Dec 6, 2016 at 4:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 6, 2016 at 1:29 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

I just occasionally insert a bunch of equal tuples, which have to be in
overflow pages no matter how much splitting happens.

I am getting vacuum errors against HEAD, after about 20 minutes or so (8
cores).

49233 XX002 2016-12-05 23:06:44.087 PST:ERROR: index "foo_index_idx"
contains unexpected zero page at block 64941
49233 XX002 2016-12-05 23:06:44.087 PST:HINT: Please REINDEX it.
49233 XX002 2016-12-05 23:06:44.087 PST:CONTEXT: automatic vacuum of

table

"jjanes.public.foo"

Thanks for the report. This can happen due to vacuum trying to access
a new page. Vacuum can do so if (a) the calculation for maxbuckets
(in metapage) is wrong or (b) it is trying to free the overflow page
twice. Offhand, I don't see that can happen in code. I will
investigate further to see if there is any another reason why vacuum
can access the new page. BTW, have you done the test after commit
2f4193c3, that doesn't appear to be the cause of this problem, but
still, it is better to test after that fix. I am trying to reproduce
the issue, but in the meantime, if by anychance, you have a callstack,
then please share the same.

It looks like I compiled the code for testing a few minutes before 2f4193c3
went in.

I've run it again with cb9dcbc1eebd8, after promoting the ERROR in
_hash_checkpage, hashutil.c:174 to a PANIC so that I can get a coredump to
feed to gdb.

This time it was an INSERT, not autovac, that got the error:

35495 INSERT XX002 2016-12-06 09:25:09.517 PST:PANIC: XX002: index
"foo_index_idx" contains unexpected zero page at block 202328
35495 INSERT XX002 2016-12-06 09:25:09.517 PST:HINT: Please REINDEX it.
35495 INSERT XX002 2016-12-06 09:25:09.517 PST:LOCATION: _hash_checkpage,
hashutil.c:174
35495 INSERT XX002 2016-12-06 09:25:09.517 PST:STATEMENT: insert into foo
(index) select $1 from generate_series(1,10000)

#0 0x0000003838c325e5 in raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0 0x0000003838c325e5 in raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003838c33dc5 in abort () at abort.c:92
#2 0x00000000007d6adf in errfinish (dummy=<value optimized out>) at
elog.c:557
#3 0x0000000000498d93 in _hash_checkpage (rel=0x7f4d030906a0, buf=<value
optimized out>, flags=<value optimized out>) at hashutil.c:169
#4 0x00000000004967cf in _hash_getbuf_with_strategy (rel=0x7f4d030906a0,
blkno=<value optimized out>, access=2, flags=1, bstrategy=<value optimized
out>)
at hashpage.c:234
#5 0x0000000000493dbb in hashbucketcleanup (rel=0x7f4d030906a0,
cur_bucket=14544, bucket_buf=7801, bucket_blkno=22864, bstrategy=0x0,
maxbucket=276687,
highmask=524287, lowmask=262143, tuples_removed=0x0,
num_index_tuples=0x0, split_cleanup=1 '\001', callback=0,
callback_state=0x0) at hash.c:799
#6 0x0000000000497921 in _hash_expandtable (rel=0x7f4d030906a0,
metabuf=7961) at hashpage.c:668
#7 0x0000000000495596 in _hash_doinsert (rel=0x7f4d030906a0,
itup=0x1f426b0) at hashinsert.c:236
#8 0x0000000000494830 in hashinsert (rel=0x7f4d030906a0, values=<value
optimized out>, isnull=<value optimized out>, ht_ctid=0x7f4d03076404,
heapRel=<value optimized out>, checkUnique=<value optimized out>) at
hash.c:247
#9 0x00000000005c81bc in ExecInsertIndexTuples (slot=0x1f28940,
tupleid=0x7f4d03076404, estate=0x1f28280, noDupErr=0 '\000',
specConflict=0x0,
arbiterIndexes=0x0) at execIndexing.c:389
#10 0x00000000005e74ad in ExecInsert (node=0x1f284d0) at
nodeModifyTable.c:496
#11 ExecModifyTable (node=0x1f284d0) at nodeModifyTable.c:1511
#12 0x00000000005cc9d8 in ExecProcNode (node=0x1f284d0) at
execProcnode.c:396
#13 0x00000000005ca53a in ExecutePlan (queryDesc=0x1f21a30,
direction=<value optimized out>, count=0) at execMain.c:1567
#14 standard_ExecutorRun (queryDesc=0x1f21a30, direction=<value optimized
out>, count=0) at execMain.c:338
#15 0x00007f4d0c1a6dfb in pgss_ExecutorRun (queryDesc=0x1f21a30,
direction=ForwardScanDirection, count=0) at pg_stat_statements.c:877
#16 0x00000000006dfc8a in ProcessQuery (plan=<value optimized out>,
sourceText=0x1f21990 "insert into foo (index) select $1 from
generate_series(1,10000)",
params=0x1f219e0, dest=0xc191c0, completionTag=0x7ffe82a959d0 "") at
pquery.c:185
#17 0x00000000006dfeda in PortalRunMulti (portal=0x1e86900, isTopLevel=1
'\001', setHoldSnapshot=0 '\000', dest=0xc191c0, altdest=0xc191c0,
completionTag=0x7ffe82a959d0 "") at pquery.c:1299
#18 0x00000000006e056c in PortalRun (portal=0x1e86900,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x1eec870,
altdest=0x1eec870,
completionTag=0x7ffe82a959d0 "") at pquery.c:813
#19 0x00000000006de832 in exec_execute_message (argc=<value optimized out>,
argv=<value optimized out>, dbname=0x1e933b8 "jjanes",
username=<value optimized out>) at postgres.c:1977
#20 PostgresMain (argc=<value optimized out>, argv=<value optimized out>,
dbname=0x1e933b8 "jjanes", username=<value optimized out>) at
postgres.c:4132
#21 0x000000000067e8a4 in BackendRun (argc=<value optimized out>,
argv=<value optimized out>) at postmaster.c:4274
#22 BackendStartup (argc=<value optimized out>, argv=<value optimized out>)
at postmaster.c:3946
#23 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at
postmaster.c:1704
#24 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>)
at postmaster.c:1312
#25 0x0000000000606388 in main (argc=2, argv=0x1e68320) at main.c:228

Attached is the 'bt full' output.

Cheers,

Jeff

Attachments:

gdb.outapplication/octet-stream; name=gdb.outDownload
#161Amit Kapila
amit.kapila16@gmail.com
In reply to: Jeff Janes (#160)
1 attachment(s)
Re: Hash Indexes

On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Dec 6, 2016 at 4:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 6, 2016 at 1:29 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

I just occasionally insert a bunch of equal tuples, which have to be in
overflow pages no matter how much splitting happens.

I am getting vacuum errors against HEAD, after about 20 minutes or so (8
cores).

49233 XX002 2016-12-05 23:06:44.087 PST:ERROR: index "foo_index_idx"
contains unexpected zero page at block 64941
49233 XX002 2016-12-05 23:06:44.087 PST:HINT: Please REINDEX it.
49233 XX002 2016-12-05 23:06:44.087 PST:CONTEXT: automatic vacuum of
table
"jjanes.public.foo"

Thanks for the report. This can happen due to vacuum trying to access
a new page. Vacuum can do so if (a) the calculation for maxbuckets
(in metapage) is wrong or (b) it is trying to free the overflow page
twice. Offhand, I don't see that can happen in code. I will
investigate further to see if there is any another reason why vacuum
can access the new page. BTW, have you done the test after commit
2f4193c3, that doesn't appear to be the cause of this problem, but
still, it is better to test after that fix. I am trying to reproduce
the issue, but in the meantime, if by anychance, you have a callstack,
then please share the same.

It looks like I compiled the code for testing a few minutes before 2f4193c3
went in.

I've run it again with cb9dcbc1eebd8, after promoting the ERROR in
_hash_checkpage, hashutil.c:174 to a PANIC so that I can get a coredump to
feed to gdb.

This time it was an INSERT, not autovac, that got the error:

The reason for this and the similar error in vacuum was that in one of
the corner cases after freeing the overflow page and updating the link
for the previous bucket, we were not marking the buffer as dirty. So,
due to concurrent activity, the buffer containing the updated links
got evicted and then later when we tried to access the same buffer, it
brought back the old copy which contains a link to freed overflow
page.

Apart from above issue, Kuntal has noticed that there is assertion
failure (Assert(bucket == new_bucket);) in hashbucketcleanup with the
same test as provided by you. The reason for that problem was that
after deleting tuples in hashbucketcleanup, we were not marking the
buffer as dirty due to which the old copy of the overflow page was
re-appearing and causing that failure.

After fixing the above problem, it has been noticed that there is
another assertion failure (Assert(bucket == obucket);) in
_hash_splitbucket_guts. The reason for this problem was that after
the split, vacuum failed to remove tuples from the old bucket that are
moved due to split. Now, during next split from the same old bucket,
we don't expect old bucket to contain tuples from the previous split.
To fix this whenever vacuum needs to perform split cleanup, it should
update the metapage values (masks required to calculate bucket
number), so that it shouldn't miss cleaning the tuples.
I believe this is the same assertion what Andreas has reported in
another thread [1]/messages/by-id/87y3zrzbu5.fsf_-_@ansel.ydns.eu.

The next problem we encountered is that after running the same test
for somewhat longer, inserts were failing with error "unexpected zero
page at block ..". After some analysis, I have found that the lock
chain (lock next overflow bucket page before releasing the previous
bucket page) was broken in one corner case in _hash_freeovflpage due
to which insert went ahead than squeeze bucket operation and accessed
the freed overflow page before the link for the same has been updated.

With above fixes, the test ran successfully for more than a day.

Many thanks to Kuntal and Dilip for helping me in analyzing and
testing the fixes for these problems.

[1]: /messages/by-id/87y3zrzbu5.fsf_-_@ansel.ydns.eu

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_hashindex_issues_v1.patchapplication/octet-stream; name=fix_hashindex_issues_v1.patchDownload
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 6806e32..a8c446c 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -523,7 +523,8 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	orig_maxbucket = metap->hashm_maxbucket;
 	orig_ntuples = metap->hashm_ntuples;
 	memcpy(&local_metapage, metap, sizeof(local_metapage));
-	_hash_relbuf(rel, metabuf);
+	/* release the lock, but keep pin */
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
 
 	/* Scan the buckets that we know exist */
 	cur_bucket = 0;
@@ -563,8 +564,21 @@ loop_top:
 		 */
 		if (!H_BUCKET_BEING_SPLIT(bucket_opaque) &&
 			H_NEEDS_SPLIT_CLEANUP(bucket_opaque))
+		{
 			split_cleanup = true;
 
+			/*
+			 * To perform split cleanup, refresh the meta page values.  It is
+			 * done to ensure that values of hashm_maxbucket, hashm_highmask
+			 * and hashm_lowmask are corresponding to latest split of the
+			 * bucket.  Otherwise, it will fail to remove tuples that are
+			 * moved by latest split.
+			 */
+			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			memcpy(&local_metapage, metap, sizeof(local_metapage));
+			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		}
+
 		bucket_buf = buf;
 
 		hashbucketcleanup(rel, cur_bucket, bucket_buf, blkno, info->strategy,
@@ -581,7 +595,7 @@ loop_top:
 	}
 
 	/* Write-lock metapage and check for split since we started */
-	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_WRITE, LH_META_PAGE);
+	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 	metap = HashPageGetMeta(BufferGetPage(metabuf));
 
 	if (cur_maxbucket != metap->hashm_maxbucket)
@@ -589,7 +603,7 @@ loop_top:
 		/* There's been a split, so process the additional bucket(s) */
 		cur_maxbucket = metap->hashm_maxbucket;
 		memcpy(&local_metapage, metap, sizeof(local_metapage));
-		_hash_relbuf(rel, metabuf);
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
 		goto loop_top;
 	}
 
@@ -689,6 +703,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		curr_page_dirty;
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -708,7 +723,8 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable = 0;
 		bool		retain_pin = false;
-		bool		curr_page_dirty = false;
+
+		curr_page_dirty = false;
 
 		vacuum_delay_point();
 
@@ -827,7 +843,10 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	 */
 	if (buf != bucket_buf)
 	{
-		_hash_relbuf(rel, buf);
+		if (curr_page_dirty)
+			_hash_wrtbuf(rel, buf);
+		else
+			_hash_relbuf(rel, buf);
 		_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
 	}
 
@@ -849,6 +868,16 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	}
 
 	/*
+	 * we need to release and reacquire the lock on bucket buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.  This is mainly
+	 * required once hash indexes are WAL logged, but without that also it
+	 * helps in simplifying the code as without that we need to pass the
+	 * information of bucket buffer being dirty to _hash_squeezebucket.
+	 */
+	_hash_chgbufaccess(rel, bucket_buf, HASH_WRITE, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+
+	/*
 	 * If we have deleted anything, try to compact free space.  For squeezing
 	 * the bucket, we must have a cleanup lock, else it can impact the
 	 * ordering of tuples for a scan that has started before it.
@@ -857,7 +886,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
 							bstrategy);
 	else
-		_hash_chgbufaccess(rel, bucket_buf, HASH_WRITE, HASH_NOLOCK);
+		_hash_chgbufaccess(rel, bucket_buf, HASH_READ, HASH_NOLOCK);
 }
 
 void
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index e2d208e..cc922a9 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -369,8 +369,8 @@ _hash_firstfreebit(uint32 map)
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
  *
- *	Returns the block number of the page that followed the given page
- *	in the bucket, or InvalidBlockNumber if no following page.
+ *	Returns the buffer that followed the given wbuf in the bucket, or
+ *	InvalidBuffer if no following page.
  *
  *	NB: caller must not hold lock on metapage, nor on page, that's next to
  *	ovflbuf in the bucket chain.  We don't acquire the lock on page that's
@@ -378,7 +378,7 @@ _hash_firstfreebit(uint32 map)
  *	has a lock on same.  This function releases the lock on wbuf and caller
  *	is responsible for releasing the pin on same.
  */
-BlockNumber
+Buffer
 _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 				   bool wbuf_dirty, BufferAccessStrategy bstrategy)
 {
@@ -386,14 +386,17 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	Buffer		metabuf;
 	Buffer		mapbuf;
 	Buffer		prevbuf = InvalidBuffer;
+	Buffer		next_wbuf = InvalidBuffer;
 	BlockNumber ovflblkno;
 	BlockNumber prevblkno;
 	BlockNumber blkno;
 	BlockNumber nextblkno;
 	BlockNumber writeblkno;
 	HashPageOpaque ovflopaque;
+	HashPageOpaque wopaque;
 	Page		ovflpage;
 	Page		mappage;
+	Page		wpage;
 	uint32	   *freep;
 	uint32		ovflbitno;
 	int32		bitmappage,
@@ -446,14 +449,13 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 
 		if (prevblkno != writeblkno)
 			_hash_wrtbuf(rel, prevbuf);
+		else
+		{
+			/* ensure to mark prevbuf as dirty */
+			wbuf_dirty = true;
+		}
 	}
 
-	/* write and unlock the write buffer */
-	if (wbuf_dirty)
-		_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
-	else
-		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
-
 	if (BlockNumberIsValid(nextblkno))
 	{
 		Buffer		nextbuf = _hash_getbuf_with_strategy(rel,
@@ -469,6 +471,38 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 		_hash_wrtbuf(rel, nextbuf);
 	}
 
+	/*
+	 * To maintain lock chaining as described atop hashbucketcleanup, we need
+	 * to lock next bucket buffer in chain before releasing current.  This is
+	 * required only if the next overflow page from which to read is not same
+	 * as page to which we need to write.
+	 *
+	 * XXX Here, we are moving to next overflow page for writing without
+	 * ensuring if the previous write page is full.  This is annoying, but
+	 * should not hurt much in practice as that space will anyway be consumed
+	 * by future inserts.
+	 */
+	if (prevblkno != writeblkno)
+	{
+		wpage = BufferGetPage(wbuf);
+		wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
+		Assert(wopaque->hasho_bucket == bucket);
+		writeblkno = wopaque->hasho_nextblkno;
+
+		if (BlockNumberIsValid(writeblkno));
+		next_wbuf = _hash_getbuf_with_strategy(rel,
+											   writeblkno,
+											   HASH_WRITE,
+											   LH_OVERFLOW_PAGE,
+											   bstrategy);
+	}
+
+	/* write and unlock the write buffer */
+	if (wbuf_dirty)
+		_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
+
 	/* Note: bstrategy is intentionally not used for metapage and bitmap */
 
 	/* Read the metapage so we can determine which bitmap page to use */
@@ -511,7 +545,7 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 		_hash_relbuf(rel, metabuf);
 	}
 
-	return nextblkno;
+	return next_wbuf;
 }
 
 
@@ -676,6 +710,7 @@ _hash_squeezebucket(Relation rel,
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable = 0;
 		bool		retain_pin = false;
+		Buffer		next_wbuf = InvalidBuffer;
 
 		/* Scan each tuple in "read" page */
 		maxroffnum = PageGetMaxOffsetNumber(rpage);
@@ -701,8 +736,6 @@ _hash_squeezebucket(Relation rel,
 			 */
 			while (PageGetFreeSpace(wpage) < itemsz)
 			{
-				Buffer		next_wbuf = InvalidBuffer;
-
 				Assert(!PageIsEmpty(wpage));
 
 				if (wblkno == bucket_blkno)
@@ -789,19 +822,29 @@ _hash_squeezebucket(Relation rel,
 		Assert(BlockNumberIsValid(rblkno));
 
 		/* free this overflow page (releases rbuf) */
-		_hash_freeovflpage(rel, rbuf, wbuf, wbuf_dirty, bstrategy);
+		next_wbuf = _hash_freeovflpage(rel, rbuf, wbuf, wbuf_dirty, bstrategy);
+
+		/* retain the pin on primary bucket page till end of bucket scan */
+		if (wblkno != bucket_blkno)
+			_hash_dropbuf(rel, wbuf);
 
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
+			return;
+
+		/* are we freeing the page adjacent to next_wbuf? */
+		if (BufferIsValid(next_wbuf) &&
+			rblkno == BufferGetBlockNumber(next_wbuf))
 		{
-			/* retain the pin on primary bucket page till end of bucket scan */
-			if (wblkno != bucket_blkno)
-				_hash_dropbuf(rel, wbuf);
+			_hash_relbuf(rel, next_wbuf);
 			return;
 		}
 
-		/* lock the overflow page being written, then get the previous one */
-		_hash_chgbufaccess(rel, wbuf, HASH_NOLOCK, HASH_WRITE);
+		wbuf = next_wbuf;
+		wblkno = BufferGetBlockNumber(wbuf);
+		wpage = BufferGetPage(wbuf);
+		wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
+		Assert(wopaque->hasho_bucket == bucket);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 6dfc41f..bc63719 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -313,7 +313,7 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 
 /* hashovfl.c */
 extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
-extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
+extern Buffer _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 				   bool wbuf_dirty, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
#162Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#161)
1 attachment(s)
Re: Hash Indexes

On Sun, Dec 11, 2016 at 11:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

With above fixes, the test ran successfully for more than a day.

There was a small typo in the previous patch which is fixed in
attached. I don't think it will impact the test results if you have
already started the test with the previous patch, but if not, then it
is better to test with attached.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_hashindex_issues_v2.patchapplication/octet-stream; name=fix_hashindex_issues_v2.patchDownload
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 6806e32..a8c446c 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -523,7 +523,8 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	orig_maxbucket = metap->hashm_maxbucket;
 	orig_ntuples = metap->hashm_ntuples;
 	memcpy(&local_metapage, metap, sizeof(local_metapage));
-	_hash_relbuf(rel, metabuf);
+	/* release the lock, but keep pin */
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
 
 	/* Scan the buckets that we know exist */
 	cur_bucket = 0;
@@ -563,8 +564,21 @@ loop_top:
 		 */
 		if (!H_BUCKET_BEING_SPLIT(bucket_opaque) &&
 			H_NEEDS_SPLIT_CLEANUP(bucket_opaque))
+		{
 			split_cleanup = true;
 
+			/*
+			 * To perform split cleanup, refresh the meta page values.  It is
+			 * done to ensure that values of hashm_maxbucket, hashm_highmask
+			 * and hashm_lowmask are corresponding to latest split of the
+			 * bucket.  Otherwise, it will fail to remove tuples that are
+			 * moved by latest split.
+			 */
+			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			memcpy(&local_metapage, metap, sizeof(local_metapage));
+			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		}
+
 		bucket_buf = buf;
 
 		hashbucketcleanup(rel, cur_bucket, bucket_buf, blkno, info->strategy,
@@ -581,7 +595,7 @@ loop_top:
 	}
 
 	/* Write-lock metapage and check for split since we started */
-	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_WRITE, LH_META_PAGE);
+	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 	metap = HashPageGetMeta(BufferGetPage(metabuf));
 
 	if (cur_maxbucket != metap->hashm_maxbucket)
@@ -589,7 +603,7 @@ loop_top:
 		/* There's been a split, so process the additional bucket(s) */
 		cur_maxbucket = metap->hashm_maxbucket;
 		memcpy(&local_metapage, metap, sizeof(local_metapage));
-		_hash_relbuf(rel, metabuf);
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
 		goto loop_top;
 	}
 
@@ -689,6 +703,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		curr_page_dirty;
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -708,7 +723,8 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable = 0;
 		bool		retain_pin = false;
-		bool		curr_page_dirty = false;
+
+		curr_page_dirty = false;
 
 		vacuum_delay_point();
 
@@ -827,7 +843,10 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	 */
 	if (buf != bucket_buf)
 	{
-		_hash_relbuf(rel, buf);
+		if (curr_page_dirty)
+			_hash_wrtbuf(rel, buf);
+		else
+			_hash_relbuf(rel, buf);
 		_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
 	}
 
@@ -849,6 +868,16 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	}
 
 	/*
+	 * we need to release and reacquire the lock on bucket buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.  This is mainly
+	 * required once hash indexes are WAL logged, but without that also it
+	 * helps in simplifying the code as without that we need to pass the
+	 * information of bucket buffer being dirty to _hash_squeezebucket.
+	 */
+	_hash_chgbufaccess(rel, bucket_buf, HASH_WRITE, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+
+	/*
 	 * If we have deleted anything, try to compact free space.  For squeezing
 	 * the bucket, we must have a cleanup lock, else it can impact the
 	 * ordering of tuples for a scan that has started before it.
@@ -857,7 +886,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
 							bstrategy);
 	else
-		_hash_chgbufaccess(rel, bucket_buf, HASH_WRITE, HASH_NOLOCK);
+		_hash_chgbufaccess(rel, bucket_buf, HASH_READ, HASH_NOLOCK);
 }
 
 void
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index e2d208e..16876f2 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -369,8 +369,8 @@ _hash_firstfreebit(uint32 map)
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
  *
- *	Returns the block number of the page that followed the given page
- *	in the bucket, or InvalidBlockNumber if no following page.
+ *	Returns the buffer that followed the given wbuf in the bucket, or
+ *	InvalidBuffer if no following page.
  *
  *	NB: caller must not hold lock on metapage, nor on page, that's next to
  *	ovflbuf in the bucket chain.  We don't acquire the lock on page that's
@@ -378,7 +378,7 @@ _hash_firstfreebit(uint32 map)
  *	has a lock on same.  This function releases the lock on wbuf and caller
  *	is responsible for releasing the pin on same.
  */
-BlockNumber
+Buffer
 _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 				   bool wbuf_dirty, BufferAccessStrategy bstrategy)
 {
@@ -386,14 +386,17 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	Buffer		metabuf;
 	Buffer		mapbuf;
 	Buffer		prevbuf = InvalidBuffer;
+	Buffer		next_wbuf = InvalidBuffer;
 	BlockNumber ovflblkno;
 	BlockNumber prevblkno;
 	BlockNumber blkno;
 	BlockNumber nextblkno;
 	BlockNumber writeblkno;
 	HashPageOpaque ovflopaque;
+	HashPageOpaque wopaque;
 	Page		ovflpage;
 	Page		mappage;
+	Page		wpage;
 	uint32	   *freep;
 	uint32		ovflbitno;
 	int32		bitmappage,
@@ -446,14 +449,13 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 
 		if (prevblkno != writeblkno)
 			_hash_wrtbuf(rel, prevbuf);
+		else
+		{
+			/* ensure to mark prevbuf as dirty */
+			wbuf_dirty = true;
+		}
 	}
 
-	/* write and unlock the write buffer */
-	if (wbuf_dirty)
-		_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
-	else
-		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
-
 	if (BlockNumberIsValid(nextblkno))
 	{
 		Buffer		nextbuf = _hash_getbuf_with_strategy(rel,
@@ -469,6 +471,38 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 		_hash_wrtbuf(rel, nextbuf);
 	}
 
+	/*
+	 * To maintain lock chaining as described atop hashbucketcleanup, we need
+	 * to lock next bucket buffer in chain before releasing current.  This is
+	 * required only if the next overflow page from which to read is not same
+	 * as page to which we need to write.
+	 *
+	 * XXX Here, we are moving to next overflow page for writing without
+	 * ensuring if the previous write page is full.  This is annoying, but
+	 * should not hurt much in practice as that space will anyway be consumed
+	 * by future inserts.
+	 */
+	if (prevblkno != writeblkno)
+	{
+		wpage = BufferGetPage(wbuf);
+		wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
+		Assert(wopaque->hasho_bucket == bucket);
+		writeblkno = wopaque->hasho_nextblkno;
+		Assert(BlockNumberIsValid(writeblkno));
+
+		next_wbuf = _hash_getbuf_with_strategy(rel,
+											   writeblkno,
+											   HASH_WRITE,
+											   LH_OVERFLOW_PAGE,
+											   bstrategy);
+	}
+
+	/* write and unlock the write buffer */
+	if (wbuf_dirty)
+		_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
+
 	/* Note: bstrategy is intentionally not used for metapage and bitmap */
 
 	/* Read the metapage so we can determine which bitmap page to use */
@@ -511,7 +545,7 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 		_hash_relbuf(rel, metabuf);
 	}
 
-	return nextblkno;
+	return next_wbuf;
 }
 
 
@@ -676,6 +710,7 @@ _hash_squeezebucket(Relation rel,
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable = 0;
 		bool		retain_pin = false;
+		Buffer		next_wbuf = InvalidBuffer;
 
 		/* Scan each tuple in "read" page */
 		maxroffnum = PageGetMaxOffsetNumber(rpage);
@@ -701,8 +736,6 @@ _hash_squeezebucket(Relation rel,
 			 */
 			while (PageGetFreeSpace(wpage) < itemsz)
 			{
-				Buffer		next_wbuf = InvalidBuffer;
-
 				Assert(!PageIsEmpty(wpage));
 
 				if (wblkno == bucket_blkno)
@@ -789,19 +822,29 @@ _hash_squeezebucket(Relation rel,
 		Assert(BlockNumberIsValid(rblkno));
 
 		/* free this overflow page (releases rbuf) */
-		_hash_freeovflpage(rel, rbuf, wbuf, wbuf_dirty, bstrategy);
+		next_wbuf = _hash_freeovflpage(rel, rbuf, wbuf, wbuf_dirty, bstrategy);
+
+		/* retain the pin on primary bucket page till end of bucket scan */
+		if (wblkno != bucket_blkno)
+			_hash_dropbuf(rel, wbuf);
 
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
+			return;
+
+		/* are we freeing the page adjacent to next_wbuf? */
+		if (BufferIsValid(next_wbuf) &&
+			rblkno == BufferGetBlockNumber(next_wbuf))
 		{
-			/* retain the pin on primary bucket page till end of bucket scan */
-			if (wblkno != bucket_blkno)
-				_hash_dropbuf(rel, wbuf);
+			_hash_relbuf(rel, next_wbuf);
 			return;
 		}
 
-		/* lock the overflow page being written, then get the previous one */
-		_hash_chgbufaccess(rel, wbuf, HASH_NOLOCK, HASH_WRITE);
+		wbuf = next_wbuf;
+		wblkno = BufferGetBlockNumber(wbuf);
+		wpage = BufferGetPage(wbuf);
+		wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
+		Assert(wopaque->hasho_bucket == bucket);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 6dfc41f..bc63719 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -313,7 +313,7 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 
 /* hashovfl.c */
 extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
-extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
+extern Buffer _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 				   bool wbuf_dirty, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
#163Amit Kapila
amit.kapila16@gmail.com
In reply to: Jeff Janes (#158)
Re: Hash Indexes

On Tue, Dec 6, 2016 at 1:29 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Dec 1, 2016 at 10:54 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

With the latest HASH WAL patch applied, I get different but apparently
related errors

41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:ERROR: index "foo_index_idx"
contains corrupted page at block 27602
41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:HINT: Please REINDEX it.
41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:STATEMENT: update foo set
count=count+1 where index=$1

This is not the problem of WAL patch per se. It should be fixed with
the hash index bug fix patch I sent upthread. However, after the bug
fix patch, WAL patch needs to be rebased which I will do and send it
after verification. In the meantime, it would be great if you can
verify the fix posted.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#164Jeff Janes
jeff.janes@gmail.com
In reply to: Amit Kapila (#162)
Re: Hash Indexes

On Sun, Dec 11, 2016 at 8:37 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Sun, Dec 11, 2016 at 11:54 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

With above fixes, the test ran successfully for more than a day.

There was a small typo in the previous patch which is fixed in
attached. I don't think it will impact the test results if you have
already started the test with the previous patch, but if not, then it
is better to test with attached.

Thanks, I've already been running the previous one for several hours, and
so far it looks good. I've tried forward porting it to the WAL patch to
test that as well, but didn't have any luck doing so.

Cheers,

Jeff

#165Amit Kapila
amit.kapila16@gmail.com
In reply to: Jeff Janes (#164)
Re: Hash Indexes

On Mon, Dec 12, 2016 at 10:25 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Sun, Dec 11, 2016 at 8:37 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Sun, Dec 11, 2016 at 11:54 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

With above fixes, the test ran successfully for more than a day.

There was a small typo in the previous patch which is fixed in
attached. I don't think it will impact the test results if you have
already started the test with the previous patch, but if not, then it
is better to test with attached.

Thanks, I've already been running the previous one for several hours, and
so far it looks good.

Thanks.

I've tried forward porting it to the WAL patch to
test that as well, but didn't have any luck doing so.

I think we can verify WAL patch separately. I am already working on it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#166Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#161)
Re: Hash Indexes

On Sun, Dec 11, 2016 at 1:24 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The reason for this and the similar error in vacuum was that in one of
the corner cases after freeing the overflow page and updating the link
for the previous bucket, we were not marking the buffer as dirty. So,
due to concurrent activity, the buffer containing the updated links
got evicted and then later when we tried to access the same buffer, it
brought back the old copy which contains a link to freed overflow
page.

Apart from above issue, Kuntal has noticed that there is assertion
failure (Assert(bucket == new_bucket);) in hashbucketcleanup with the
same test as provided by you. The reason for that problem was that
after deleting tuples in hashbucketcleanup, we were not marking the
buffer as dirty due to which the old copy of the overflow page was
re-appearing and causing that failure.

After fixing the above problem, it has been noticed that there is
another assertion failure (Assert(bucket == obucket);) in
_hash_splitbucket_guts. The reason for this problem was that after
the split, vacuum failed to remove tuples from the old bucket that are
moved due to split. Now, during next split from the same old bucket,
we don't expect old bucket to contain tuples from the previous split.
To fix this whenever vacuum needs to perform split cleanup, it should
update the metapage values (masks required to calculate bucket
number), so that it shouldn't miss cleaning the tuples.
I believe this is the same assertion what Andreas has reported in
another thread [1].

The next problem we encountered is that after running the same test
for somewhat longer, inserts were failing with error "unexpected zero
page at block ..". After some analysis, I have found that the lock
chain (lock next overflow bucket page before releasing the previous
bucket page) was broken in one corner case in _hash_freeovflpage due
to which insert went ahead than squeeze bucket operation and accessed
the freed overflow page before the link for the same has been updated.

With above fixes, the test ran successfully for more than a day.

Instead of doing this:

+    _hash_chgbufaccess(rel, bucket_buf, HASH_WRITE, HASH_NOLOCK);
+    _hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);

...wouldn't it be better to just do MarkBufferDirty()? There's no
real reason to release the lock only to reacquire it again, is there?
I don't think we should be afraid to call MarkBufferDirty() instead of
going through these (fairly stupid) hasham-specific APIs.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#167Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#166)
Re: Hash Indexes

On Tue, Dec 13, 2016 at 2:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Dec 11, 2016 at 1:24 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

With above fixes, the test ran successfully for more than a day.

Instead of doing this:

+    _hash_chgbufaccess(rel, bucket_buf, HASH_WRITE, HASH_NOLOCK);
+    _hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);

...wouldn't it be better to just do MarkBufferDirty()? There's no
real reason to release the lock only to reacquire it again, is there?

The reason is to make the operations consistent in master and standby.
In WAL patch, for clearing the SPLIT_CLEANUP flag, we write a WAL and
if we don't release the lock after writing a WAL, the operation can
appear on standby even before on master. Currently, without WAL,
there is no benefit of doing so and we can fix by using
MarkBufferDirty, however, I thought it might be simpler to keep it
this way as this is required for enabling WAL. OTOH, we can leave
that for WAL patch. Let me know which way you prefer?

I don't think we should be afraid to call MarkBufferDirty() instead of
going through these (fairly stupid) hasham-specific APIs.

Right and anyway we need to use it at many more call sites when we
enable WAL for hash index.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#168Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Amit Kapila (#162)
Re: Hash Indexes

On 12/11/2016 11:37 PM, Amit Kapila wrote:

On Sun, Dec 11, 2016 at 11:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

With above fixes, the test ran successfully for more than a day.

There was a small typo in the previous patch which is fixed in
attached. I don't think it will impact the test results if you have
already started the test with the previous patch, but if not, then it
is better to test with attached.

A mix work load (INSERT, DELETE and VACUUM primarily) is successful here
too using _v2.

Thanks !

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#169Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#167)
1 attachment(s)
Re: Hash Indexes

On Mon, Dec 12, 2016 at 9:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The reason is to make the operations consistent in master and standby.
In WAL patch, for clearing the SPLIT_CLEANUP flag, we write a WAL and
if we don't release the lock after writing a WAL, the operation can
appear on standby even before on master. Currently, without WAL,
there is no benefit of doing so and we can fix by using
MarkBufferDirty, however, I thought it might be simpler to keep it
this way as this is required for enabling WAL. OTOH, we can leave
that for WAL patch. Let me know which way you prefer?

It's not required for enabling WAL. You don't *have* to release and
reacquire the buffer lock; in fact, that just adds overhead. You *do*
have to be aware that the standby will perhaps see the intermediate
state, because it won't hold the lock throughout. But that does not
mean that the master must release the lock.

I don't think we should be afraid to call MarkBufferDirty() instead of
going through these (fairly stupid) hasham-specific APIs.

Right and anyway we need to use it at many more call sites when we
enable WAL for hash index.

I propose the attached patch, which removes _hash_wrtbuf() and causes
those functions which previously called it to do MarkBufferDirty()
directly. Aside from hopefully fixing the bug we're talking about
here, this makes the logic in several places noticeably less
contorted.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

remove-hash-wrtbuf.patchtext/x-patch; charset=US-ASCII; name=remove-hash-wrtbuf.patchDownload
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f1511d0..0eeb37d 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -635,7 +635,8 @@ loop_top:
 		num_index_tuples = metap->hashm_ntuples;
 	}
 
-	_hash_wrtbuf(rel, metabuf);
+	MarkBufferDirty(metabuf);
+	_hash_relbuf(rel, metabuf);
 
 	/* return statistics */
 	if (stats == NULL)
@@ -724,7 +725,6 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable = 0;
 		bool		retain_pin = false;
-		bool		curr_page_dirty = false;
 
 		vacuum_delay_point();
 
@@ -805,7 +805,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		{
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
-			curr_page_dirty = true;
+			MarkBufferDirty(buf);
 		}
 
 		/* bail out if there are no more pages to scan. */
@@ -820,15 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 * release the lock on previous page after acquiring the lock on next
 		 * page
 		 */
-		if (curr_page_dirty)
-		{
-			if (retain_pin)
-				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
-			else
-				_hash_wrtbuf(rel, buf);
-			curr_page_dirty = false;
-		}
-		else if (retain_pin)
+		if (retain_pin)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, buf);
@@ -862,6 +854,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
 		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
+		MarkBufferDirty(bucket_buf);
 	}
 
 	/*
@@ -873,7 +866,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
 							bstrategy);
 	else
-		_hash_chgbufaccess(rel, bucket_buf, HASH_WRITE, HASH_NOLOCK);
+		_hash_chgbufaccess(rel, bucket_buf, HASH_READ, HASH_NOLOCK);
 }
 
 void
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 572146a..59c4213 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -208,11 +208,12 @@ restart_insert:
 	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
 
 	/*
-	 * write and release the modified page.  if the page we modified was an
+	 * dirty and release the modified page.  if the page we modified was an
 	 * overflow page, we also need to separately drop the pin we retained on
 	 * the primary bucket page.
 	 */
-	_hash_wrtbuf(rel, buf);
+	MarkBufferDirty(buf);
+	_hash_relbuf(rel, buf);
 	if (buf != bucket_buf)
 		_hash_dropbuf(rel, bucket_buf);
 
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index e2d208e..8fbf494 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -149,10 +149,11 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 	/* logically chain overflow page to previous page */
 	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
+	MarkBufferDirty(buf);
 	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
-		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 	else
-		_hash_wrtbuf(rel, buf);
+		_hash_relbuf(rel, buf);
 
 	return ovflbuf;
 }
@@ -304,7 +305,8 @@ found:
 
 	/* mark page "in use" in the bitmap */
 	SETBIT(freep, bit);
-	_hash_wrtbuf(rel, mapbuf);
+	MarkBufferDirty(mapbuf);
+	_hash_relbuf(rel, mapbuf);
 
 	/* Reacquire exclusive lock on the meta page */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
@@ -416,7 +418,8 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	 * in _hash_pageinit() when the page is reused.)
 	 */
 	MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));
-	_hash_wrtbuf(rel, ovflbuf);
+	MarkBufferDirty(ovflbuf);
+	_hash_relbuf(rel, ovflbuf);
 
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
@@ -445,7 +448,10 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 		prevopaque->hasho_nextblkno = nextblkno;
 
 		if (prevblkno != writeblkno)
-			_hash_wrtbuf(rel, prevbuf);
+		{
+			MarkBufferDirty(prevbuf);
+			_hash_relbuf(rel, prevbuf);
+		}
 	}
 
 	/* write and unlock the write buffer */
@@ -466,7 +472,8 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 
 		Assert(nextopaque->hasho_bucket == bucket);
 		nextopaque->hasho_prevblkno = prevblkno;
-		_hash_wrtbuf(rel, nextbuf);
+		MarkBufferDirty(nextbuf);
+		_hash_relbuf(rel, nextbuf);
 	}
 
 	/* Note: bstrategy is intentionally not used for metapage and bitmap */
@@ -494,7 +501,8 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	freep = HashPageGetBitmap(mappage);
 	Assert(ISSET(freep, bitmapbit));
 	CLRBIT(freep, bitmapbit);
-	_hash_wrtbuf(rel, mapbuf);
+	MarkBufferDirty(mapbuf);
+	_hash_relbuf(rel, mapbuf);
 
 	/* Get write-lock on metapage to update firstfree */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
@@ -503,13 +511,9 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	if (ovflbitno < metap->hashm_firstfree)
 	{
 		metap->hashm_firstfree = ovflbitno;
-		_hash_wrtbuf(rel, metabuf);
-	}
-	else
-	{
-		/* no need to change metapage */
-		_hash_relbuf(rel, metabuf);
+		MarkBufferDirty(metabuf);
 	}
+	_hash_relbuf(rel, metabuf);
 
 	return nextblkno;
 }
@@ -559,8 +563,9 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
 	freep = HashPageGetBitmap(pg);
 	MemSet(freep, 0xFF, BMPGSZ_BYTE(metap));
 
-	/* write out the new bitmap page (releasing write lock and pin) */
-	_hash_wrtbuf(rel, buf);
+	/* dirty the new bitmap page, and release write lock and pin */
+	MarkBufferDirty(buf);
+	_hash_relbuf(rel, buf);
 
 	/* add the new bitmap page to the metapage's list of bitmaps */
 	/* metapage already has a write lock */
@@ -724,13 +729,8 @@ _hash_squeezebucket(Relation rel,
 				 * on next page
 				 */
 				if (wbuf_dirty)
-				{
-					if (retain_pin)
-						_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
-					else
-						_hash_wrtbuf(rel, wbuf);
-				}
-				else if (retain_pin)
+					MarkBufferDirty(wbuf);
+				if (retain_pin)
 					_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
 				else
 					_hash_relbuf(rel, wbuf);
@@ -742,10 +742,9 @@ _hash_squeezebucket(Relation rel,
 					{
 						/* Delete tuples we already moved off read page */
 						PageIndexMultiDelete(rpage, deletable, ndeletable);
-						_hash_wrtbuf(rel, rbuf);
+						MarkBufferDirty(rbuf);
 					}
-					else
-						_hash_relbuf(rel, rbuf);
+					_hash_relbuf(rel, rbuf);
 					return;
 				}
 
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 44332e7..a3d2138 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -290,25 +290,6 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 }
 
 /*
- *	_hash_wrtbuf() -- write a hash page to disk.
- *
- *		This routine releases the lock held on the buffer and our refcount
- *		for it.  It is an error to call _hash_wrtbuf() without a write lock
- *		and a pin on the buffer.
- *
- * NOTE: this routine should go away when/if hash indexes are WAL-ified.
- * The correct sequence of operations is to mark the buffer dirty, then
- * write the WAL record, then release the lock and pin; so marking dirty
- * can't be combined with releasing.
- */
-void
-_hash_wrtbuf(Relation rel, Buffer buf)
-{
-	MarkBufferDirty(buf);
-	UnlockReleaseBuffer(buf);
-}
-
-/*
  * _hash_chgbufaccess() -- Change the lock type on a buffer, without
  *			dropping our pin on it.
  *
@@ -483,7 +464,8 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 		pageopaque->hasho_bucket = i;
 		pageopaque->hasho_flag = LH_BUCKET_PAGE;
 		pageopaque->hasho_page_id = HASHO_PAGE_ID;
-		_hash_wrtbuf(rel, buf);
+		MarkBufferDirty(buf);
+		_hash_relbuf(rel, buf);
 	}
 
 	/* Now reacquire buffer lock on metapage */
@@ -495,7 +477,8 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	_hash_initbitmap(rel, metap, num_buckets + 1, forkNum);
 
 	/* all done */
-	_hash_wrtbuf(rel, metabuf);
+	MarkBufferDirty(metabuf);
+	_hash_relbuf(rel, metabuf);
 
 	return num_buckets;
 }
@@ -1075,7 +1058,10 @@ _hash_splitbucket_guts(Relation rel,
 	if (nbuf == bucket_nbuf)
 		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
 	else
-		_hash_wrtbuf(rel, nbuf);
+	{
+		MarkBufferDirty(nbuf);
+		_hash_relbuf(rel, nbuf);
+	}
 
 	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
 	opage = BufferGetPage(bucket_obuf);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 6dfc41f..9ce44a7 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -336,7 +336,6 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
 extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
-extern void _hash_wrtbuf(Relation rel, Buffer buf);
 extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
 				   int to_access);
 extern uint32 _hash_metapinit(Relation rel, double num_tuples,
#170Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#169)
Re: Hash Indexes

On Tue, Dec 13, 2016 at 11:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 12, 2016 at 9:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The reason is to make the operations consistent in master and standby.
In WAL patch, for clearing the SPLIT_CLEANUP flag, we write a WAL and
if we don't release the lock after writing a WAL, the operation can
appear on standby even before on master. Currently, without WAL,
there is no benefit of doing so and we can fix by using
MarkBufferDirty, however, I thought it might be simpler to keep it
this way as this is required for enabling WAL. OTOH, we can leave
that for WAL patch. Let me know which way you prefer?

It's not required for enabling WAL. You don't *have* to release and
reacquire the buffer lock; in fact, that just adds overhead.

If we don't release the lock, then it will break the general coding
pattern of writing WAL (Acquire pin and an exclusive lock,
Markbufferdirty, Write WAL, Release Lock). Basically, I think it is
to ensure that we don't hold it for multiple atomic operations or in
this case avoid calling MarkBufferDirty multiple times.

You *do*
have to be aware that the standby will perhaps see the intermediate
state, because it won't hold the lock throughout. But that does not
mean that the master must release the lock.

Okay, but I think that will be avoided to a great extent because we do
follow the practice of releasing the lock immediately after writing
the WAL.

I don't think we should be afraid to call MarkBufferDirty() instead of
going through these (fairly stupid) hasham-specific APIs.

Right and anyway we need to use it at many more call sites when we
enable WAL for hash index.

I propose the attached patch, which removes _hash_wrtbuf() and causes
those functions which previously called it to do MarkBufferDirty()
directly.

It is possible that we can MarkBufferDirty multiple times (twice in
hashbucketcleanup and then in _hash_squeezebucket) while holding the
lock. I don't think that is a big problem as of now but wanted to
avoid it as I thought we need it for WAL patch.

Aside from hopefully fixing the bug we're talking about
here, this makes the logic in several places noticeably less
contorted.

Yeah, it will fix the problem in hashbucketcleanup, but there are two
other problems that need to be fixed for which I can send a separate
patch. By the way, as mentioned to you earlier that WAL patch already
takes care of removing _hash_wrtbuf and simplified the logic wherever
possible without introducing the logic of MarkBufferDirty multiple
times under one lock. However, if you want to proceed with the
current patch, then I can send you separate patches for the remaining
problems as addressed in bug fix patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#171Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#170)
Re: Hash Indexes

On Wed, Dec 14, 2016 at 4:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

It's not required for enabling WAL. You don't *have* to release and
reacquire the buffer lock; in fact, that just adds overhead.

If we don't release the lock, then it will break the general coding
pattern of writing WAL (Acquire pin and an exclusive lock,
Markbufferdirty, Write WAL, Release Lock). Basically, I think it is
to ensure that we don't hold it for multiple atomic operations or in
this case avoid calling MarkBufferDirty multiple times.

I think you're being too pedantic. Yes, the general rule is to
release the lock after each WAL record, but no harm comes if we take
the lock, emit TWO WAL records, and then release it.

It is possible that we can MarkBufferDirty multiple times (twice in
hashbucketcleanup and then in _hash_squeezebucket) while holding the
lock. I don't think that is a big problem as of now but wanted to
avoid it as I thought we need it for WAL patch.

I see no harm in calling MarkBufferDirty multiple times, either now or
after the WAL patch. Of course we don't want to end up with tons of
extra calls - it's not totally free - but it's pretty cheap.

Aside from hopefully fixing the bug we're talking about
here, this makes the logic in several places noticeably less
contorted.

Yeah, it will fix the problem in hashbucketcleanup, but there are two
other problems that need to be fixed for which I can send a separate
patch. By the way, as mentioned to you earlier that WAL patch already
takes care of removing _hash_wrtbuf and simplified the logic wherever
possible without introducing the logic of MarkBufferDirty multiple
times under one lock. However, if you want to proceed with the
current patch, then I can send you separate patches for the remaining
problems as addressed in bug fix patch.

That sounds good.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#172Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#171)
2 attachment(s)
Re: Hash Indexes

On Wed, Dec 14, 2016 at 10:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Dec 14, 2016 at 4:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, it will fix the problem in hashbucketcleanup, but there are two
other problems that need to be fixed for which I can send a separate
patch. By the way, as mentioned to you earlier that WAL patch already
takes care of removing _hash_wrtbuf and simplified the logic wherever
possible without introducing the logic of MarkBufferDirty multiple
times under one lock. However, if you want to proceed with the
current patch, then I can send you separate patches for the remaining
problems as addressed in bug fix patch.

That sounds good.

Attached are the two patches on top of remove-hash-wrtbuf. Patch
fix_dirty_marking_v1.patch allows to mark the buffer dirty in one of
the corner cases in _hash_freeovflpage() and avoids to mark dirty
without need in _hash_squeezebucket(). I think this can be combined
with remove-hash-wrtbuf patch. fix_lock_chaining_v1.patch fixes the
chaining behavior (lock next overflow bucket page before releasing the
previous bucket page) was broken in _hash_freeovflpage(). These
patches can be applied in series, first remove-hash-wrtbuf, then
fix_dirst_marking_v1.patch and then fix_lock_chaining_v1.patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_dirty_marking_v1.patchapplication/octet-stream; name=fix_dirty_marking_v1.patchDownload
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 8fbf494..5f1513b 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -452,6 +452,11 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 			MarkBufferDirty(prevbuf);
 			_hash_relbuf(rel, prevbuf);
 		}
+		else
+		{
+			/* ensure to mark prevbuf as dirty */
+			wbuf_dirty = true;
+		}
 	}
 
 	/* write and unlock the write buffer */
@@ -643,7 +648,7 @@ _hash_squeezebucket(Relation rel,
 	 */
 	if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
 	{
-		_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
+		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
 		return;
 	}
 
fix_lock_chaining_v1.patchapplication/octet-stream; name=fix_lock_chaining_v1.patchDownload
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 5f1513b..eaefd90 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -371,8 +371,8 @@ _hash_firstfreebit(uint32 map)
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
  *
- *	Returns the block number of the page that followed the given page
- *	in the bucket, or InvalidBlockNumber if no following page.
+ *	Returns the buffer that followed the given wbuf in the bucket, or
+ *	InvalidBuffer if no following page.
  *
  *	NB: caller must not hold lock on metapage, nor on page, that's next to
  *	ovflbuf in the bucket chain.  We don't acquire the lock on page that's
@@ -380,7 +380,7 @@ _hash_firstfreebit(uint32 map)
  *	has a lock on same.  This function releases the lock on wbuf and caller
  *	is responsible for releasing the pin on same.
  */
-BlockNumber
+Buffer
 _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 				   bool wbuf_dirty, BufferAccessStrategy bstrategy)
 {
@@ -388,14 +388,17 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	Buffer		metabuf;
 	Buffer		mapbuf;
 	Buffer		prevbuf = InvalidBuffer;
+	Buffer		next_wbuf = InvalidBuffer;
 	BlockNumber ovflblkno;
 	BlockNumber prevblkno;
 	BlockNumber blkno;
 	BlockNumber nextblkno;
 	BlockNumber writeblkno;
 	HashPageOpaque ovflopaque;
+	HashPageOpaque wopaque;
 	Page		ovflpage;
 	Page		mappage;
+	Page		wpage;
 	uint32	   *freep;
 	uint32		ovflbitno;
 	int32		bitmappage,
@@ -458,13 +461,6 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 			wbuf_dirty = true;
 		}
 	}
-
-	/* write and unlock the write buffer */
-	if (wbuf_dirty)
-		_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
-	else
-		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
-
 	if (BlockNumberIsValid(nextblkno))
 	{
 		Buffer		nextbuf = _hash_getbuf_with_strategy(rel,
@@ -481,6 +477,38 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 		_hash_relbuf(rel, nextbuf);
 	}
 
+	/*
+	 * To maintain lock chaining as described atop hashbucketcleanup, we need
+	 * to lock next bucket buffer in chain before releasing current.  This is
+	 * required only if the next overflow page from which to read is not same
+	 * as page to which we need to write.
+	 *
+	 * XXX Here, we are moving to next overflow page for writing without
+	 * ensuring if the previous write page is full.  This is annoying, but
+	 * should not hurt much in practice as that space will anyway be consumed
+	 * by future inserts.
+	 */
+	if (prevblkno != writeblkno)
+	{
+		wpage = BufferGetPage(wbuf);
+		wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
+		Assert(wopaque->hasho_bucket == bucket);
+		writeblkno = wopaque->hasho_nextblkno;
+		Assert(BlockNumberIsValid(writeblkno));
+
+		next_wbuf = _hash_getbuf_with_strategy(rel,
+											   writeblkno,
+											   HASH_WRITE,
+											   LH_OVERFLOW_PAGE,
+											   bstrategy);
+	}
+
+	/* write and unlock the write buffer */
+	if (wbuf_dirty)
+		_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
+
 	/* Note: bstrategy is intentionally not used for metapage and bitmap */
 
 	/* Read the metapage so we can determine which bitmap page to use */
@@ -520,7 +548,7 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	}
 	_hash_relbuf(rel, metabuf);
 
-	return nextblkno;
+	return next_wbuf;
 }
 
 
@@ -686,6 +714,7 @@ _hash_squeezebucket(Relation rel,
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable = 0;
 		bool		retain_pin = false;
+		Buffer		next_wbuf = InvalidBuffer;
 
 		/* Scan each tuple in "read" page */
 		maxroffnum = PageGetMaxOffsetNumber(rpage);
@@ -793,19 +822,29 @@ _hash_squeezebucket(Relation rel,
 		Assert(BlockNumberIsValid(rblkno));
 
 		/* free this overflow page (releases rbuf) */
-		_hash_freeovflpage(rel, rbuf, wbuf, wbuf_dirty, bstrategy);
+		next_wbuf = _hash_freeovflpage(rel, rbuf, wbuf, wbuf_dirty, bstrategy);
+
+		/* retain the pin on primary bucket page till end of bucket scan */
+		if (wblkno != bucket_blkno)
+			_hash_dropbuf(rel, wbuf);
 
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
+			return;
+
+		/* are we freeing the page adjacent to next_wbuf? */
+		if (BufferIsValid(next_wbuf) &&
+			rblkno == BufferGetBlockNumber(next_wbuf))
 		{
-			/* retain the pin on primary bucket page till end of bucket scan */
-			if (wblkno != bucket_blkno)
-				_hash_dropbuf(rel, wbuf);
+			_hash_relbuf(rel, next_wbuf);
 			return;
 		}
 
-		/* lock the overflow page being written, then get the previous one */
-		_hash_chgbufaccess(rel, wbuf, HASH_NOLOCK, HASH_WRITE);
+		wbuf = next_wbuf;
+		wblkno = BufferGetBlockNumber(wbuf);
+		wpage = BufferGetPage(wbuf);
+		wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
+		Assert(wopaque->hasho_bucket == bucket);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9ce44a7..5691ee3 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -313,7 +313,7 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 
 /* hashovfl.c */
 extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
-extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
+extern Buffer _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 				   bool wbuf_dirty, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
#173Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#172)
Re: Hash Indexes

On Thu, Dec 15, 2016 at 11:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached are the two patches on top of remove-hash-wrtbuf. Patch
fix_dirty_marking_v1.patch allows to mark the buffer dirty in one of
the corner cases in _hash_freeovflpage() and avoids to mark dirty
without need in _hash_squeezebucket(). I think this can be combined
with remove-hash-wrtbuf patch. fix_lock_chaining_v1.patch fixes the
chaining behavior (lock next overflow bucket page before releasing the
previous bucket page) was broken in _hash_freeovflpage(). These
patches can be applied in series, first remove-hash-wrtbuf, then
fix_dirst_marking_v1.patch and then fix_lock_chaining_v1.patch.

I committed remove-hash-wrtbuf and fix_dirty_marking_v1 but I've got
some reservations about fix_lock_chaining_v1. ISTM that the natural
fix here would be to change the API contract for _hash_freeovflpage so
that it doesn't release the lock on the write buffer. Why does it
even do that? I think that the only reason why _hash_freeovflpage
should be getting wbuf as an argument is so that it can handle the
case where wbuf happens to be the previous block correctly. Aside
from that there's no reason for it to touch wbuf. If you fix it like
that then you should be able to avoid this rather ugly wart:

* XXX Here, we are moving to next overflow page for writing without
* ensuring if the previous write page is full. This is annoying, but
* should not hurt much in practice as that space will anyway be consumed
* by future inserts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#174Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#173)
1 attachment(s)
Re: Hash Indexes

On Fri, Dec 16, 2016 at 9:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 15, 2016 at 11:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached are the two patches on top of remove-hash-wrtbuf. Patch
fix_dirty_marking_v1.patch allows to mark the buffer dirty in one of
the corner cases in _hash_freeovflpage() and avoids to mark dirty
without need in _hash_squeezebucket(). I think this can be combined
with remove-hash-wrtbuf patch. fix_lock_chaining_v1.patch fixes the
chaining behavior (lock next overflow bucket page before releasing the
previous bucket page) was broken in _hash_freeovflpage(). These
patches can be applied in series, first remove-hash-wrtbuf, then
fix_dirst_marking_v1.patch and then fix_lock_chaining_v1.patch.

I committed remove-hash-wrtbuf and fix_dirty_marking_v1 but I've got
some reservations about fix_lock_chaining_v1. ISTM that the natural
fix here would be to change the API contract for _hash_freeovflpage so
that it doesn't release the lock on the write buffer. Why does it
even do that? I think that the only reason why _hash_freeovflpage
should be getting wbuf as an argument is so that it can handle the
case where wbuf happens to be the previous block correctly.

Yeah, as of now that is the only case, but for WAL patch, I think we
need to ensure that the action of moving all the tuples to the page
being written and the overflow page being freed needs to be logged
together as an atomic operation. Now apart from that, it is
theoretically possible that write page will remain locked for multiple
overflow pages being freed (when the page being written has enough
space that it can accommodate tuples from multiple overflow pages). I
am not sure if it is worth worrying about such a case because
practically it might happen rarely. So, I have prepared a patch to
retain a lock on wbuf in _hash_freeovflpage() as suggested by you.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_lock_chaining_v2.patchapplication/octet-stream; name=fix_lock_chaining_v2.patchDownload
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 5f1513b..e29fe0c 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -382,7 +382,7 @@ _hash_firstfreebit(uint32 map)
  */
 BlockNumber
 _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
-				   bool wbuf_dirty, BufferAccessStrategy bstrategy)
+				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
 	Buffer		metabuf;
@@ -447,24 +447,10 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 		Assert(prevopaque->hasho_bucket == bucket);
 		prevopaque->hasho_nextblkno = nextblkno;
 
+		MarkBufferDirty(prevbuf);
 		if (prevblkno != writeblkno)
-		{
-			MarkBufferDirty(prevbuf);
 			_hash_relbuf(rel, prevbuf);
-		}
-		else
-		{
-			/* ensure to mark prevbuf as dirty */
-			wbuf_dirty = true;
-		}
 	}
-
-	/* write and unlock the write buffer */
-	if (wbuf_dirty)
-		_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
-	else
-		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
-
 	if (BlockNumberIsValid(nextblkno))
 	{
 		Buffer		nextbuf = _hash_getbuf_with_strategy(rel,
@@ -783,30 +769,28 @@ _hash_squeezebucket(Relation rel,
 		 * Tricky point here: if our read and write pages are adjacent in the
 		 * bucket chain, our write lock on wbuf will conflict with
 		 * _hash_freeovflpage's attempt to update the sibling links of the
-		 * removed page.  In that case, we don't need to lock it again and we
-		 * always release the lock on wbuf in _hash_freeovflpage and then
-		 * retake it again here.  This will not only simplify the code, but is
-		 * required to atomically log the changes which will be helpful when
-		 * we write WAL for hash indexes.
+		 * removed page.  In that case, we don't need to lock it again.
 		 */
 		rblkno = ropaque->hasho_prevblkno;
 		Assert(BlockNumberIsValid(rblkno));
 
 		/* free this overflow page (releases rbuf) */
-		_hash_freeovflpage(rel, rbuf, wbuf, wbuf_dirty, bstrategy);
+		_hash_freeovflpage(rel, rbuf, wbuf, bstrategy);
+
+		if (wbuf_dirty)
+			MarkBufferDirty(wbuf);
 
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
 			/* retain the pin on primary bucket page till end of bucket scan */
-			if (wblkno != bucket_blkno)
-				_hash_dropbuf(rel, wbuf);
+			if (wblkno == bucket_blkno)
+				_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
+			else
+				_hash_relbuf(rel, wbuf);
 			return;
 		}
 
-		/* lock the overflow page being written, then get the previous one */
-		_hash_chgbufaccess(rel, wbuf, HASH_NOLOCK, HASH_WRITE);
-
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
 										  HASH_WRITE,
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9ce44a7..627fa2c 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -314,7 +314,7 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 /* hashovfl.c */
 extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
 extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
-				   bool wbuf_dirty, BufferAccessStrategy bstrategy);
+				   BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_squeezebucket(Relation rel,
#175Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#174)
Re: Hash Indexes

On Sun, Dec 18, 2016 at 8:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I committed remove-hash-wrtbuf and fix_dirty_marking_v1 but I've got
some reservations about fix_lock_chaining_v1. ISTM that the natural
fix here would be to change the API contract for _hash_freeovflpage so
that it doesn't release the lock on the write buffer. Why does it
even do that? I think that the only reason why _hash_freeovflpage
should be getting wbuf as an argument is so that it can handle the
case where wbuf happens to be the previous block correctly.

Yeah, as of now that is the only case, but for WAL patch, I think we
need to ensure that the action of moving all the tuples to the page
being written and the overflow page being freed needs to be logged
together as an atomic operation.

Not really. We can have one operation that empties the overflow page
and another that unlinks it and makes it free.

Now apart from that, it is
theoretically possible that write page will remain locked for multiple
overflow pages being freed (when the page being written has enough
space that it can accommodate tuples from multiple overflow pages). I
am not sure if it is worth worrying about such a case because
practically it might happen rarely. So, I have prepared a patch to
retain a lock on wbuf in _hash_freeovflpage() as suggested by you.

Committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#176Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#175)
Re: Hash Indexes

On Mon, Dec 19, 2016 at 11:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Dec 18, 2016 at 8:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I committed remove-hash-wrtbuf and fix_dirty_marking_v1 but I've got
some reservations about fix_lock_chaining_v1. ISTM that the natural
fix here would be to change the API contract for _hash_freeovflpage so
that it doesn't release the lock on the write buffer. Why does it
even do that? I think that the only reason why _hash_freeovflpage
should be getting wbuf as an argument is so that it can handle the
case where wbuf happens to be the previous block correctly.

Yeah, as of now that is the only case, but for WAL patch, I think we
need to ensure that the action of moving all the tuples to the page
being written and the overflow page being freed needs to be logged
together as an atomic operation.

Not really. We can have one operation that empties the overflow page
and another that unlinks it and makes it free.

We have mainly four actions for squeeze operation, add tuples to the
write page, empty overflow page, unlinks overflow page, make it free
by setting the corresponding bit in overflow page. Now, if we don't
log the changes to write page and freeing of overflow page as one
operation, then won't query on standby can either see duplicate tuples
or miss the tuples which are freed in overflow page.

Now apart from that, it is
theoretically possible that write page will remain locked for multiple
overflow pages being freed (when the page being written has enough
space that it can accommodate tuples from multiple overflow pages). I
am not sure if it is worth worrying about such a case because
practically it might happen rarely. So, I have prepared a patch to
retain a lock on wbuf in _hash_freeovflpage() as suggested by you.

Committed.

Thanks.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#177Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#176)
Re: Hash Indexes

On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

We have mainly four actions for squeeze operation, add tuples to the
write page, empty overflow page, unlinks overflow page, make it free
by setting the corresponding bit in overflow page. Now, if we don't
log the changes to write page and freeing of overflow page as one
operation, then won't query on standby can either see duplicate tuples
or miss the tuples which are freed in overflow page.

No, I think you could have two operations:

1. Move tuples from the "read" page to the "write" page.

2. Unlink the overflow page from the chain and mark it free.

If we fail after step 1, the bucket chain might end with an empty
overflow page, but that's OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#178Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#177)
Re: Hash Indexes

On Tue, Dec 20, 2016 at 7:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

We have mainly four actions for squeeze operation, add tuples to the
write page, empty overflow page, unlinks overflow page, make it free
by setting the corresponding bit in overflow page. Now, if we don't
log the changes to write page and freeing of overflow page as one
operation, then won't query on standby can either see duplicate tuples
or miss the tuples which are freed in overflow page.

No, I think you could have two operations:

1. Move tuples from the "read" page to the "write" page.

2. Unlink the overflow page from the chain and mark it free.

If we fail after step 1, the bucket chain might end with an empty
overflow page, but that's OK.

If there is an empty page in bucket chain, access to that page will
give an error (In WAL patch we are initializing the page instead of
making it completely empty, so we might not see an error in such a
case). What advantage do you see by splitting the operation?
Anyway, I think it is better to discuss this in WAL patch thread.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#179Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#178)
Re: Hash Indexes

On Tue, Dec 20, 2016 at 9:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 20, 2016 at 7:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

We have mainly four actions for squeeze operation, add tuples to the
write page, empty overflow page, unlinks overflow page, make it free
by setting the corresponding bit in overflow page. Now, if we don't
log the changes to write page and freeing of overflow page as one
operation, then won't query on standby can either see duplicate tuples
or miss the tuples which are freed in overflow page.

No, I think you could have two operations:

1. Move tuples from the "read" page to the "write" page.

2. Unlink the overflow page from the chain and mark it free.

If we fail after step 1, the bucket chain might end with an empty
overflow page, but that's OK.

If there is an empty page in bucket chain, access to that page will
give an error (In WAL patch we are initializing the page instead of
making it completely empty, so we might not see an error in such a
case).

It wouldn't be a new, uninitialized page. It would be empty of
tuples, not all-zeroes.

What advantage do you see by splitting the operation?

It's simpler. The code here is very complicated and trying to merge
too many things into a single operation may make it even more
complicated, increasing the risk of bugs and making the code hard to
maintain without necessarily buying much performance.

Anyway, I think it is better to discuss this in WAL patch thread.

OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#180Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#179)
Re: Hash Indexes

On Tue, Dec 20, 2016 at 7:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Dec 20, 2016 at 9:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 20, 2016 at 7:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

We have mainly four actions for squeeze operation, add tuples to the
write page, empty overflow page, unlinks overflow page, make it free
by setting the corresponding bit in overflow page. Now, if we don't
log the changes to write page and freeing of overflow page as one
operation, then won't query on standby can either see duplicate tuples
or miss the tuples which are freed in overflow page.

No, I think you could have two operations:

1. Move tuples from the "read" page to the "write" page.

2. Unlink the overflow page from the chain and mark it free.

If we fail after step 1, the bucket chain might end with an empty
overflow page, but that's OK.

If there is an empty page in bucket chain, access to that page will
give an error (In WAL patch we are initializing the page instead of
making it completely empty, so we might not see an error in such a
case).

It wouldn't be a new, uninitialized page. It would be empty of
tuples, not all-zeroes.

AFAIU we initialize page as all-zeros, but I think you are envisioning
that we need to change it to a new uninitialized page.

What advantage do you see by splitting the operation?

It's simpler. The code here is very complicated and trying to merge
too many things into a single operation may make it even more
complicated, increasing the risk of bugs and making the code hard to
maintain without necessarily buying much performance.

Sure, if you find that way better, then we can change it, but the
current patch treats it as a single operation. If after looking the
patch you find it is better to change it into two operations, I will
do so.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers