[POC] A better way to expand hash indexes.

Started by Mithun Cyalmost 9 years ago38 messages
#1Mithun Cy
mithun.cy@enterprisedb.com
1 attachment(s)

Hi all,

As of now, we expand the hash index by doubling the number of bucket
blocks. But unfortunately, those blocks will not be used immediately.
So I thought if we can differ bucket block allocation by some factor,
hash index size can grow much efficiently. I have written a POC patch
which does following things -
Say at overflow point 'i' we need to add new "x = 2^(i-1)" buckets as
per the old code, I think we can do this addition of buckets in a
controlled way. Instead of adding all of 'x' bucket blocks at a time,
the patch will add x/4 blocks at a time. And, once those blocks are
consumed, it adds next installment of x/4 blocks. And, this will
continue until all of 'x' blocks of the overflow point 'i' is
allocated. My test result shows index size grows in a more efficient
way with above patch.

Note: This patch is just a POC. It can have bugs and I have to change
comments in the code, and README which are related to new changes.

Test:
create table t1(t int);
create index i1 on t1 using hash(t);
And then, Incrementally add rows as below.
insert into t1 select generate_series(1, 10000000);

records base index new patch
in million -- size(MB) -- index size(MB)

10 384 384

20 768 768

30 1417 1161

40 1531 1531

50 2556 1788

60 2785 2273

70 2963 2709

80 3060 3061

90 5111 3575

100 5111 3575

To implement such an incremental addition of bucket blocks I have to
increase the size of array hashm_spares in meta page by four times.
Also, mapping methods which map a total number of buckets to a
split-point position of hashm_spares array, need to be changed. These
changes create backward compatibility issues.

Implementation Details in brief:
=======================
Each element of hashm_spares (we call overflow point) is expanded into
4 slots {0, 1, 2, 3}. If 'x' (a power2 number) is the total number of
buckets to be added before the overflow point we add only a quarter of
it (x/4) to each slot, once we have consumed previously added blocks
we add next quarter of buckets and so on.
As in old code new hashm_spares[i] stores total overflow pages
allocated between those bucket allocation.

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

expand_hashbucket_efficiently_01application/octet-stream; name=expand_hashbucket_efficiently_01Download
commit b6374f84fb714951d88aeb4ac0fca0fdda3c3d7a
Author: mithun <mithun@localhost.localdomain>
Date:   Fri Feb 17 18:57:02 2017 +0530

    commit 1

diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 3334089..4a9a409 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -48,7 +48,7 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 	 * Convert to absolute page number by adding the number of bucket pages
 	 * that exist before this split point.
 	 */
-	return (BlockNumber) ((1 << i) + ovflbitnum);
+	return (BlockNumber) (_hash_get_tbuckets(i) + ovflbitnum);
 }
 
 /*
@@ -66,9 +66,9 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 	/* Determine the split number containing this page */
 	for (i = 1; i <= splitnum; i++)
 	{
-		if (ovflblkno <= (BlockNumber) (1 << i))
+		if (ovflblkno <= (BlockNumber) _hash_get_tbuckets(i))
 			break;				/* oops */
-		bitnum = ovflblkno - (1 << i);
+		bitnum = ovflblkno - _hash_get_tbuckets(i);
 
 		/*
 		 * bitnum has to be greater than number of overflow page added in
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 9485978..c460350 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -315,7 +315,7 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	int32		ffactor;
 	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
+	uint32		spare_index;
 	uint32		i;
 
 	/* safety check */
@@ -350,11 +350,10 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	else if (dnumbuckets >= (double) 0x40000000)
 		num_buckets = 0x40000000;
 	else
-		num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+		num_buckets = _hash_get_tbuckets(_hash_spareindex(dnumbuckets));
 
-	log2_num_buckets = _hash_log2(num_buckets);
-	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
-	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
+	spare_index = _hash_spareindex(num_buckets);
+	Assert(spare_index < HASH_MAX_SPLITPOINTS);
 
 	/*
 	 * We initialize the metapage, the first N bucket pages, and the first
@@ -410,8 +409,8 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
 	/* Set up mapping for one spare page after the initial splitpoints */
-	metap->hashm_spares[log2_num_buckets] = 1;
-	metap->hashm_ovflpoint = log2_num_buckets;
+	metap->hashm_spares[spare_index] = 1;
+	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
 
 	/*
@@ -649,21 +648,23 @@ restart_expand:
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to allocate a new batch of bucket pages.
+	 * If the split point is increasing we need to allocate a new batch of
+	 * bucket pages.
 	 */
-	spare_ndx = _hash_log2(new_bucket + 1);
+	spare_ndx = _hash_spareindex(new_bucket + 1);
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
+		uint32		buckets_toadd = 0;
+
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
 		 * The number of buckets in the new splitpoint is equal to the total
-		 * number already in existence, i.e. new_bucket.  Currently this maps
-		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.
+		 * number already in existence, i.e. new_bucket. We add one fourth of
+		 * total buckets to be added in this split point.
 		 */
-		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+		buckets_toadd = ((1 << ((spare_ndx >> 2) - 1)) >> 2);
+		if (!_hash_alloc_buckets(rel, start_nblkno, buckets_toadd))
 		{
 			/* can't split due to BlockNumber overflow */
 			_hash_relbuf(rel, buf_oblkno);
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index c705531..2c0e975 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -149,6 +149,57 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_spareindex -- returns spare index of the bucket
+ */
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
+	uint32		i,
+				tbuckets,
+				bucket_pos;
+
+	i = _hash_log2(num_bucket);
+
+	/*
+	 * Get to the buckets main spare index position and then add slot number
+	 * to where this bucket is allocated.
+	 */
+	tbuckets = 1 << (i - 1);
+	bucket_pos = num_bucket - tbuckets;
+	return ((i << 2) + ((tbuckets >> 2) ? ((bucket_pos - 1) / (tbuckets >> 2)) : 0));
+}
+
+/*
+ *	_hash_get_tbuckets -- returns total number of buckets for this split number.
+ */
+uint32
+_hash_get_tbuckets(uint32 split_num)
+{
+	uint32		tbuckets,
+				slot;
+
+	/*
+	 * For the first three groups of split_num we will not have enough buckets
+	 * to distribute among the 4 slots. So just return their total buckets
+	 * irrespective of the slot they occupy.
+	 */
+	if ((split_num >> 2) == 0)
+		return 1;
+	if ((split_num >> 2) == 1)
+		return 2;
+	if ((split_num >> 2) == 2)
+		return 4;
+	/*
+	 * total_buckets = total number of buckets in previous split point group +
+	 * number of buckets for our slot in the current split point group.
+	 */
+	tbuckets = (1 << ((split_num >> 2) - 1));
+	slot = ((3 & split_num) + 1);
+	tbuckets = tbuckets + (slot * (tbuckets >> 2));
+	return tbuckets;
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 3bf587b..3407659 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -36,7 +36,7 @@ typedef uint32 Bucket;
 #define InvalidBucket	((Bucket) 0xFFFFFFFF)
 
 #define BUCKET_TO_BLKNO(metap,B) \
-		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
+		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)
 
 /*
  * Special space for hash index pages.
@@ -168,7 +168,7 @@ typedef HashScanOpaqueData *HashScanOpaque;
  * needing to fit into the metapage.  (With 8K block size, 128 bitmaps
  * limit us to 64 GB of overflow space...)
  */
-#define HASH_MAX_SPLITPOINTS		32
+#define HASH_MAX_SPLITPOINTS		128
 #define HASH_MAX_BITMAPS			128
 
 typedef struct HashMetaPageData
@@ -364,6 +364,8 @@ extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
 extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
 					 uint32 highmask, uint32 lowmask);
 extern uint32 _hash_log2(uint32 num);
+extern uint32 _hash_spareindex(uint32 num);
+extern uint32 _hash_get_tbuckets(uint32 num);
 extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
 extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
 extern bool _hash_convert_tuple(Relation index,
#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Mithun Cy (#1)
Re: [POC] A better way to expand hash indexes.

On Fri, Feb 17, 2017 at 7:21 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

To implement such an incremental addition of bucket blocks I have to
increase the size of array hashm_spares in meta page by four times.
Also, mapping methods which map a total number of buckets to a
split-point position of hashm_spares array, need to be changed. These
changes create backward compatibility issues.

How will high and lowmask calculations work in this new strategy?
Till now they always work on doubling strategy and I don't see you
have changed anything related to that code. Check below places.

_hash_metapinit()
{
..
/*
* We initialize the index with N buckets, 0 .. N-1, occupying physical
* blocks 1 to N. The first freespace bitmap page is in block N+1. Since
* N is a power of 2, we can set the masks this way:
*/
metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
metap->hashm_highmask = (num_buckets << 1) - 1;
..
}

_hash_expandtable()
{
..
if (new_bucket > metap->hashm_highmask)
{
/* Starting a new doubling */
metap->hashm_lowmask = metap->hashm_highmask;
metap->hashm_highmask = new_bucket | metap->hashm_lowmask;
}
..
}

Till now, we have worked hard for not changing the page format in a
backward incompatible way, so it will be better if we could find some
way of doing this without changing the meta page format in a backward
incompatible way. Have you considered to store some information in
shared memory based on which we can decide how much percentage of
buckets are allocated in current table half? I think we might be able
to construct this information after crash recovery as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Amit Kapila (#2)
1 attachment(s)
Re: [POC] A better way to expand hash indexes.

Thanks, Amit

On Mon, Feb 20, 2017 at 9:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

How will high and lowmask calculations work in this new strategy?
Till now they always work on doubling strategy and I don't see you
have changed anything related to that code. Check below places.

It is important that the mask has to be (2^x) -1, if we have to retain
the same hash map function. So mask variables will take same values as
before. Only place I think we need change is _hash_metapinit();
unfortunately, I did not test for the case where we build the hash
index on already existing tuples. Now I have fixed in the latest
patch.

Till now, we have worked hard for not changing the page format in a
backward incompatible way, so it will be better if we could find some
way of doing this without changing the meta page format in a backward
incompatible way.

We are not adding any new variable/ deleting some, we increase the
size of hashm_spares and hence mapping functions should be adjusted.
The problem is the block allocation, and its management is based on
the fact that all of the buckets(will be 2^x in number) belonging to a
particular split-point is allocated at once and together. The
hashm_spares is used to record those allocations and that will be used
further by map functions to reach a particular block in the file. If
we want to change the way we allocate the buckets then hashm_spares
will change and hence mapping function. So I do not think we can avoid
incompatibility issue.

One thing I can think of is if we can increase the hashm_version of
hash index; then for old indexes, we can continue to use doubling
method and its mapping. For new indexes, we can use new way as above.

Have you considered to store some information in

shared memory based on which we can decide how much percentage of
buckets are allocated in current table half? I think we might be able
to construct this information after crash recovery as well.

I think all of above data has to be persistent. I am not able to
understand what should be/can be stored in shared buffers. Can you
please correct me if I am wrong?

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

expand_hashbucket_efficiently_02application/octet-stream; name=expand_hashbucket_efficiently_02Download
commit 7448239668ca4fedd6c1f5cf3d1d434efcc3291f
Author: mithun <mithun@localhost.localdomain>
Date:   Tue Feb 21 12:08:11 2017 +0530

    Feature : expand hash table efficiently.

diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 3334089..4a9a409 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -48,7 +48,7 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 	 * Convert to absolute page number by adding the number of bucket pages
 	 * that exist before this split point.
 	 */
-	return (BlockNumber) ((1 << i) + ovflbitnum);
+	return (BlockNumber) (_hash_get_tbuckets(i) + ovflbitnum);
 }
 
 /*
@@ -66,9 +66,9 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 	/* Determine the split number containing this page */
 	for (i = 1; i <= splitnum; i++)
 	{
-		if (ovflblkno <= (BlockNumber) (1 << i))
+		if (ovflblkno <= (BlockNumber) _hash_get_tbuckets(i))
 			break;				/* oops */
-		bitnum = ovflblkno - (1 << i);
+		bitnum = ovflblkno - _hash_get_tbuckets(i);
 
 		/*
 		 * bitnum has to be greater than number of overflow page added in
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 9485978..1afcbd0 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -315,7 +315,7 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	int32		ffactor;
 	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
+	uint32		spare_index;
 	uint32		i;
 
 	/* safety check */
@@ -350,11 +350,10 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	else if (dnumbuckets >= (double) 0x40000000)
 		num_buckets = 0x40000000;
 	else
-		num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+		num_buckets = _hash_get_tbuckets(_hash_spareindex(dnumbuckets));
 
-	log2_num_buckets = _hash_log2(num_buckets);
-	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
-	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
+	spare_index = _hash_spareindex(num_buckets);
+	Assert(spare_index < HASH_MAX_SPLITPOINTS);
 
 	/*
 	 * We initialize the metapage, the first N bucket pages, and the first
@@ -400,18 +399,20 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
-	 * blocks 1 to N.  The first freespace bitmap page is in block N+1. Since
-	 * N is a power of 2, we can set the masks this way:
+	 * blocks 1 to N.  The first freespace bitmap page is in block N+1.
 	 */
-	metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
-	metap->hashm_highmask = (num_buckets << 1) - 1;
+	metap->hashm_maxbucket = num_buckets - 1;
+
+	/* set hishmask, which should be sufficient to cover num_buckets. */
+	metap->hashm_highmask = (1 << (_hash_log2(num_buckets + 1))) - 1;
+	metap->hashm_lowmask = (metap->hashm_highmask >> 1);
 
 	MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
 	/* Set up mapping for one spare page after the initial splitpoints */
-	metap->hashm_spares[log2_num_buckets] = 1;
-	metap->hashm_ovflpoint = log2_num_buckets;
+	metap->hashm_spares[spare_index] = 1;
+	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
 
 	/*
@@ -619,8 +620,8 @@ restart_expand:
 	{
 		/*
 		 * Copy bucket mapping info now; refer to the comment in code below
-		 * where we copy this information before calling _hash_splitbucket
-		 * to see why this is okay.
+		 * where we copy this information before calling _hash_splitbucket to
+		 * see why this is okay.
 		 */
 		maxbucket = metap->hashm_maxbucket;
 		highmask = metap->hashm_highmask;
@@ -649,21 +650,23 @@ restart_expand:
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to allocate a new batch of bucket pages.
+	 * If the split point is increasing we need to allocate a new batch of
+	 * bucket pages.
 	 */
-	spare_ndx = _hash_log2(new_bucket + 1);
+	spare_ndx = _hash_spareindex(new_bucket + 1);
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
+		uint32		buckets_toadd = 0;
+
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
 		 * The number of buckets in the new splitpoint is equal to the total
-		 * number already in existence, i.e. new_bucket.  Currently this maps
-		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.
+		 * number already in existence, i.e. new_bucket. We add one fourth of
+		 * total buckets to be added in this split point.
 		 */
-		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+		buckets_toadd = ((1 << ((spare_ndx >> 2) - 1)) >> 2);
+		if (!_hash_alloc_buckets(rel, start_nblkno, buckets_toadd))
 		{
 			/* can't split due to BlockNumber overflow */
 			_hash_relbuf(rel, buf_oblkno);
@@ -847,10 +850,9 @@ _hash_splitbucket(Relation rel,
 
 	/*
 	 * Mark the old bucket to indicate that split is in progress.  (At
-	 * operation end, we will clear the split-in-progress flag.)  Also,
-	 * for a primary bucket page, hasho_prevblkno stores the number of
-	 * buckets that existed as of the last split, so we must update that
-	 * value here.
+	 * operation end, we will clear the split-in-progress flag.)  Also, for a
+	 * primary bucket page, hasho_prevblkno stores the number of buckets that
+	 * existed as of the last split, so we must update that value here.
 	 */
 	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
 	oopaque->hasho_prevblkno = maxbucket;
@@ -1206,11 +1208,11 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
  *	_hash_getcachedmetap() -- Returns cached metapage data.
  *
  *	If metabuf is not InvalidBuffer, caller must hold a pin, but no lock, on
- *  the metapage.  If not set, we'll set it before returning if we have to
- *  refresh the cache, and return with a pin but no lock on it; caller is
- *  responsible for releasing the pin.
+ *	the metapage.  If not set, we'll set it before returning if we have to
+ *	refresh the cache, and return with a pin but no lock on it; caller is
+ *	responsible for releasing the pin.
  *
- *  We refresh the cache if it's not initialized yet or force_refresh is true.
+ *	We refresh the cache if it's not initialized yet or force_refresh is true.
  */
 HashMetaPage
 _hash_getcachedmetap(Relation rel, Buffer *metabuf, bool force_refresh)
@@ -1220,13 +1222,13 @@ _hash_getcachedmetap(Relation rel, Buffer *metabuf, bool force_refresh)
 	Assert(metabuf);
 	if (force_refresh || rel->rd_amcache == NULL)
 	{
-		char   *cache = NULL;
+		char	   *cache = NULL;
 
 		/*
-		 * It's important that we don't set rd_amcache to an invalid
-		 * value.  Either MemoryContextAlloc or _hash_getbuf could fail,
-		 * so don't install a pointer to the newly-allocated storage in the
-		 * actual relcache entry until both have succeeeded.
+		 * It's important that we don't set rd_amcache to an invalid value.
+		 * Either MemoryContextAlloc or _hash_getbuf could fail, so don't
+		 * install a pointer to the newly-allocated storage in the actual
+		 * relcache entry until both have succeeeded.
 		 */
 		if (rel->rd_amcache == NULL)
 			cache = MemoryContextAlloc(rel->rd_indexcxt,
@@ -1261,7 +1263,7 @@ _hash_getcachedmetap(Relation rel, Buffer *metabuf, bool force_refresh)
  *	us an opportunity to use the previously saved metapage contents to reach
  *	the target bucket buffer, instead of reading from the metapage every time.
  *	This saves one buffer access every time we want to reach the target bucket
- *  buffer, which is very helpful savings in bufmgr traffic and contention.
+ *	buffer, which is very helpful savings in bufmgr traffic and contention.
  *
  *	The access type parameter (HASH_READ or HASH_WRITE) indicates whether the
  *	bucket buffer has to be locked for reading or writing.
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index c705531..2c0e975 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -149,6 +149,57 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_spareindex -- returns spare index of the bucket
+ */
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
+	uint32		i,
+				tbuckets,
+				bucket_pos;
+
+	i = _hash_log2(num_bucket);
+
+	/*
+	 * Get to the buckets main spare index position and then add slot number
+	 * to where this bucket is allocated.
+	 */
+	tbuckets = 1 << (i - 1);
+	bucket_pos = num_bucket - tbuckets;
+	return ((i << 2) + ((tbuckets >> 2) ? ((bucket_pos - 1) / (tbuckets >> 2)) : 0));
+}
+
+/*
+ *	_hash_get_tbuckets -- returns total number of buckets for this split number.
+ */
+uint32
+_hash_get_tbuckets(uint32 split_num)
+{
+	uint32		tbuckets,
+				slot;
+
+	/*
+	 * For the first three groups of split_num we will not have enough buckets
+	 * to distribute among the 4 slots. So just return their total buckets
+	 * irrespective of the slot they occupy.
+	 */
+	if ((split_num >> 2) == 0)
+		return 1;
+	if ((split_num >> 2) == 1)
+		return 2;
+	if ((split_num >> 2) == 2)
+		return 4;
+	/*
+	 * total_buckets = total number of buckets in previous split point group +
+	 * number of buckets for our slot in the current split point group.
+	 */
+	tbuckets = (1 << ((split_num >> 2) - 1));
+	slot = ((3 & split_num) + 1);
+	tbuckets = tbuckets + (slot * (tbuckets >> 2));
+	return tbuckets;
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 3bf587b..3407659 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -36,7 +36,7 @@ typedef uint32 Bucket;
 #define InvalidBucket	((Bucket) 0xFFFFFFFF)
 
 #define BUCKET_TO_BLKNO(metap,B) \
-		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
+		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)
 
 /*
  * Special space for hash index pages.
@@ -168,7 +168,7 @@ typedef HashScanOpaqueData *HashScanOpaque;
  * needing to fit into the metapage.  (With 8K block size, 128 bitmaps
  * limit us to 64 GB of overflow space...)
  */
-#define HASH_MAX_SPLITPOINTS		32
+#define HASH_MAX_SPLITPOINTS		128
 #define HASH_MAX_BITMAPS			128
 
 typedef struct HashMetaPageData
@@ -364,6 +364,8 @@ extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
 extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
 					 uint32 highmask, uint32 lowmask);
 extern uint32 _hash_log2(uint32 num);
+extern uint32 _hash_spareindex(uint32 num);
+extern uint32 _hash_get_tbuckets(uint32 num);
 extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
 extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
 extern bool _hash_convert_tuple(Relation index,
#4David Steele
david@pgmasters.net
In reply to: Mithun Cy (#3)
Re: [POC] A better way to expand hash indexes.

On 2/21/17 4:58 AM, Mithun Cy wrote:

Thanks, Amit

On Mon, Feb 20, 2017 at 9:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

How will high and lowmask calculations work in this new strategy?
Till now they always work on doubling strategy and I don't see you
have changed anything related to that code. Check below places.

It is important that the mask has to be (2^x) -1, if we have to retain
the same hash map function. So mask variables will take same values as
before. Only place I think we need change is _hash_metapinit();
unfortunately, I did not test for the case where we build the hash
index on already existing tuples. Now I have fixed in the latest
patch.

Till now, we have worked hard for not changing the page format in a
backward incompatible way, so it will be better if we could find some
way of doing this without changing the meta page format in a backward
incompatible way.

We are not adding any new variable/ deleting some, we increase the
size of hashm_spares and hence mapping functions should be adjusted.
The problem is the block allocation, and its management is based on
the fact that all of the buckets(will be 2^x in number) belonging to a
particular split-point is allocated at once and together. The
hashm_spares is used to record those allocations and that will be used
further by map functions to reach a particular block in the file. If
we want to change the way we allocate the buckets then hashm_spares
will change and hence mapping function. So I do not think we can avoid
incompatibility issue.

One thing I can think of is if we can increase the hashm_version of
hash index; then for old indexes, we can continue to use doubling
method and its mapping. For new indexes, we can use new way as above.

Have you considered to store some information in

shared memory based on which we can decide how much percentage of
buckets are allocated in current table half? I think we might be able
to construct this information after crash recovery as well.

I think all of above data has to be persistent. I am not able to
understand what should be/can be stored in shared buffers. Can you
please correct me if I am wrong?

This patch does not apply at cccbdde:

$ patch -p1 < ../other/expand_hashbucket_efficiently_02
patching file src/backend/access/hash/hashovfl.c
Hunk #1 succeeded at 49 (offset 1 line).
Hunk #2 succeeded at 67 (offset 1 line).
patching file src/backend/access/hash/hashpage.c
Hunk #1 succeeded at 502 with fuzz 1 (offset 187 lines).
Hunk #2 succeeded at 518 with fuzz 2 (offset 168 lines).
Hunk #3 succeeded at 562 (offset 163 lines).
Hunk #4 succeeded at 744 (offset 124 lines).
Hunk #5 FAILED at 774.
Hunk #6 succeeded at 869 (offset 19 lines).
Hunk #7 succeeded at 1450 (offset 242 lines).
Hunk #8 succeeded at 1464 (offset 242 lines).
Hunk #9 succeeded at 1505 (offset 242 lines).
1 out of 9 hunks FAILED -- saving rejects to file
src/backend/access/hash/hashpage.c.rej
patching file src/backend/access/hash/hashutil.c
Hunk #1 succeeded at 150 (offset 1 line).
patching file src/include/access/hash.h
Hunk #2 succeeded at 180 (offset 12 lines).
Hunk #3 succeeded at 382 (offset 18 lines).

It does apply with fuzz on 2b32ac2, so it looks like c11453c and
subsequent commits are the cause. They represent a fairly substantial
change to hash indexes by introducing WAL logging so I think you should
reevaluate your patches to be sure they still function as expected.

Marked "Waiting on Author".

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Mithun Cy
mithun.cy@enterprisedb.com
In reply to: David Steele (#4)
1 attachment(s)
Re: [POC] A better way to expand hash indexes.

On Thu, Mar 16, 2017 at 10:55 PM, David Steele <david@pgmasters.net> wrote:

It does apply with fuzz on 2b32ac2, so it looks like c11453c and
subsequent commits are the cause. They represent a fairly substantial
change to hash indexes by introducing WAL logging so I think you should
reevaluate your patches to be sure they still function as expected.

Thanks, David here is the new improved patch I have also corrected the
pageinspect's test output. Also, added notes in README regarding the
new way of adding bucket pages efficiently in hash index. I also did
some more tests pgbench read only and read write;
There is no performance impact due to the patch. The growth of index
size has become much efficient as the numbers posted in the initial
proposal mail.

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

expand_hashbucket_efficiently_03.patchapplication/octet-stream; name=expand_hashbucket_efficiently_03.patchDownload
commit dc358886a1cbdad3ce5f1b2bebdb79fcb88a9286
Author: mithun <mithun@localhost.localdomain>
Date:   Sat Mar 18 22:12:45 2017 +0530

    Expand the bucket efficiently
    -----------------------------
    Mithun C Y

diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 3ba01f6..c97b279 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -51,13 +51,13 @@ bsize     | 8152
 bmsize    | 4096
 bmshift   | 15
 maxbucket | 3
-highmask  | 7
-lowmask   | 3
-ovflpoint | 2
+highmask  | 3
+lowmask   | 1
+ovflpoint | 3
 firstfree | 0
 nmaps     | 1
 procid    | 450
-spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 
 SELECT magic, version, ntuples, bsize, bmsize, bmshift, maxbucket, highmask,
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..8789805 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -58,35 +58,50 @@ rules to support a variable number of overflow pages while not having to
 move primary bucket pages around after they are created.
 
 Primary bucket pages (henceforth just "bucket pages") are allocated in
-power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
-are created when the index is initialized.  At the first split, buckets 2
-and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
-when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
-pages of a power-of-2 group appear consecutively in the index.  This
-addressing scheme allows the physical location of a bucket page to be
-computed from the bucket number relatively easily, using only a small
-amount of control information.  We take the log2() of the bucket number
-to determine which split point S the bucket belongs to, and then simply
-add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
-metapage) to compute the physical address.  hashm_spares[S] can be
-interpreted as the total number of overflow pages that have been allocated
-before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
-so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
-block numbers 1 and 2, just after the meta page.  We always have
+power-of-2 groups, called "split points" in the code.  That means on every new
+split points we double the existing number of buckets.  And, it seems bad to
+allocate huge chunks of bucket pages all at once and we take ages to consume it.
+To avoid this exponential growth of index size, we did a trick to breakup
+allocation of buckets at splitpoint into 4 equal phases.  If 2^x is the total
+buckets need to be allocated at a splitpoint (from now on we shall call this
+as splitpoint group), then we allocate 1/4th (2^(x - 2)) of total buckets at
+each phase of splitpoint group. Next quarter of allocation will only happen if
+buckets of previous phase has been already consumed.  Since for buckets number
+< 4 we cannot further divide it in to multiple phases, the first splitpoint
+group 0's allocation is done as follows {1, 1, 1, 1} = 4 buckets in total, the
+numbers in curly braces indicate number of buckets allocated within each phase
+of splitpoint group 0.  In next splitpoint group 1 the allocation phases will
+be as follow {1, 1, 1, 1} = 8 buckets in total.  And, for splitpoint group 2
+and 3 allocation phase will be {2, 2, 2, 2} = 16 buckets in total and
+{4, 4, 4, 4} = 32 buckets in total.  We can see that at each splitpoint group
+we double the total number of buckets from previous group but in a incremental
+phase.  The bucket pages allocated within one phase of a splitpoint group will
+appear consecutively in the index.  This addressing scheme allows the physical
+location of a bucket page to be computed from the bucket number relatively
+easily, using only a small amount of control information.  If we look at the
+function _hash_spareindex for a given bucket number we first compute splitpoint
+group it belongs to and then the phase with in it to which the bucket belongs
+to.  Adding them we get the global splitpoint phase number S to which the
+bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[] is
+an array stored in the metapage) with given bucket number to compute its
+physical address.  The hashm_spares[S] can be interpreted as the total number
+of overflow pages that have been allocated before the bucket pages of
+splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
+(which belong to splitpoint group 0's phase 1 and phase 2 respectively) always
+appear at block numbers 1 and 2, just after the meta page.  We always have
 hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
-former.  The difference between the two represents the number of overflow
-pages appearing between the bucket page groups of splitpoints N and N+1.
-
+former. The difference between the two represents the number of overflow
+pages appearing between the bucket page groups of splitpoints phase N and N+1.
 (Note: the above describes what happens when filling an initially minimally
 sized hash index.  In practice, we try to estimate the required index size
-and allocate a suitable number of splitpoints immediately, to avoid
+and allocate a suitable number of splitpoints phases immediately, to avoid
 expensive re-splitting during initial index build.)
 
 When S splitpoints exist altogether, the array entries hashm_spares[0]
 through hashm_spares[S] are valid; hashm_spares[S] records the current
 total number of overflow pages.  New overflow pages are created as needed
 at the end of the index, and recorded by incrementing hashm_spares[S].
-When it is time to create a new splitpoint's worth of bucket pages, we
+When it is time to create a new splitpoint phase's worth of bucket pages, we
 copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
 stored in the hashm_ovflpoint field of the meta page).  This has the
 effect of reserving the correct number of bucket pages at the end of the
@@ -101,7 +116,7 @@ We have to allow the case "greater than" because it's possible that during
 an index extension we crash after allocating filesystem space and before
 updating the metapage.  Note that on filesystems that allow "holes" in
 files, it's entirely likely that pages before the logical EOF are not yet
-allocated: when we allocate a new splitpoint's worth of bucket pages, we
+allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
 physically zero the last such page to force the EOF up, and the first such
 page will be used immediately, but the intervening pages are not written
 until needed.
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..d14516f 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -49,7 +49,7 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 	 * Convert to absolute page number by adding the number of bucket pages
 	 * that exist before this split point.
 	 */
-	return (BlockNumber) ((1 << i) + ovflbitnum);
+	return (BlockNumber) (_hash_get_tbuckets(i) + ovflbitnum);
 }
 
 /*
@@ -67,9 +67,9 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 	/* Determine the split number containing this page */
 	for (i = 1; i <= splitnum; i++)
 	{
-		if (ovflblkno <= (BlockNumber) (1 << i))
+		if (ovflblkno <= (BlockNumber) _hash_get_tbuckets(i))
 			break;				/* oops */
-		bitnum = ovflblkno - (1 << i);
+		bitnum = ovflblkno - _hash_get_tbuckets(i);
 
 		/*
 		 * bitnum has to be greater than number of overflow page added in
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 622cc4b..6dec432 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -502,14 +502,15 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	Page		page;
 	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
+	uint32		spare_index;
 	uint32		i;
 
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
-	 * given the estimated number of tuples.  We round up the result to the
-	 * next power of 2, however, and always force at least 2 bucket pages. The
-	 * upper limit is determined by considerations explained in
+	 * given the estimated number of tuples.  We round up the result to total
+	 * the number of buckets which has to be allocated before using its
+	 * _hashm_spares index slot, however, and always force at least 2 bucket
+	 * pages. The upper limit is determined by considerations explained in
 	 * _hash_expandtable().
 	 */
 	dnumbuckets = num_tuples / ffactor;
@@ -518,11 +519,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	else if (dnumbuckets >= (double) 0x40000000)
 		num_buckets = 0x40000000;
 	else
-		num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+		num_buckets = _hash_get_tbuckets(_hash_spareindex(dnumbuckets));
 
-	log2_num_buckets = _hash_log2(num_buckets);
-	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
-	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
+	spare_index = _hash_spareindex(num_buckets);
+	Assert(spare_index < HASH_MAX_SPLITPOINTS);
 
 	page = BufferGetPage(buf);
 	if (initpage)
@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
-	 * blocks 1 to N.  The first freespace bitmap page is in block N+1. Since
-	 * N is a power of 2, we can set the masks this way:
+	 * blocks 1 to N.  The first freespace bitmap page is in block N+1.
 	 */
-	metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
-	metap->hashm_highmask = (num_buckets << 1) - 1;
+	metap->hashm_maxbucket = num_buckets - 1;
+
+	/* set hishmask, which should be sufficient to cover num_buckets. */
+	metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
+	metap->hashm_lowmask = (metap->hashm_highmask >> 1);
 
 	MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
 	/* Set up mapping for one spare page after the initial splitpoints */
-	metap->hashm_spares[log2_num_buckets] = 1;
-	metap->hashm_ovflpoint = log2_num_buckets;
+	metap->hashm_spares[spare_index] = 1;
+	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
 
 	/*
@@ -773,25 +775,40 @@ restart_expand:
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to allocate a new batch of bucket pages.
+	 * If the split point is increasing we need to allocate a new batch of
+	 * bucket pages.
 	 */
-	spare_ndx = _hash_log2(new_bucket + 1);
+	spare_ndx = _hash_spareindex(new_bucket + 1);
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
+		uint32		buckets_toadd = 0;
+		uint32		splitpoint_group = 0;
+
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
-		 * The number of buckets in the new splitpoint is equal to the total
-		 * number already in existence, i.e. new_bucket.  Currently this maps
-		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.  We treat allocation of buckets as a
-		 * separate WAL-logged action.  Even if we fail after this operation,
+		 * The number of buckets in the new splitpoint group is equal to the
+		 * total number already in existence, i.e. new_bucket. But we do not
+		 * allocate them at once. Each splitpoint group will have 4 slots, we
+		 * distribute the buckets equally among them. So we allocate only one
+		 * forth of total buckets in new splitpoint group at time to consume
+		 * one phase after another. We treat allocation of buckets as a
+		 * separate WAL-logged action. Even if we fail after this operation,
 		 * won't leak bucket pages; rather, the next split will consume this
 		 * space. In any case, even without failure we don't use all the space
 		 * in one split operation.
 		 */
-		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+		splitpoint_group = (spare_ndx >> 2);
+
+		/*
+		 * Each phase in the splitpoint_group will allocate
+		 * 2^(splitpoint_group - 1) buckets if we divide buckets among 4
+		 * slots. The 0th group is a special case where we allocate 1 bucket
+		 * per slot as we cannot reduce it any further. See README for more
+		 * details.
+		 */
+		buckets_toadd = (splitpoint_group) ? (1 << (splitpoint_group - 1)) : 1;
+		if (!_hash_alloc_buckets(rel, start_nblkno, buckets_toadd))
 		{
 			/* can't split due to BlockNumber overflow */
 			_hash_relbuf(rel, buf_oblkno);
@@ -836,10 +853,9 @@ restart_expand:
 	}
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to adjust the hashm_spares[] array and
-	 * hashm_ovflpoint so that future overflow pages will be created beyond
-	 * this new batch of bucket pages.
+	 * If the split point is increasing we need to adjust the hashm_spares[]
+	 * array and hashm_ovflpoint so that future overflow pages will be created
+	 * beyond this new batch of bucket pages.
 	 */
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index 0e0f393..8aa8769 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -56,9 +56,8 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	 * num_buckets buckets in the index, the appropriate mask can be computed
 	 * as follows.
 	 *
-	 * Note: at present, the passed-in num_buckets is always a power of 2, so
-	 * we could just compute num_buckets - 1.  We prefer not to assume that
-	 * here, though.
+	 * NOTE : This hash_mask calculation should be in sync with similar
+	 * calculation in _hash_init_metabuffer.
 	 */
 	hspool->hash_mask = (((uint32) 1) << _hash_log2(num_buckets)) - 1;
 
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..e818e16 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -150,6 +150,81 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_spareindex -- returns spare index / global splitpoint phase of the
+ *					   bucket
+ */
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
+	uint32		splitpoint_group,
+				tbuckets,
+				phases_beyond_bucket;
+
+	/*
+	 * The first 4 bucket belongs to corresponding first 4 splitpoint phases.
+	 */
+	if (num_bucket <= 4)
+		return (num_bucket - 1);	/* converted to base 0. */
+	splitpoint_group = _hash_log2(num_bucket) - 2;		/* The are 4 buckets in
+														 * splitpoint group 0
+														 * itself so subtracting
+														 * - 2 to get right
+														 * splitspoint group of
+														 * the bucket */
+	/*
+	 * bucket's global splitpoint phase = total number of split point phases
+	 * until its splitpoint group - splitpoint phase within this splitpoint
+	 * group but after buckets own splitpoint phase.
+	 */
+	tbuckets = (1 << (splitpoint_group + 2));
+	phases_beyond_bucket =
+		(tbuckets - num_bucket) / (1 << (splitpoint_group - 1));
+	return (((splitpoint_group + 1) << 2) - phases_beyond_bucket) - 1;
+}
+
+/*
+ *	_hash_get_tbuckets -- returns total number of buckets allocated till the
+ *						  given splitpoint phase.
+ */
+uint32
+_hash_get_tbuckets(uint32 splitpoint_phase)
+{
+	uint32		splitpoint_group,
+				tbuckets,
+				phases_beyond_bucket;
+
+	/*
+	 * First 4 splitpoint phases allocate 1 bucket each.
+	 */
+	if (splitpoint_phase < 4)
+		return (splitpoint_phase + 1);
+
+	/*
+	 * total_buckets = total number of buckets upto the corresponding
+	 * splitpoint group - buckets of splitpoint phases of this group which are
+	 * beyond the given splitpoint_phase
+	 */
+	splitpoint_group = (splitpoint_phase >> 2); /* Every 4 consecutive phases
+												 * makes one group and group's
+												 * are numbered from 0. */
+	tbuckets = (1 << (splitpoint_group + 2));	/* Total buckets allocated
+												 * upto splitpoint_group is
+												 * 2^(splitpoint_group + 2).
+												 * See README to check the
+												 * pattern. */
+	phases_beyond_bucket =
+		((splitpoint_group + 1) << 2) - (splitpoint_phase + 1);
+
+	/*
+	 * each splitpoint phase in a group will allocate 1 << (splitpoint_group -
+	 * 1) number of buckets, see pattern in README.
+	 */
+	tbuckets =
+		tbuckets - (phases_beyond_bucket * (1 << (splitpoint_group - 1)));
+	return tbuckets;
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..c9665c4 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -36,7 +36,7 @@ typedef uint32 Bucket;
 #define InvalidBucket	((Bucket) 0xFFFFFFFF)
 
 #define BUCKET_TO_BLKNO(metap,B) \
-		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
+		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)
 
 /*
  * Special space for hash index pages.
@@ -180,7 +180,7 @@ typedef HashScanOpaqueData *HashScanOpaque;
  * needing to fit into the metapage.  (With 8K block size, 128 bitmaps
  * limit us to 64 GB of overflow space...)
  */
-#define HASH_MAX_SPLITPOINTS		32
+#define HASH_MAX_SPLITPOINTS		128
 #define HASH_MAX_BITMAPS			128
 
 typedef struct HashMetaPageData
@@ -382,6 +382,8 @@ extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
 extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
 					 uint32 highmask, uint32 lowmask);
 extern uint32 _hash_log2(uint32 num);
+extern uint32 _hash_spareindex(uint32 num_bucket);
+extern uint32 _hash_get_tbuckets(uint32 splitpoint_phase);
 extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
 extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
 extern bool _hash_convert_tuple(Relation index,
#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Mithun Cy (#5)
Re: [POC] A better way to expand hash indexes.

On Sat, Mar 18, 2017 at 10:59 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

On Thu, Mar 16, 2017 at 10:55 PM, David Steele <david@pgmasters.net> wrote:

It does apply with fuzz on 2b32ac2, so it looks like c11453c and
subsequent commits are the cause. They represent a fairly substantial
change to hash indexes by introducing WAL logging so I think you should
reevaluate your patches to be sure they still function as expected.

Thanks, David here is the new improved patch I have also corrected the
pageinspect's test output. Also, added notes in README regarding the
new way of adding bucket pages efficiently in hash index. I also did
some more tests pgbench read only and read write;

To make this work, I think the calculations you have introduced are
not so easy to understand. For example, it took me quite some time to
understand how the below function works to compute the location in
hash spares.

+uint32
+_hash_spareindex(uint32 num_bucket)
+{
..
+ /*
+ * The first 4 bucket belongs to corresponding first 4 splitpoint phases.
+ */
+ if (num_bucket <= 4)
+ return (num_bucket - 1); /* converted to base 0. */
+ splitpoint_group = _hash_log2(num_bucket) - 2; /* The are 4 buckets in
..
+ /*
+ * bucket's global splitpoint phase = total number of split point phases
+ * until its splitpoint group - splitpoint phase within this splitpoint
+ * group but after buckets own splitpoint phase.
+ */
+ tbuckets = (1 << (splitpoint_group + 2));
+ phases_beyond_bucket =
+ (tbuckets - num_bucket) / (1 << (splitpoint_group - 1));
+ return (((splitpoint_group + 1) << 2) - phases_beyond_bucket) - 1;
+}

I am not sure if it is just a matter of better comments to explain it
in a simple way or maybe we can try to find some simpler mechanism to
group the split into four (or more) equal parts. I think if someone
else can read and share their opinion it could be helpful. Another
idea could be to make hashm_spares a two-dimensional array
hashm_spares[32][4] where the first dimension will indicate the split
point and second will indicate the sub-split number. I am not sure
whether it will be simpler or complex than the method used in the
proposed patch, but I think we should think a bit more to see if we
can come up with some simple technique to solve this problem.

+ * allocate them at once. Each splitpoint group will have 4 slots, we
+ * distribute the buckets equally among them. So we allocate only one
+ * forth of total buckets in new splitpoint group at time to consume
+ * one phase after another.

spelling.
/forth/fourth
/at time/at a time

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Amit Kapila (#6)
Re: [POC] A better way to expand hash indexes.

Hi Amit, Thanks for the review,

On Mon, Mar 20, 2017 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

idea could be to make hashm_spares a two-dimensional array
hashm_spares[32][4] where the first dimension will indicate the split
point and second will indicate the sub-split number. I am not sure
whether it will be simpler or complex than the method used in the
proposed patch, but I think we should think a bit more to see if we
can come up with some simple technique to solve this problem.

I think making it a 2-dimensional array will not be any useful in fact
we really treat the given array 2-dimensional elements now.
The main concern of yours I think is the calculation steps to find the
phase of the splitpoint group the bucket belongs to.
+ tbuckets = (1 << (splitpoint_group + 2));
+ phases_beyond_bucket =
+ (tbuckets - num_bucket) / (1 << (splitpoint_group - 1));
+ return (((splitpoint_group + 1) << 2) - phases_beyond_bucket) - 1;

Quickly thinking further we allocate 2^x of buckets on each phase of
any splitpoint group and we have 4 such phases in each group; At each
splitpoint group again we allocate 2^y buckets, So buckets number have
a pattern in their bitmap representation to find out which phase of
allocation they belong to.

As below
===========
Group 0 -- bit 0, 1 define which phase of group each bucket belong to.
0 -- 00000000
1 -- 00000001
2 -- 00000010
3 -- 00000011
===========
Group 1 -- bit 0, 1 define which phase of group each bucket belong to.
4 -- 00000100
5 -- 00000101
6 -- 00000110
7 -- 00000111
===========
Group 2 -- bit 1, 2 define which phase of group each bucket belong to.
8 -- 00001000
9 -- 00001001
10 -- 00001010
11 -- 00001011
12 -- 00001100
13 -- 00001101
14 -- 00001110
15 -- 00001111
===========
Group 3 -- bit 2, 3 define which phase of group each bucket belong to.
16 -- 00010000
17 -- 00010001
18 -- 00010010
19 -- 00010011
20 -- 00010100
21 -- 00010101
22 -- 00010110
23 -- 00010111
24 -- 00011000
25 -- 00011001
26 -- 00011010
27 -- 00011011
28 -- 00011100
29 -- 00011101
30 -- 00011110
31 -- 00011111
============

So we can say given a bucket x of group n > 0 bits (n-1, n) defines
which phase of group they belong to. I see an opportunity here to
completely simplify the above calculation into a simple bitwise anding
the bucket number with (n-1,n) bitmask to get the phase of allocation
of bucket x.

Formula can be (x >> (splitpoint_group - 1)) & 0x3 = phase of bucket

Does this satisfy your concern?

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Mithun Cy (#7)
Re: [POC] A better way to expand hash indexes.

On Mon, Mar 20, 2017 at 8:58 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

Hi Amit, Thanks for the review,

On Mon, Mar 20, 2017 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

idea could be to make hashm_spares a two-dimensional array
hashm_spares[32][4] where the first dimension will indicate the split
point and second will indicate the sub-split number. I am not sure
whether it will be simpler or complex than the method used in the
proposed patch, but I think we should think a bit more to see if we
can come up with some simple technique to solve this problem.

I think making it a 2-dimensional array will not be any useful in fact
we really treat the given array 2-dimensional elements now.

Sure, I was telling you based on that. If you are implicitly treating
it as 2-dimensional array, it might be easier to compute the array
offsets.

The main concern of yours I think is the calculation steps to find the
phase of the splitpoint group the bucket belongs to.
+ tbuckets = (1 << (splitpoint_group + 2));
+ phases_beyond_bucket =
+ (tbuckets - num_bucket) / (1 << (splitpoint_group - 1));
+ return (((splitpoint_group + 1) << 2) - phases_beyond_bucket) - 1;

It is not only about above calculation, but also what the patch is
doing in function _hash_get_tbuckets(). By the way function name also
seems unclear (mainly *tbuckets* in the name).

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#8)
Re: [POC] A better way to expand hash indexes.

On Tue, Mar 21, 2017 at 8:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 20, 2017 at 8:58 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

Hi Amit, Thanks for the review,

On Mon, Mar 20, 2017 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

idea could be to make hashm_spares a two-dimensional array
hashm_spares[32][4] where the first dimension will indicate the split
point and second will indicate the sub-split number. I am not sure
whether it will be simpler or complex than the method used in the
proposed patch, but I think we should think a bit more to see if we
can come up with some simple technique to solve this problem.

I think making it a 2-dimensional array will not be any useful in fact
we really treat the given array 2-dimensional elements now.

Sure, I was telling you based on that. If you are implicitly treating
it as 2-dimensional array, it might be easier to compute the array
offsets.

The above sentence looks incomplete.
If you are implicitly treating it as a 2-dimensional array, it might
be easier to compute the array offsets if you explicitly also treats
as a 2-dimensional array.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Amit Kapila (#8)
1 attachment(s)
Re: [POC] A better way to expand hash indexes.

Hi Amit please find the new patch

On Tue, Mar 21, 2017 at 8:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

It is not only about above calculation, but also what the patch is
doing in function _hash_get_tbuckets(). By the way function name also
seems unclear (mainly *tbuckets* in the name).

Fixed I have introduced some macros for readability and added more
comments to explain why some calculations are mad. Please let me know
if you think more improvements are needed.

spelling.
/forth/fourth
/at time/at a time

Thanks fixed.

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

expand_hashbucket_efficiently_04.patchapplication/octet-stream; name=expand_hashbucket_efficiently_04.patchDownload
commit 6dcc4fee1668c1df88ad63e2b1857694a4e2ee9c
Author: mithun <mithun@localhost.localdomain>
Date:   Fri Mar 24 01:15:45 2017 +0530

    Expand the bucket efficiently
    -----------------------------
    Mithun C Y

diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 3ba01f6..c97b279 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -51,13 +51,13 @@ bsize     | 8152
 bmsize    | 4096
 bmshift   | 15
 maxbucket | 3
-highmask  | 7
-lowmask   | 3
-ovflpoint | 2
+highmask  | 3
+lowmask   | 1
+ovflpoint | 3
 firstfree | 0
 nmaps     | 1
 procid    | 450
-spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 
 SELECT magic, version, ntuples, bsize, bmsize, bmshift, maxbucket, highmask,
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..8789805 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -58,35 +58,50 @@ rules to support a variable number of overflow pages while not having to
 move primary bucket pages around after they are created.
 
 Primary bucket pages (henceforth just "bucket pages") are allocated in
-power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
-are created when the index is initialized.  At the first split, buckets 2
-and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
-when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
-pages of a power-of-2 group appear consecutively in the index.  This
-addressing scheme allows the physical location of a bucket page to be
-computed from the bucket number relatively easily, using only a small
-amount of control information.  We take the log2() of the bucket number
-to determine which split point S the bucket belongs to, and then simply
-add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
-metapage) to compute the physical address.  hashm_spares[S] can be
-interpreted as the total number of overflow pages that have been allocated
-before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
-so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
-block numbers 1 and 2, just after the meta page.  We always have
+power-of-2 groups, called "split points" in the code.  That means on every new
+split points we double the existing number of buckets.  And, it seems bad to
+allocate huge chunks of bucket pages all at once and we take ages to consume it.
+To avoid this exponential growth of index size, we did a trick to breakup
+allocation of buckets at splitpoint into 4 equal phases.  If 2^x is the total
+buckets need to be allocated at a splitpoint (from now on we shall call this
+as splitpoint group), then we allocate 1/4th (2^(x - 2)) of total buckets at
+each phase of splitpoint group. Next quarter of allocation will only happen if
+buckets of previous phase has been already consumed.  Since for buckets number
+< 4 we cannot further divide it in to multiple phases, the first splitpoint
+group 0's allocation is done as follows {1, 1, 1, 1} = 4 buckets in total, the
+numbers in curly braces indicate number of buckets allocated within each phase
+of splitpoint group 0.  In next splitpoint group 1 the allocation phases will
+be as follow {1, 1, 1, 1} = 8 buckets in total.  And, for splitpoint group 2
+and 3 allocation phase will be {2, 2, 2, 2} = 16 buckets in total and
+{4, 4, 4, 4} = 32 buckets in total.  We can see that at each splitpoint group
+we double the total number of buckets from previous group but in a incremental
+phase.  The bucket pages allocated within one phase of a splitpoint group will
+appear consecutively in the index.  This addressing scheme allows the physical
+location of a bucket page to be computed from the bucket number relatively
+easily, using only a small amount of control information.  If we look at the
+function _hash_spareindex for a given bucket number we first compute splitpoint
+group it belongs to and then the phase with in it to which the bucket belongs
+to.  Adding them we get the global splitpoint phase number S to which the
+bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[] is
+an array stored in the metapage) with given bucket number to compute its
+physical address.  The hashm_spares[S] can be interpreted as the total number
+of overflow pages that have been allocated before the bucket pages of
+splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
+(which belong to splitpoint group 0's phase 1 and phase 2 respectively) always
+appear at block numbers 1 and 2, just after the meta page.  We always have
 hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
-former.  The difference between the two represents the number of overflow
-pages appearing between the bucket page groups of splitpoints N and N+1.
-
+former. The difference between the two represents the number of overflow
+pages appearing between the bucket page groups of splitpoints phase N and N+1.
 (Note: the above describes what happens when filling an initially minimally
 sized hash index.  In practice, we try to estimate the required index size
-and allocate a suitable number of splitpoints immediately, to avoid
+and allocate a suitable number of splitpoints phases immediately, to avoid
 expensive re-splitting during initial index build.)
 
 When S splitpoints exist altogether, the array entries hashm_spares[0]
 through hashm_spares[S] are valid; hashm_spares[S] records the current
 total number of overflow pages.  New overflow pages are created as needed
 at the end of the index, and recorded by incrementing hashm_spares[S].
-When it is time to create a new splitpoint's worth of bucket pages, we
+When it is time to create a new splitpoint phase's worth of bucket pages, we
 copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
 stored in the hashm_ovflpoint field of the meta page).  This has the
 effect of reserving the correct number of bucket pages at the end of the
@@ -101,7 +116,7 @@ We have to allow the case "greater than" because it's possible that during
 an index extension we crash after allocating filesystem space and before
 updating the metapage.  Note that on filesystems that allow "holes" in
 files, it's entirely likely that pages before the logical EOF are not yet
-allocated: when we allocate a new splitpoint's worth of bucket pages, we
+allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
 physically zero the last such page to force the EOF up, and the first such
 page will be used immediately, but the intervening pages are not written
 until needed.
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..fb8ea49 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -49,7 +49,7 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 	 * Convert to absolute page number by adding the number of bucket pages
 	 * that exist before this split point.
 	 */
-	return (BlockNumber) ((1 << i) + ovflbitnum);
+	return (BlockNumber) (_hash_get_totalbuckets(i) + ovflbitnum);
 }
 
 /*
@@ -67,9 +67,9 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 	/* Determine the split number containing this page */
 	for (i = 1; i <= splitnum; i++)
 	{
-		if (ovflblkno <= (BlockNumber) (1 << i))
+		if (ovflblkno <= (BlockNumber) _hash_get_totalbuckets(i))
 			break;				/* oops */
-		bitnum = ovflblkno - (1 << i);
+		bitnum = ovflblkno - _hash_get_totalbuckets(i);
 
 		/*
 		 * bitnum has to be greater than number of overflow page added in
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 622cc4b..bd16e56 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -502,14 +502,15 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	Page		page;
 	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
+	uint32		spare_index;
 	uint32		i;
 
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
-	 * given the estimated number of tuples.  We round up the result to the
-	 * next power of 2, however, and always force at least 2 bucket pages. The
-	 * upper limit is determined by considerations explained in
+	 * given the estimated number of tuples.  We round up the result to total
+	 * the number of buckets which has to be allocated before using its
+	 * _hashm_spares index slot, however, and always force at least 2 bucket
+	 * pages. The upper limit is determined by considerations explained in
 	 * _hash_expandtable().
 	 */
 	dnumbuckets = num_tuples / ffactor;
@@ -518,11 +519,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	else if (dnumbuckets >= (double) 0x40000000)
 		num_buckets = 0x40000000;
 	else
-		num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+		num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
 
-	log2_num_buckets = _hash_log2(num_buckets);
-	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
-	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
+	spare_index = _hash_spareindex(num_buckets);
+	Assert(spare_index < HASH_MAX_SPLITPOINTS);
 
 	page = BufferGetPage(buf);
 	if (initpage)
@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
-	 * blocks 1 to N.  The first freespace bitmap page is in block N+1. Since
-	 * N is a power of 2, we can set the masks this way:
+	 * blocks 1 to N.  The first freespace bitmap page is in block N+1.
 	 */
-	metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
-	metap->hashm_highmask = (num_buckets << 1) - 1;
+	metap->hashm_maxbucket = num_buckets - 1;
+
+	/* set hishmask, which should be sufficient to cover num_buckets. */
+	metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
+	metap->hashm_lowmask = (metap->hashm_highmask >> 1);
 
 	MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
 	/* Set up mapping for one spare page after the initial splitpoints */
-	metap->hashm_spares[log2_num_buckets] = 1;
-	metap->hashm_ovflpoint = log2_num_buckets;
+	metap->hashm_spares[spare_index] = 1;
+	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
 
 	/*
@@ -773,25 +775,40 @@ restart_expand:
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to allocate a new batch of bucket pages.
+	 * If the split point is increasing we need to allocate a new batch of
+	 * bucket pages.
 	 */
-	spare_ndx = _hash_log2(new_bucket + 1);
+	spare_ndx = _hash_spareindex(new_bucket + 1);
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
+		uint32		buckets_toadd = 0;
+		uint32		splitpoint_group = 0;
+
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
-		 * The number of buckets in the new splitpoint is equal to the total
-		 * number already in existence, i.e. new_bucket.  Currently this maps
-		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.  We treat allocation of buckets as a
-		 * separate WAL-logged action.  Even if we fail after this operation,
+		 * The number of buckets in the new splitpoint group is equal to the
+		 * total number already in existence, i.e. new_bucket. But we do not
+		 * allocate them at once. Each splitpoint group will have 4 slots, we
+		 * distribute the buckets equally among them. So we allocate only one
+		 * fourth of total buckets in new splitpoint group at a time to consume
+		 * one phase after another. We treat allocation of buckets as a
+		 * separate WAL-logged action. Even if we fail after this operation,
 		 * won't leak bucket pages; rather, the next split will consume this
 		 * space. In any case, even without failure we don't use all the space
 		 * in one split operation.
 		 */
-		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+		splitpoint_group = (spare_ndx >> 2);
+
+		/*
+		 * Each phase in the splitpoint_group will allocate
+		 * 2^(splitpoint_group - 1) buckets if we divide buckets among 4
+		 * slots. The 0th group is a special case where we allocate 1 bucket
+		 * per slot as we cannot reduce it any further. See README for more
+		 * details.
+		 */
+		buckets_toadd = (splitpoint_group) ? (1 << (splitpoint_group - 1)) : 1;
+		if (!_hash_alloc_buckets(rel, start_nblkno, buckets_toadd))
 		{
 			/* can't split due to BlockNumber overflow */
 			_hash_relbuf(rel, buf_oblkno);
@@ -836,10 +853,9 @@ restart_expand:
 	}
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to adjust the hashm_spares[] array and
-	 * hashm_ovflpoint so that future overflow pages will be created beyond
-	 * this new batch of bucket pages.
+	 * If the split point is increasing we need to adjust the hashm_spares[]
+	 * array and hashm_ovflpoint so that future overflow pages will be created
+	 * beyond this new batch of bucket pages.
 	 */
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index 0e0f393..8aa8769 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -56,9 +56,8 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	 * num_buckets buckets in the index, the appropriate mask can be computed
 	 * as follows.
 	 *
-	 * Note: at present, the passed-in num_buckets is always a power of 2, so
-	 * we could just compute num_buckets - 1.  We prefer not to assume that
-	 * here, though.
+	 * NOTE : This hash_mask calculation should be in sync with similar
+	 * calculation in _hash_init_metabuffer.
 	 */
 	hspool->hash_mask = (((uint32) 1) << _hash_log2(num_buckets)) - 1;
 
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..6f1943b 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -149,6 +149,87 @@ _hash_log2(uint32 num)
 	return i;
 }
 
+#define SPLITPOINT_PHASES_PER_GRP 4
+#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
+
+#define TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(sp_g) (sp_g > 0 ? (sp_g << 2) : 0)
+
+/*
+ * This is just a trick to save a division operation. If you look into the
+ * bitmap of 0-based bucket_num 2nd and 3rd most significant bit will indicate
+ * which phase of allocation the bucket_num belogs to with in the group. This
+ * is because at every splitpoint group we allocate 2^x buckets and we have
+ * divided the allocation process into 4 equal phases. This macro returns value
+ * from 0 to 3.
+ */
+#define SPLITPOINT_PHASES_WITHIN_GROUP(sp_g, bucket_num) \
+		((sp_g > 0) ? (((bucket_num) >> (sp_g - 1)) & SPLITPOINT_PHASE_MASK) : \
+						(bucket_num))
+
+/*
+ * At splitpoint group 0 we have 2^(0 + 2) = 4 buckets, then at splitpoint
+ * group 1 we have 2^(1 + 2) = 8 total buckets. As the doubling continues at
+ * splitpoint group "x" we will have 2^(x + 2) total buckets. Total buckets
+ * before x splitpoint group will be 2^(x + 1). At each phase of allocation
+ * within splitpoint group we add 2^(x - 1) buckets, as we have to divide the
+ * task of allocation of 2^(x + 1) buckets among 4 phases.
+ */
+#define BUCKETS_BEFORE_SP_GRP(sp_g) ((sp_g == 0) ? 0 : (1 << (sp_g + 1)))
+#define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \
+						((nphase) * ((sp_g == 0) ? 1 : (1 << (sp_g - 1))))
+
+/*
+ * _hash_spareindex -- returns spare index / global splitpoint phase of the
+ *					   bucket
+ */
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
+	uint32		splitpoint_group;
+
+	/*
+	 * The first 4 bucket belongs to first splitpoint group 0. And since group
+	 * 0 have 4 = 2^2 buckets, we double them in group 1. So total buckets
+	 * after group 1 is 8 = 2^3. Then again at group 2 we add another 2^3
+	 * buckets to double the total number of buckets, which will become 2^4. I
+	 * think by this time we can see a pattern which say if num_bucket > 4
+	 * splitpoint group = log2(num_bucket) - 2
+	 */
+	if (num_bucket <= 4)
+		splitpoint_group = 0;	/* converted to base 0. */
+	else
+		splitpoint_group = _hash_log2(num_bucket) - 2;
+
+	return TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(splitpoint_group) +
+		SPLITPOINT_PHASES_WITHIN_GROUP(splitpoint_group,
+									   num_bucket - 1); /* make it 0-based */
+}
+
+/*
+ *	_hash_get_totalbuckets -- returns total number of buckets allocated till
+ *							the given splitpoint phase.
+ */
+uint32
+_hash_get_totalbuckets(uint32 splitpoint_phase)
+{
+	uint32		splitpoint_group;
+
+	/*
+	 * Every 4 consecutive phases makes one group and group's are numbered
+	 * from 0
+	 */
+	splitpoint_group = (splitpoint_phase / SPLITPOINT_PHASES_PER_GRP);
+
+	/*
+	 * total_buckets = total number of buckets before its splitpoint group +
+	 * total buckets within its splitpoint group until given splitpoint_phase.
+	 */
+	return BUCKETS_BEFORE_SP_GRP(splitpoint_group) +
+		BUCKETS_WITHIN_SP_GRP(splitpoint_group,
+							  (splitpoint_phase % SPLITPOINT_PHASES_PER_GRP)
+							  + 1);
+}
+
 /*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..ac502ef 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -36,7 +36,7 @@ typedef uint32 Bucket;
 #define InvalidBucket	((Bucket) 0xFFFFFFFF)
 
 #define BUCKET_TO_BLKNO(metap,B) \
-		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
+		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)
 
 /*
  * Special space for hash index pages.
@@ -180,7 +180,7 @@ typedef HashScanOpaqueData *HashScanOpaque;
  * needing to fit into the metapage.  (With 8K block size, 128 bitmaps
  * limit us to 64 GB of overflow space...)
  */
-#define HASH_MAX_SPLITPOINTS		32
+#define HASH_MAX_SPLITPOINTS		128
 #define HASH_MAX_BITMAPS			128
 
 typedef struct HashMetaPageData
@@ -382,6 +382,8 @@ extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
 extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
 					 uint32 highmask, uint32 lowmask);
 extern uint32 _hash_log2(uint32 num);
+extern uint32 _hash_spareindex(uint32 num_bucket);
+extern uint32 _hash_get_totalbuckets(uint32 splitpoint_phase);
 extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
 extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
 extern bool _hash_convert_tuple(Relation index,
#11Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Mithun Cy (#10)
1 attachment(s)
Re: [POC] A better way to expand hash indexes.

On Fri, Mar 24, 2017 at 1:22 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

Hi Amit please find the new patch

The pageinspect.sgml has an example which shows the output of
"hash_metapage_info()". Since we increase the spares array and
eventually ovflpoint, I have updated the example with corresponding
values..

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

expand_hashbucket_efficiently_05.patchapplication/octet-stream; name=expand_hashbucket_efficiently_05.patchDownload
commit 59a3f9ce6a2f979be2f2e0ce1fcb7ff3afa46861
Author: mithun <mithun@localhost.localdomain>
Date:   Fri Mar 24 12:02:47 2017 +0530

    expand_hash_efficiently_05
    
    Mithun C Y

diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 3ba01f6..c97b279 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -51,13 +51,13 @@ bsize     | 8152
 bmsize    | 4096
 bmshift   | 15
 maxbucket | 3
-highmask  | 7
-lowmask   | 3
-ovflpoint | 2
+highmask  | 3
+lowmask   | 1
+ovflpoint | 3
 firstfree | 0
 nmaps     | 1
 procid    | 450
-spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 
 SELECT magic, version, ntuples, bsize, bmsize, bmshift, maxbucket, highmask,
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 9f41bb2..f19066f 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -667,11 +667,11 @@ bmshift   | 15
 maxbucket | 12512
 highmask  | 16383
 lowmask   | 8191
-ovflpoint | 14
+ovflpoint | 50
 firstfree | 1204
 nmaps     | 1
 procid    | 450
-spares    | {0,0,0,0,0,0,1,1,1,1,1,4,59,704,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,4,4,4,45,55,58,59,508,567,628,704,1193,1202,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 </screen>
      </para>
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..8789805 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -58,35 +58,50 @@ rules to support a variable number of overflow pages while not having to
 move primary bucket pages around after they are created.
 
 Primary bucket pages (henceforth just "bucket pages") are allocated in
-power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
-are created when the index is initialized.  At the first split, buckets 2
-and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
-when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
-pages of a power-of-2 group appear consecutively in the index.  This
-addressing scheme allows the physical location of a bucket page to be
-computed from the bucket number relatively easily, using only a small
-amount of control information.  We take the log2() of the bucket number
-to determine which split point S the bucket belongs to, and then simply
-add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
-metapage) to compute the physical address.  hashm_spares[S] can be
-interpreted as the total number of overflow pages that have been allocated
-before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
-so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
-block numbers 1 and 2, just after the meta page.  We always have
+power-of-2 groups, called "split points" in the code.  That means on every new
+split points we double the existing number of buckets.  And, it seems bad to
+allocate huge chunks of bucket pages all at once and we take ages to consume it.
+To avoid this exponential growth of index size, we did a trick to breakup
+allocation of buckets at splitpoint into 4 equal phases.  If 2^x is the total
+buckets need to be allocated at a splitpoint (from now on we shall call this
+as splitpoint group), then we allocate 1/4th (2^(x - 2)) of total buckets at
+each phase of splitpoint group. Next quarter of allocation will only happen if
+buckets of previous phase has been already consumed.  Since for buckets number
+< 4 we cannot further divide it in to multiple phases, the first splitpoint
+group 0's allocation is done as follows {1, 1, 1, 1} = 4 buckets in total, the
+numbers in curly braces indicate number of buckets allocated within each phase
+of splitpoint group 0.  In next splitpoint group 1 the allocation phases will
+be as follow {1, 1, 1, 1} = 8 buckets in total.  And, for splitpoint group 2
+and 3 allocation phase will be {2, 2, 2, 2} = 16 buckets in total and
+{4, 4, 4, 4} = 32 buckets in total.  We can see that at each splitpoint group
+we double the total number of buckets from previous group but in a incremental
+phase.  The bucket pages allocated within one phase of a splitpoint group will
+appear consecutively in the index.  This addressing scheme allows the physical
+location of a bucket page to be computed from the bucket number relatively
+easily, using only a small amount of control information.  If we look at the
+function _hash_spareindex for a given bucket number we first compute splitpoint
+group it belongs to and then the phase with in it to which the bucket belongs
+to.  Adding them we get the global splitpoint phase number S to which the
+bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[] is
+an array stored in the metapage) with given bucket number to compute its
+physical address.  The hashm_spares[S] can be interpreted as the total number
+of overflow pages that have been allocated before the bucket pages of
+splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
+(which belong to splitpoint group 0's phase 1 and phase 2 respectively) always
+appear at block numbers 1 and 2, just after the meta page.  We always have
 hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
-former.  The difference between the two represents the number of overflow
-pages appearing between the bucket page groups of splitpoints N and N+1.
-
+former. The difference between the two represents the number of overflow
+pages appearing between the bucket page groups of splitpoints phase N and N+1.
 (Note: the above describes what happens when filling an initially minimally
 sized hash index.  In practice, we try to estimate the required index size
-and allocate a suitable number of splitpoints immediately, to avoid
+and allocate a suitable number of splitpoints phases immediately, to avoid
 expensive re-splitting during initial index build.)
 
 When S splitpoints exist altogether, the array entries hashm_spares[0]
 through hashm_spares[S] are valid; hashm_spares[S] records the current
 total number of overflow pages.  New overflow pages are created as needed
 at the end of the index, and recorded by incrementing hashm_spares[S].
-When it is time to create a new splitpoint's worth of bucket pages, we
+When it is time to create a new splitpoint phase's worth of bucket pages, we
 copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
 stored in the hashm_ovflpoint field of the meta page).  This has the
 effect of reserving the correct number of bucket pages at the end of the
@@ -101,7 +116,7 @@ We have to allow the case "greater than" because it's possible that during
 an index extension we crash after allocating filesystem space and before
 updating the metapage.  Note that on filesystems that allow "holes" in
 files, it's entirely likely that pages before the logical EOF are not yet
-allocated: when we allocate a new splitpoint's worth of bucket pages, we
+allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
 physically zero the last such page to force the EOF up, and the first such
 page will be used immediately, but the intervening pages are not written
 until needed.
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..fb8ea49 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -49,7 +49,7 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 	 * Convert to absolute page number by adding the number of bucket pages
 	 * that exist before this split point.
 	 */
-	return (BlockNumber) ((1 << i) + ovflbitnum);
+	return (BlockNumber) (_hash_get_totalbuckets(i) + ovflbitnum);
 }
 
 /*
@@ -67,9 +67,9 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 	/* Determine the split number containing this page */
 	for (i = 1; i <= splitnum; i++)
 	{
-		if (ovflblkno <= (BlockNumber) (1 << i))
+		if (ovflblkno <= (BlockNumber) _hash_get_totalbuckets(i))
 			break;				/* oops */
-		bitnum = ovflblkno - (1 << i);
+		bitnum = ovflblkno - _hash_get_totalbuckets(i);
 
 		/*
 		 * bitnum has to be greater than number of overflow page added in
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 622cc4b..bd16e56 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -502,14 +502,15 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	Page		page;
 	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
+	uint32		spare_index;
 	uint32		i;
 
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
-	 * given the estimated number of tuples.  We round up the result to the
-	 * next power of 2, however, and always force at least 2 bucket pages. The
-	 * upper limit is determined by considerations explained in
+	 * given the estimated number of tuples.  We round up the result to total
+	 * the number of buckets which has to be allocated before using its
+	 * _hashm_spares index slot, however, and always force at least 2 bucket
+	 * pages. The upper limit is determined by considerations explained in
 	 * _hash_expandtable().
 	 */
 	dnumbuckets = num_tuples / ffactor;
@@ -518,11 +519,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	else if (dnumbuckets >= (double) 0x40000000)
 		num_buckets = 0x40000000;
 	else
-		num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+		num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
 
-	log2_num_buckets = _hash_log2(num_buckets);
-	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
-	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
+	spare_index = _hash_spareindex(num_buckets);
+	Assert(spare_index < HASH_MAX_SPLITPOINTS);
 
 	page = BufferGetPage(buf);
 	if (initpage)
@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
-	 * blocks 1 to N.  The first freespace bitmap page is in block N+1. Since
-	 * N is a power of 2, we can set the masks this way:
+	 * blocks 1 to N.  The first freespace bitmap page is in block N+1.
 	 */
-	metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
-	metap->hashm_highmask = (num_buckets << 1) - 1;
+	metap->hashm_maxbucket = num_buckets - 1;
+
+	/* set hishmask, which should be sufficient to cover num_buckets. */
+	metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
+	metap->hashm_lowmask = (metap->hashm_highmask >> 1);
 
 	MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
 	/* Set up mapping for one spare page after the initial splitpoints */
-	metap->hashm_spares[log2_num_buckets] = 1;
-	metap->hashm_ovflpoint = log2_num_buckets;
+	metap->hashm_spares[spare_index] = 1;
+	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
 
 	/*
@@ -773,25 +775,40 @@ restart_expand:
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to allocate a new batch of bucket pages.
+	 * If the split point is increasing we need to allocate a new batch of
+	 * bucket pages.
 	 */
-	spare_ndx = _hash_log2(new_bucket + 1);
+	spare_ndx = _hash_spareindex(new_bucket + 1);
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
+		uint32		buckets_toadd = 0;
+		uint32		splitpoint_group = 0;
+
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
-		 * The number of buckets in the new splitpoint is equal to the total
-		 * number already in existence, i.e. new_bucket.  Currently this maps
-		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.  We treat allocation of buckets as a
-		 * separate WAL-logged action.  Even if we fail after this operation,
+		 * The number of buckets in the new splitpoint group is equal to the
+		 * total number already in existence, i.e. new_bucket. But we do not
+		 * allocate them at once. Each splitpoint group will have 4 slots, we
+		 * distribute the buckets equally among them. So we allocate only one
+		 * fourth of total buckets in new splitpoint group at a time to consume
+		 * one phase after another. We treat allocation of buckets as a
+		 * separate WAL-logged action. Even if we fail after this operation,
 		 * won't leak bucket pages; rather, the next split will consume this
 		 * space. In any case, even without failure we don't use all the space
 		 * in one split operation.
 		 */
-		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+		splitpoint_group = (spare_ndx >> 2);
+
+		/*
+		 * Each phase in the splitpoint_group will allocate
+		 * 2^(splitpoint_group - 1) buckets if we divide buckets among 4
+		 * slots. The 0th group is a special case where we allocate 1 bucket
+		 * per slot as we cannot reduce it any further. See README for more
+		 * details.
+		 */
+		buckets_toadd = (splitpoint_group) ? (1 << (splitpoint_group - 1)) : 1;
+		if (!_hash_alloc_buckets(rel, start_nblkno, buckets_toadd))
 		{
 			/* can't split due to BlockNumber overflow */
 			_hash_relbuf(rel, buf_oblkno);
@@ -836,10 +853,9 @@ restart_expand:
 	}
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to adjust the hashm_spares[] array and
-	 * hashm_ovflpoint so that future overflow pages will be created beyond
-	 * this new batch of bucket pages.
+	 * If the split point is increasing we need to adjust the hashm_spares[]
+	 * array and hashm_ovflpoint so that future overflow pages will be created
+	 * beyond this new batch of bucket pages.
 	 */
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index 0e0f393..8aa8769 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -56,9 +56,8 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	 * num_buckets buckets in the index, the appropriate mask can be computed
 	 * as follows.
 	 *
-	 * Note: at present, the passed-in num_buckets is always a power of 2, so
-	 * we could just compute num_buckets - 1.  We prefer not to assume that
-	 * here, though.
+	 * NOTE : This hash_mask calculation should be in sync with similar
+	 * calculation in _hash_init_metabuffer.
 	 */
 	hspool->hash_mask = (((uint32) 1) << _hash_log2(num_buckets)) - 1;
 
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..6f1943b 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -149,6 +149,87 @@ _hash_log2(uint32 num)
 	return i;
 }
 
+#define SPLITPOINT_PHASES_PER_GRP 4
+#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
+
+#define TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(sp_g) (sp_g > 0 ? (sp_g << 2) : 0)
+
+/*
+ * This is just a trick to save a division operation. If you look into the
+ * bitmap of 0-based bucket_num 2nd and 3rd most significant bit will indicate
+ * which phase of allocation the bucket_num belogs to with in the group. This
+ * is because at every splitpoint group we allocate 2^x buckets and we have
+ * divided the allocation process into 4 equal phases. This macro returns value
+ * from 0 to 3.
+ */
+#define SPLITPOINT_PHASES_WITHIN_GROUP(sp_g, bucket_num) \
+		((sp_g > 0) ? (((bucket_num) >> (sp_g - 1)) & SPLITPOINT_PHASE_MASK) : \
+						(bucket_num))
+
+/*
+ * At splitpoint group 0 we have 2^(0 + 2) = 4 buckets, then at splitpoint
+ * group 1 we have 2^(1 + 2) = 8 total buckets. As the doubling continues at
+ * splitpoint group "x" we will have 2^(x + 2) total buckets. Total buckets
+ * before x splitpoint group will be 2^(x + 1). At each phase of allocation
+ * within splitpoint group we add 2^(x - 1) buckets, as we have to divide the
+ * task of allocation of 2^(x + 1) buckets among 4 phases.
+ */
+#define BUCKETS_BEFORE_SP_GRP(sp_g) ((sp_g == 0) ? 0 : (1 << (sp_g + 1)))
+#define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \
+						((nphase) * ((sp_g == 0) ? 1 : (1 << (sp_g - 1))))
+
+/*
+ * _hash_spareindex -- returns spare index / global splitpoint phase of the
+ *					   bucket
+ */
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
+	uint32		splitpoint_group;
+
+	/*
+	 * The first 4 bucket belongs to first splitpoint group 0. And since group
+	 * 0 have 4 = 2^2 buckets, we double them in group 1. So total buckets
+	 * after group 1 is 8 = 2^3. Then again at group 2 we add another 2^3
+	 * buckets to double the total number of buckets, which will become 2^4. I
+	 * think by this time we can see a pattern which say if num_bucket > 4
+	 * splitpoint group = log2(num_bucket) - 2
+	 */
+	if (num_bucket <= 4)
+		splitpoint_group = 0;	/* converted to base 0. */
+	else
+		splitpoint_group = _hash_log2(num_bucket) - 2;
+
+	return TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(splitpoint_group) +
+		SPLITPOINT_PHASES_WITHIN_GROUP(splitpoint_group,
+									   num_bucket - 1); /* make it 0-based */
+}
+
+/*
+ *	_hash_get_totalbuckets -- returns total number of buckets allocated till
+ *							the given splitpoint phase.
+ */
+uint32
+_hash_get_totalbuckets(uint32 splitpoint_phase)
+{
+	uint32		splitpoint_group;
+
+	/*
+	 * Every 4 consecutive phases makes one group and group's are numbered
+	 * from 0
+	 */
+	splitpoint_group = (splitpoint_phase / SPLITPOINT_PHASES_PER_GRP);
+
+	/*
+	 * total_buckets = total number of buckets before its splitpoint group +
+	 * total buckets within its splitpoint group until given splitpoint_phase.
+	 */
+	return BUCKETS_BEFORE_SP_GRP(splitpoint_group) +
+		BUCKETS_WITHIN_SP_GRP(splitpoint_group,
+							  (splitpoint_phase % SPLITPOINT_PHASES_PER_GRP)
+							  + 1);
+}
+
 /*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..ac502ef 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -36,7 +36,7 @@ typedef uint32 Bucket;
 #define InvalidBucket	((Bucket) 0xFFFFFFFF)
 
 #define BUCKET_TO_BLKNO(metap,B) \
-		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
+		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)
 
 /*
  * Special space for hash index pages.
@@ -180,7 +180,7 @@ typedef HashScanOpaqueData *HashScanOpaque;
  * needing to fit into the metapage.  (With 8K block size, 128 bitmaps
  * limit us to 64 GB of overflow space...)
  */
-#define HASH_MAX_SPLITPOINTS		32
+#define HASH_MAX_SPLITPOINTS		128
 #define HASH_MAX_BITMAPS			128
 
 typedef struct HashMetaPageData
@@ -382,6 +382,8 @@ extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
 extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
 					 uint32 highmask, uint32 lowmask);
 extern uint32 _hash_log2(uint32 num);
+extern uint32 _hash_spareindex(uint32 num_bucket);
+extern uint32 _hash_get_totalbuckets(uint32 splitpoint_phase);
 extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
 extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
 extern bool _hash_convert_tuple(Relation index,
#12Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Mithun Cy (#11)
2 attachment(s)
Re: [POC] A better way to expand hash indexes.

On Tue, Mar 21, 2017 at 8:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Sure, I was telling you based on that. If you are implicitly treating
it as 2-dimensional array, it might be easier to compute the array
offsets.

I think calculating spares offset will not become anyway much simpler
we still need to calculate split group and split phase separately. I
have added a patch to show how a 2-dimensional spares code looks like
and where all we need changes.

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

expand_hashbucket_efficiently_06_spares_2dimesion.patchapplication/octet-stream; name=expand_hashbucket_efficiently_06_spares_2dimesion.patchDownload
diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 3ba01f6..c97b279 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -51,13 +51,13 @@ bsize     | 8152
 bmsize    | 4096
 bmshift   | 15
 maxbucket | 3
-highmask  | 7
-lowmask   | 3
-ovflpoint | 2
+highmask  | 3
+lowmask   | 1
+ovflpoint | 3
 firstfree | 0
 nmaps     | 1
 procid    | 450
-spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 
 SELECT magic, version, ntuples, bsize, bmsize, bmshift, maxbucket, highmask,
diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c
index 812a03f..3149029 100644
--- a/contrib/pageinspect/hashfuncs.c
+++ b/contrib/pageinspect/hashfuncs.c
@@ -502,10 +502,11 @@ hash_metapage_info(PG_FUNCTION_ARGS)
 	TupleDesc	tupleDesc;
 	HeapTuple	tuple;
 	int			i,
-				j;
+				j,
+				p;
 	Datum		values[16];
 	bool		nulls[16];
-	Datum       spares[HASH_MAX_SPLITPOINTS];
+	Datum       spares[HASH_MAX_SPLITPOINTS * HASH_SPLITPOINT_PHASES];
 	Datum       mapp[HASH_MAX_BITMAPS];
 
 	if (!superuser())
@@ -541,9 +542,11 @@ hash_metapage_info(PG_FUNCTION_ARGS)
 	values[j++] = ObjectIdGetDatum((Oid) metad->hashm_procid);
 
 	for (i = 0; i < HASH_MAX_SPLITPOINTS; i++)
-		spares[i] = Int64GetDatum((int64) metad->hashm_spares[i]);
+		for (p = 0; p < HASH_SPLITPOINT_PHASES; p++)
+			spares[(i * HASH_SPLITPOINT_PHASES) + p] =
+				Int64GetDatum((int64) metad->hashm_spares[i][p]);
 	values[j++] = PointerGetDatum(construct_array(spares,
-												  HASH_MAX_SPLITPOINTS,
+												  HASH_MAX_SPLITPOINTS * HASH_SPLITPOINT_PHASES,
 												  INT8OID,
 												  8, FLOAT8PASSBYVAL, 'd'));
 
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 9f41bb2..f19066f 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -667,11 +667,11 @@ bmshift   | 15
 maxbucket | 12512
 highmask  | 16383
 lowmask   | 8191
-ovflpoint | 14
+ovflpoint | 50
 firstfree | 1204
 nmaps     | 1
 procid    | 450
-spares    | {0,0,0,0,0,0,1,1,1,1,1,4,59,704,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,4,4,4,45,55,58,59,508,567,628,704,1193,1202,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 </screen>
      </para>
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..8789805 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -58,35 +58,50 @@ rules to support a variable number of overflow pages while not having to
 move primary bucket pages around after they are created.
 
 Primary bucket pages (henceforth just "bucket pages") are allocated in
-power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
-are created when the index is initialized.  At the first split, buckets 2
-and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
-when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
-pages of a power-of-2 group appear consecutively in the index.  This
-addressing scheme allows the physical location of a bucket page to be
-computed from the bucket number relatively easily, using only a small
-amount of control information.  We take the log2() of the bucket number
-to determine which split point S the bucket belongs to, and then simply
-add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
-metapage) to compute the physical address.  hashm_spares[S] can be
-interpreted as the total number of overflow pages that have been allocated
-before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
-so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
-block numbers 1 and 2, just after the meta page.  We always have
+power-of-2 groups, called "split points" in the code.  That means on every new
+split points we double the existing number of buckets.  And, it seems bad to
+allocate huge chunks of bucket pages all at once and we take ages to consume it.
+To avoid this exponential growth of index size, we did a trick to breakup
+allocation of buckets at splitpoint into 4 equal phases.  If 2^x is the total
+buckets need to be allocated at a splitpoint (from now on we shall call this
+as splitpoint group), then we allocate 1/4th (2^(x - 2)) of total buckets at
+each phase of splitpoint group. Next quarter of allocation will only happen if
+buckets of previous phase has been already consumed.  Since for buckets number
+< 4 we cannot further divide it in to multiple phases, the first splitpoint
+group 0's allocation is done as follows {1, 1, 1, 1} = 4 buckets in total, the
+numbers in curly braces indicate number of buckets allocated within each phase
+of splitpoint group 0.  In next splitpoint group 1 the allocation phases will
+be as follow {1, 1, 1, 1} = 8 buckets in total.  And, for splitpoint group 2
+and 3 allocation phase will be {2, 2, 2, 2} = 16 buckets in total and
+{4, 4, 4, 4} = 32 buckets in total.  We can see that at each splitpoint group
+we double the total number of buckets from previous group but in a incremental
+phase.  The bucket pages allocated within one phase of a splitpoint group will
+appear consecutively in the index.  This addressing scheme allows the physical
+location of a bucket page to be computed from the bucket number relatively
+easily, using only a small amount of control information.  If we look at the
+function _hash_spareindex for a given bucket number we first compute splitpoint
+group it belongs to and then the phase with in it to which the bucket belongs
+to.  Adding them we get the global splitpoint phase number S to which the
+bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[] is
+an array stored in the metapage) with given bucket number to compute its
+physical address.  The hashm_spares[S] can be interpreted as the total number
+of overflow pages that have been allocated before the bucket pages of
+splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
+(which belong to splitpoint group 0's phase 1 and phase 2 respectively) always
+appear at block numbers 1 and 2, just after the meta page.  We always have
 hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
-former.  The difference between the two represents the number of overflow
-pages appearing between the bucket page groups of splitpoints N and N+1.
-
+former. The difference between the two represents the number of overflow
+pages appearing between the bucket page groups of splitpoints phase N and N+1.
 (Note: the above describes what happens when filling an initially minimally
 sized hash index.  In practice, we try to estimate the required index size
-and allocate a suitable number of splitpoints immediately, to avoid
+and allocate a suitable number of splitpoints phases immediately, to avoid
 expensive re-splitting during initial index build.)
 
 When S splitpoints exist altogether, the array entries hashm_spares[0]
 through hashm_spares[S] are valid; hashm_spares[S] records the current
 total number of overflow pages.  New overflow pages are created as needed
 at the end of the index, and recorded by incrementing hashm_spares[S].
-When it is time to create a new splitpoint's worth of bucket pages, we
+When it is time to create a new splitpoint phase's worth of bucket pages, we
 copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
 stored in the hashm_ovflpoint field of the meta page).  This has the
 effect of reserving the correct number of bucket pages at the end of the
@@ -101,7 +116,7 @@ We have to allow the case "greater than" because it's possible that during
 an index extension we crash after allocating filesystem space and before
 updating the metapage.  Note that on filesystems that allow "holes" in
 files, it's entirely likely that pages before the logical EOF are not yet
-allocated: when we allocate a new splitpoint's worth of bucket pages, we
+allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
 physically zero the last such page to force the EOF up, and the first such
 page will be used immediately, but the intervening pages are not written
 until needed.
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 34cc08f..6b162a7 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -548,7 +548,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	num_index_tuples = 0;
 
 	/*
-	 * We need a copy of the metapage so that we can use its hashm_spares[]
+	 * We need a copy of the metapage so that we can use its hashm_spares[][]
 	 * values to compute bucket page addresses, but a cached copy should be
 	 * good enough.  (If not, we'll detect that further down and refresh the
 	 * cache as necessary.)
@@ -575,7 +575,7 @@ loop_top:
 		bool		split_cleanup = false;
 
 		/* Get address of bucket's start page */
-		bucket_blkno = BUCKET_TO_BLKNO(cachedmetap, cur_bucket);
+		bucket_blkno = bucket_to_blkno(cachedmetap, cur_bucket);
 
 		blkno = bucket_blkno;
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index de7522e..02f3328 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -263,7 +263,9 @@ hash_xlog_add_ovfl_page(XLogReaderState *record)
 
 		if (!xlrec->bmpage_found)
 		{
-			metap->hashm_spares[metap->hashm_ovflpoint]++;
+			uint32		split_grp = SP_GRP(metap->hashm_ovflpoint);
+			uint32 		split_phase = SP_PHASE(metap->hashm_ovflpoint);
+			metap->hashm_spares[split_grp][split_phase]++;
 
 			if (new_bmpage)
 			{
@@ -271,7 +273,7 @@ hash_xlog_add_ovfl_page(XLogReaderState *record)
 
 				metap->hashm_mapp[metap->hashm_nmaps] = newmapblk;
 				metap->hashm_nmaps++;
-				metap->hashm_spares[metap->hashm_ovflpoint]++;
+				metap->hashm_spares[split_grp][split_phase]++;
 			}
 		}
 
@@ -388,7 +390,8 @@ hash_xlog_split_allocate_page(XLogReaderState *record)
 			ovflpages = (uint32 *) ((char *) data + sizeof(uint32));
 
 			/* update metapage */
-			metap->hashm_spares[ovflpoint] = *ovflpages;
+			metap->hashm_spares[SP_GRP(ovflpoint)][SP_PHASE(ovflpoint)] =
+																	*ovflpages;
 			metap->hashm_ovflpoint = ovflpoint;
 		}
 
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..41cba14 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -41,7 +41,8 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 
 	/* Determine the split number for this page (must be >= 1) */
 	for (i = 1;
-		 i < splitnum && ovflbitnum > metap->hashm_spares[i];
+		 i < splitnum &&
+		 ovflbitnum > metap->hashm_spares[SP_GRP(i)][SP_PHASE(i)];
 		 i++)
 		 /* loop */ ;
 
@@ -49,7 +50,7 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 	 * Convert to absolute page number by adding the number of bucket pages
 	 * that exist before this split point.
 	 */
-	return (BlockNumber) ((1 << i) + ovflbitnum);
+	return (BlockNumber) (_hash_get_totalbuckets(i) + ovflbitnum);
 }
 
 /*
@@ -67,17 +68,18 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 	/* Determine the split number containing this page */
 	for (i = 1; i <= splitnum; i++)
 	{
-		if (ovflblkno <= (BlockNumber) (1 << i))
+		if (ovflblkno <= (BlockNumber) _hash_get_totalbuckets(i))
 			break;				/* oops */
-		bitnum = ovflblkno - (1 << i);
+		bitnum = ovflblkno - _hash_get_totalbuckets(i);
 
 		/*
 		 * bitnum has to be greater than number of overflow page added in
 		 * previous split point. The overflow page at this splitnum (i) if any
-		 * should start from ((2 ^ i) + metap->hashm_spares[i - 1] + 1).
+		 * should start from (_hash_get_totalbuckets(i) +
+		 * metap->hashm_spares[SP_GRP(i - 1)][SP_GRP(i -1)] + 1).
 		 */
-		if (bitnum > metap->hashm_spares[i - 1] &&
-			bitnum <= metap->hashm_spares[i])
+		if (bitnum > metap->hashm_spares[SP_GRP(i - 1)][SP_PHASE(i -1)] &&
+			bitnum <= metap->hashm_spares[SP_GRP(i)][SP_PHASE(i)])
 			return bitnum - 1;	/* -1 to convert 1-based to 0-based */
 	}
 
@@ -120,6 +122,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	BlockNumber blkno;
 	uint32		orig_firstfree;
 	uint32		splitnum;
+	uint32		split_grp,
+				split_phase;
 	uint32	   *freep = NULL;
 	uint32		max_ovflpg;
 	uint32		bit;
@@ -201,7 +205,9 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 		/* want to end search with the last existing overflow page */
 		splitnum = metap->hashm_ovflpoint;
-		max_ovflpg = metap->hashm_spares[splitnum] - 1;
+		split_grp = SP_GRP(splitnum);
+		split_phase = SP_PHASE(splitnum);
+		max_ovflpg = metap->hashm_spares[split_grp][split_phase] - 1;
 		last_page = max_ovflpg >> BMPG_SHIFT(metap);
 		last_bit = max_ovflpg & BMPG_MASK(metap);
 
@@ -273,7 +279,7 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 		 * marked "in use".  Subsequent pages do not exist yet, but it is
 		 * convenient to pre-mark them as "in use" too.
 		 */
-		bit = metap->hashm_spares[splitnum];
+		bit = metap->hashm_spares[split_grp][split_phase];
 
 		/* metapage already has a write lock */
 		if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
@@ -294,7 +300,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 	/* Calculate address of the new overflow page */
 	bit = BufferIsValid(newmapbuf) ?
-		metap->hashm_spares[splitnum] + 1 : metap->hashm_spares[splitnum];
+		metap->hashm_spares[split_grp][split_phase] + 1 :
+		metap->hashm_spares[split_grp][split_phase];
 	blkno = bitno_to_blkno(metap, bit);
 
 	/*
@@ -329,7 +336,7 @@ found:
 	else
 	{
 		/* update the count to indicate new overflow page is added */
-		metap->hashm_spares[splitnum]++;
+		metap->hashm_spares[split_grp][split_phase]++;
 
 		if (BufferIsValid(newmapbuf))
 		{
@@ -339,7 +346,7 @@ found:
 			/* add the new bitmap page to the metapage's list of bitmaps */
 			metap->hashm_mapp[metap->hashm_nmaps] = BufferGetBlockNumber(newmapbuf);
 			metap->hashm_nmaps++;
-			metap->hashm_spares[splitnum]++;
+			metap->hashm_spares[split_grp][split_phase]++;
 			MarkBufferDirty(metabuf);
 		}
 
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 622cc4b..0d17e90 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -422,7 +422,7 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 		/* Allow interrupts, in case N is huge */
 		CHECK_FOR_INTERRUPTS();
 
-		blkno = BUCKET_TO_BLKNO(metap, i);
+		blkno = bucket_to_blkno(metap, i);
 		buf = _hash_getnewbuf(rel, blkno, forkNum);
 		_hash_initbuf(buf, metap->hashm_maxbucket, i, LH_BUCKET_PAGE, false);
 		MarkBufferDirty(buf);
@@ -502,14 +502,15 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	Page		page;
 	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
+	uint32		spare_index;
 	uint32		i;
 
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
-	 * given the estimated number of tuples.  We round up the result to the
-	 * next power of 2, however, and always force at least 2 bucket pages. The
-	 * upper limit is determined by considerations explained in
+	 * given the estimated number of tuples.  We round up the result to total
+	 * the number of buckets which has to be allocated before using its
+	 * _hashm_spares index slot, however, and always force at least 2 bucket
+	 * pages. The upper limit is determined by considerations explained in
 	 * _hash_expandtable().
 	 */
 	dnumbuckets = num_tuples / ffactor;
@@ -518,11 +519,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	else if (dnumbuckets >= (double) 0x40000000)
 		num_buckets = 0x40000000;
 	else
-		num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+		num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
 
-	log2_num_buckets = _hash_log2(num_buckets);
-	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
-	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
+	spare_index = _hash_spareindex(num_buckets);
+	Assert(spare_index < (HASH_MAX_SPLITPOINTS * HASH_SPLITPOINT_PHASES));
 
 	page = BufferGetPage(buf);
 	if (initpage)
@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
-	 * blocks 1 to N.  The first freespace bitmap page is in block N+1. Since
-	 * N is a power of 2, we can set the masks this way:
+	 * blocks 1 to N.  The first freespace bitmap page is in block N+1.
 	 */
-	metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
-	metap->hashm_highmask = (num_buckets << 1) - 1;
+	metap->hashm_maxbucket = num_buckets - 1;
+
+	/* set hishmask, which should be sufficient to cover num_buckets. */
+	metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
+	metap->hashm_lowmask = (metap->hashm_highmask >> 1);
 
 	MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
 	/* Set up mapping for one spare page after the initial splitpoints */
-	metap->hashm_spares[log2_num_buckets] = 1;
-	metap->hashm_ovflpoint = log2_num_buckets;
+	metap->hashm_spares[SP_GRP(spare_index)][SP_PHASE(spare_index)] = 1;
+	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
 
 	/*
@@ -655,7 +657,7 @@ restart_expand:
 	 * Ideally we'd allow bucket numbers up to UINT_MAX-1 (no higher because
 	 * the calculation maxbucket+1 mustn't overflow).  Currently we restrict
 	 * to half that because of overflow looping in _hash_log2() and
-	 * insufficient space in hashm_spares[].  It's moot anyway because an
+	 * insufficient space in hashm_spares[][].  It's moot anyway because an
 	 * index with 2^32 buckets would certainly overflow BlockNumber and hence
 	 * _hash_alloc_buckets() would fail, but if we supported buckets smaller
 	 * than a disk block then this would be an independent constraint.
@@ -682,7 +684,7 @@ restart_expand:
 
 	old_bucket = (new_bucket & metap->hashm_lowmask);
 
-	start_oblkno = BUCKET_TO_BLKNO(metap, old_bucket);
+	start_oblkno = bucket_to_blkno(metap, old_bucket);
 
 	buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
 	if (!buf_oblkno)
@@ -766,32 +768,49 @@ restart_expand:
 	 * There shouldn't be any active scan on new bucket.
 	 *
 	 * Note: it is safe to compute the new bucket's blkno here, even though we
-	 * may still need to update the BUCKET_TO_BLKNO mapping.  This is because
-	 * the current value of hashm_spares[hashm_ovflpoint] correctly shows
-	 * where we are going to put a new splitpoint's worth of buckets.
+	 * may still need to update the bucket_to_blkno mapping.  This is because
+	 * the current value of
+	 * hashm_spares[SP_GRP(hashm_ovflpoint)][SP_PHASE(hashm_ovflpoint)]
+	 * correctly shows where we are going to put a new splitpoint's worth of
+	 * buckets.
 	 */
-	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
+	start_nblkno = bucket_to_blkno(metap, new_bucket);
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to allocate a new batch of bucket pages.
+	 * If the split point is increasing we need to allocate a new batch of
+	 * bucket pages.
 	 */
-	spare_ndx = _hash_log2(new_bucket + 1);
+	spare_ndx = _hash_spareindex(new_bucket + 1);
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
+		uint32		buckets_toadd = 0;
+		uint32		splitpoint_group = 0;
+
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
-		 * The number of buckets in the new splitpoint is equal to the total
-		 * number already in existence, i.e. new_bucket.  Currently this maps
-		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.  We treat allocation of buckets as a
-		 * separate WAL-logged action.  Even if we fail after this operation,
+		 * The number of buckets in the new splitpoint group is equal to the
+		 * total number already in existence, i.e. new_bucket. But we do not
+		 * allocate them at once. Each splitpoint group will have 4 slots, we
+		 * distribute the buckets equally among them. So we allocate only one
+		 * fourth of total buckets in new splitpoint group at a time to consume
+		 * one phase after another. We treat allocation of buckets as a
+		 * separate WAL-logged action. Even if we fail after this operation,
 		 * won't leak bucket pages; rather, the next split will consume this
 		 * space. In any case, even without failure we don't use all the space
 		 * in one split operation.
 		 */
-		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+		splitpoint_group = (spare_ndx >> 2);
+
+		/*
+		 * Each phase in the splitpoint_group will allocate
+		 * 2^(splitpoint_group - 1) buckets if we divide buckets among 4
+		 * slots. The 0th group is a special case where we allocate 1 bucket
+		 * per slot as we cannot reduce it any further. See README for more
+		 * details.
+		 */
+		buckets_toadd = (splitpoint_group) ? (1 << (splitpoint_group - 1)) : 1;
+		if (!_hash_alloc_buckets(rel, start_nblkno, buckets_toadd))
 		{
 			/* can't split due to BlockNumber overflow */
 			_hash_relbuf(rel, buf_oblkno);
@@ -836,14 +855,15 @@ restart_expand:
 	}
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to adjust the hashm_spares[] array and
-	 * hashm_ovflpoint so that future overflow pages will be created beyond
-	 * this new batch of bucket pages.
+	 * If the split point is increasing we need to adjust the hashm_spares[]
+	 * array and hashm_ovflpoint so that future overflow pages will be created
+	 * beyond this new batch of bucket pages.
 	 */
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
-		metap->hashm_spares[spare_ndx] = metap->hashm_spares[metap->hashm_ovflpoint];
+		metap->hashm_spares[SP_GRP(spare_ndx)][SP_PHASE(spare_ndx)] =
+			metap->hashm_spares[SP_GRP(metap->hashm_ovflpoint)]
+							   [SP_PHASE(metap->hashm_ovflpoint)];
 		metap->hashm_ovflpoint = spare_ndx;
 		metap_update_splitpoint = true;
 	}
@@ -917,12 +937,15 @@ restart_expand:
 
 		if (metap_update_splitpoint)
 		{
+			uint32 splitpoint_grp = SP_GRP(metap->hashm_ovflpoint);
+			uint32 splitpoint_phase = SP_PHASE(metap->hashm_ovflpoint);
+
 			xlrec.flags |= XLH_SPLIT_META_UPDATE_SPLITPOINT;
 			XLogRegisterBufData(2, (char *) &metap->hashm_ovflpoint,
 								sizeof(uint32));
 			XLogRegisterBufData(2,
-					   (char *) &metap->hashm_spares[metap->hashm_ovflpoint],
-								sizeof(uint32));
+				(char *) &metap->hashm_spares[splitpoint_grp][splitpoint_phase],
+				sizeof(uint32));
 		}
 
 		XLogRegisterData((char *) &xlrec, SizeOfHashSplitAllocPage);
@@ -1543,7 +1566,7 @@ _hash_getbucketbuf_from_hashkey(Relation rel, uint32 hashkey, int access,
 									  metap->hashm_highmask,
 									  metap->hashm_lowmask);
 
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
+		blkno = bucket_to_blkno(metap, bucket);
 
 		/* Fetch the primary bucket page for the bucket */
 		buf = _hash_getbuf(rel, blkno, access, LH_BUCKET_PAGE);
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index 0e0f393..8aa8769 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -56,9 +56,8 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	 * num_buckets buckets in the index, the appropriate mask can be computed
 	 * as follows.
 	 *
-	 * Note: at present, the passed-in num_buckets is always a power of 2, so
-	 * we could just compute num_buckets - 1.  We prefer not to assume that
-	 * here, though.
+	 * NOTE : This hash_mask calculation should be in sync with similar
+	 * calculation in _hash_init_metabuffer.
 	 */
 	hspool->hash_mask = (((uint32) 1) << _hash_log2(num_buckets)) - 1;
 
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..30d6a6a 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -25,6 +25,19 @@
 			old_bucket | (lowmask + 1)
 
 /*
+ * bucket_to_blkno -- given the bucket returns its block number in index.
+ */
+BlockNumber
+bucket_to_blkno(HashMetaPage metap, Bucket B)
+{
+	uint32 prev_spare_idx = _hash_spareindex(B + 1) - 1;
+
+	return	((BlockNumber) ((B) +
+		((B) ? (metap)->hashm_spares[SP_GRP(prev_spare_idx)]
+									[SP_PHASE(prev_spare_idx)] : 0)) + 1);
+}
+
+/*
  * _hash_checkqual -- does the index tuple satisfy the scan conditions?
  */
 bool
@@ -149,6 +162,87 @@ _hash_log2(uint32 num)
 	return i;
 }
 
+#define SPLITPOINT_PHASES_PER_GRP 4
+#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
+
+#define TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(sp_g) (sp_g > 0 ? (sp_g << 2) : 0)
+
+/*
+ * This is just a trick to save a division operation. If you look into the
+ * bitmap of 0-based bucket_num 2nd and 3rd most significant bit will indicate
+ * which phase of allocation the bucket_num belogs to with in the group. This
+ * is because at every splitpoint group we allocate 2^x buckets and we have
+ * divided the allocation process into 4 equal phases. This macro returns value
+ * from 0 to 3.
+ */
+#define SPLITPOINT_PHASES_WITHIN_GROUP(sp_g, bucket_num) \
+		((sp_g > 0) ? (((bucket_num) >> (sp_g - 1)) & SPLITPOINT_PHASE_MASK) : \
+						(bucket_num))
+
+/*
+ * At splitpoint group 0 we have 2^(0 + 2) = 4 buckets, then at splitpoint
+ * group 1 we have 2^(1 + 2) = 8 total buckets. As the doubling continues at
+ * splitpoint group "x" we will have 2^(x + 2) total buckets. Total buckets
+ * before x splitpoint group will be 2^(x + 1). At each phase of allocation
+ * within splitpoint group we add 2^(x - 1) buckets, as we have to divide the
+ * task of allocation of 2^(x + 1) buckets among 4 phases.
+ */
+#define BUCKETS_BEFORE_SP_GRP(sp_g) ((sp_g == 0) ? 0 : (1 << (sp_g + 1)))
+#define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \
+						((nphase) * ((sp_g == 0) ? 1 : (1 << (sp_g - 1))))
+
+/*
+ * _hash_spareindex -- returns spare index / global splitpoint phase of the
+ *					   bucket
+ */
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
+	uint32		splitpoint_group;
+
+	/*
+	 * The first 4 bucket belongs to first splitpoint group 0. And since group
+	 * 0 have 4 = 2^2 buckets, we double them in group 1. So total buckets
+	 * after group 1 is 8 = 2^3. Then again at group 2 we add another 2^3
+	 * buckets to double the total number of buckets, which will become 2^4. I
+	 * think by this time we can see a pattern which say if num_bucket > 4
+	 * splitpoint group = log2(num_bucket) - 2
+	 */
+	if (num_bucket <= 4)
+		splitpoint_group = 0;	/* converted to base 0. */
+	else
+		splitpoint_group = _hash_log2(num_bucket) - 2;
+
+	return TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(splitpoint_group) +
+		SPLITPOINT_PHASES_WITHIN_GROUP(splitpoint_group,
+									   num_bucket - 1); /* make it 0-based */
+}
+
+/*
+ *	_hash_get_totalbuckets -- returns total number of buckets allocated till
+ *							the given splitpoint phase.
+ */
+uint32
+_hash_get_totalbuckets(uint32 splitpoint_phase)
+{
+	uint32		splitpoint_group;
+
+	/*
+	 * Every 4 consecutive phases makes one group and group's are numbered
+	 * from 0
+	 */
+	splitpoint_group = (splitpoint_phase / SPLITPOINT_PHASES_PER_GRP);
+
+	/*
+	 * total_buckets = total number of buckets before its splitpoint group +
+	 * total buckets within its splitpoint group until given splitpoint_phase.
+	 */
+	return BUCKETS_BEFORE_SP_GRP(splitpoint_group) +
+		BUCKETS_WITHIN_SP_GRP(splitpoint_group,
+							  (splitpoint_phase % SPLITPOINT_PHASES_PER_GRP)
+							  + 1);
+}
+
 /*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
@@ -383,7 +477,7 @@ _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bucket)
 	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
 	metap = HashPageGetMeta(BufferGetPage(metabuf));
 
-	blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+	blkno = bucket_to_blkno(metap, old_bucket);
 
 	_hash_relbuf(rel, metabuf);
 
@@ -413,7 +507,7 @@ _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket)
 	new_bucket = _hash_get_newbucket_from_oldbucket(rel, old_bucket,
 													metap->hashm_lowmask,
 													metap->hashm_maxbucket);
-	blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+	blkno = bucket_to_blkno(metap, new_bucket);
 
 	_hash_relbuf(rel, metabuf);
 
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..35c2807 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -35,9 +35,6 @@ typedef uint32 Bucket;
 
 #define InvalidBucket	((Bucket) 0xFFFFFFFF)
 
-#define BUCKET_TO_BLKNO(metap,B) \
-		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
-
 /*
  * Special space for hash index pages.
  *
@@ -161,13 +158,14 @@ typedef HashScanOpaqueData *HashScanOpaque;
 #define HASH_VERSION	2		/* 2 signifies only hash key value is stored */
 
 /*
- * spares[] holds the number of overflow pages currently allocated at or
- * before a certain splitpoint. For example, if spares[3] = 7 then there are
- * 7 ovflpages before splitpoint 3 (compare BUCKET_TO_BLKNO macro).  The
- * value in spares[ovflpoint] increases as overflow pages are added at the
- * end of the index.  Once ovflpoint increases (ie, we have actually allocated
- * the bucket pages belonging to that splitpoint) the number of spares at the
- * prior splitpoint cannot change anymore.
+ * spares[][] holds the number of overflow pages currently allocated at or
+ * before a certain splitpoint phase. For example, if spares[3][0] = 7 then
+ * there are 7 ovflpages before splitpoint phase 12.  The value in
+ * spares[ovflpoint / HASH_SPLITPOINT_PHASES][ovflpoint % HASH_SPLITPOINT_PHASES]
+ * increases as overflow pages are added at the end of the index.  Once
+ * ovflpoint increases (ie, we have actually allocated the bucket pages
+ * belonging to that splitpoint phase) the number of spares at the
+ * prior splitpoint phases cannot change anymore.
  *
  * ovflpages that have been recycled for reuse can be found by looking at
  * bitmaps that are stored within ovflpages dedicated for the purpose.
@@ -181,6 +179,9 @@ typedef HashScanOpaqueData *HashScanOpaque;
  * limit us to 64 GB of overflow space...)
  */
 #define HASH_MAX_SPLITPOINTS		32
+#define HASH_SPLITPOINT_PHASES		4
+#define SP_GRP(splitpoint)			(splitpoint / HASH_SPLITPOINT_PHASES)
+#define SP_PHASE(splitpoint)		(splitpoint % HASH_SPLITPOINT_PHASES)
 #define HASH_MAX_BITMAPS			128
 
 typedef struct HashMetaPageData
@@ -201,8 +202,9 @@ typedef struct HashMetaPageData
 	uint32		hashm_firstfree;	/* lowest-number free ovflpage (bit#) */
 	uint32		hashm_nmaps;	/* number of bitmap pages */
 	RegProcedure hashm_procid;	/* hash procedure id from pg_proc */
-	uint32		hashm_spares[HASH_MAX_SPLITPOINTS];		/* spare pages before
-														 * each splitpoint */
+
+	/* spare pages before each splitpoint phase */
+	uint32		hashm_spares[HASH_MAX_SPLITPOINTS][HASH_SPLITPOINT_PHASES];
 	BlockNumber hashm_mapp[HASH_MAX_BITMAPS];	/* blknos of ovfl bitmaps */
 } HashMetaPageData;
 
@@ -283,7 +285,7 @@ typedef HashMetaPageData *HashMetaPage;
 
 
 /* public routines */
-
+BlockNumber bucket_to_blkno(HashMetaPage metap, Bucket B);
 extern IndexBuildResult *hashbuild(Relation heap, Relation index,
 		  struct IndexInfo *indexInfo);
 extern void hashbuildempty(Relation index);
@@ -382,6 +384,8 @@ extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
 extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
 					 uint32 highmask, uint32 lowmask);
 extern uint32 _hash_log2(uint32 num);
+extern uint32 _hash_spareindex(uint32 num_bucket);
+extern uint32 _hash_get_totalbuckets(uint32 splitpoint_phase);
 extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
 extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
 extern bool _hash_convert_tuple(Relation index,
expand_hashbucket_efficiently_06_spares_1dimension.patchapplication/octet-stream; name=expand_hashbucket_efficiently_06_spares_1dimension.patchDownload
diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 3ba01f6..c97b279 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -51,13 +51,13 @@ bsize     | 8152
 bmsize    | 4096
 bmshift   | 15
 maxbucket | 3
-highmask  | 7
-lowmask   | 3
-ovflpoint | 2
+highmask  | 3
+lowmask   | 1
+ovflpoint | 3
 firstfree | 0
 nmaps     | 1
 procid    | 450
-spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 
 SELECT magic, version, ntuples, bsize, bmsize, bmshift, maxbucket, highmask,
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 9f41bb2..f19066f 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -667,11 +667,11 @@ bmshift   | 15
 maxbucket | 12512
 highmask  | 16383
 lowmask   | 8191
-ovflpoint | 14
+ovflpoint | 50
 firstfree | 1204
 nmaps     | 1
 procid    | 450
-spares    | {0,0,0,0,0,0,1,1,1,1,1,4,59,704,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,4,4,4,45,55,58,59,508,567,628,704,1193,1202,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 </screen>
      </para>
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..8789805 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -58,35 +58,50 @@ rules to support a variable number of overflow pages while not having to
 move primary bucket pages around after they are created.
 
 Primary bucket pages (henceforth just "bucket pages") are allocated in
-power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
-are created when the index is initialized.  At the first split, buckets 2
-and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
-when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
-pages of a power-of-2 group appear consecutively in the index.  This
-addressing scheme allows the physical location of a bucket page to be
-computed from the bucket number relatively easily, using only a small
-amount of control information.  We take the log2() of the bucket number
-to determine which split point S the bucket belongs to, and then simply
-add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
-metapage) to compute the physical address.  hashm_spares[S] can be
-interpreted as the total number of overflow pages that have been allocated
-before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
-so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
-block numbers 1 and 2, just after the meta page.  We always have
+power-of-2 groups, called "split points" in the code.  That means on every new
+split points we double the existing number of buckets.  And, it seems bad to
+allocate huge chunks of bucket pages all at once and we take ages to consume it.
+To avoid this exponential growth of index size, we did a trick to breakup
+allocation of buckets at splitpoint into 4 equal phases.  If 2^x is the total
+buckets need to be allocated at a splitpoint (from now on we shall call this
+as splitpoint group), then we allocate 1/4th (2^(x - 2)) of total buckets at
+each phase of splitpoint group. Next quarter of allocation will only happen if
+buckets of previous phase has been already consumed.  Since for buckets number
+< 4 we cannot further divide it in to multiple phases, the first splitpoint
+group 0's allocation is done as follows {1, 1, 1, 1} = 4 buckets in total, the
+numbers in curly braces indicate number of buckets allocated within each phase
+of splitpoint group 0.  In next splitpoint group 1 the allocation phases will
+be as follow {1, 1, 1, 1} = 8 buckets in total.  And, for splitpoint group 2
+and 3 allocation phase will be {2, 2, 2, 2} = 16 buckets in total and
+{4, 4, 4, 4} = 32 buckets in total.  We can see that at each splitpoint group
+we double the total number of buckets from previous group but in a incremental
+phase.  The bucket pages allocated within one phase of a splitpoint group will
+appear consecutively in the index.  This addressing scheme allows the physical
+location of a bucket page to be computed from the bucket number relatively
+easily, using only a small amount of control information.  If we look at the
+function _hash_spareindex for a given bucket number we first compute splitpoint
+group it belongs to and then the phase with in it to which the bucket belongs
+to.  Adding them we get the global splitpoint phase number S to which the
+bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[] is
+an array stored in the metapage) with given bucket number to compute its
+physical address.  The hashm_spares[S] can be interpreted as the total number
+of overflow pages that have been allocated before the bucket pages of
+splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
+(which belong to splitpoint group 0's phase 1 and phase 2 respectively) always
+appear at block numbers 1 and 2, just after the meta page.  We always have
 hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
-former.  The difference between the two represents the number of overflow
-pages appearing between the bucket page groups of splitpoints N and N+1.
-
+former. The difference between the two represents the number of overflow
+pages appearing between the bucket page groups of splitpoints phase N and N+1.
 (Note: the above describes what happens when filling an initially minimally
 sized hash index.  In practice, we try to estimate the required index size
-and allocate a suitable number of splitpoints immediately, to avoid
+and allocate a suitable number of splitpoints phases immediately, to avoid
 expensive re-splitting during initial index build.)
 
 When S splitpoints exist altogether, the array entries hashm_spares[0]
 through hashm_spares[S] are valid; hashm_spares[S] records the current
 total number of overflow pages.  New overflow pages are created as needed
 at the end of the index, and recorded by incrementing hashm_spares[S].
-When it is time to create a new splitpoint's worth of bucket pages, we
+When it is time to create a new splitpoint phase's worth of bucket pages, we
 copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
 stored in the hashm_ovflpoint field of the meta page).  This has the
 effect of reserving the correct number of bucket pages at the end of the
@@ -101,7 +116,7 @@ We have to allow the case "greater than" because it's possible that during
 an index extension we crash after allocating filesystem space and before
 updating the metapage.  Note that on filesystems that allow "holes" in
 files, it's entirely likely that pages before the logical EOF are not yet
-allocated: when we allocate a new splitpoint's worth of bucket pages, we
+allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
 physically zero the last such page to force the EOF up, and the first such
 page will be used immediately, but the intervening pages are not written
 until needed.
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..fe0b4ef 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -49,7 +49,7 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 	 * Convert to absolute page number by adding the number of bucket pages
 	 * that exist before this split point.
 	 */
-	return (BlockNumber) ((1 << i) + ovflbitnum);
+	return (BlockNumber) (_hash_get_totalbuckets(i) + ovflbitnum);
 }
 
 /*
@@ -67,14 +67,15 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 	/* Determine the split number containing this page */
 	for (i = 1; i <= splitnum; i++)
 	{
-		if (ovflblkno <= (BlockNumber) (1 << i))
+		if (ovflblkno <= (BlockNumber) _hash_get_totalbuckets(i))
 			break;				/* oops */
-		bitnum = ovflblkno - (1 << i);
+		bitnum = ovflblkno - _hash_get_totalbuckets(i);
 
 		/*
 		 * bitnum has to be greater than number of overflow page added in
 		 * previous split point. The overflow page at this splitnum (i) if any
-		 * should start from ((2 ^ i) + metap->hashm_spares[i - 1] + 1).
+		 * should start from
+		 * (_hash_get_totalbuckets(i) + metap->hashm_spares[i - 1] + 1).
 		 */
 		if (bitnum > metap->hashm_spares[i - 1] &&
 			bitnum <= metap->hashm_spares[i])
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 622cc4b..bd16e56 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -502,14 +502,15 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	Page		page;
 	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
+	uint32		spare_index;
 	uint32		i;
 
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
-	 * given the estimated number of tuples.  We round up the result to the
-	 * next power of 2, however, and always force at least 2 bucket pages. The
-	 * upper limit is determined by considerations explained in
+	 * given the estimated number of tuples.  We round up the result to total
+	 * the number of buckets which has to be allocated before using its
+	 * _hashm_spares index slot, however, and always force at least 2 bucket
+	 * pages. The upper limit is determined by considerations explained in
 	 * _hash_expandtable().
 	 */
 	dnumbuckets = num_tuples / ffactor;
@@ -518,11 +519,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	else if (dnumbuckets >= (double) 0x40000000)
 		num_buckets = 0x40000000;
 	else
-		num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+		num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
 
-	log2_num_buckets = _hash_log2(num_buckets);
-	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
-	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
+	spare_index = _hash_spareindex(num_buckets);
+	Assert(spare_index < HASH_MAX_SPLITPOINTS);
 
 	page = BufferGetPage(buf);
 	if (initpage)
@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
-	 * blocks 1 to N.  The first freespace bitmap page is in block N+1. Since
-	 * N is a power of 2, we can set the masks this way:
+	 * blocks 1 to N.  The first freespace bitmap page is in block N+1.
 	 */
-	metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
-	metap->hashm_highmask = (num_buckets << 1) - 1;
+	metap->hashm_maxbucket = num_buckets - 1;
+
+	/* set hishmask, which should be sufficient to cover num_buckets. */
+	metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
+	metap->hashm_lowmask = (metap->hashm_highmask >> 1);
 
 	MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
 	/* Set up mapping for one spare page after the initial splitpoints */
-	metap->hashm_spares[log2_num_buckets] = 1;
-	metap->hashm_ovflpoint = log2_num_buckets;
+	metap->hashm_spares[spare_index] = 1;
+	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
 
 	/*
@@ -773,25 +775,40 @@ restart_expand:
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to allocate a new batch of bucket pages.
+	 * If the split point is increasing we need to allocate a new batch of
+	 * bucket pages.
 	 */
-	spare_ndx = _hash_log2(new_bucket + 1);
+	spare_ndx = _hash_spareindex(new_bucket + 1);
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
+		uint32		buckets_toadd = 0;
+		uint32		splitpoint_group = 0;
+
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
-		 * The number of buckets in the new splitpoint is equal to the total
-		 * number already in existence, i.e. new_bucket.  Currently this maps
-		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.  We treat allocation of buckets as a
-		 * separate WAL-logged action.  Even if we fail after this operation,
+		 * The number of buckets in the new splitpoint group is equal to the
+		 * total number already in existence, i.e. new_bucket. But we do not
+		 * allocate them at once. Each splitpoint group will have 4 slots, we
+		 * distribute the buckets equally among them. So we allocate only one
+		 * fourth of total buckets in new splitpoint group at a time to consume
+		 * one phase after another. We treat allocation of buckets as a
+		 * separate WAL-logged action. Even if we fail after this operation,
 		 * won't leak bucket pages; rather, the next split will consume this
 		 * space. In any case, even without failure we don't use all the space
 		 * in one split operation.
 		 */
-		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+		splitpoint_group = (spare_ndx >> 2);
+
+		/*
+		 * Each phase in the splitpoint_group will allocate
+		 * 2^(splitpoint_group - 1) buckets if we divide buckets among 4
+		 * slots. The 0th group is a special case where we allocate 1 bucket
+		 * per slot as we cannot reduce it any further. See README for more
+		 * details.
+		 */
+		buckets_toadd = (splitpoint_group) ? (1 << (splitpoint_group - 1)) : 1;
+		if (!_hash_alloc_buckets(rel, start_nblkno, buckets_toadd))
 		{
 			/* can't split due to BlockNumber overflow */
 			_hash_relbuf(rel, buf_oblkno);
@@ -836,10 +853,9 @@ restart_expand:
 	}
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to adjust the hashm_spares[] array and
-	 * hashm_ovflpoint so that future overflow pages will be created beyond
-	 * this new batch of bucket pages.
+	 * If the split point is increasing we need to adjust the hashm_spares[]
+	 * array and hashm_ovflpoint so that future overflow pages will be created
+	 * beyond this new batch of bucket pages.
 	 */
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index 0e0f393..8aa8769 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -56,9 +56,8 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	 * num_buckets buckets in the index, the appropriate mask can be computed
 	 * as follows.
 	 *
-	 * Note: at present, the passed-in num_buckets is always a power of 2, so
-	 * we could just compute num_buckets - 1.  We prefer not to assume that
-	 * here, though.
+	 * NOTE : This hash_mask calculation should be in sync with similar
+	 * calculation in _hash_init_metabuffer.
 	 */
 	hspool->hash_mask = (((uint32) 1) << _hash_log2(num_buckets)) - 1;
 
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..6f1943b 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -149,6 +149,87 @@ _hash_log2(uint32 num)
 	return i;
 }
 
+#define SPLITPOINT_PHASES_PER_GRP 4
+#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
+
+#define TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(sp_g) (sp_g > 0 ? (sp_g << 2) : 0)
+
+/*
+ * This is just a trick to save a division operation. If you look into the
+ * bitmap of 0-based bucket_num 2nd and 3rd most significant bit will indicate
+ * which phase of allocation the bucket_num belogs to with in the group. This
+ * is because at every splitpoint group we allocate 2^x buckets and we have
+ * divided the allocation process into 4 equal phases. This macro returns value
+ * from 0 to 3.
+ */
+#define SPLITPOINT_PHASES_WITHIN_GROUP(sp_g, bucket_num) \
+		((sp_g > 0) ? (((bucket_num) >> (sp_g - 1)) & SPLITPOINT_PHASE_MASK) : \
+						(bucket_num))
+
+/*
+ * At splitpoint group 0 we have 2^(0 + 2) = 4 buckets, then at splitpoint
+ * group 1 we have 2^(1 + 2) = 8 total buckets. As the doubling continues at
+ * splitpoint group "x" we will have 2^(x + 2) total buckets. Total buckets
+ * before x splitpoint group will be 2^(x + 1). At each phase of allocation
+ * within splitpoint group we add 2^(x - 1) buckets, as we have to divide the
+ * task of allocation of 2^(x + 1) buckets among 4 phases.
+ */
+#define BUCKETS_BEFORE_SP_GRP(sp_g) ((sp_g == 0) ? 0 : (1 << (sp_g + 1)))
+#define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \
+						((nphase) * ((sp_g == 0) ? 1 : (1 << (sp_g - 1))))
+
+/*
+ * _hash_spareindex -- returns spare index / global splitpoint phase of the
+ *					   bucket
+ */
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
+	uint32		splitpoint_group;
+
+	/*
+	 * The first 4 bucket belongs to first splitpoint group 0. And since group
+	 * 0 have 4 = 2^2 buckets, we double them in group 1. So total buckets
+	 * after group 1 is 8 = 2^3. Then again at group 2 we add another 2^3
+	 * buckets to double the total number of buckets, which will become 2^4. I
+	 * think by this time we can see a pattern which say if num_bucket > 4
+	 * splitpoint group = log2(num_bucket) - 2
+	 */
+	if (num_bucket <= 4)
+		splitpoint_group = 0;	/* converted to base 0. */
+	else
+		splitpoint_group = _hash_log2(num_bucket) - 2;
+
+	return TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(splitpoint_group) +
+		SPLITPOINT_PHASES_WITHIN_GROUP(splitpoint_group,
+									   num_bucket - 1); /* make it 0-based */
+}
+
+/*
+ *	_hash_get_totalbuckets -- returns total number of buckets allocated till
+ *							the given splitpoint phase.
+ */
+uint32
+_hash_get_totalbuckets(uint32 splitpoint_phase)
+{
+	uint32		splitpoint_group;
+
+	/*
+	 * Every 4 consecutive phases makes one group and group's are numbered
+	 * from 0
+	 */
+	splitpoint_group = (splitpoint_phase / SPLITPOINT_PHASES_PER_GRP);
+
+	/*
+	 * total_buckets = total number of buckets before its splitpoint group +
+	 * total buckets within its splitpoint group until given splitpoint_phase.
+	 */
+	return BUCKETS_BEFORE_SP_GRP(splitpoint_group) +
+		BUCKETS_WITHIN_SP_GRP(splitpoint_group,
+							  (splitpoint_phase % SPLITPOINT_PHASES_PER_GRP)
+							  + 1);
+}
+
 /*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..ac502ef 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -36,7 +36,7 @@ typedef uint32 Bucket;
 #define InvalidBucket	((Bucket) 0xFFFFFFFF)
 
 #define BUCKET_TO_BLKNO(metap,B) \
-		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
+		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)
 
 /*
  * Special space for hash index pages.
@@ -180,7 +180,7 @@ typedef HashScanOpaqueData *HashScanOpaque;
  * needing to fit into the metapage.  (With 8K block size, 128 bitmaps
  * limit us to 64 GB of overflow space...)
  */
-#define HASH_MAX_SPLITPOINTS		32
+#define HASH_MAX_SPLITPOINTS		128
 #define HASH_MAX_BITMAPS			128
 
 typedef struct HashMetaPageData
@@ -382,6 +382,8 @@ extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
 extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
 					 uint32 highmask, uint32 lowmask);
 extern uint32 _hash_log2(uint32 num);
+extern uint32 _hash_spareindex(uint32 num_bucket);
+extern uint32 _hash_get_totalbuckets(uint32 splitpoint_phase);
 extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
 extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
 extern bool _hash_convert_tuple(Relation index,
#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Mithun Cy (#12)
Re: [POC] A better way to expand hash indexes.

On Sat, Mar 25, 2017 at 10:13 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

On Tue, Mar 21, 2017 at 8:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Sure, I was telling you based on that. If you are implicitly treating
it as 2-dimensional array, it might be easier to compute the array
offsets.

I think calculating spares offset will not become anyway much simpler
we still need to calculate split group and split phase separately. I
have added a patch to show how a 2-dimensional spares code looks like
and where all we need changes.

I think one-dimensional patch has fewer places to touch, so that looks
better to me. However, I think there is still hard coding and
assumptions in code which we should try to improve.

1.
+ /*
+ * The first 4 bucket belongs to first splitpoint group 0. And since group
+ * 0 have 4 = 2^2 buckets, we double them in group 1. So total buckets
+ * after group 1 is 8 = 2^3. Then again at group 2 we add another 2^3
+ * buckets to double the total number of buckets, which will become 2^4. I
+ * think by this time we can see a pattern which say if num_bucket > 4
+ * splitpoint group = log2(num_bucket) - 2
+ */
+ if (num_bucket <= 4)
+ splitpoint_group = 0; /* converted to base 0. */
+ else
+ splitpoint_group = _hash_log2(num_bucket) - 2;

This patch defines split point group zero has four buckets and based
on that above calculation is done. I feel you can define it like
#define Buckets_First_Split_Group 4 and then use it in above code.
Also, in else part number 2 looks awkward, can we define it as
log2_buckets_first_group = _hash_log2(Buckets_First_Split_Group) or
some thing like that. I think that way code will look neat. I don't
like the way above comment is worded even though it is helpful to
understand the calculation. If you want, then you can add such a
comment in file header, here it looks out of place.

2.
+power-of-2 groups, called "split points" in the code.  That means on every new
+split points we double the existing number of buckets.

split points/split point

3.
+ * which phase of allocation the bucket_num belogs to with in the group.

/belogs/belongs

I have still not completely reviewed the patch as I have ran out of
time for today.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Amit Kapila (#13)
1 attachment(s)
Re: [POC] A better way to expand hash indexes.

Thanks, Amit for the review.
On Sat, Mar 25, 2017 at 7:03 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think one-dimensional patch has fewer places to touch, so that looks
better to me. However, I think there is still hard coding and
assumptions in code which we should try to improve.

Great!, I will continue with spares 1-dimensional improvement.

1.
+ /*
+ * The first 4 bucket belongs to first splitpoint group 0. And since group
+ * 0 have 4 = 2^2 buckets, we double them in group 1. So total buckets
+ * after group 1 is 8 = 2^3. Then again at group 2 we add another 2^3
+ * buckets to double the total number of buckets, which will become 2^4. I
+ * think by this time we can see a pattern which say if num_bucket > 4
+ * splitpoint group = log2(num_bucket) - 2
+ */
+ if (num_bucket <= 4)
+ splitpoint_group = 0; /* converted to base 0. */
+ else
+ splitpoint_group = _hash_log2(num_bucket) - 2;

This patch defines split point group zero has four buckets and based
on that above calculation is done. I feel you can define it like
#define Buckets_First_Split_Group 4 and then use it in above code.
Also, in else part number 2 looks awkward, can we define it as
log2_buckets_first_group = _hash_log2(Buckets_First_Split_Group) or
some thing like that. I think that way code will look neat. I don't
like the way above comment is worded even though it is helpful to
understand the calculation. If you want, then you can add such a
comment in file header, here it looks out of place.

I have removed the comments. And, defined a new macro which maps
bucket to SPLIT GROUP

#define BUCKET_TO_SPLITPOINT_GRP(num_bucket) \
((num_bucket <= Buckets_First_Split_Group) ? 0 : \
(_hash_log2(num_bucket) - 2))

I could not find a way to explain why minus 2? better than " The
splitpoint group of a given bucket can be taken as
(_hash_log2(bucket) - 2). Subtracted by 2 because each group have 2 ^
(x + 2) buckets.". Now I have added those with existing comments I
think that should make it little better.

Adding comments about spares array in hashutils.c's file header did
not appear right to me. I think README has some details about same.

2.
+power-of-2 groups, called "split points" in the code.  That means on every new
+split points we double the existing number of buckets.

split points/split point

Fixed.

3.
+ * which phase of allocation the bucket_num belogs to with in the group.

/belogs/belongs

Fixed

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

expand_hashbucket_efficiently_07.patchapplication/octet-stream; name=expand_hashbucket_efficiently_07.patchDownload
diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 3ba01f6..c97b279 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -51,13 +51,13 @@ bsize     | 8152
 bmsize    | 4096
 bmshift   | 15
 maxbucket | 3
-highmask  | 7
-lowmask   | 3
-ovflpoint | 2
+highmask  | 3
+lowmask   | 1
+ovflpoint | 3
 firstfree | 0
 nmaps     | 1
 procid    | 450
-spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 
 SELECT magic, version, ntuples, bsize, bmsize, bmshift, maxbucket, highmask,
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 9f41bb2..f19066f 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -667,11 +667,11 @@ bmshift   | 15
 maxbucket | 12512
 highmask  | 16383
 lowmask   | 8191
-ovflpoint | 14
+ovflpoint | 50
 firstfree | 1204
 nmaps     | 1
 procid    | 450
-spares    | {0,0,0,0,0,0,1,1,1,1,1,4,59,704,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,4,4,4,45,55,58,59,508,567,628,704,1193,1202,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 </screen>
      </para>
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..6721ee1 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -58,35 +58,50 @@ rules to support a variable number of overflow pages while not having to
 move primary bucket pages around after they are created.
 
 Primary bucket pages (henceforth just "bucket pages") are allocated in
-power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
-are created when the index is initialized.  At the first split, buckets 2
-and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
-when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
-pages of a power-of-2 group appear consecutively in the index.  This
-addressing scheme allows the physical location of a bucket page to be
-computed from the bucket number relatively easily, using only a small
-amount of control information.  We take the log2() of the bucket number
-to determine which split point S the bucket belongs to, and then simply
-add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
-metapage) to compute the physical address.  hashm_spares[S] can be
-interpreted as the total number of overflow pages that have been allocated
-before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
-so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
-block numbers 1 and 2, just after the meta page.  We always have
+power-of-2 groups, called "split points" in the code.  That means on every new
+split point we double the existing number of buckets.  And, it seems bad to
+allocate huge chunks of bucket pages all at once and we take ages to consume it.
+To avoid this exponential growth of index size, we did a trick to breakup
+allocation of buckets at splitpoint into 4 equal phases.  If 2^x is the total
+buckets need to be allocated at a splitpoint (from now on we shall call this
+as splitpoint group), then we allocate 1/4th (2^(x - 2)) of total buckets at
+each phase of splitpoint group. Next quarter of allocation will only happen if
+buckets of previous phase has been already consumed.  Since for buckets number
+< 4 we cannot further divide it in to multiple phases, the first splitpoint
+group 0's allocation is done as follows {1, 1, 1, 1} = 4 buckets in total, the
+numbers in curly braces indicate number of buckets allocated within each phase
+of splitpoint group 0.  In next splitpoint group 1 the allocation phases will
+be as follow {1, 1, 1, 1} = 8 buckets in total.  And, for splitpoint group 2
+and 3 allocation phase will be {2, 2, 2, 2} = 16 buckets in total and
+{4, 4, 4, 4} = 32 buckets in total.  We can see that at each splitpoint group
+we double the total number of buckets from previous group but in a incremental
+phase.  The bucket pages allocated within one phase of a splitpoint group will
+appear consecutively in the index.  This addressing scheme allows the physical
+location of a bucket page to be computed from the bucket number relatively
+easily, using only a small amount of control information.  If we look at the
+function _hash_spareindex for a given bucket number we first compute splitpoint
+group it belongs to and then the phase with in it to which the bucket belongs
+to.  Adding them we get the global splitpoint phase number S to which the
+bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[] is
+an array stored in the metapage) with given bucket number to compute its
+physical address.  The hashm_spares[S] can be interpreted as the total number
+of overflow pages that have been allocated before the bucket pages of
+splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
+(which belong to splitpoint group 0's phase 1 and phase 2 respectively) always
+appear at block numbers 1 and 2, just after the meta page.  We always have
 hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
-former.  The difference between the two represents the number of overflow
-pages appearing between the bucket page groups of splitpoints N and N+1.
-
+former. The difference between the two represents the number of overflow
+pages appearing between the bucket page groups of splitpoints phase N and N+1.
 (Note: the above describes what happens when filling an initially minimally
 sized hash index.  In practice, we try to estimate the required index size
-and allocate a suitable number of splitpoints immediately, to avoid
+and allocate a suitable number of splitpoints phases immediately, to avoid
 expensive re-splitting during initial index build.)
 
 When S splitpoints exist altogether, the array entries hashm_spares[0]
 through hashm_spares[S] are valid; hashm_spares[S] records the current
 total number of overflow pages.  New overflow pages are created as needed
 at the end of the index, and recorded by incrementing hashm_spares[S].
-When it is time to create a new splitpoint's worth of bucket pages, we
+When it is time to create a new splitpoint phase's worth of bucket pages, we
 copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
 stored in the hashm_ovflpoint field of the meta page).  This has the
 effect of reserving the correct number of bucket pages at the end of the
@@ -101,7 +116,7 @@ We have to allow the case "greater than" because it's possible that during
 an index extension we crash after allocating filesystem space and before
 updating the metapage.  Note that on filesystems that allow "holes" in
 files, it's entirely likely that pages before the logical EOF are not yet
-allocated: when we allocate a new splitpoint's worth of bucket pages, we
+allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
 physically zero the last such page to force the EOF up, and the first such
 page will be used immediately, but the intervening pages are not written
 until needed.
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..fe0b4ef 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -49,7 +49,7 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 	 * Convert to absolute page number by adding the number of bucket pages
 	 * that exist before this split point.
 	 */
-	return (BlockNumber) ((1 << i) + ovflbitnum);
+	return (BlockNumber) (_hash_get_totalbuckets(i) + ovflbitnum);
 }
 
 /*
@@ -67,14 +67,15 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 	/* Determine the split number containing this page */
 	for (i = 1; i <= splitnum; i++)
 	{
-		if (ovflblkno <= (BlockNumber) (1 << i))
+		if (ovflblkno <= (BlockNumber) _hash_get_totalbuckets(i))
 			break;				/* oops */
-		bitnum = ovflblkno - (1 << i);
+		bitnum = ovflblkno - _hash_get_totalbuckets(i);
 
 		/*
 		 * bitnum has to be greater than number of overflow page added in
 		 * previous split point. The overflow page at this splitnum (i) if any
-		 * should start from ((2 ^ i) + metap->hashm_spares[i - 1] + 1).
+		 * should start from
+		 * (_hash_get_totalbuckets(i) + metap->hashm_spares[i - 1] + 1).
 		 */
 		if (bitnum > metap->hashm_spares[i - 1] &&
 			bitnum <= metap->hashm_spares[i])
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 622cc4b..bd16e56 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -502,14 +502,15 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	Page		page;
 	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
+	uint32		spare_index;
 	uint32		i;
 
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
-	 * given the estimated number of tuples.  We round up the result to the
-	 * next power of 2, however, and always force at least 2 bucket pages. The
-	 * upper limit is determined by considerations explained in
+	 * given the estimated number of tuples.  We round up the result to total
+	 * the number of buckets which has to be allocated before using its
+	 * _hashm_spares index slot, however, and always force at least 2 bucket
+	 * pages. The upper limit is determined by considerations explained in
 	 * _hash_expandtable().
 	 */
 	dnumbuckets = num_tuples / ffactor;
@@ -518,11 +519,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	else if (dnumbuckets >= (double) 0x40000000)
 		num_buckets = 0x40000000;
 	else
-		num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+		num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
 
-	log2_num_buckets = _hash_log2(num_buckets);
-	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
-	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
+	spare_index = _hash_spareindex(num_buckets);
+	Assert(spare_index < HASH_MAX_SPLITPOINTS);
 
 	page = BufferGetPage(buf);
 	if (initpage)
@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
-	 * blocks 1 to N.  The first freespace bitmap page is in block N+1. Since
-	 * N is a power of 2, we can set the masks this way:
+	 * blocks 1 to N.  The first freespace bitmap page is in block N+1.
 	 */
-	metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
-	metap->hashm_highmask = (num_buckets << 1) - 1;
+	metap->hashm_maxbucket = num_buckets - 1;
+
+	/* set hishmask, which should be sufficient to cover num_buckets. */
+	metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
+	metap->hashm_lowmask = (metap->hashm_highmask >> 1);
 
 	MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
 	/* Set up mapping for one spare page after the initial splitpoints */
-	metap->hashm_spares[log2_num_buckets] = 1;
-	metap->hashm_ovflpoint = log2_num_buckets;
+	metap->hashm_spares[spare_index] = 1;
+	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
 
 	/*
@@ -773,25 +775,40 @@ restart_expand:
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to allocate a new batch of bucket pages.
+	 * If the split point is increasing we need to allocate a new batch of
+	 * bucket pages.
 	 */
-	spare_ndx = _hash_log2(new_bucket + 1);
+	spare_ndx = _hash_spareindex(new_bucket + 1);
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
+		uint32		buckets_toadd = 0;
+		uint32		splitpoint_group = 0;
+
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
-		 * The number of buckets in the new splitpoint is equal to the total
-		 * number already in existence, i.e. new_bucket.  Currently this maps
-		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.  We treat allocation of buckets as a
-		 * separate WAL-logged action.  Even if we fail after this operation,
+		 * The number of buckets in the new splitpoint group is equal to the
+		 * total number already in existence, i.e. new_bucket. But we do not
+		 * allocate them at once. Each splitpoint group will have 4 slots, we
+		 * distribute the buckets equally among them. So we allocate only one
+		 * fourth of total buckets in new splitpoint group at a time to consume
+		 * one phase after another. We treat allocation of buckets as a
+		 * separate WAL-logged action. Even if we fail after this operation,
 		 * won't leak bucket pages; rather, the next split will consume this
 		 * space. In any case, even without failure we don't use all the space
 		 * in one split operation.
 		 */
-		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+		splitpoint_group = (spare_ndx >> 2);
+
+		/*
+		 * Each phase in the splitpoint_group will allocate
+		 * 2^(splitpoint_group - 1) buckets if we divide buckets among 4
+		 * slots. The 0th group is a special case where we allocate 1 bucket
+		 * per slot as we cannot reduce it any further. See README for more
+		 * details.
+		 */
+		buckets_toadd = (splitpoint_group) ? (1 << (splitpoint_group - 1)) : 1;
+		if (!_hash_alloc_buckets(rel, start_nblkno, buckets_toadd))
 		{
 			/* can't split due to BlockNumber overflow */
 			_hash_relbuf(rel, buf_oblkno);
@@ -836,10 +853,9 @@ restart_expand:
 	}
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to adjust the hashm_spares[] array and
-	 * hashm_ovflpoint so that future overflow pages will be created beyond
-	 * this new batch of bucket pages.
+	 * If the split point is increasing we need to adjust the hashm_spares[]
+	 * array and hashm_ovflpoint so that future overflow pages will be created
+	 * beyond this new batch of bucket pages.
 	 */
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index 0e0f393..8aa8769 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -56,9 +56,8 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	 * num_buckets buckets in the index, the appropriate mask can be computed
 	 * as follows.
 	 *
-	 * Note: at present, the passed-in num_buckets is always a power of 2, so
-	 * we could just compute num_buckets - 1.  We prefer not to assume that
-	 * here, though.
+	 * NOTE : This hash_mask calculation should be in sync with similar
+	 * calculation in _hash_init_metabuffer.
 	 */
 	hspool->hash_mask = (((uint32) 1) << _hash_log2(num_buckets)) - 1;
 
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..98975e1 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -149,6 +149,84 @@ _hash_log2(uint32 num)
 	return i;
 }
 
+#define SPLITPOINT_PHASES_PER_GRP 4
+#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
+#define Buckets_First_Split_Group 4
+
+#define TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(sp_g) (sp_g > 0 ? (sp_g << 2) : 0)
+
+/*
+ * This is just a trick to save a division operation. If you look into the
+ * bitmap of 0-based bucket_num 2nd and 3rd most significant bit will indicate
+ * which phase of allocation the bucket_num belongs to with in the group. This
+ * is because at every splitpoint group we allocate (2 ^ x) buckets and we have
+ * divided the allocation process into 4 equal phases. This macro returns value
+ * from 0 to 3.
+ */
+#define SPLITPOINT_PHASES_WITHIN_GROUP(sp_g, bucket_num) \
+		((sp_g > 0) ? (((bucket_num) >> (sp_g - 1)) & SPLITPOINT_PHASE_MASK) : \
+						(bucket_num))
+
+/*
+ * At splitpoint group 0 we have 2 ^ (0 + 2) = 4 buckets, then at splitpoint
+ * group 1 we have 2 ^ (1 + 2) = 8 total buckets. As the doubling continues at
+ * splitpoint group "x" we will have 2 ^ (x + 2) total buckets. Total buckets
+ * before x splitpoint group will be 2 ^ (x + 1). At each phase of allocation
+ * within splitpoint group we add 2 ^ (x - 1) buckets, as we have to divide the
+ * task of allocation of 2 ^ (x + 1) buckets among 4 phases.
+ *
+ * Also, splitpoint group of a given bucket can be taken as
+ * (_hash_log2(bucket) - 2). Subtracted by 2 because each group have
+ * 2 ^ (x + 2) buckets.
+ */
+#define BUCKETS_BEFORE_SP_GRP(sp_g) ((sp_g == 0) ? 0 : (1 << (sp_g + 1)))
+#define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \
+						((nphase) * ((sp_g == 0) ? 1 : (1 << (sp_g - 1))))
+#define BUCKET_TO_SPLITPOINT_GRP(num_bucket) \
+		((num_bucket <= Buckets_First_Split_Group) ? 0 : \
+												(_hash_log2(num_bucket) - 2))
+
+/*
+ * _hash_spareindex -- returns spare index / global splitpoint phase of the
+ *					   bucket
+ */
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
+	uint32		splitpoint_group;
+
+	splitpoint_group = BUCKET_TO_SPLITPOINT_GRP(num_bucket);
+
+	return TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(splitpoint_group) +
+		SPLITPOINT_PHASES_WITHIN_GROUP(splitpoint_group,
+									   num_bucket - 1); /* make it 0-based */
+}
+
+/*
+ *	_hash_get_totalbuckets -- returns total number of buckets allocated till
+ *							the given splitpoint phase.
+ */
+uint32
+_hash_get_totalbuckets(uint32 splitpoint_phase)
+{
+	uint32		splitpoint_group;
+
+	/*
+	 * Every 4 consecutive phases makes one group and group's are numbered
+	 * from 0
+	 */
+	splitpoint_group = (splitpoint_phase / SPLITPOINT_PHASES_PER_GRP);
+
+	/*
+	 * total_buckets = total number of buckets before its splitpoint group +
+	 * total buckets within its splitpoint group until given splitpoint_phase.
+	 */
+	return BUCKETS_BEFORE_SP_GRP(splitpoint_group) +
+		BUCKETS_WITHIN_SP_GRP(splitpoint_group,
+							  (splitpoint_phase % SPLITPOINT_PHASES_PER_GRP)
+							  + 1);
+}
+
 /*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..ac502ef 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -36,7 +36,7 @@ typedef uint32 Bucket;
 #define InvalidBucket	((Bucket) 0xFFFFFFFF)
 
 #define BUCKET_TO_BLKNO(metap,B) \
-		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
+		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)
 
 /*
  * Special space for hash index pages.
@@ -180,7 +180,7 @@ typedef HashScanOpaqueData *HashScanOpaque;
  * needing to fit into the metapage.  (With 8K block size, 128 bitmaps
  * limit us to 64 GB of overflow space...)
  */
-#define HASH_MAX_SPLITPOINTS		32
+#define HASH_MAX_SPLITPOINTS		128
 #define HASH_MAX_BITMAPS			128
 
 typedef struct HashMetaPageData
@@ -382,6 +382,8 @@ extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
 extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
 					 uint32 highmask, uint32 lowmask);
 extern uint32 _hash_log2(uint32 num);
+extern uint32 _hash_spareindex(uint32 num_bucket);
+extern uint32 _hash_get_totalbuckets(uint32 splitpoint_phase);
 extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
 extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
 extern bool _hash_convert_tuple(Relation index,
#15Amit Kapila
amit.kapila16@gmail.com
In reply to: Mithun Cy (#14)
Re: [POC] A better way to expand hash indexes.

On Sun, Mar 26, 2017 at 11:26 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

Thanks, Amit for the review.
On Sat, Mar 25, 2017 at 7:03 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think one-dimensional patch has fewer places to touch, so that looks
better to me. However, I think there is still hard coding and
assumptions in code which we should try to improve.

Great!, I will continue with spares 1-dimensional improvement.

@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double
num_tuples, RegProcedure procid,\
{
..
  else
- num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+ num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
..
..
- metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
- metap->hashm_highmask = (num_buckets << 1) - 1;
+ metap->hashm_maxbucket = num_buckets - 1;
+
+ /* set hishmask, which should be sufficient to cover num_buckets. */
+ metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
+ metap->hashm_lowmask = (metap->hashm_highmask >> 1);
}

I think we can't change the number of buckets to be created or lowmask
and highmask calculation here without modifying _h_spoolinit() because
it sorts the data to be inserted based on hashkey which in turn
depends on the number of buckets that we are going to create during
create index operation. We either need to allow create index
operation to still always create buckets in power-of-two fashion or we
need to update _h_spoolinit according to new computation. One minor
drawback of using power-of-two scheme for creation of buckets during
create index is that it can lead to wastage of space and will be
inconsistent with what the patch does during split operation.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#15)
Re: [POC] A better way to expand hash indexes.

On Mon, Mar 27, 2017 at 11:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Mar 26, 2017 at 11:26 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

Thanks, Amit for the review.
On Sat, Mar 25, 2017 at 7:03 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think one-dimensional patch has fewer places to touch, so that looks
better to me. However, I think there is still hard coding and
assumptions in code which we should try to improve.

Great!, I will continue with spares 1-dimensional improvement.

@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double
num_tuples, RegProcedure procid,\
{
..
else
- num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+ num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
..
..
- metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
- metap->hashm_highmask = (num_buckets << 1) - 1;
+ metap->hashm_maxbucket = num_buckets - 1;
+
+ /* set hishmask, which should be sufficient to cover num_buckets. */
+ metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
+ metap->hashm_lowmask = (metap->hashm_highmask >> 1);
}

I think we can't change the number of buckets to be created or lowmask
and highmask calculation here without modifying _h_spoolinit() because
it sorts the data to be inserted based on hashkey which in turn
depends on the number of buckets that we are going to create during
create index operation. We either need to allow create index
operation to still always create buckets in power-of-two fashion or we
need to update _h_spoolinit according to new computation. One minor
drawback of using power-of-two scheme for creation of buckets during
create index is that it can lead to wastage of space and will be
inconsistent with what the patch does during split operation.

Few more comments:

1.
@@ -149,6 +149,84 @@ _hash_log2(uint32 num)
return i;
}

+#define SPLITPOINT_PHASES_PER_GRP 4
+#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
+#define Buckets_First_Split_Group 4

The defines should be at the beginning of file.

2.
+/*
+ * At splitpoint group 0 we have 2 ^ (0 + 2) = 4 buckets, then at splitpoint
+ * group 1 we have 2 ^ (1 + 2) = 8 total buckets. As the doubling continues at
+ * splitpoint group "x" we will have 2 ^ (x + 2) total buckets. Total buckets
+ * before x splitpoint group will be 2 ^ (x + 1). At each phase of allocation
+ * within splitpoint group we add 2 ^ (x - 1) buckets, as we have to divide the
+ * task of allocation of 2 ^ (x + 1) buckets among 4 phases.
+ *
+ * Also, splitpoint group of a given bucket can be taken as
+ * (_hash_log2(bucket) - 2). Subtracted by 2 because each group have
+ * 2 ^ (x + 2) buckets.
+ */
..
+#define BUCKET_TO_SPLITPOINT_GRP(num_bucket) \
+ ((num_bucket <= Buckets_First_Split_Group) ? 0 : \
+ (_hash_log2(num_bucket) - 2))

In the above computation +2 and -2 still bothers me. I think you need
to do this because you have defined split group zero to have four
buckets, how about if you don't force that and rather define to have
split point phases only from split point which has four or more
buckets?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Mithun Cy (#14)
1 attachment(s)
Re: [POC] A better way to expand hash indexes.

Hi Mithun,

On 03/26/2017 01:56 AM, Mithun Cy wrote:

Thanks, Amit for the review.
On Sat, Mar 25, 2017 at 7:03 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think one-dimensional patch has fewer places to touch, so that looks
better to me. However, I think there is still hard coding and
assumptions in code which we should try to improve.

Great!, I will continue with spares 1-dimensional improvement.

I ran some performance scenarios on the patch to see if the increased
'spares' allocation had an impact. I havn't found any regressions in
that regard.

Attached patch contains some small fixes, mainly to the documentation -
on top of v7.

Best regards,
Jesper

Attachments:

hashbucket_fixes.patchtext/x-patch; name=hashbucket_fixes.patchDownload
From 5545e48ab7136f17b3d471e0ee679a6db6040865 Mon Sep 17 00:00:00 2001
From: jesperpedersen <jesper.pedersen@redhat.com>
Date: Mon, 27 Mar 2017 14:15:00 -0400
Subject: [PATCH] Small fixes

---
 src/backend/access/hash/README     | 50 +++++++++++++++++++-------------------
 src/backend/access/hash/hashpage.c | 26 ++++++++++----------
 src/backend/access/hash/hashutil.c | 24 +++++++++---------
 3 files changed, 50 insertions(+), 50 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 6721ee1..ca46de7 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -58,50 +58,50 @@ rules to support a variable number of overflow pages while not having to
 move primary bucket pages around after they are created.
 
 Primary bucket pages (henceforth just "bucket pages") are allocated in
-power-of-2 groups, called "split points" in the code.  That means on every new
-split point we double the existing number of buckets.  And, it seems bad to
-allocate huge chunks of bucket pages all at once and we take ages to consume it.
-To avoid this exponential growth of index size, we did a trick to breakup
-allocation of buckets at splitpoint into 4 equal phases.  If 2^x is the total
-buckets need to be allocated at a splitpoint (from now on we shall call this
-as splitpoint group), then we allocate 1/4th (2^(x - 2)) of total buckets at
-each phase of splitpoint group. Next quarter of allocation will only happen if
+power-of-2 groups, called "split points" in the code.  That means at every new
+split point we double the existing number of buckets.  Allocating huge chucks
+of bucket pages all at once isn't optimal and we will take ages to consume those.
+To avoid this exponential growth of index size, we did use a trick to breakup
+allocation of buckets at the split points into 4 equal phases.  If 2^x is the total
+buckets needed to be allocated at a split point (from now on we shall call this
+a split point group), then we allocate 1/4th (2^(x - 2)) of total buckets at
+each phase of the split point group. Next quarter of allocation will only happen if
 buckets of previous phase has been already consumed.  Since for buckets number
-< 4 we cannot further divide it in to multiple phases, the first splitpoint
+< 4 we cannot further divide it in to multiple phases, the first split point
 group 0's allocation is done as follows {1, 1, 1, 1} = 4 buckets in total, the
 numbers in curly braces indicate number of buckets allocated within each phase
-of splitpoint group 0.  In next splitpoint group 1 the allocation phases will
-be as follow {1, 1, 1, 1} = 8 buckets in total.  And, for splitpoint group 2
+of split point group 0.  In next split point group 1 the allocation phases will
+be as follow {1, 1, 1, 1} = 8 buckets in total.  And, for split point group 2
 and 3 allocation phase will be {2, 2, 2, 2} = 16 buckets in total and
-{4, 4, 4, 4} = 32 buckets in total.  We can see that at each splitpoint group
-we double the total number of buckets from previous group but in a incremental
-phase.  The bucket pages allocated within one phase of a splitpoint group will
+{4, 4, 4, 4} = 32 buckets in total.  We can see that at each split point group
+we double the total number of buckets from previous group but in an incremental
+phase.  The bucket pages allocated within one phase of a split point group will
 appear consecutively in the index.  This addressing scheme allows the physical
 location of a bucket page to be computed from the bucket number relatively
 easily, using only a small amount of control information.  If we look at the
-function _hash_spareindex for a given bucket number we first compute splitpoint
-group it belongs to and then the phase with in it to which the bucket belongs
-to.  Adding them we get the global splitpoint phase number S to which the
+function _hash_spareindex for a given bucket number we first compute the split point
+group it belongs to and then the phase to which the bucket belongs
+to.  Adding them we get the global split point phase number S to which the
 bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[] is
 an array stored in the metapage) with given bucket number to compute its
 physical address.  The hashm_spares[S] can be interpreted as the total number
 of overflow pages that have been allocated before the bucket pages of
-splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
-(which belong to splitpoint group 0's phase 1 and phase 2 respectively) always
+split point phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
+(which belong to split point group 0's phase 1 and phase 2 respectively) always
 appear at block numbers 1 and 2, just after the meta page.  We always have
 hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
 former. The difference between the two represents the number of overflow
-pages appearing between the bucket page groups of splitpoints phase N and N+1.
+pages appearing between the bucket page groups of split points phase N and N+1.
 (Note: the above describes what happens when filling an initially minimally
 sized hash index.  In practice, we try to estimate the required index size
-and allocate a suitable number of splitpoints phases immediately, to avoid
+and allocate a suitable number of split point phases immediately, to avoid
 expensive re-splitting during initial index build.)
 
-When S splitpoints exist altogether, the array entries hashm_spares[0]
-through hashm_spares[S] are valid; hashm_spares[S] records the current
+When S split point exists, the array entries hashm_spares[0]
+through hashm_spares[S] are all valid; hashm_spares[S] records the current
 total number of overflow pages.  New overflow pages are created as needed
 at the end of the index, and recorded by incrementing hashm_spares[S].
-When it is time to create a new splitpoint phase's worth of bucket pages, we
+When it is time to create a new split point phase's worth of bucket pages, we
 copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
 stored in the hashm_ovflpoint field of the meta page).  This has the
 effect of reserving the correct number of bucket pages at the end of the
@@ -116,7 +116,7 @@ We have to allow the case "greater than" because it's possible that during
 an index extension we crash after allocating filesystem space and before
 updating the metapage.  Note that on filesystems that allow "holes" in
 files, it's entirely likely that pages before the logical EOF are not yet
-allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
+allocated: when we allocate a new split point phase's worth of bucket pages, we
 physically zero the last such page to force the EOF up, and the first such
 page will be used immediately, but the intervening pages are not written
 until needed.
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 9b63414..fcb9711 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -507,9 +507,9 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
-	 * given the estimated number of tuples.  We round up the result to total
-	 * the number of buckets which has to be allocated before using its
-	 * _hashm_spares index slot, however, and always force at least 2 bucket
+	 * given the estimated number of tuples.  We round up the result to the
+	 * total number of buckets which has to be allocated before using its
+	 * _hashm_spares index slot. However always force at least 2 bucket
 	 * pages. The upper limit is determined by considerations explained in
 	 * _hash_expandtable().
 	 */
@@ -567,14 +567,14 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	 */
 	metap->hashm_maxbucket = num_buckets - 1;
 
-	/* set hishmask, which should be sufficient to cover num_buckets. */
+	/* set highmask, which should be sufficient to cover num_buckets. */
 	metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
 	metap->hashm_lowmask = (metap->hashm_highmask >> 1);
 
 	MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
-	/* Set up mapping for one spare page after the initial splitpoints */
+	/* Set up mapping for one spare page after the initial split point */
 	metap->hashm_spares[spare_index] = 1;
 	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
@@ -770,7 +770,7 @@ restart_expand:
 	 * Note: it is safe to compute the new bucket's blkno here, even though we
 	 * may still need to update the BUCKET_TO_BLKNO mapping.  This is because
 	 * the current value of hashm_spares[hashm_ovflpoint] correctly shows
-	 * where we are going to put a new splitpoint's worth of buckets.
+	 * where we are going to put a new split point's worth of buckets.
 	 */
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
@@ -787,11 +787,11 @@ restart_expand:
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
-		 * The number of buckets in the new splitpoint group is equal to the
+		 * The number of buckets in the new split point group is equal to the
 		 * total number already in existence, i.e. new_bucket. But we do not
-		 * allocate them at once. Each splitpoint group will have 4 slots, we
+		 * allocate them at once. Each split point group will have 4 slots, we
 		 * distribute the buckets equally among them. So we allocate only one
-		 * fourth of total buckets in new splitpoint group at a time to consume
+		 * fourth of total buckets in new split point group at a time to consume
 		 * one phase after another. We treat allocation of buckets as a
 		 * separate WAL-logged action. Even if we fail after this operation,
 		 * won't leak bucket pages; rather, the next split will consume this
@@ -976,14 +976,14 @@ fail:
 
 
 /*
- * _hash_alloc_buckets -- allocate a new splitpoint's worth of bucket pages
+ * _hash_alloc_buckets -- allocate a new split point worth of bucket pages
  *
  * This does not need to initialize the new bucket pages; we'll do that as
  * each one is used by _hash_expandtable().  But we have to extend the logical
- * EOF to the end of the splitpoint; this keeps smgr's idea of the EOF in
+ * EOF to the end of the split point; this keeps smgr's idea of the EOF in
  * sync with ours, so that we don't get complaints from smgr.
  *
- * We do this by writing a page of zeroes at the end of the splitpoint range.
+ * We do this by writing a page of zeroes at the end of the split point range.
  * We expect that the filesystem will ensure that the intervening pages read
  * as zeroes too.  On many filesystems this "hole" will not be allocated
  * immediately, which means that the index file may end up more fragmented
@@ -993,7 +993,7 @@ fail:
  * XXX It's annoying that this code is executed with the metapage lock held.
  * We need to interlock against _hash_addovflpage() adding a new overflow page
  * concurrently, but it'd likely be better to use LockRelationForExtension
- * for the purpose.  OTOH, adding a splitpoint is a very infrequent operation,
+ * for the purpose.  OTOH, adding a split point is a very infrequent operation,
  * so it may not be worth worrying about.
  *
  * Returns TRUE if successful, or FALSE if allocation failed due to
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 98975e1..4b89fa3 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -151,7 +151,7 @@ _hash_log2(uint32 num)
 
 #define SPLITPOINT_PHASES_PER_GRP 4
 #define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
-#define Buckets_First_Split_Group 4
+#define BUCKETS_FIRST_SPLIT_GROUP 4
 
 #define TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(sp_g) (sp_g > 0 ? (sp_g << 2) : 0)
 
@@ -159,7 +159,7 @@ _hash_log2(uint32 num)
  * This is just a trick to save a division operation. If you look into the
  * bitmap of 0-based bucket_num 2nd and 3rd most significant bit will indicate
  * which phase of allocation the bucket_num belongs to with in the group. This
- * is because at every splitpoint group we allocate (2 ^ x) buckets and we have
+ * is because at every split point group we allocate (2 ^ x) buckets and we have
  * divided the allocation process into 4 equal phases. This macro returns value
  * from 0 to 3.
  */
@@ -168,14 +168,14 @@ _hash_log2(uint32 num)
 						(bucket_num))
 
 /*
- * At splitpoint group 0 we have 2 ^ (0 + 2) = 4 buckets, then at splitpoint
+ * At split point group 0 we have 2 ^ (0 + 2) = 4 buckets, then at split point
  * group 1 we have 2 ^ (1 + 2) = 8 total buckets. As the doubling continues at
- * splitpoint group "x" we will have 2 ^ (x + 2) total buckets. Total buckets
- * before x splitpoint group will be 2 ^ (x + 1). At each phase of allocation
- * within splitpoint group we add 2 ^ (x - 1) buckets, as we have to divide the
+ * split point group "x" we will have 2 ^ (x + 2) total buckets. Total buckets
+ * before x split point group will be 2 ^ (x + 1). At each phase of allocation
+ * within split point group we add 2 ^ (x - 1) buckets, as we have to divide the
  * task of allocation of 2 ^ (x + 1) buckets among 4 phases.
  *
- * Also, splitpoint group of a given bucket can be taken as
+ * Also, the split point group of a given bucket can be taken as
  * (_hash_log2(bucket) - 2). Subtracted by 2 because each group have
  * 2 ^ (x + 2) buckets.
  */
@@ -183,11 +183,11 @@ _hash_log2(uint32 num)
 #define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \
 						((nphase) * ((sp_g == 0) ? 1 : (1 << (sp_g - 1))))
 #define BUCKET_TO_SPLITPOINT_GRP(num_bucket) \
-		((num_bucket <= Buckets_First_Split_Group) ? 0 : \
+		((num_bucket <= BUCKETS_FIRST_SPLIT_GROUP) ? 0 : \
 												(_hash_log2(num_bucket) - 2))
 
 /*
- * _hash_spareindex -- returns spare index / global splitpoint phase of the
+ * _hash_spareindex -- returns spare index / global split point phase of the
  *					   bucket
  */
 uint32
@@ -204,7 +204,7 @@ _hash_spareindex(uint32 num_bucket)
 
 /*
  *	_hash_get_totalbuckets -- returns total number of buckets allocated till
- *							the given splitpoint phase.
+ *							the given split point phase.
  */
 uint32
 _hash_get_totalbuckets(uint32 splitpoint_phase)
@@ -218,8 +218,8 @@ _hash_get_totalbuckets(uint32 splitpoint_phase)
 	splitpoint_group = (splitpoint_phase / SPLITPOINT_PHASES_PER_GRP);
 
 	/*
-	 * total_buckets = total number of buckets before its splitpoint group +
-	 * total buckets within its splitpoint group until given splitpoint_phase.
+	 * total_buckets = total number of buckets before its split point group +
+	 * total buckets within its split point group until given splitpoint_phase.
 	 */
 	return BUCKETS_BEFORE_SP_GRP(splitpoint_group) +
 		BUCKETS_WITHIN_SP_GRP(splitpoint_group,
-- 
2.7.4

#18Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Amit Kapila (#15)
4 attachment(s)
Re: [POC] A better way to expand hash indexes.

On Mon, Mar 27, 2017 at 11:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think we can't change the number of buckets to be created or lowmask
and highmask calculation here without modifying _h_spoolinit() because
it sorts the data to be inserted based on hashkey which in turn
depends on the number of buckets that we are going to create during
create index operation. We either need to allow create index
operation to still always create buckets in power-of-two fashion or we
need to update _h_spoolinit according to new computation. One minor
drawback of using power-of-two scheme for creation of buckets during
create index is that it can lead to wastage of space and will be
inconsistent with what the patch does during split operation.

Yes, this was a miss. Now Number of buckets allocated during
metap_init is not always a power-of-two number. The hashbuild which
uses just the hash_mask to decide which bucket does the hashkey
belong to is not sufficient. It can give buckets beyond max_buckets
and sorting of index values based on their buckets will be out of
order. When we try to actually insert the same in hash index we loose
the advantage of the spatial locality which existed before. And, hence
indexbuild performance can reduce.

As you have said we can solve it if we allocate buckets always in
power-of-2 when we do hash index meta page init. But on other
occasions, when we try to double the existing buckets we can do the
allocation in 4 equal phases.

But I think there are 2 more ways to solve same,

A. Why not pass all 3 parameters high_mask, low_mask, max-buckets to
tuplesort and let them use _hash_hashkey2bucket to figure out which
key belong to which bucket. And then sort them. I think this way we
make both sorting and insertion to hash index both consistent with
each other.

B. In tuple sort we can use hash function bucket = hash_key %
num_buckets instead of existing one which does bitwise "and" to
determine the bucket of hash key. This way we will not wrongly assign
buckets beyond max_buckets and sorted hash keys will be in sync with
actual insertion order of _hash_doinsert.

I am adding both the patches Patch_A and Patch_B. My preference is
Patch_A and I am open for suggestion.

+#define SPLITPOINT_PHASES_PER_GRP 4
+#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
+#define Buckets_First_Split_Group 4

Fixed.

In the above computation +2 and -2 still bothers me. I think you need
to do this because you have defined split group zero to have four
buckets, how about if you don't force that and rather define to have
split point phases only from split point which has four or more
buckets?

Okay as suggested instead of group zero having 4 phases of 1 bucket
each, I have recalculated the spare mapping as below.
Allocating huge chunks of bucket pages all at once isn't optimal and
we will take ages to consume those. To avoid this exponential growth
of index size, we did use a trick to breakup allocation of buckets at
the splitpoint into 4 equal phases. If (2 ^ x) is the total buckets
need to be allocated at a splitpoint (from now on we shall call this
as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2)) of total
buckets at each phase of splitpoint group. Next quarter of allocation
will only happen if buckets of the previous phase have been already
consumed. Since for buckets number < 4 we cannot further divide it
into multiple phases, the first 3 group will have only one phase of
allocation. The groups 0, 1, 2 will allocate 1, 1, 2 buckets
respectively at once in one phase. For the groups > 2 Where we
allocate buckets > 4, the allocation process is distributed among four
equal phases. At group 3 we allocate 4 buckets in 4 different phases
{1, 1, 1, 1}, the numbers in curly braces indicate number of buckets
allocated within each phase of splitpoint group 3. And, for splitpoint
group 4 and 5 allocation phase will be {2, 2, 2, 2} = 16 buckets in
total and {4, 4, 4, 4} = 32 buckets in total. We can see that at each
splitpoint group
we double the total number of buckets from previous group but in an
incremental phase. The bucket pages allocated within one phase of a
splitpoint group will appear consecutively in the index.

The sortbuild_hash_*.patch can be applied independently on any of
expand_hashbucket_efficiently_08.patch
--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

sortbuild_hash_B.patchapplication/octet-stream; name=sortbuild_hash_B.patchDownload
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index 0e0f393..011f7e4 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -37,7 +37,7 @@ struct HSpool
 {
 	Tuplesortstate *sortstate;	/* state data for tuplesort.c */
 	Relation	index;
-	uint32		hash_mask;		/* bitmask for hash codes */
+	uint32		num_buckets;		/* bitmask for hash codes */
 };
 
 
@@ -52,15 +52,12 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	hspool->index = index;
 
 	/*
-	 * Determine the bitmask for hash code values.  Since there are currently
-	 * num_buckets buckets in the index, the appropriate mask can be computed
-	 * as follows.
-	 *
-	 * Note: at present, the passed-in num_buckets is always a power of 2, so
-	 * we could just compute num_buckets - 1.  We prefer not to assume that
-	 * here, though.
+	 * At this point max_buckets in hash index is num_buckets - 1.
+	 * The "hash key mod num_buckets" will indicate which bucket does the
+	 * hash key belongs to, and will be used to sort the index tuples based on
+	 * their bucket.
 	 */
-	hspool->hash_mask = (((uint32) 1) << _hash_log2(num_buckets)) - 1;
+	hspool->num_buckets = num_buckets;
 
 	/*
 	 * We size the sort area as maintenance_work_mem rather than work_mem to
@@ -69,7 +66,7 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	 */
 	hspool->sortstate = tuplesort_begin_index_hash(heap,
 												   index,
-												   hspool->hash_mask,
+												   hspool->num_buckets,
 												   maintenance_work_mem,
 												   false);
 
@@ -122,7 +119,7 @@ _h_indexbuild(HSpool *hspool, Relation heapRel)
 #ifdef USE_ASSERT_CHECKING
 		uint32		lasthashkey = hashkey;
 
-		hashkey = _hash_get_indextuple_hashkey(itup) & hspool->hash_mask;
+		hashkey = _hash_get_indextuple_hashkey(itup) % hspool->num_buckets;
 		Assert(hashkey >= lasthashkey);
 #endif
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e1e692d..675c6a8 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -126,6 +126,7 @@
 #include <limits.h>
 
 #include "access/htup_details.h"
+#include "access/hash.h"
 #include "access/nbtree.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -473,7 +474,7 @@ struct Tuplesortstate
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 
 	/* These are specific to the index_hash subcase: */
-	uint32		hash_mask;		/* mask for sortable part of hash code */
+	uint32		num_buckets;		/* to find the bucket of given hash key */
 
 	/*
 	 * These variables are specific to the Datum case; they are set by
@@ -991,7 +992,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 Tuplesortstate *
 tuplesort_begin_index_hash(Relation heapRel,
 						   Relation indexRel,
-						   uint32 hash_mask,
+						   uint32 num_buckets,
 						   int workMem, bool randomAccess)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
@@ -1002,8 +1003,8 @@ tuplesort_begin_index_hash(Relation heapRel,
 #ifdef TRACE_SORT
 	if (trace_sort)
 		elog(LOG,
-		"begin index sort: hash_mask = 0x%x, workMem = %d, randomAccess = %c",
-			 hash_mask,
+		"begin index sort: num_buckets = 0x%x, workMem = %d, randomAccess = %c",
+			 num_buckets,
 			 workMem, randomAccess ? 't' : 'f');
 #endif
 
@@ -1017,7 +1018,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
 
-	state->hash_mask = hash_mask;
+	state->num_buckets = num_buckets;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -4157,27 +4158,27 @@ static int
 comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 					  Tuplesortstate *state)
 {
-	uint32		hash1;
-	uint32		hash2;
+	Bucket		bucket1;
+	Bucket		bucket2;
 	IndexTuple	tuple1;
 	IndexTuple	tuple2;
 
 	/*
-	 * Fetch hash keys and mask off bits we don't want to sort by. We know
-	 * that the first column of the index tuple is the hash key.
+	 * Get the buckets of hash keys. We know that the first column of the index
+	 * tuple is the hash key.
 	 */
 	Assert(!a->isnull1);
-	hash1 = DatumGetUInt32(a->datum1) & state->hash_mask;
+	bucket1 = DatumGetUInt32(a->datum1) % state->num_buckets;
 	Assert(!b->isnull1);
-	hash2 = DatumGetUInt32(b->datum1) & state->hash_mask;
+	bucket2 = DatumGetUInt32(b->datum1) % state->num_buckets;
 
-	if (hash1 > hash2)
+	if (bucket1 > bucket2)
 		return 1;
-	else if (hash1 < hash2)
+	else if (bucket1 < bucket2)
 		return -1;
 
 	/*
-	 * If hash values are equal, we sort on ItemPointer.  This does not affect
+	 * If buckets are equal, we sort on ItemPointer.  This does not affect
 	 * validity of the finished index, but it may be useful to have index
 	 * scans in physical order.
 	 */
sortbuild_hash_A.patchapplication/octet-stream; name=sortbuild_hash_A.patchDownload
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index 0e0f393..04d9c46 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -37,7 +37,15 @@ struct HSpool
 {
 	Tuplesortstate *sortstate;	/* state data for tuplesort.c */
 	Relation	index;
-	uint32		hash_mask;		/* bitmask for hash codes */
+
+	/*
+	 * We sort the hash keys based on the buckets they belong to. Below masks
+	 * are used in _hash_hashkey2bucket to determine the bucket of given hash
+	 * key.
+	 */
+	uint32		high_mask;
+	uint32		low_mask;
+	uint32		max_buckets;
 };
 
 
@@ -56,11 +64,12 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	 * num_buckets buckets in the index, the appropriate mask can be computed
 	 * as follows.
 	 *
-	 * Note: at present, the passed-in num_buckets is always a power of 2, so
-	 * we could just compute num_buckets - 1.  We prefer not to assume that
-	 * here, though.
+	 * NOTE : This hash mask calculation should be in sync with similar
+	 * calculation in _hash_init_metabuffer.
 	 */
-	hspool->hash_mask = (((uint32) 1) << _hash_log2(num_buckets)) - 1;
+	hspool->high_mask = (((uint32) 1) << _hash_log2(num_buckets)) - 1;
+	hspool->low_mask = (hspool->high_mask >> 1);
+	hspool->max_buckets = num_buckets - 1;
 
 	/*
 	 * We size the sort area as maintenance_work_mem rather than work_mem to
@@ -69,7 +78,9 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	 */
 	hspool->sortstate = tuplesort_begin_index_hash(heap,
 												   index,
-												   hspool->hash_mask,
+												   hspool->high_mask,
+												   hspool->low_mask,
+												   hspool->max_buckets,
 												   maintenance_work_mem,
 												   false);
 
@@ -122,7 +133,9 @@ _h_indexbuild(HSpool *hspool, Relation heapRel)
 #ifdef USE_ASSERT_CHECKING
 		uint32		lasthashkey = hashkey;
 
-		hashkey = _hash_get_indextuple_hashkey(itup) & hspool->hash_mask;
+		hashkey = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+									   hspool->max_buckets, hspool->high_mask,
+									   hspool->low_mask);
 		Assert(hashkey >= lasthashkey);
 #endif
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e1e692d..5b8aad1 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -127,6 +127,7 @@
 
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/hash.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
 #include "commands/tablespace.h"
@@ -473,7 +474,9 @@ struct Tuplesortstate
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 
 	/* These are specific to the index_hash subcase: */
-	uint32		hash_mask;		/* mask for sortable part of hash code */
+	uint32		high_mask;		/* masks for sortable part of hash code */
+	uint32		low_mask;
+	uint32		max_buckets;
 
 	/*
 	 * These variables are specific to the Datum case; they are set by
@@ -991,7 +994,9 @@ tuplesort_begin_index_btree(Relation heapRel,
 Tuplesortstate *
 tuplesort_begin_index_hash(Relation heapRel,
 						   Relation indexRel,
-						   uint32 hash_mask,
+						   uint32 high_mask,
+						   uint32 low_mask,
+						   uint32 max_buckets,
 						   int workMem, bool randomAccess)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
@@ -1002,8 +1007,11 @@ tuplesort_begin_index_hash(Relation heapRel,
 #ifdef TRACE_SORT
 	if (trace_sort)
 		elog(LOG,
-		"begin index sort: hash_mask = 0x%x, workMem = %d, randomAccess = %c",
-			 hash_mask,
+		"begin index sort: high_mask = 0x%x, low_mask = 0x%x, "
+		"max_buckets = 0x%x, workMem = %d, randomAccess = %c",
+			 high_mask,
+			 low_mask,
+			 max_buckets,
 			 workMem, randomAccess ? 't' : 'f');
 #endif
 
@@ -1017,7 +1025,9 @@ tuplesort_begin_index_hash(Relation heapRel,
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
 
-	state->hash_mask = hash_mask;
+	state->high_mask = high_mask;
+	state->low_mask = low_mask;
+	state->max_buckets = max_buckets;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -4157,8 +4167,8 @@ static int
 comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 					  Tuplesortstate *state)
 {
-	uint32		hash1;
-	uint32		hash2;
+	Bucket		bucket1;
+	Bucket		bucket2;
 	IndexTuple	tuple1;
 	IndexTuple	tuple2;
 
@@ -4167,13 +4177,14 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 	 * that the first column of the index tuple is the hash key.
 	 */
 	Assert(!a->isnull1);
-	hash1 = DatumGetUInt32(a->datum1) & state->hash_mask;
+	bucket1 = _hash_hashkey2bucket(DatumGetUInt32(a->datum1), state->max_buckets,
+								 state->high_mask, state->low_mask);
 	Assert(!b->isnull1);
-	hash2 = DatumGetUInt32(b->datum1) & state->hash_mask;
-
-	if (hash1 > hash2)
+	bucket2 = _hash_hashkey2bucket(DatumGetUInt32(b->datum1), state->max_buckets,
+								 state->high_mask, state->low_mask);
+	if (bucket1 > bucket2)
 		return 1;
-	else if (hash1 < hash2)
+	else if (bucket1 < bucket2)
 		return -1;
 
 	/*
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 5b3f475..9719db4 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -72,7 +72,9 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 							int workMem, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
 						   Relation indexRel,
-						   uint32 hash_mask,
+						   uint32 high_mask,
+						   uint32 low_mask,
+						   uint32 max_buckets,
 						   int workMem, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 					  Oid sortOperator, Oid sortCollation,
yet_another_expand_hashbucket_efficiently_08.patchapplication/octet-stream; name=yet_another_expand_hashbucket_efficiently_08.patchDownload
commit 617b371baf787c2c45f21d4b2aedd02d7d9031e0
Author: mithun <mithun@localhost.localdomain>
Date:   Tue Mar 28 10:26:34 2017 +0530

    Yet another distribution expand hash efficiently.

diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 3ba01f6..518bdbe 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -51,13 +51,13 @@ bsize     | 8152
 bmsize    | 4096
 bmshift   | 15
 maxbucket | 3
-highmask  | 7
-lowmask   | 3
+highmask  | 3
+lowmask   | 1
 ovflpoint | 2
 firstfree | 0
 nmaps     | 1
 procid    | 450
-spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 
 SELECT magic, version, ntuples, bsize, bmsize, bmshift, maxbucket, highmask,
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 9f41bb2..f19066f 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -667,11 +667,11 @@ bmshift   | 15
 maxbucket | 12512
 highmask  | 16383
 lowmask   | 8191
-ovflpoint | 14
+ovflpoint | 50
 firstfree | 1204
 nmaps     | 1
 procid    | 450
-spares    | {0,0,0,0,0,0,1,1,1,1,1,4,59,704,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,4,4,4,45,55,58,59,508,567,628,704,1193,1202,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 </screen>
      </para>
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..9b12acb 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -58,35 +58,52 @@ rules to support a variable number of overflow pages while not having to
 move primary bucket pages around after they are created.
 
 Primary bucket pages (henceforth just "bucket pages") are allocated in
-power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
-are created when the index is initialized.  At the first split, buckets 2
-and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
-when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
-pages of a power-of-2 group appear consecutively in the index.  This
-addressing scheme allows the physical location of a bucket page to be
-computed from the bucket number relatively easily, using only a small
-amount of control information.  We take the log2() of the bucket number
-to determine which split point S the bucket belongs to, and then simply
-add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
-metapage) to compute the physical address.  hashm_spares[S] can be
-interpreted as the total number of overflow pages that have been allocated
-before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
-so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
-block numbers 1 and 2, just after the meta page.  We always have
+power-of-2 groups, called "split points" in the code.  That means at every new
+split point we double the existing number of buckets.  Allocating huge chucks
+of bucket pages all at once isn't optimal and we will take ages to consume
+those. To avoid this exponential growth of index size, we did use a trick to
+breakup allocation of buckets at the splitpoint into 4 equal phases.  If
+(2 ^ x) is the total buckets need to be allocated at a splitpoint (from now on
+we shall call this as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2))
+of total buckets at each phase of splitpoint group. Next quarter of allocation
+will only happen if buckets of previous phase has been already consumed.  Since
+for buckets number < 4 we cannot further divide it in to multiple phases, the
+first 3 group will have only one phase of allocation. The groups 0, 1, 2 will
+allocate 1, 1, 2 buckets respectively at once in one phase. For the groups > 2
+Where we allocate buckets > 4, allocation process is distributed among four
+equal phases. At group 3 we allocate 4 buckets in 4 different phases
+{1, 1, 1, 1}, the numbers in curly braces indicate number of buckets
+allocated within each phase of splitpoint group 3. And, for splitpoint group 4
+and 5 allocation phase will be {2, 2, 2, 2} = 16 buckets in total and
+{4, 4, 4, 4} = 32 buckets in total.  We can see that at each splitpoint group
+we double the total number of buckets from previous group but in an incremental
+phase.  The bucket pages allocated within one phase of a splitpoint group will
+appear consecutively in the index.  This addressing scheme allows the physical
+location of a bucket page to be computed from the bucket number relatively
+easily, using only a small amount of control information.  If we look at the
+function _hash_spareindex for a given bucket number we first compute the
+splitpoint group it belongs to and then the phase  to which the bucket belongs
+to.  Adding them we get the global splitpoint phase number S to which the
+bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[]
+is an array stored in the metapage) with given bucket number to compute its
+physical address.  The hashm_spares[S] can be interpreted as the total number
+of overflow pages that have been allocated before the bucket pages of
+splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
+(which belong to splitpoint group 0's phase 1 and phase 2 respectively) always
+appear at block numbers 1 and 2, just after the meta page. We always have
 hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
-former.  The difference between the two represents the number of overflow
-pages appearing between the bucket page groups of splitpoints N and N+1.
-
+former.  The difference between the two represents the number of overflow pages
+appearing between the bucket page groups of splitpoints phase N and N+1.
 (Note: the above describes what happens when filling an initially minimally
-sized hash index.  In practice, we try to estimate the required index size
-and allocate a suitable number of splitpoints immediately, to avoid
+sized hash index.  In practice, we try to estimate the required index size and
+allocate a suitable number of splitpoints phases immediately, to avoid
 expensive re-splitting during initial index build.)
 
 When S splitpoints exist altogether, the array entries hashm_spares[0]
 through hashm_spares[S] are valid; hashm_spares[S] records the current
 total number of overflow pages.  New overflow pages are created as needed
 at the end of the index, and recorded by incrementing hashm_spares[S].
-When it is time to create a new splitpoint's worth of bucket pages, we
+When it is time to create a new splitpoint phase's worth of bucket pages, we
 copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
 stored in the hashm_ovflpoint field of the meta page).  This has the
 effect of reserving the correct number of bucket pages at the end of the
@@ -101,7 +118,7 @@ We have to allow the case "greater than" because it's possible that during
 an index extension we crash after allocating filesystem space and before
 updating the metapage.  Note that on filesystems that allow "holes" in
 files, it's entirely likely that pages before the logical EOF are not yet
-allocated: when we allocate a new splitpoint's worth of bucket pages, we
+allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
 physically zero the last such page to force the EOF up, and the first such
 page will be used immediately, but the intervening pages are not written
 until needed.
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..fe0b4ef 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -49,7 +49,7 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 	 * Convert to absolute page number by adding the number of bucket pages
 	 * that exist before this split point.
 	 */
-	return (BlockNumber) ((1 << i) + ovflbitnum);
+	return (BlockNumber) (_hash_get_totalbuckets(i) + ovflbitnum);
 }
 
 /*
@@ -67,14 +67,15 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 	/* Determine the split number containing this page */
 	for (i = 1; i <= splitnum; i++)
 	{
-		if (ovflblkno <= (BlockNumber) (1 << i))
+		if (ovflblkno <= (BlockNumber) _hash_get_totalbuckets(i))
 			break;				/* oops */
-		bitnum = ovflblkno - (1 << i);
+		bitnum = ovflblkno - _hash_get_totalbuckets(i);
 
 		/*
 		 * bitnum has to be greater than number of overflow page added in
 		 * previous split point. The overflow page at this splitnum (i) if any
-		 * should start from ((2 ^ i) + metap->hashm_spares[i - 1] + 1).
+		 * should start from
+		 * (_hash_get_totalbuckets(i) + metap->hashm_spares[i - 1] + 1).
 		 */
 		if (bitnum > metap->hashm_spares[i - 1] &&
 			bitnum <= metap->hashm_spares[i])
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 61ca2ec..a06cabb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -502,14 +502,15 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	Page		page;
 	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
+	uint32		spare_index;
 	uint32		i;
 
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
-	 * next power of 2, however, and always force at least 2 bucket pages. The
-	 * upper limit is determined by considerations explained in
+	 * total number of buckets which has to be allocated before using its
+	 * _hashm_spares index slot. However always force at least 2 bucket pages.
+	 * The upper limit is determined by considerations explained in
 	 * _hash_expandtable().
 	 */
 	dnumbuckets = num_tuples / ffactor;
@@ -518,11 +519,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	else if (dnumbuckets >= (double) 0x40000000)
 		num_buckets = 0x40000000;
 	else
-		num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+		num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
 
-	log2_num_buckets = _hash_log2(num_buckets);
-	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
-	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
+	spare_index = _hash_spareindex(num_buckets);
+	Assert(spare_index < HASH_MAX_SPLITPOINTS);
 
 	page = BufferGetPage(buf);
 	if (initpage)
@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
-	 * blocks 1 to N.  The first freespace bitmap page is in block N+1. Since
-	 * N is a power of 2, we can set the masks this way:
+	 * blocks 1 to N.  The first freespace bitmap page is in block N+1.
 	 */
-	metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
-	metap->hashm_highmask = (num_buckets << 1) - 1;
+	metap->hashm_maxbucket = num_buckets - 1;
+
+	/* set highmask, which should be sufficient to cover num_buckets. */
+	metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
+	metap->hashm_lowmask = (metap->hashm_highmask >> 1);
 
 	MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
 	/* Set up mapping for one spare page after the initial splitpoints */
-	metap->hashm_spares[log2_num_buckets] = 1;
-	metap->hashm_ovflpoint = log2_num_buckets;
+	metap->hashm_spares[spare_index] = 1;
+	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
 
 	/*
@@ -773,25 +775,42 @@ restart_expand:
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to allocate a new batch of bucket pages.
+	 * If the split point is increasing we need to allocate a new batch of
+	 * bucket pages.
 	 */
-	spare_ndx = _hash_log2(new_bucket + 1);
+	spare_ndx = _hash_spareindex(new_bucket + 1);
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
+		uint32		buckets_toadd = 0;
+		uint32		splitpoint_group = 0;
+
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
-		 * The number of buckets in the new splitpoint is equal to the total
-		 * number already in existence, i.e. new_bucket.  Currently this maps
-		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.  We treat allocation of buckets as a
-		 * separate WAL-logged action.  Even if we fail after this operation,
-		 * won't leak bucket pages; rather, the next split will consume this
-		 * space. In any case, even without failure we don't use all the space
-		 * in one split operation.
+		 * The number of buckets in the new splitpoint group is equal to the
+		 * total number already in existence. But we do not allocate them at
+		 * once. Each splitpoint group will have 4 slots, we distribute the
+		 * buckets equally among them. So we allocate only one fourth of total
+		 * buckets in new splitpoint group at a time to consume one phase after
+		 * another. We treat allocation of buckets as a separate WAL-logged
+		 * action. Even if we fail after this operation, won't leak bucket
+		 * pages; rather, the next split will consume this space. In any case,
+		 * even without failure we don't use all the space in one split
+		 * operation.
 		 */
-		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+		splitpoint_group = SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(spare_ndx);
+
+		/*
+		 * Each phase in the splitpoint_group will allocate one fourth of total
+		 * buckets to be allocated in splitpoint_group. For
+		 * splitpoint_group < 3, have only one phase of allocation so we
+		 * allocate all of the buckets belonging to that buckets at once.
+ 		 */
+		buckets_toadd =
+			(splitpoint_group < 3) ?
+			(new_bucket) :
+			((1 << (splitpoint_group - 1)) / SPLITPOINT_PHASES_PER_GRP);
+		if (!_hash_alloc_buckets(rel, start_nblkno, buckets_toadd))
 		{
 			/* can't split due to BlockNumber overflow */
 			_hash_relbuf(rel, buf_oblkno);
@@ -836,10 +855,9 @@ restart_expand:
 	}
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to adjust the hashm_spares[] array and
-	 * hashm_ovflpoint so that future overflow pages will be created beyond
-	 * this new batch of bucket pages.
+	 * If the split point is increasing we need to adjust the hashm_spares[]
+	 * array and hashm_ovflpoint so that future overflow pages will be created
+	 * beyond this new batch of bucket pages.
 	 */
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..9007013 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -150,6 +150,43 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_spareindex -- returns spare index / global splitpoint phase of the
+ *					   bucket
+ */
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
+	uint32		splitpoint_group;
+
+	splitpoint_group = BUCKET_TO_SPLITPOINT_GRP(num_bucket);
+
+	return TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(splitpoint_group) +
+		   SPLITPOINT_PHASES_WITHIN_GROUP(splitpoint_group,
+										  num_bucket - 1); /* to 0-based */
+}
+
+/*
+ *	_hash_get_totalbuckets -- returns total number of buckets allocated till
+ *							the given splitpoint phase.
+ */
+uint32
+_hash_get_totalbuckets(uint32 splitpoint_phase)
+{
+	uint32		splitpoint_group;
+
+	splitpoint_group = SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(splitpoint_phase);
+
+	/*
+	 * total_buckets = total number of buckets before its splitpoint group +
+	 * total buckets within its splitpoint group until given splitpoint_phase.
+	 */
+	return BUCKETS_BEFORE_SP_GRP(splitpoint_group) +
+		   BUCKETS_WITHIN_SP_GRP(splitpoint_group,
+				((splitpoint_phase - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) %
+				  SPLITPOINT_PHASES_PER_GRP) + 1);
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..880ed75 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -36,7 +36,7 @@ typedef uint32 Bucket;
 #define InvalidBucket	((Bucket) 0xFFFFFFFF)
 
 #define BUCKET_TO_BLKNO(metap,B) \
-		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
+		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)
 
 /*
  * Special space for hash index pages.
@@ -180,9 +180,51 @@ typedef HashScanOpaqueData *HashScanOpaque;
  * needing to fit into the metapage.  (With 8K block size, 128 bitmaps
  * limit us to 64 GB of overflow space...)
  */
-#define HASH_MAX_SPLITPOINTS		32
+#define HASH_MAX_SPLITPOINTS		128
 #define HASH_MAX_BITMAPS			128
 
+#define SPLITPOINT_PHASES_PER_GRP	4
+#define SPLITPOINT_PHASE_MASK		(SPLITPOINT_PHASES_PER_GRP - 1)
+#define SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE 3
+
+#define TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(sp_g) \
+		((sp_g <= SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) ? \
+			(sp_g) : (((sp_g - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) << 2) + \
+						SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE))
+
+/*
+ * This is just a trick to save a division operation. If you look into the
+ * bitmap of 0-based bucket_num 2nd and 3rd most significant bit will indicate
+ * which phase of allocation the bucket_num belongs to with in the group. This
+ * is because at every splitpoint group we allocate (2 ^ x) buckets and we have
+ * divided the allocation process into 4 equal phases. This macro returns value
+ * from 0 to 3.
+ */
+#define SPLITPOINT_PHASES_WITHIN_GROUP(sp_g, bucket_num) \
+		((sp_g < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) ? 0 : \
+		(((bucket_num) >> (sp_g - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE)) & \
+													SPLITPOINT_PHASE_MASK))
+
+/*
+ * At every splitpoint group we double the total number of buckets. So at
+ * splitpoint group sp_g we allocate (1 << (sp_g -1)) buckets as we will have
+ * same number of buckets already allocated before this group. For spitpoint
+ * groups >= SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE we allocate buckets in 4
+ * equal phases hence we allocate ((1 << (sp_g - 1)) >> 2) buckets per phase.
+ */
+#define BUCKETS_BEFORE_SP_GRP(sp_g) ((sp_g == 0) ? 0 : (1 << (sp_g - 1)))
+#define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \
+	((sp_g < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) ? \
+	 (1 << (sp_g - 1)) : \
+	 ((nphase) * ((1 << (sp_g - 1)) >> 2)))
+
+#define BUCKET_TO_SPLITPOINT_GRP(num_bucket) (_hash_log2(num_bucket))
+
+#define SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(sp_phase) \
+	((sp_phase < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) ? (sp_phase) : \
+		(((sp_phase - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) >> 2) + \
+			SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE))
+
 typedef struct HashMetaPageData
 {
 	uint32		hashm_magic;	/* magic no. for hash tables */
@@ -382,6 +424,8 @@ extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
 extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
 					 uint32 highmask, uint32 lowmask);
 extern uint32 _hash_log2(uint32 num);
+extern uint32 _hash_spareindex(uint32 num_bucket);
+extern uint32 _hash_get_totalbuckets(uint32 splitpoint_phase);
 extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
 extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
 extern bool _hash_convert_tuple(Relation index,
expand_hashbucket_efficiently_08.patchapplication/octet-stream; name=expand_hashbucket_efficiently_08.patchDownload
diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 3ba01f6..c97b279 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -51,13 +51,13 @@ bsize     | 8152
 bmsize    | 4096
 bmshift   | 15
 maxbucket | 3
-highmask  | 7
-lowmask   | 3
-ovflpoint | 2
+highmask  | 3
+lowmask   | 1
+ovflpoint | 3
 firstfree | 0
 nmaps     | 1
 procid    | 450
-spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 
 SELECT magic, version, ntuples, bsize, bmsize, bmshift, maxbucket, highmask,
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 9f41bb2..f19066f 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -667,11 +667,11 @@ bmshift   | 15
 maxbucket | 12512
 highmask  | 16383
 lowmask   | 8191
-ovflpoint | 14
+ovflpoint | 50
 firstfree | 1204
 nmaps     | 1
 procid    | 450
-spares    | {0,0,0,0,0,0,1,1,1,1,1,4,59,704,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,4,4,4,45,55,58,59,508,567,628,704,1193,1202,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 </screen>
      </para>
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..ff527c1 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -58,35 +58,51 @@ rules to support a variable number of overflow pages while not having to
 move primary bucket pages around after they are created.
 
 Primary bucket pages (henceforth just "bucket pages") are allocated in
-power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
-are created when the index is initialized.  At the first split, buckets 2
-and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
-when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
-pages of a power-of-2 group appear consecutively in the index.  This
-addressing scheme allows the physical location of a bucket page to be
-computed from the bucket number relatively easily, using only a small
-amount of control information.  We take the log2() of the bucket number
-to determine which split point S the bucket belongs to, and then simply
-add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
-metapage) to compute the physical address.  hashm_spares[S] can be
-interpreted as the total number of overflow pages that have been allocated
-before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
-so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
-block numbers 1 and 2, just after the meta page.  We always have
-hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
-former.  The difference between the two represents the number of overflow
-pages appearing between the bucket page groups of splitpoints N and N+1.
-
+power-of-2 groups, called "split points" in the code.  That means at every new
+split point we double the existing number of buckets.  Allocating huge chucks
+of bucket pages all at once isn't optimal and we will take ages to consume
+those. To avoid this exponential growth of index size, we did use a trick to
+breakup allocation of buckets at the splitpoint into 4 equal phases.  If
+(2 ^ x) is the total buckets need to be allocated at a splitpoint (from now on
+we shall call this as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2))
+of total buckets at each phase of splitpoint group. Next quarter of allocation
+will only happen if buckets of previous phase has been already consumed.  Since
+for buckets number < 4 we cannot further divide it in to multiple phases, the
+first splitpoint group 0's allocation is done as follows {1, 1, 1, 1} = 4
+buckets in total, the numbers in curly braces indicate number of buckets
+allocated within each phase of splitpoint group 0.  In next splitpoint group 1
+the allocation phases will be as follow {1, 1, 1, 1} = 8 buckets in total.
+And, for splitpoint group 2 and 3 allocation phase will be {2, 2, 2, 2} = 16
+buckets in total and {4, 4, 4, 4} = 32 buckets in total.  We can see that at
+each splitpoint group we double the total number of buckets from previous group
+but in an incremental phase.  The bucket pages allocated within one phase of a
+splitpoint group will appear consecutively in the index.  This addressing
+scheme allows the physical location of a bucket page to be computed from the
+bucket number relatively easily, using only a small amount of control
+information.  If we look at the function _hash_spareindex for a given bucket
+number we first compute the splitpoint group it belongs to and then the phase
+to which the bucket belongs to.  Adding them we get the global splitpoint phase
+number S to which the bucket belongs and then simply add "hashm_spares[S] + 1"
+(where hashm_spares[] is an array stored in the metapage) with given bucket
+number to compute its physical address.  The hashm_spares[S] can be interpreted
+as the total number of overflow pages that have been allocated before the
+bucket pages of splitpoint phase S.  The hashm_spares[0] is always 0, so that
+buckets 0 and 1 (which belong to splitpoint group 0's phase 1 and phase 2
+respectively) always appear at block numbers 1 and 2, just after the meta page.
+We always have hashm_spares[N] <= hashm_spares[N+1], since the latter count
+includes the former.  The difference between the two represents the number of
+overflow pages appearing between the bucket page groups of splitpoints phase N
+and N+1.
 (Note: the above describes what happens when filling an initially minimally
-sized hash index.  In practice, we try to estimate the required index size
-and allocate a suitable number of splitpoints immediately, to avoid
+sized hash index.  In practice, we try to estimate the required index size and
+allocate a suitable number of splitpoints phases immediately, to avoid
 expensive re-splitting during initial index build.)
 
 When S splitpoints exist altogether, the array entries hashm_spares[0]
 through hashm_spares[S] are valid; hashm_spares[S] records the current
 total number of overflow pages.  New overflow pages are created as needed
 at the end of the index, and recorded by incrementing hashm_spares[S].
-When it is time to create a new splitpoint's worth of bucket pages, we
+When it is time to create a new splitpoint phase's worth of bucket pages, we
 copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
 stored in the hashm_ovflpoint field of the meta page).  This has the
 effect of reserving the correct number of bucket pages at the end of the
@@ -101,7 +117,7 @@ We have to allow the case "greater than" because it's possible that during
 an index extension we crash after allocating filesystem space and before
 updating the metapage.  Note that on filesystems that allow "holes" in
 files, it's entirely likely that pages before the logical EOF are not yet
-allocated: when we allocate a new splitpoint's worth of bucket pages, we
+allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
 physically zero the last such page to force the EOF up, and the first such
 page will be used immediately, but the intervening pages are not written
 until needed.
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..fe0b4ef 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -49,7 +49,7 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 	 * Convert to absolute page number by adding the number of bucket pages
 	 * that exist before this split point.
 	 */
-	return (BlockNumber) ((1 << i) + ovflbitnum);
+	return (BlockNumber) (_hash_get_totalbuckets(i) + ovflbitnum);
 }
 
 /*
@@ -67,14 +67,15 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 	/* Determine the split number containing this page */
 	for (i = 1; i <= splitnum; i++)
 	{
-		if (ovflblkno <= (BlockNumber) (1 << i))
+		if (ovflblkno <= (BlockNumber) _hash_get_totalbuckets(i))
 			break;				/* oops */
-		bitnum = ovflblkno - (1 << i);
+		bitnum = ovflblkno - _hash_get_totalbuckets(i);
 
 		/*
 		 * bitnum has to be greater than number of overflow page added in
 		 * previous split point. The overflow page at this splitnum (i) if any
-		 * should start from ((2 ^ i) + metap->hashm_spares[i - 1] + 1).
+		 * should start from
+		 * (_hash_get_totalbuckets(i) + metap->hashm_spares[i - 1] + 1).
 		 */
 		if (bitnum > metap->hashm_spares[i - 1] &&
 			bitnum <= metap->hashm_spares[i])
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 61ca2ec..b8d682f 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -502,14 +502,15 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	Page		page;
 	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
+	uint32		spare_index;
 	uint32		i;
 
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
-	 * next power of 2, however, and always force at least 2 bucket pages. The
-	 * upper limit is determined by considerations explained in
+	 * total number of buckets which has to be allocated before using its
+	 * _hashm_spares index slot. However always force at least 2 bucket pages.
+	 * The upper limit is determined by considerations explained in
 	 * _hash_expandtable().
 	 */
 	dnumbuckets = num_tuples / ffactor;
@@ -518,11 +519,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	else if (dnumbuckets >= (double) 0x40000000)
 		num_buckets = 0x40000000;
 	else
-		num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+		num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
 
-	log2_num_buckets = _hash_log2(num_buckets);
-	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
-	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
+	spare_index = _hash_spareindex(num_buckets);
+	Assert(spare_index < HASH_MAX_SPLITPOINTS);
 
 	page = BufferGetPage(buf);
 	if (initpage)
@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
-	 * blocks 1 to N.  The first freespace bitmap page is in block N+1. Since
-	 * N is a power of 2, we can set the masks this way:
+	 * blocks 1 to N.  The first freespace bitmap page is in block N+1.
 	 */
-	metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
-	metap->hashm_highmask = (num_buckets << 1) - 1;
+	metap->hashm_maxbucket = num_buckets - 1;
+
+	/* set highmask, which should be sufficient to cover num_buckets. */
+	metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
+	metap->hashm_lowmask = (metap->hashm_highmask >> 1);
 
 	MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
 	/* Set up mapping for one spare page after the initial splitpoints */
-	metap->hashm_spares[log2_num_buckets] = 1;
-	metap->hashm_ovflpoint = log2_num_buckets;
+	metap->hashm_spares[spare_index] = 1;
+	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
 
 	/*
@@ -773,25 +775,40 @@ restart_expand:
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to allocate a new batch of bucket pages.
+	 * If the split point is increasing we need to allocate a new batch of
+	 * bucket pages.
 	 */
-	spare_ndx = _hash_log2(new_bucket + 1);
+	spare_ndx = _hash_spareindex(new_bucket + 1);
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
+		uint32		buckets_toadd = 0;
+		uint32		splitpoint_group = 0;
+
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
-		 * The number of buckets in the new splitpoint is equal to the total
-		 * number already in existence, i.e. new_bucket.  Currently this maps
-		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.  We treat allocation of buckets as a
-		 * separate WAL-logged action.  Even if we fail after this operation,
+		 * The number of buckets in the new splitpoint group is equal to the
+		 * total number already in existence, i.e. new_bucket. But we do not
+		 * allocate them at once. Each splitpoint group will have 4 slots, we
+		 * distribute the buckets equally among them. So we allocate only one
+		 * fourth of total buckets in new splitpoint group at a time to consume
+		 * one phase after another. We treat allocation of buckets as a
+		 * separate WAL-logged action. Even if we fail after this operation,
 		 * won't leak bucket pages; rather, the next split will consume this
 		 * space. In any case, even without failure we don't use all the space
 		 * in one split operation.
 		 */
-		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+		splitpoint_group = (spare_ndx >> 2);
+
+		/*
+		 * Each phase in the splitpoint_group will allocate
+		 * 2^(splitpoint_group - 1) buckets if we divide buckets among 4
+		 * slots. The 0th group is a special case where we allocate 1 bucket
+		 * per slot as we cannot reduce it any further. See README for more
+		 * details.
+		 */
+		buckets_toadd = (splitpoint_group) ? (1 << (splitpoint_group - 1)) : 1;
+		if (!_hash_alloc_buckets(rel, start_nblkno, buckets_toadd))
 		{
 			/* can't split due to BlockNumber overflow */
 			_hash_relbuf(rel, buf_oblkno);
@@ -836,10 +853,9 @@ restart_expand:
 	}
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to adjust the hashm_spares[] array and
-	 * hashm_ovflpoint so that future overflow pages will be created beyond
-	 * this new batch of bucket pages.
+	 * If the split point is increasing we need to adjust the hashm_spares[]
+	 * array and hashm_ovflpoint so that future overflow pages will be created
+	 * beyond this new batch of bucket pages.
 	 */
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..98975e1 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -149,6 +149,84 @@ _hash_log2(uint32 num)
 	return i;
 }
 
+#define SPLITPOINT_PHASES_PER_GRP 4
+#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
+#define Buckets_First_Split_Group 4
+
+#define TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(sp_g) (sp_g > 0 ? (sp_g << 2) : 0)
+
+/*
+ * This is just a trick to save a division operation. If you look into the
+ * bitmap of 0-based bucket_num 2nd and 3rd most significant bit will indicate
+ * which phase of allocation the bucket_num belongs to with in the group. This
+ * is because at every splitpoint group we allocate (2 ^ x) buckets and we have
+ * divided the allocation process into 4 equal phases. This macro returns value
+ * from 0 to 3.
+ */
+#define SPLITPOINT_PHASES_WITHIN_GROUP(sp_g, bucket_num) \
+		((sp_g > 0) ? (((bucket_num) >> (sp_g - 1)) & SPLITPOINT_PHASE_MASK) : \
+						(bucket_num))
+
+/*
+ * At splitpoint group 0 we have 2 ^ (0 + 2) = 4 buckets, then at splitpoint
+ * group 1 we have 2 ^ (1 + 2) = 8 total buckets. As the doubling continues at
+ * splitpoint group "x" we will have 2 ^ (x + 2) total buckets. Total buckets
+ * before x splitpoint group will be 2 ^ (x + 1). At each phase of allocation
+ * within splitpoint group we add 2 ^ (x - 1) buckets, as we have to divide the
+ * task of allocation of 2 ^ (x + 1) buckets among 4 phases.
+ *
+ * Also, splitpoint group of a given bucket can be taken as
+ * (_hash_log2(bucket) - 2). Subtracted by 2 because each group have
+ * 2 ^ (x + 2) buckets.
+ */
+#define BUCKETS_BEFORE_SP_GRP(sp_g) ((sp_g == 0) ? 0 : (1 << (sp_g + 1)))
+#define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \
+						((nphase) * ((sp_g == 0) ? 1 : (1 << (sp_g - 1))))
+#define BUCKET_TO_SPLITPOINT_GRP(num_bucket) \
+		((num_bucket <= Buckets_First_Split_Group) ? 0 : \
+												(_hash_log2(num_bucket) - 2))
+
+/*
+ * _hash_spareindex -- returns spare index / global splitpoint phase of the
+ *					   bucket
+ */
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
+	uint32		splitpoint_group;
+
+	splitpoint_group = BUCKET_TO_SPLITPOINT_GRP(num_bucket);
+
+	return TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(splitpoint_group) +
+		SPLITPOINT_PHASES_WITHIN_GROUP(splitpoint_group,
+									   num_bucket - 1); /* make it 0-based */
+}
+
+/*
+ *	_hash_get_totalbuckets -- returns total number of buckets allocated till
+ *							the given splitpoint phase.
+ */
+uint32
+_hash_get_totalbuckets(uint32 splitpoint_phase)
+{
+	uint32		splitpoint_group;
+
+	/*
+	 * Every 4 consecutive phases makes one group and group's are numbered
+	 * from 0
+	 */
+	splitpoint_group = (splitpoint_phase / SPLITPOINT_PHASES_PER_GRP);
+
+	/*
+	 * total_buckets = total number of buckets before its splitpoint group +
+	 * total buckets within its splitpoint group until given splitpoint_phase.
+	 */
+	return BUCKETS_BEFORE_SP_GRP(splitpoint_group) +
+		BUCKETS_WITHIN_SP_GRP(splitpoint_group,
+							  (splitpoint_phase % SPLITPOINT_PHASES_PER_GRP)
+							  + 1);
+}
+
 /*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..ac502ef 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -36,7 +36,7 @@ typedef uint32 Bucket;
 #define InvalidBucket	((Bucket) 0xFFFFFFFF)
 
 #define BUCKET_TO_BLKNO(metap,B) \
-		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
+		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)
 
 /*
  * Special space for hash index pages.
@@ -180,7 +180,7 @@ typedef HashScanOpaqueData *HashScanOpaque;
  * needing to fit into the metapage.  (With 8K block size, 128 bitmaps
  * limit us to 64 GB of overflow space...)
  */
-#define HASH_MAX_SPLITPOINTS		32
+#define HASH_MAX_SPLITPOINTS		128
 #define HASH_MAX_BITMAPS			128
 
 typedef struct HashMetaPageData
@@ -382,6 +382,8 @@ extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
 extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
 					 uint32 highmask, uint32 lowmask);
 extern uint32 _hash_log2(uint32 num);
+extern uint32 _hash_spareindex(uint32 num_bucket);
+extern uint32 _hash_get_totalbuckets(uint32 splitpoint_phase);
 extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
 extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
 extern bool _hash_convert_tuple(Relation index,
#19Amit Kapila
amit.kapila16@gmail.com
In reply to: Mithun Cy (#18)
Re: [POC] A better way to expand hash indexes.

On Tue, Mar 28, 2017 at 10:43 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

On Mon, Mar 27, 2017 at 11:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As you have said we can solve it if we allocate buckets always in
power-of-2 when we do hash index meta page init. But on other
occasions, when we try to double the existing buckets we can do the
allocation in 4 equal phases.

But I think there are 2 more ways to solve same,

A. Why not pass all 3 parameters high_mask, low_mask, max-buckets to
tuplesort and let them use _hash_hashkey2bucket to figure out which
key belong to which bucket. And then sort them. I think this way we
make both sorting and insertion to hash index both consistent with
each other.

B. In tuple sort we can use hash function bucket = hash_key %
num_buckets instead of existing one which does bitwise "and" to
determine the bucket of hash key. This way we will not wrongly assign
buckets beyond max_buckets and sorted hash keys will be in sync with
actual insertion order of _hash_doinsert.

I am adding both the patches Patch_A and Patch_B. My preference is
Patch_A and I am open for suggestion.

I think both way it can work. I feel there is no hard pressed need
that we should make the computation in sorting same as what you do
_hash_doinsert. In patch_B, I don't think new naming for variables is
good.

  Assert(!a->isnull1);
- hash1 = DatumGetUInt32(a->datum1) & state->hash_mask;
+ bucket1 = DatumGetUInt32(a->datum1) % state->num_buckets;

Can we use hash_mod instead of num_buckets and retain hash1 in above
code and similar other places?

+#define SPLITPOINT_PHASES_PER_GRP 4
+#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
+#define Buckets_First_Split_Group 4

Fixed.

In the above computation +2 and -2 still bothers me. I think you need
to do this because you have defined split group zero to have four
buckets, how about if you don't force that and rather define to have
split point phases only from split point which has four or more
buckets?

Okay as suggested instead of group zero having 4 phases of 1 bucket
each, I have recalculated the spare mapping as below.

Few comments:
1.
+#define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \
+ ((sp_g < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) ? \
+ (1 << (sp_g - 1)) : \
+ ((nphase) * ((1 << (sp_g - 1)) >> 2)))

This will go wrong for split point group zero. In general, I feel if
you handle computation for split groups lesser than
SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE in the caller, then all your
macros will become much simpler and less error prone.

2.
+#define BUCKET_TO_SPLITPOINT_GRP(num_bucket) (_hash_log2(num_bucket))

What is the use of such a define, can't we directly use _hash_log2 in
the caller?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Amit Kapila (#19)
2 attachment(s)
Re: [POC] A better way to expand hash indexes.

On Tue, Mar 28, 2017 at 12:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think both way it can work. I feel there is no hard pressed need
that we should make the computation in sorting same as what you do
_hash_doinsert. In patch_B, I don't think new naming for variables is
good.

Assert(!a->isnull1);
- hash1 = DatumGetUInt32(a->datum1) & state->hash_mask;
+ bucket1 = DatumGetUInt32(a->datum1) % state->num_buckets;

Can we use hash_mod instead of num_buckets and retain hash1 in above
code and similar other places?

Yes done renamed it to hash_mod.

Few comments:
1.
+#define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \
+ ((sp_g < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) ? \
+ (1 << (sp_g - 1)) : \
+ ((nphase) * ((1 << (sp_g - 1)) >> 2)))

This will go wrong for split point group zero. In general, I feel if
you handle computation for split groups lesser than
SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE in the caller, then all your
macros will become much simpler and less error prone.

Fixed, apart from SPLITPOINT_PHASE_TO_SPLITPOINT_GRP rest all macros
only handle multi phase group. The SPLITPOINT_PHASE_TO_SPLITPOINT_GRP
is used in one more place at expand index so thought kepeping it as it
is is better.
.

2.
+#define BUCKET_TO_SPLITPOINT_GRP(num_bucket) (_hash_log2(num_bucket))

What is the use of such a define, can't we directly use _hash_log2 in
the caller?

No real reason, just that NAMED macro appeared more readable than just
_hash_log2. Now I have removed same.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

yet_another_expand_hashbucket_efficiently_09.patchapplication/octet-stream; name=yet_another_expand_hashbucket_efficiently_09.patchDownload
diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 3ba01f6..518bdbe 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -51,13 +51,13 @@ bsize     | 8152
 bmsize    | 4096
 bmshift   | 15
 maxbucket | 3
-highmask  | 7
-lowmask   | 3
+highmask  | 3
+lowmask   | 1
 ovflpoint | 2
 firstfree | 0
 nmaps     | 1
 procid    | 450
-spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 
 SELECT magic, version, ntuples, bsize, bmsize, bmshift, maxbucket, highmask,
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 9f41bb2..682747e 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -667,11 +667,11 @@ bmshift   | 15
 maxbucket | 12512
 highmask  | 16383
 lowmask   | 8191
-ovflpoint | 14
+ovflpoint | 49
 firstfree | 1204
 nmaps     | 1
 procid    | 450
-spares    | {0,0,0,0,0,0,1,1,1,1,1,4,59,704,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,4,4,4,45,55,58,59,508,567,628,704,1193,1202,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 </screen>
      </para>
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..e0115de 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -58,35 +58,53 @@ rules to support a variable number of overflow pages while not having to
 move primary bucket pages around after they are created.
 
 Primary bucket pages (henceforth just "bucket pages") are allocated in
-power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
-are created when the index is initialized.  At the first split, buckets 2
-and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
-when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
-pages of a power-of-2 group appear consecutively in the index.  This
-addressing scheme allows the physical location of a bucket page to be
-computed from the bucket number relatively easily, using only a small
-amount of control information.  We take the log2() of the bucket number
-to determine which split point S the bucket belongs to, and then simply
-add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
-metapage) to compute the physical address.  hashm_spares[S] can be
-interpreted as the total number of overflow pages that have been allocated
-before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
-so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
-block numbers 1 and 2, just after the meta page.  We always have
-hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
-former.  The difference between the two represents the number of overflow
-pages appearing between the bucket page groups of splitpoints N and N+1.
-
+power-of-2 groups, called "split points" in the code.  That means at every new
+splitpoint we double the existing number of buckets.  Allocating huge chunks
+of bucket pages all at once isn't optimal and we will take ages to consume
+those.  To avoid this exponential growth of index size, we did use a trick to
+break up allocation of buckets at the splitpoint into 4 equal phases.  If
+(2 ^ x) are the total buckets need to be allocated at a splitpoint (from now on
+we shall call this as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2))
+of total buckets at each phase of splitpoint group.  Next quarter of allocation
+will only happen if buckets of the previous phase have been already consumed.
+Since for buckets number < 4 we cannot further divide it into multiple phases,
+the first 3 group will have only one phase of allocation.  The groups 0, 1 and 2
+will allocate 1, 1 and 2 buckets respectively at once in one phase.  For the
+groups > 2 Where we allocate buckets > 4, the allocation process is distributed
+among four equal phases.  At group 3 we allocate 4 buckets in 4 different
+phases {1, 1, 1, 1}, the numbers in curly braces indicate the number of buckets
+allocated within each phase of splitpoint group 3.  And, for splitpoint group 4
+and 5 allocation phase will be {2, 2, 2, 2} = 16 buckets in total and
+{4, 4, 4, 4} = 32 buckets in total.  We can see that at each splitpoint group
+we double the total number of buckets from the previous group but in an
+incremental phase.  The bucket pages allocated within one phase of a splitpoint
+group will appear consecutively in the index.  This addressing scheme allows
+the physical location of a bucket page to be computed from the bucket number
+relatively easily, using only a small amount of control information.  If we
+look at the function _hash_spareindex for a given bucket number we first
+compute the splitpoint group it belongs to and then the phase to which the
+bucket belongs to.  Adding them we get the global splitpoint phase number S to
+which the bucket belongs and then simply add "hashm_spares[S] + 1"
+(where hashm_spares[] is an array stored in the metapage) with given bucket
+number to compute its physical address.  The hashm_spares[S] can be interpreted
+as the total number of overflow pages that have been allocated before the
+bucket pages of splitpoint phase S.  The hashm_spares[0] is always 0, so that
+buckets 0 and 1 (which belong to splitpoint group 0's phase 1 and phase 2
+respectively) always appear at block numbers 1 and 2, just after the meta page.
+We always have hashm_spares[N] <= hashm_spares[N+1], since the latter count
+includes the former.  The difference between the two represents the number of
+overflow pages appearing between the bucket page groups of splitpoints phase N
+and N+1.
 (Note: the above describes what happens when filling an initially minimally
-sized hash index.  In practice, we try to estimate the required index size
-and allocate a suitable number of splitpoints immediately, to avoid
+sized hash index.  In practice, we try to estimate the required index size and
+allocate a suitable number of splitpoints phases immediately, to avoid
 expensive re-splitting during initial index build.)
 
 When S splitpoints exist altogether, the array entries hashm_spares[0]
 through hashm_spares[S] are valid; hashm_spares[S] records the current
 total number of overflow pages.  New overflow pages are created as needed
 at the end of the index, and recorded by incrementing hashm_spares[S].
-When it is time to create a new splitpoint's worth of bucket pages, we
+When it is time to create a new splitpoint phase's worth of bucket pages, we
 copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
 stored in the hashm_ovflpoint field of the meta page).  This has the
 effect of reserving the correct number of bucket pages at the end of the
@@ -101,7 +119,7 @@ We have to allow the case "greater than" because it's possible that during
 an index extension we crash after allocating filesystem space and before
 updating the metapage.  Note that on filesystems that allow "holes" in
 files, it's entirely likely that pages before the logical EOF are not yet
-allocated: when we allocate a new splitpoint's worth of bucket pages, we
+allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
 physically zero the last such page to force the EOF up, and the first such
 page will be used immediately, but the intervening pages are not written
 until needed.
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..fe0b4ef 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -49,7 +49,7 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 	 * Convert to absolute page number by adding the number of bucket pages
 	 * that exist before this split point.
 	 */
-	return (BlockNumber) ((1 << i) + ovflbitnum);
+	return (BlockNumber) (_hash_get_totalbuckets(i) + ovflbitnum);
 }
 
 /*
@@ -67,14 +67,15 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 	/* Determine the split number containing this page */
 	for (i = 1; i <= splitnum; i++)
 	{
-		if (ovflblkno <= (BlockNumber) (1 << i))
+		if (ovflblkno <= (BlockNumber) _hash_get_totalbuckets(i))
 			break;				/* oops */
-		bitnum = ovflblkno - (1 << i);
+		bitnum = ovflblkno - _hash_get_totalbuckets(i);
 
 		/*
 		 * bitnum has to be greater than number of overflow page added in
 		 * previous split point. The overflow page at this splitnum (i) if any
-		 * should start from ((2 ^ i) + metap->hashm_spares[i - 1] + 1).
+		 * should start from
+		 * (_hash_get_totalbuckets(i) + metap->hashm_spares[i - 1] + 1).
 		 */
 		if (bitnum > metap->hashm_spares[i - 1] &&
 			bitnum <= metap->hashm_spares[i])
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 61ca2ec..d7374fa 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -502,14 +502,15 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	Page		page;
 	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
+	uint32		spare_index;
 	uint32		i;
 
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
-	 * next power of 2, however, and always force at least 2 bucket pages. The
-	 * upper limit is determined by considerations explained in
+	 * total number of buckets which has to be allocated before using its
+	 * _hashm_spares index slot. However always force at least 2 bucket pages.
+	 * The upper limit is determined by considerations explained in
 	 * _hash_expandtable().
 	 */
 	dnumbuckets = num_tuples / ffactor;
@@ -518,11 +519,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	else if (dnumbuckets >= (double) 0x40000000)
 		num_buckets = 0x40000000;
 	else
-		num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+		num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
 
-	log2_num_buckets = _hash_log2(num_buckets);
-	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
-	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
+	spare_index = _hash_spareindex(num_buckets);
+	Assert(spare_index < HASH_MAX_SPLITPOINTS);
 
 	page = BufferGetPage(buf);
 	if (initpage)
@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
-	 * blocks 1 to N.  The first freespace bitmap page is in block N+1. Since
-	 * N is a power of 2, we can set the masks this way:
+	 * blocks 1 to N.  The first freespace bitmap page is in block N+1.
 	 */
-	metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
-	metap->hashm_highmask = (num_buckets << 1) - 1;
+	metap->hashm_maxbucket = num_buckets - 1;
+
+	/* set highmask, which should be sufficient to cover num_buckets. */
+	metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
+	metap->hashm_lowmask = (metap->hashm_highmask >> 1);
 
 	MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
 	/* Set up mapping for one spare page after the initial splitpoints */
-	metap->hashm_spares[log2_num_buckets] = 1;
-	metap->hashm_ovflpoint = log2_num_buckets;
+	metap->hashm_spares[spare_index] = 1;
+	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
 
 	/*
@@ -773,25 +775,44 @@ restart_expand:
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to allocate a new batch of bucket pages.
+	 * If the split point is increasing we need to allocate a new batch of
+	 * bucket pages.
 	 */
-	spare_ndx = _hash_log2(new_bucket + 1);
+	spare_ndx = _hash_spareindex(new_bucket + 1);
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
+		uint32		buckets_toadd = 0;
+		uint32		splitpoint_group = 0;
+
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
-		 * The number of buckets in the new splitpoint is equal to the total
-		 * number already in existence, i.e. new_bucket.  Currently this maps
-		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.  We treat allocation of buckets as a
-		 * separate WAL-logged action.  Even if we fail after this operation,
-		 * won't leak bucket pages; rather, the next split will consume this
-		 * space. In any case, even without failure we don't use all the space
-		 * in one split operation.
+		 * The number of buckets in the new splitpoint group is equal to the
+		 * total number already in existence. But we do not allocate them at
+		 * once. Each splitpoint group will have 4 slots, we distribute the
+		 * buckets equally among them. So we allocate only one fourth of total
+		 * buckets in new splitpoint group at a time to consume one phase after
+		 * another. We treat allocation of buckets as a separate WAL-logged
+		 * action. Even if we fail after this operation, won't leak bucket
+		 * pages; rather, the next split will consume this space. In any case,
+		 * even without failure we don't use all the space in one split
+		 * operation.
+		 */
+
+		splitpoint_group = SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(spare_ndx);
+
+		/*
+		 * Each phase in the splitpoint_group will allocate one fourth of total
+		 * buckets to be allocated in splitpoint_group. For
+		 * splitpoint_group < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE, have only
+		 * one phase of allocation so we allocate all of the buckets belonging
+		 * to that buckets at once.
 		 */
-		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+		buckets_toadd =
+			(splitpoint_group < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) ?
+			(new_bucket) :
+			((1 << (splitpoint_group - 1)) / SPLITPOINT_PHASES_PER_GRP);
+		if (!_hash_alloc_buckets(rel, start_nblkno, buckets_toadd))
 		{
 			/* can't split due to BlockNumber overflow */
 			_hash_relbuf(rel, buf_oblkno);
@@ -836,10 +857,9 @@ restart_expand:
 	}
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to adjust the hashm_spares[] array and
-	 * hashm_ovflpoint so that future overflow pages will be created beyond
-	 * this new batch of bucket pages.
+	 * If the split point is increasing we need to adjust the hashm_spares[]
+	 * array and hashm_ovflpoint so that future overflow pages will be created
+	 * beyond this new batch of bucket pages.
 	 */
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..c2f2c71 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -150,6 +150,49 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_spareindex -- returns spare index / global splitpoint phase of the
+ *					   bucket
+ */
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
+	uint32		splitpoint_group;
+
+	splitpoint_group = _hash_log2(num_bucket);
+
+	if (splitpoint_group < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE)
+		return splitpoint_group;
+
+	return TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(splitpoint_group) +
+		   SPLITPOINT_PHASES_WITHIN_GROUP(splitpoint_group,
+										  num_bucket - 1); /* to 0-based */
+}
+
+/*
+ *	_hash_get_totalbuckets -- returns total number of buckets allocated till
+ *							the given splitpoint phase.
+ */
+uint32
+_hash_get_totalbuckets(uint32 splitpoint_phase)
+{
+	uint32		splitpoint_group;
+
+	splitpoint_group = SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(splitpoint_phase);
+
+	if (splitpoint_group < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE)
+		return (1 << splitpoint_group);
+
+	/*
+	 * total_buckets = total number of buckets before its splitpoint group +
+	 * total buckets within its splitpoint group until given splitpoint_phase.
+	 */
+	return BUCKETS_BEFORE_SP_GRP(splitpoint_group) +
+		   BUCKETS_WITHIN_SP_GRP(splitpoint_group,
+				((splitpoint_phase - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) %
+				  SPLITPOINT_PHASES_PER_GRP) + 1);
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..64e98c2 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -36,7 +36,7 @@ typedef uint32 Bucket;
 #define InvalidBucket	((Bucket) 0xFFFFFFFF)
 
 #define BUCKET_TO_BLKNO(metap,B) \
-		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
+		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)
 
 /*
  * Special space for hash index pages.
@@ -180,9 +180,46 @@ typedef HashScanOpaqueData *HashScanOpaque;
  * needing to fit into the metapage.  (With 8K block size, 128 bitmaps
  * limit us to 64 GB of overflow space...)
  */
-#define HASH_MAX_SPLITPOINTS		32
+#define HASH_MAX_SPLITPOINTS		128
 #define HASH_MAX_BITMAPS			128
 
+#define SPLITPOINT_PHASES_PER_GRP	4
+#define SPLITPOINT_PHASE_MASK		(SPLITPOINT_PHASES_PER_GRP - 1)
+#define SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE 3
+
+#define TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(sp_g) \
+		((((sp_g - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) << 2) + \
+		  SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE))
+
+/*
+ * This is just a trick to save a division operation. If you look into the
+ * bitmap of 0-based bucket_num 2nd and 3rd most significant bit will indicate
+ * which phase of allocation the bucket_num belongs to with in the group. This
+ * is because at every splitpoint group we allocate (2 ^ x) buckets and we have
+ * divided the allocation process into 4 equal phases. This macro returns value
+ * from 0 to 3.
+ */
+#define SPLITPOINT_PHASES_WITHIN_GROUP(sp_g, bucket_num) \
+		(((bucket_num) >> (sp_g - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE)) & \
+													SPLITPOINT_PHASE_MASK)
+
+/*
+ * At every splitpoint group we double the total number of buckets. So at
+ * splitpoint group sp_g we allocate (1 << (sp_g -1)) buckets as we will have
+ * same number of buckets already allocated before this group. For spitpoint
+ * groups >= SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE we allocate buckets in 4
+ * equal phases hence we allocate ((1 << (sp_g - 1)) >> 2) buckets per phase.
+ */
+#define BUCKETS_BEFORE_SP_GRP(sp_g) (1 << (sp_g - 1))
+#define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \
+									((nphase) * ((1 << (sp_g - 1)) >> 2))
+
+#define SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(sp_phase) \
+ 	((sp_phase < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) ? \
+	 (sp_phase) : \
+	 (((sp_phase - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) >> 2) + \
+	  SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE))
+
 typedef struct HashMetaPageData
 {
 	uint32		hashm_magic;	/* magic no. for hash tables */
@@ -382,6 +419,8 @@ extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
 extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
 					 uint32 highmask, uint32 lowmask);
 extern uint32 _hash_log2(uint32 num);
+extern uint32 _hash_spareindex(uint32 num_bucket);
+extern uint32 _hash_get_totalbuckets(uint32 splitpoint_phase);
 extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
 extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
 extern bool _hash_convert_tuple(Relation index,
sortbuild_hash_B_2.patchapplication/octet-stream; name=sortbuild_hash_B_2.patchDownload
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index 0e0f393..18a788f 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -37,7 +37,7 @@ struct HSpool
 {
 	Tuplesortstate *sortstate;	/* state data for tuplesort.c */
 	Relation	index;
-	uint32		hash_mask;		/* bitmask for hash codes */
+	uint32		hash_mod;		/* mask for hash codes */
 };
 
 
@@ -52,15 +52,12 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	hspool->index = index;
 
 	/*
-	 * Determine the bitmask for hash code values.  Since there are currently
-	 * num_buckets buckets in the index, the appropriate mask can be computed
-	 * as follows.
-	 *
-	 * Note: at present, the passed-in num_buckets is always a power of 2, so
-	 * we could just compute num_buckets - 1.  We prefer not to assume that
-	 * here, though.
+	 * At this point max_buckets in hash index is num_buckets - 1.
+	 * The "hash key mod num_buckets" will indicate which bucket does the
+	 * hash key belongs to, and will be used to sort the index tuples based on
+	 * their bucket.
 	 */
-	hspool->hash_mask = (((uint32) 1) << _hash_log2(num_buckets)) - 1;
+	hspool->hash_mod = num_buckets;
 
 	/*
 	 * We size the sort area as maintenance_work_mem rather than work_mem to
@@ -69,7 +66,7 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	 */
 	hspool->sortstate = tuplesort_begin_index_hash(heap,
 												   index,
-												   hspool->hash_mask,
+												   hspool->hash_mod,
 												   maintenance_work_mem,
 												   false);
 
@@ -122,7 +119,7 @@ _h_indexbuild(HSpool *hspool, Relation heapRel)
 #ifdef USE_ASSERT_CHECKING
 		uint32		lasthashkey = hashkey;
 
-		hashkey = _hash_get_indextuple_hashkey(itup) & hspool->hash_mask;
+		hashkey = _hash_get_indextuple_hashkey(itup) % hspool->hash_mod;
 		Assert(hashkey >= lasthashkey);
 #endif
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e1e692d..8ff50a1 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -473,7 +473,7 @@ struct Tuplesortstate
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 
 	/* These are specific to the index_hash subcase: */
-	uint32		hash_mask;		/* mask for sortable part of hash code */
+	uint32		hash_mod;		/* mask for sortable part of hash code */
 
 	/*
 	 * These variables are specific to the Datum case; they are set by
@@ -991,7 +991,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 Tuplesortstate *
 tuplesort_begin_index_hash(Relation heapRel,
 						   Relation indexRel,
-						   uint32 hash_mask,
+						   uint32 hash_mod,
 						   int workMem, bool randomAccess)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
@@ -1002,8 +1002,8 @@ tuplesort_begin_index_hash(Relation heapRel,
 #ifdef TRACE_SORT
 	if (trace_sort)
 		elog(LOG,
-		"begin index sort: hash_mask = 0x%x, workMem = %d, randomAccess = %c",
-			 hash_mask,
+		"begin index sort: hash_mod = 0x%x, workMem = %d, randomAccess = %c",
+			 hash_mod,
 			 workMem, randomAccess ? 't' : 'f');
 #endif
 
@@ -1017,7 +1017,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
 
-	state->hash_mask = hash_mask;
+	state->hash_mod = hash_mod;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -4167,9 +4167,9 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 	 * that the first column of the index tuple is the hash key.
 	 */
 	Assert(!a->isnull1);
-	hash1 = DatumGetUInt32(a->datum1) & state->hash_mask;
+	hash1 = DatumGetUInt32(a->datum1) % state->hash_mod;
 	Assert(!b->isnull1);
-	hash2 = DatumGetUInt32(b->datum1) & state->hash_mask;
+	hash2 = DatumGetUInt32(b->datum1) % state->hash_mod;
 
 	if (hash1 > hash2)
 		return 1;
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 5b3f475..03594e7 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -72,7 +72,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 							int workMem, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
 						   Relation indexRel,
-						   uint32 hash_mask,
+						   uint32 hash_mod,
 						   int workMem, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 					  Oid sortOperator, Oid sortCollation,
#21Robert Haas
robertmhaas@gmail.com
In reply to: Mithun Cy (#20)
Re: [POC] A better way to expand hash indexes.

On Tue, Mar 28, 2017 at 5:00 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

This will go wrong for split point group zero. In general, I feel if
you handle computation for split groups lesser than
SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE in the caller, then all your
macros will become much simpler and less error prone.

Fixed, apart from SPLITPOINT_PHASE_TO_SPLITPOINT_GRP rest all macros
only handle multi phase group. The SPLITPOINT_PHASE_TO_SPLITPOINT_GRP
is used in one more place at expand index so thought kepeping it as it
is is better.

I wonder if we should consider increasing
SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE somewhat. For example, split
point 4 is responsible for allocating only 16 new buckets = 128kB;
doing those in four groups of two (16kB) seems fairly pointless.
Suppose we start applying this technique beginning around splitpoint 9
or 10. Breaking 1024 new buckets * 8kB = 8MB of index growth into 4
phases might save enough to be worthwhile.

Of course, even if we decide to apply this even for very small
splitpoints, it probably doesn't cost us anything other than some
space in the metapage. But maybe saving space in the metapage isn't
such a bad idea anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#21)
Re: [POC] A better way to expand hash indexes.

On Tue, Mar 28, 2017 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 28, 2017 at 5:00 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

This will go wrong for split point group zero. In general, I feel if
you handle computation for split groups lesser than
SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE in the caller, then all your
macros will become much simpler and less error prone.

Fixed, apart from SPLITPOINT_PHASE_TO_SPLITPOINT_GRP rest all macros
only handle multi phase group. The SPLITPOINT_PHASE_TO_SPLITPOINT_GRP
is used in one more place at expand index so thought kepeping it as it
is is better.

I wonder if we should consider increasing
SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE somewhat. For example, split
point 4 is responsible for allocating only 16 new buckets = 128kB;
doing those in four groups of two (16kB) seems fairly pointless.
Suppose we start applying this technique beginning around splitpoint 9
or 10. Breaking 1024 new buckets * 8kB = 8MB of index growth into 4
phases might save enough to be worthwhile.

10 sounds better point to start allocating in phases.

Of course, even if we decide to apply this even for very small
splitpoints, it probably doesn't cost us anything other than some
space in the metapage. But maybe saving space in the metapage isn't
such a bad idea anyway.

Yeah metapage space is scarce, so lets try to save it as much possible.

Few other comments:
+/*
+ * This is just a trick to save a division operation. If you look into the
+ * bitmap of 0-based bucket_num 2nd and 3rd most significant bit will indicate
+ * which phase of allocation the bucket_num belongs to with in the group. This
+ * is because at every splitpoint group we allocate (2 ^ x) buckets and we have
+ * divided the allocation process into 4 equal phases. This macro returns value
+ * from 0 to 3.
+ */
+#define SPLITPOINT_PHASES_WITHIN_GROUP(sp_g, bucket_num) \
+ (((bucket_num) >> (sp_g - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE)) & \
+ SPLITPOINT_PHASE_MASK)

This won't work if we change SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE to
number other than 3. I think you should change it so that it can work
with any value of SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE.

I think you should name this define as SPLITPOINT_PHASE_WITHIN_GROUP
as this refers to only one particular phase within group.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Amit Kapila (#22)
1 attachment(s)
Re: [POC] A better way to expand hash indexes.

On Wed, Mar 29, 2017 at 10:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I wonder if we should consider increasing
SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE somewhat. For example, split
point 4 is responsible for allocating only 16 new buckets = 128kB;
doing those in four groups of two (16kB) seems fairly pointless.
Suppose we start applying this technique beginning around splitpoint 9
or 10. Breaking 1024 new buckets * 8kB = 8MB of index growth into 4
phases might save enough to be worthwhile.

10 sounds better point to start allocating in phases.

+1. At splitpoint group 10 we will allocate (2 ^ 9) buckets = 4MB in
total and each phase will allocate 2 ^ 7 buckets = 128 * 8kB = 1MB.

Few other comments:
+/*
+ * This is just a trick to save a division operation. If you look into the
+ * bitmap of 0-based bucket_num 2nd and 3rd most significant bit will indicate
+ * which phase of allocation the bucket_num belongs to with in the group. This
+ * is because at every splitpoint group we allocate (2 ^ x) buckets and we have
+ * divided the allocation process into 4 equal phases. This macro returns value
+ * from 0 to 3.
+ */
+#define SPLITPOINT_PHASES_WITHIN_GROUP(sp_g, bucket_num) \
+ (((bucket_num) >> (sp_g - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE)) & \
+ SPLITPOINT_PHASE_MASK)
This won't work if we change SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE to
number other than 3.  I think you should change it so that it can work
with any value of  SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE.

Fixed, using SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE was accidental. All
I need is most significant 3 bits hence should be subtracted by 3
always.

I think you should name this define as SPLITPOINT_PHASE_WITHIN_GROUP
as this refers to only one particular phase within group.

Fixed.

Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

yet_another_expand_hashbucket_efficiently_10.patchapplication/octet-stream; name=yet_another_expand_hashbucket_efficiently_10.patchDownload
diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 3ba01f6..e287093 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -51,13 +51,13 @@ bsize     | 8152
 bmsize    | 4096
 bmshift   | 15
 maxbucket | 3
-highmask  | 7
-lowmask   | 3
+highmask  | 3
+lowmask   | 1
 ovflpoint | 2
 firstfree | 0
 nmaps     | 1
 procid    | 450
-spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 
 SELECT magic, version, ntuples, bsize, bmsize, bmshift, maxbucket, highmask,
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 9f41bb2..b3d0056 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -667,11 +667,11 @@ bmshift   | 15
 maxbucket | 12512
 highmask  | 16383
 lowmask   | 8191
-ovflpoint | 14
+ovflpoint | 28
 firstfree | 1204
 nmaps     | 1
 procid    | 450
-spares    | {0,0,0,0,0,0,1,1,1,1,1,4,59,704,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,0,0,0,1,1,1,1,1,1,1,1,3,4,4,4,45,55,58,59,508,567,628,704,1193,1202,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 </screen>
      </para>
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..bc339bc 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -58,35 +58,52 @@ rules to support a variable number of overflow pages while not having to
 move primary bucket pages around after they are created.
 
 Primary bucket pages (henceforth just "bucket pages") are allocated in
-power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
-are created when the index is initialized.  At the first split, buckets 2
-and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
-when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
-pages of a power-of-2 group appear consecutively in the index.  This
-addressing scheme allows the physical location of a bucket page to be
-computed from the bucket number relatively easily, using only a small
-amount of control information.  We take the log2() of the bucket number
-to determine which split point S the bucket belongs to, and then simply
-add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
-metapage) to compute the physical address.  hashm_spares[S] can be
-interpreted as the total number of overflow pages that have been allocated
-before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
-so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
-block numbers 1 and 2, just after the meta page.  We always have
+power-of-2 groups, called "split points" in the code.  That means at every new
+splitpoint we double the existing number of buckets.  Allocating huge chunks
+of bucket pages all at once isn't optimal and we will take ages to consume
+those.  To avoid this exponential growth of index size, we did use a trick to
+break up allocation of buckets at the splitpoint into 4 equal phases.  If
+(2 ^ x) are the total buckets need to be allocated at a splitpoint (from now on
+we shall call this as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2))
+of total buckets at each phase of splitpoint group.  Next quarter of allocation
+will only happen if buckets of the previous phase have been already consumed.
+For the initial splitpoint groups < 10 we will allocate all of their buckets in
+single phase only, as number of buckets allocated at initial groups are small
+in numbers.  And for the groups >= 10 the allocation process is distributed
+among four equal phases.  At group 10 we allocate (2 ^ 9) buckets in 4
+different phases {2 ^ 7, 2 ^ 7, 2 ^ 7, 2 ^ 7}, the numbers in curly braces
+indicate the number of buckets allocated within each phase of splitpoint group
+10.  And, for splitpoint group 11 and 12 allocation phases will be
+{2 ^ 8, 2 ^ 8, 2 ^ 8, 2 ^ 8} and {2 ^ 9, 2 ^ 9, 2 ^ 9, 2 ^ 9} respectively.  We
+can see that at each splitpoint group we double the total number of buckets
+from the previous group but in an incremental phase.  The bucket pages
+allocated within one phase of a splitpoint group will appear consecutively in
+the index.  This addressing scheme allows the physical location of a bucket
+page to be computed from the bucket number relatively easily, using only a
+small amount of control information.  If we look at the function
+_hash_spareindex for a given bucket number we first compute the
+splitpoint group it belongs to and then the phase to which the bucket belongs
+to.  Adding them we get the global splitpoint phase number S to which the
+bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[]
+is an array stored in the metapage) with given bucket number to compute its
+physical address.  The hashm_spares[S] can be interpreted as the total number
+of overflow pages that have been allocated before the bucket pages of
+splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
+(which belong to splitpoint group 0's phase 1 and phase 2 respectively) always
+appear at block numbers 1 and 2, just after the meta page.  We always have
 hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
-former.  The difference between the two represents the number of overflow
-pages appearing between the bucket page groups of splitpoints N and N+1.
-
+former.  The difference between the two represents the number of overflow pages
+appearing between the bucket page groups of splitpoints phase N and N+1.
 (Note: the above describes what happens when filling an initially minimally
-sized hash index.  In practice, we try to estimate the required index size
-and allocate a suitable number of splitpoints immediately, to avoid
+sized hash index.  In practice, we try to estimate the required index size and
+allocate a suitable number of splitpoints phases immediately, to avoid
 expensive re-splitting during initial index build.)
 
 When S splitpoints exist altogether, the array entries hashm_spares[0]
 through hashm_spares[S] are valid; hashm_spares[S] records the current
 total number of overflow pages.  New overflow pages are created as needed
 at the end of the index, and recorded by incrementing hashm_spares[S].
-When it is time to create a new splitpoint's worth of bucket pages, we
+When it is time to create a new splitpoint phase's worth of bucket pages, we
 copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
 stored in the hashm_ovflpoint field of the meta page).  This has the
 effect of reserving the correct number of bucket pages at the end of the
@@ -101,7 +118,7 @@ We have to allow the case "greater than" because it's possible that during
 an index extension we crash after allocating filesystem space and before
 updating the metapage.  Note that on filesystems that allow "holes" in
 files, it's entirely likely that pages before the logical EOF are not yet
-allocated: when we allocate a new splitpoint's worth of bucket pages, we
+allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
 physically zero the last such page to force the EOF up, and the first such
 page will be used immediately, but the intervening pages are not written
 until needed.
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..fe0b4ef 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -49,7 +49,7 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 	 * Convert to absolute page number by adding the number of bucket pages
 	 * that exist before this split point.
 	 */
-	return (BlockNumber) ((1 << i) + ovflbitnum);
+	return (BlockNumber) (_hash_get_totalbuckets(i) + ovflbitnum);
 }
 
 /*
@@ -67,14 +67,15 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 	/* Determine the split number containing this page */
 	for (i = 1; i <= splitnum; i++)
 	{
-		if (ovflblkno <= (BlockNumber) (1 << i))
+		if (ovflblkno <= (BlockNumber) _hash_get_totalbuckets(i))
 			break;				/* oops */
-		bitnum = ovflblkno - (1 << i);
+		bitnum = ovflblkno - _hash_get_totalbuckets(i);
 
 		/*
 		 * bitnum has to be greater than number of overflow page added in
 		 * previous split point. The overflow page at this splitnum (i) if any
-		 * should start from ((2 ^ i) + metap->hashm_spares[i - 1] + 1).
+		 * should start from
+		 * (_hash_get_totalbuckets(i) + metap->hashm_spares[i - 1] + 1).
 		 */
 		if (bitnum > metap->hashm_spares[i - 1] &&
 			bitnum <= metap->hashm_spares[i])
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 61ca2ec..d7374fa 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -502,14 +502,15 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	Page		page;
 	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
+	uint32		spare_index;
 	uint32		i;
 
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
-	 * next power of 2, however, and always force at least 2 bucket pages. The
-	 * upper limit is determined by considerations explained in
+	 * total number of buckets which has to be allocated before using its
+	 * _hashm_spares index slot. However always force at least 2 bucket pages.
+	 * The upper limit is determined by considerations explained in
 	 * _hash_expandtable().
 	 */
 	dnumbuckets = num_tuples / ffactor;
@@ -518,11 +519,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	else if (dnumbuckets >= (double) 0x40000000)
 		num_buckets = 0x40000000;
 	else
-		num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+		num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
 
-	log2_num_buckets = _hash_log2(num_buckets);
-	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
-	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
+	spare_index = _hash_spareindex(num_buckets);
+	Assert(spare_index < HASH_MAX_SPLITPOINTS);
 
 	page = BufferGetPage(buf);
 	if (initpage)
@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
-	 * blocks 1 to N.  The first freespace bitmap page is in block N+1. Since
-	 * N is a power of 2, we can set the masks this way:
+	 * blocks 1 to N.  The first freespace bitmap page is in block N+1.
 	 */
-	metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
-	metap->hashm_highmask = (num_buckets << 1) - 1;
+	metap->hashm_maxbucket = num_buckets - 1;
+
+	/* set highmask, which should be sufficient to cover num_buckets. */
+	metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
+	metap->hashm_lowmask = (metap->hashm_highmask >> 1);
 
 	MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
 	/* Set up mapping for one spare page after the initial splitpoints */
-	metap->hashm_spares[log2_num_buckets] = 1;
-	metap->hashm_ovflpoint = log2_num_buckets;
+	metap->hashm_spares[spare_index] = 1;
+	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
 
 	/*
@@ -773,25 +775,44 @@ restart_expand:
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to allocate a new batch of bucket pages.
+	 * If the split point is increasing we need to allocate a new batch of
+	 * bucket pages.
 	 */
-	spare_ndx = _hash_log2(new_bucket + 1);
+	spare_ndx = _hash_spareindex(new_bucket + 1);
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
+		uint32		buckets_toadd = 0;
+		uint32		splitpoint_group = 0;
+
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
-		 * The number of buckets in the new splitpoint is equal to the total
-		 * number already in existence, i.e. new_bucket.  Currently this maps
-		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.  We treat allocation of buckets as a
-		 * separate WAL-logged action.  Even if we fail after this operation,
-		 * won't leak bucket pages; rather, the next split will consume this
-		 * space. In any case, even without failure we don't use all the space
-		 * in one split operation.
+		 * The number of buckets in the new splitpoint group is equal to the
+		 * total number already in existence. But we do not allocate them at
+		 * once. Each splitpoint group will have 4 slots, we distribute the
+		 * buckets equally among them. So we allocate only one fourth of total
+		 * buckets in new splitpoint group at a time to consume one phase after
+		 * another. We treat allocation of buckets as a separate WAL-logged
+		 * action. Even if we fail after this operation, won't leak bucket
+		 * pages; rather, the next split will consume this space. In any case,
+		 * even without failure we don't use all the space in one split
+		 * operation.
+		 */
+
+		splitpoint_group = SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(spare_ndx);
+
+		/*
+		 * Each phase in the splitpoint_group will allocate one fourth of total
+		 * buckets to be allocated in splitpoint_group. For
+		 * splitpoint_group < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE, have only
+		 * one phase of allocation so we allocate all of the buckets belonging
+		 * to that buckets at once.
 		 */
-		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+		buckets_toadd =
+			(splitpoint_group < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) ?
+			(new_bucket) :
+			((1 << (splitpoint_group - 1)) / SPLITPOINT_PHASES_PER_GRP);
+		if (!_hash_alloc_buckets(rel, start_nblkno, buckets_toadd))
 		{
 			/* can't split due to BlockNumber overflow */
 			_hash_relbuf(rel, buf_oblkno);
@@ -836,10 +857,9 @@ restart_expand:
 	}
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to adjust the hashm_spares[] array and
-	 * hashm_ovflpoint so that future overflow pages will be created beyond
-	 * this new batch of bucket pages.
+	 * If the split point is increasing we need to adjust the hashm_spares[]
+	 * array and hashm_ovflpoint so that future overflow pages will be created
+	 * beyond this new batch of bucket pages.
 	 */
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..df8c74b 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -150,6 +150,49 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_spareindex -- returns spare index / global splitpoint phase of the
+ *					   bucket
+ */
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
+	uint32		splitpoint_group;
+
+	splitpoint_group = _hash_log2(num_bucket);
+
+	if (splitpoint_group < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE)
+		return splitpoint_group;
+
+	return TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(splitpoint_group) +
+		   SPLITPOINT_PHASE_WITHIN_GROUP(splitpoint_group,
+										  num_bucket - 1); /* to 0-based */
+}
+
+/*
+ *	_hash_get_totalbuckets -- returns total number of buckets allocated till
+ *							the given splitpoint phase.
+ */
+uint32
+_hash_get_totalbuckets(uint32 splitpoint_phase)
+{
+	uint32		splitpoint_group;
+
+	splitpoint_group = SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(splitpoint_phase);
+
+	if (splitpoint_group < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE)
+		return (1 << splitpoint_group);
+
+	/*
+	 * total_buckets = total number of buckets before its splitpoint group +
+	 * total buckets within its splitpoint group until given splitpoint_phase.
+	 */
+	return BUCKETS_BEFORE_SP_GRP(splitpoint_group) +
+		   BUCKETS_WITHIN_SP_GRP(splitpoint_group,
+				((splitpoint_phase - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) %
+				  SPLITPOINT_PHASES_PER_GRP) + 1);	/* to 1-based */
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..d186c69 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -36,7 +36,7 @@ typedef uint32 Bucket;
 #define InvalidBucket	((Bucket) 0xFFFFFFFF)
 
 #define BUCKET_TO_BLKNO(metap,B) \
-		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
+		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)
 
 /*
  * Special space for hash index pages.
@@ -180,9 +180,48 @@ typedef HashScanOpaqueData *HashScanOpaque;
  * needing to fit into the metapage.  (With 8K block size, 128 bitmaps
  * limit us to 64 GB of overflow space...)
  */
-#define HASH_MAX_SPLITPOINTS		32
 #define HASH_MAX_BITMAPS			128
 
+#define SPLITPOINT_PHASES_PER_GRP	4
+#define SPLITPOINT_PHASE_MASK		(SPLITPOINT_PHASES_PER_GRP - 1)
+#define SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE 10
+#define HASH_MAX_SPLITPOINTS \
+		((32 - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) * \
+		 SPLITPOINT_PHASES_PER_GRP) + \
+		 SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE
+
+#define TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(sp_g) \
+		((((sp_g - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) << 2) + \
+		  SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE))
+
+/*
+ * This is just a trick to save a division operation. If you look into the
+ * bitmap of 0-based bucket_num 2nd and 3rd most significant bit will indicate
+ * which phase of allocation the bucket_num belongs to with in the group. This
+ * is because at every splitpoint group we allocate (2 ^ x) buckets and we have
+ * divided the allocation process into 4 equal phases. This macro returns value
+ * from 0 to 3.
+ */
+#define SPLITPOINT_PHASE_WITHIN_GROUP(sp_g, bucket_num) \
+		(((bucket_num) >> (sp_g - 3)) & SPLITPOINT_PHASE_MASK)
+
+/*
+ * At every splitpoint group we double the total number of buckets. So at
+ * splitpoint group sp_g we allocate (1 << (sp_g -1)) buckets as we will have
+ * same number of buckets already allocated before this group. For spitpoint
+ * groups >= SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE we allocate buckets in 4
+ * equal phases hence we allocate ((1 << (sp_g - 1)) >> 2) buckets per phase.
+ */
+#define BUCKETS_BEFORE_SP_GRP(sp_g) (1 << (sp_g - 1))
+#define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \
+									((nphase) * ((1 << (sp_g - 1)) >> 2))
+
+#define SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(sp_phase) \
+		((sp_phase < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) ? \
+		 (sp_phase) : \
+		 (((sp_phase - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) >> 2) + \
+		  SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE))
+
 typedef struct HashMetaPageData
 {
 	uint32		hashm_magic;	/* magic no. for hash tables */
@@ -382,6 +421,8 @@ extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
 extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
 					 uint32 highmask, uint32 lowmask);
 extern uint32 _hash_log2(uint32 num);
+extern uint32 _hash_spareindex(uint32 num_bucket);
+extern uint32 _hash_get_totalbuckets(uint32 splitpoint_phase);
 extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
 extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
 extern bool _hash_convert_tuple(Relation index,
#24Amit Kapila
amit.kapila16@gmail.com
In reply to: Mithun Cy (#23)
Re: [POC] A better way to expand hash indexes.

On Wed, Mar 29, 2017 at 12:51 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

On Wed, Mar 29, 2017 at 10:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Few other comments:
+/*
+ * This is just a trick to save a division operation. If you look into the
+ * bitmap of 0-based bucket_num 2nd and 3rd most significant bit will indicate
+ * which phase of allocation the bucket_num belongs to with in the group. This
+ * is because at every splitpoint group we allocate (2 ^ x) buckets and we have
+ * divided the allocation process into 4 equal phases. This macro returns value
+ * from 0 to 3.
+ */
+#define SPLITPOINT_PHASES_WITHIN_GROUP(sp_g, bucket_num) \
+ (((bucket_num) >> (sp_g - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE)) & \
+ SPLITPOINT_PHASE_MASK)
This won't work if we change SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE to
number other than 3.  I think you should change it so that it can work
with any value of  SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE.

Fixed, using SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE was accidental. All
I need is most significant 3 bits hence should be subtracted by 3
always.

Okay, your current patch looks good to me apart from minor comments,
so marked as Read For Committer. Please either merge the
sort_hash_b_2.patch with main patch or submit it along with next
revision for easier reference.

Few minor comments:
1.
+splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
+(which belong to splitpoint group 0's phase 1 and phase 2 respectively) always
+appear at block numbers 1 and 2, just after the meta page.

This explanation doesn't seem to be right as with current patch we
start phased allocation only after splitpoint group 9.

2.
-#define HASH_MAX_SPLITPOINTS 32
#define HASH_MAX_BITMAPS 128

+#define SPLITPOINT_PHASES_PER_GRP 4
+#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
+#define SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE 10
+#define HASH_MAX_SPLITPOINTS \
+ ((32 - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) * \
+ SPLITPOINT_PHASES_PER_GRP) + \
+ SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE

You have changed the value of HASH_MAX_SPLITPOINTS, but the comments
explaining that value are still unchanged. Refer below text.
"The limitation on the size of spares[] comes from the fact that there's
* no point in having more than 2^32 buckets with only uint32 hashcodes."

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Jesper Pedersen (#17)
Re: [POC] A better way to expand hash indexes.

That means at every new
+split point we double the existing number of buckets. Allocating huge chucks

On Mon, Mar 27, 2017 at 11:56 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

I ran some performance scenarios on the patch to see if the increased
'spares' allocation had an impact. I havn't found any regressions in that
regard.

Thanks Jasper for testing the patch.

Attached patch contains some small fixes, mainly to the documentation - on
top of v7.

I have taken some of the grammatical and spell check issues you have
mentioned. One major thing I left it as it is term "splitpoint" which
you have tried to change in many places to "split point", The
splitpoint is not introduced by me, it was already used in many
places, so I think it is acceptable to use that term. I think I shall
not add changes which are not part of the core issue. I think another
patch on top of this should be okay.

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Amit Kapila (#24)
3 attachment(s)
Re: [POC] A better way to expand hash indexes.

Thanks, Amit for a detailed review.

On Wed, Mar 29, 2017 at 4:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, your current patch looks good to me apart from minor comments,
so marked as Read For Committer. Please either merge the
sort_hash_b_2.patch with main patch or submit it along with next
revision for easier reference.

I will keep it separated just in case commiter likes
sortbuild_hash_A.patch. We can use either of sortbuild_hash_*.patch on
top of main patch.

Few minor comments:
1.
+splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
+(which belong to splitpoint group 0's phase 1 and phase 2 respectively) always
+appear at block numbers 1 and 2, just after the meta page.

This explanation doesn't seem to be right as with current patch we
start phased allocation only after splitpoint group 9.

Again a mistake, removed the sentence in parentheses.

2.
-#define HASH_MAX_SPLITPOINTS 32
#define HASH_MAX_BITMAPS 128

+#define SPLITPOINT_PHASES_PER_GRP 4
+#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
+#define SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE 10
+#define HASH_MAX_SPLITPOINTS \
+ ((32 - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) * \
+ SPLITPOINT_PHASES_PER_GRP) + \
+ SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE

You have changed the value of HASH_MAX_SPLITPOINTS, but the comments
explaining that value are still unchanged. Refer below text.
"The limitation on the size of spares[] comes from the fact that there's
* no point in having more than 2^32 buckets with only uint32 hashcodes."

The limitation is still indirectly imposed by the fact that we can
have only 2^32 buckets. But I also added a note that
HASH_MAX_SPLITPOINTS also considers that after
SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE bucket allocation will be done
in multiple(exactly 4) phases.

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

yet_another_expand_hashbucket_efficiently_11.patchapplication/octet-stream; name=yet_another_expand_hashbucket_efficiently_11.patchDownload
diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 3ba01f6..e287093 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -51,13 +51,13 @@ bsize     | 8152
 bmsize    | 4096
 bmshift   | 15
 maxbucket | 3
-highmask  | 7
-lowmask   | 3
+highmask  | 3
+lowmask   | 1
 ovflpoint | 2
 firstfree | 0
 nmaps     | 1
 procid    | 450
-spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 
 SELECT magic, version, ntuples, bsize, bmsize, bmshift, maxbucket, highmask,
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 9f41bb2..b3d0056 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -667,11 +667,11 @@ bmshift   | 15
 maxbucket | 12512
 highmask  | 16383
 lowmask   | 8191
-ovflpoint | 14
+ovflpoint | 28
 firstfree | 1204
 nmaps     | 1
 procid    | 450
-spares    | {0,0,0,0,0,0,1,1,1,1,1,4,59,704,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,0,0,0,1,1,1,1,1,1,1,1,3,4,4,4,45,55,58,59,508,567,628,704,1193,1202,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 </screen>
      </para>
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..c8a0ec7 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -58,35 +58,51 @@ rules to support a variable number of overflow pages while not having to
 move primary bucket pages around after they are created.
 
 Primary bucket pages (henceforth just "bucket pages") are allocated in
-power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
-are created when the index is initialized.  At the first split, buckets 2
-and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
-when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
-pages of a power-of-2 group appear consecutively in the index.  This
-addressing scheme allows the physical location of a bucket page to be
-computed from the bucket number relatively easily, using only a small
-amount of control information.  We take the log2() of the bucket number
-to determine which split point S the bucket belongs to, and then simply
-add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
-metapage) to compute the physical address.  hashm_spares[S] can be
-interpreted as the total number of overflow pages that have been allocated
-before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
-so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
-block numbers 1 and 2, just after the meta page.  We always have
-hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
-former.  The difference between the two represents the number of overflow
-pages appearing between the bucket page groups of splitpoints N and N+1.
-
+power-of-2 groups, called "split points" in the code.  That means at every new
+splitpoint we double the existing number of buckets.  Allocating huge chunks
+of bucket pages all at once isn't optimal and we will take ages to consume
+those.  To avoid this exponential growth of index size, we did use a trick to
+break up allocation of buckets at the splitpoint into 4 equal phases.  If
+(2 ^ x) are the total buckets need to be allocated at a splitpoint (from now on
+we shall call this as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2))
+of total buckets at each phase of splitpoint group.  Next quarter of allocation
+will only happen if buckets of the previous phase have been already consumed.
+For the initial splitpoint groups < 10 we will allocate all of their buckets in
+single phase only, as number of buckets allocated at initial groups are small
+in numbers.  And for the groups >= 10 the allocation process is distributed
+among four equal phases.  At group 10 we allocate (2 ^ 9) buckets in 4
+different phases {2 ^ 7, 2 ^ 7, 2 ^ 7, 2 ^ 7}, the numbers in curly braces
+indicate the number of buckets allocated within each phase of splitpoint group
+10.  And, for splitpoint group 11 and 12 allocation phases will be
+{2 ^ 8, 2 ^ 8, 2 ^ 8, 2 ^ 8} and {2 ^ 9, 2 ^ 9, 2 ^ 9, 2 ^ 9} respectively.  We
+can see that at each splitpoint group we double the total number of buckets
+from the previous group but in an incremental phase.  The bucket pages
+allocated within one phase of a splitpoint group will appear consecutively in
+the index.  This addressing scheme allows the physical location of a bucket
+page to be computed from the bucket number relatively easily, using only a
+small amount of control information.  If we look at the function
+_hash_spareindex for a given bucket number we first compute the
+splitpoint group it belongs to and then the phase to which the bucket belongs
+to.  Adding them we get the global splitpoint phase number S to which the
+bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[]
+is an array stored in the metapage) with given bucket number to compute its
+physical address.  The hashm_spares[S] can be interpreted as the total number
+of overflow pages that have been allocated before the bucket pages of
+splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
+always appear at block numbers 1 and 2, just after the meta page.  We always
+have hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
+former.  The difference between the two represents the number of overflow pages
+appearing between the bucket page groups of splitpoints phase N and N+1.
 (Note: the above describes what happens when filling an initially minimally
-sized hash index.  In practice, we try to estimate the required index size
-and allocate a suitable number of splitpoints immediately, to avoid
+sized hash index.  In practice, we try to estimate the required index size and
+allocate a suitable number of splitpoints phases immediately, to avoid
 expensive re-splitting during initial index build.)
 
 When S splitpoints exist altogether, the array entries hashm_spares[0]
 through hashm_spares[S] are valid; hashm_spares[S] records the current
 total number of overflow pages.  New overflow pages are created as needed
 at the end of the index, and recorded by incrementing hashm_spares[S].
-When it is time to create a new splitpoint's worth of bucket pages, we
+When it is time to create a new splitpoint phase's worth of bucket pages, we
 copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
 stored in the hashm_ovflpoint field of the meta page).  This has the
 effect of reserving the correct number of bucket pages at the end of the
@@ -101,7 +117,7 @@ We have to allow the case "greater than" because it's possible that during
 an index extension we crash after allocating filesystem space and before
 updating the metapage.  Note that on filesystems that allow "holes" in
 files, it's entirely likely that pages before the logical EOF are not yet
-allocated: when we allocate a new splitpoint's worth of bucket pages, we
+allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
 physically zero the last such page to force the EOF up, and the first such
 page will be used immediately, but the intervening pages are not written
 until needed.
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..fe0b4ef 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -49,7 +49,7 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 	 * Convert to absolute page number by adding the number of bucket pages
 	 * that exist before this split point.
 	 */
-	return (BlockNumber) ((1 << i) + ovflbitnum);
+	return (BlockNumber) (_hash_get_totalbuckets(i) + ovflbitnum);
 }
 
 /*
@@ -67,14 +67,15 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 	/* Determine the split number containing this page */
 	for (i = 1; i <= splitnum; i++)
 	{
-		if (ovflblkno <= (BlockNumber) (1 << i))
+		if (ovflblkno <= (BlockNumber) _hash_get_totalbuckets(i))
 			break;				/* oops */
-		bitnum = ovflblkno - (1 << i);
+		bitnum = ovflblkno - _hash_get_totalbuckets(i);
 
 		/*
 		 * bitnum has to be greater than number of overflow page added in
 		 * previous split point. The overflow page at this splitnum (i) if any
-		 * should start from ((2 ^ i) + metap->hashm_spares[i - 1] + 1).
+		 * should start from
+		 * (_hash_get_totalbuckets(i) + metap->hashm_spares[i - 1] + 1).
 		 */
 		if (bitnum > metap->hashm_spares[i - 1] &&
 			bitnum <= metap->hashm_spares[i])
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 61ca2ec..d7374fa 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -502,14 +502,15 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	Page		page;
 	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
+	uint32		spare_index;
 	uint32		i;
 
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
-	 * next power of 2, however, and always force at least 2 bucket pages. The
-	 * upper limit is determined by considerations explained in
+	 * total number of buckets which has to be allocated before using its
+	 * _hashm_spares index slot. However always force at least 2 bucket pages.
+	 * The upper limit is determined by considerations explained in
 	 * _hash_expandtable().
 	 */
 	dnumbuckets = num_tuples / ffactor;
@@ -518,11 +519,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	else if (dnumbuckets >= (double) 0x40000000)
 		num_buckets = 0x40000000;
 	else
-		num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+		num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
 
-	log2_num_buckets = _hash_log2(num_buckets);
-	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
-	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
+	spare_index = _hash_spareindex(num_buckets);
+	Assert(spare_index < HASH_MAX_SPLITPOINTS);
 
 	page = BufferGetPage(buf);
 	if (initpage)
@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
-	 * blocks 1 to N.  The first freespace bitmap page is in block N+1. Since
-	 * N is a power of 2, we can set the masks this way:
+	 * blocks 1 to N.  The first freespace bitmap page is in block N+1.
 	 */
-	metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
-	metap->hashm_highmask = (num_buckets << 1) - 1;
+	metap->hashm_maxbucket = num_buckets - 1;
+
+	/* set highmask, which should be sufficient to cover num_buckets. */
+	metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
+	metap->hashm_lowmask = (metap->hashm_highmask >> 1);
 
 	MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
 	/* Set up mapping for one spare page after the initial splitpoints */
-	metap->hashm_spares[log2_num_buckets] = 1;
-	metap->hashm_ovflpoint = log2_num_buckets;
+	metap->hashm_spares[spare_index] = 1;
+	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
 
 	/*
@@ -773,25 +775,44 @@ restart_expand:
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to allocate a new batch of bucket pages.
+	 * If the split point is increasing we need to allocate a new batch of
+	 * bucket pages.
 	 */
-	spare_ndx = _hash_log2(new_bucket + 1);
+	spare_ndx = _hash_spareindex(new_bucket + 1);
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
+		uint32		buckets_toadd = 0;
+		uint32		splitpoint_group = 0;
+
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
-		 * The number of buckets in the new splitpoint is equal to the total
-		 * number already in existence, i.e. new_bucket.  Currently this maps
-		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.  We treat allocation of buckets as a
-		 * separate WAL-logged action.  Even if we fail after this operation,
-		 * won't leak bucket pages; rather, the next split will consume this
-		 * space. In any case, even without failure we don't use all the space
-		 * in one split operation.
+		 * The number of buckets in the new splitpoint group is equal to the
+		 * total number already in existence. But we do not allocate them at
+		 * once. Each splitpoint group will have 4 slots, we distribute the
+		 * buckets equally among them. So we allocate only one fourth of total
+		 * buckets in new splitpoint group at a time to consume one phase after
+		 * another. We treat allocation of buckets as a separate WAL-logged
+		 * action. Even if we fail after this operation, won't leak bucket
+		 * pages; rather, the next split will consume this space. In any case,
+		 * even without failure we don't use all the space in one split
+		 * operation.
+		 */
+
+		splitpoint_group = SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(spare_ndx);
+
+		/*
+		 * Each phase in the splitpoint_group will allocate one fourth of total
+		 * buckets to be allocated in splitpoint_group. For
+		 * splitpoint_group < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE, have only
+		 * one phase of allocation so we allocate all of the buckets belonging
+		 * to that buckets at once.
 		 */
-		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+		buckets_toadd =
+			(splitpoint_group < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) ?
+			(new_bucket) :
+			((1 << (splitpoint_group - 1)) / SPLITPOINT_PHASES_PER_GRP);
+		if (!_hash_alloc_buckets(rel, start_nblkno, buckets_toadd))
 		{
 			/* can't split due to BlockNumber overflow */
 			_hash_relbuf(rel, buf_oblkno);
@@ -836,10 +857,9 @@ restart_expand:
 	}
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to adjust the hashm_spares[] array and
-	 * hashm_ovflpoint so that future overflow pages will be created beyond
-	 * this new batch of bucket pages.
+	 * If the split point is increasing we need to adjust the hashm_spares[]
+	 * array and hashm_ovflpoint so that future overflow pages will be created
+	 * beyond this new batch of bucket pages.
 	 */
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..df8c74b 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -150,6 +150,49 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_spareindex -- returns spare index / global splitpoint phase of the
+ *					   bucket
+ */
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
+	uint32		splitpoint_group;
+
+	splitpoint_group = _hash_log2(num_bucket);
+
+	if (splitpoint_group < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE)
+		return splitpoint_group;
+
+	return TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(splitpoint_group) +
+		   SPLITPOINT_PHASE_WITHIN_GROUP(splitpoint_group,
+										  num_bucket - 1); /* to 0-based */
+}
+
+/*
+ *	_hash_get_totalbuckets -- returns total number of buckets allocated till
+ *							the given splitpoint phase.
+ */
+uint32
+_hash_get_totalbuckets(uint32 splitpoint_phase)
+{
+	uint32		splitpoint_group;
+
+	splitpoint_group = SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(splitpoint_phase);
+
+	if (splitpoint_group < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE)
+		return (1 << splitpoint_group);
+
+	/*
+	 * total_buckets = total number of buckets before its splitpoint group +
+	 * total buckets within its splitpoint group until given splitpoint_phase.
+	 */
+	return BUCKETS_BEFORE_SP_GRP(splitpoint_group) +
+		   BUCKETS_WITHIN_SP_GRP(splitpoint_group,
+				((splitpoint_phase - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) %
+				  SPLITPOINT_PHASES_PER_GRP) + 1);	/* to 1-based */
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..99c291c 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -36,7 +36,7 @@ typedef uint32 Bucket;
 #define InvalidBucket	((Bucket) 0xFFFFFFFF)
 
 #define BUCKET_TO_BLKNO(metap,B) \
-		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
+		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)
 
 /*
  * Special space for hash index pages.
@@ -176,13 +176,56 @@ typedef HashScanOpaqueData *HashScanOpaque;
  *
  * The limitation on the size of spares[] comes from the fact that there's
  * no point in having more than 2^32 buckets with only uint32 hashcodes.
+ * (Note: The value of HASH_MAX_SPLITPOINTS which is the size of spares[] is
+ * adjusted in such a way to accommodate multi phased allocation of buckets
+ * after SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE).
+ *
  * There is no particular upper limit on the size of mapp[], other than
  * needing to fit into the metapage.  (With 8K block size, 128 bitmaps
  * limit us to 64 GB of overflow space...)
  */
-#define HASH_MAX_SPLITPOINTS		32
 #define HASH_MAX_BITMAPS			128
 
+#define SPLITPOINT_PHASES_PER_GRP	4
+#define SPLITPOINT_PHASE_MASK		(SPLITPOINT_PHASES_PER_GRP - 1)
+#define SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE 10
+#define HASH_MAX_SPLITPOINTS \
+		((32 - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) * \
+		 SPLITPOINT_PHASES_PER_GRP) + \
+		 SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE
+
+#define TOTAL_SPLITPOINT_PHASES_BEFORE_GROUP(sp_g) \
+		((((sp_g - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) << 2) + \
+		  SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE))
+
+/*
+ * This is just a trick to save a division operation. If you look into the
+ * bitmap of 0-based bucket_num 2nd and 3rd most significant bit will indicate
+ * which phase of allocation the bucket_num belongs to with in the group. This
+ * is because at every splitpoint group we allocate (2 ^ x) buckets and we have
+ * divided the allocation process into 4 equal phases. This macro returns value
+ * from 0 to 3.
+ */
+#define SPLITPOINT_PHASE_WITHIN_GROUP(sp_g, bucket_num) \
+		(((bucket_num) >> (sp_g - 3)) & SPLITPOINT_PHASE_MASK)
+
+/*
+ * At every splitpoint group we double the total number of buckets. So at
+ * splitpoint group sp_g we allocate (1 << (sp_g -1)) buckets as we will have
+ * same number of buckets already allocated before this group. For spitpoint
+ * groups >= SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE we allocate buckets in 4
+ * equal phases hence we allocate ((1 << (sp_g - 1)) >> 2) buckets per phase.
+ */
+#define BUCKETS_BEFORE_SP_GRP(sp_g) (1 << (sp_g - 1))
+#define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \
+									((nphase) * ((1 << (sp_g - 1)) >> 2))
+
+#define SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(sp_phase) \
+		((sp_phase < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) ? \
+		 (sp_phase) : \
+		 (((sp_phase - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) >> 2) + \
+		  SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE))
+
 typedef struct HashMetaPageData
 {
 	uint32		hashm_magic;	/* magic no. for hash tables */
@@ -382,6 +425,8 @@ extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
 extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
 					 uint32 highmask, uint32 lowmask);
 extern uint32 _hash_log2(uint32 num);
+extern uint32 _hash_spareindex(uint32 num_bucket);
+extern uint32 _hash_get_totalbuckets(uint32 splitpoint_phase);
 extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
 extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
 extern bool _hash_convert_tuple(Relation index,
sortbuild_hash_B_2.patchapplication/octet-stream; name=sortbuild_hash_B_2.patchDownload
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index 0e0f393..18a788f 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -37,7 +37,7 @@ struct HSpool
 {
 	Tuplesortstate *sortstate;	/* state data for tuplesort.c */
 	Relation	index;
-	uint32		hash_mask;		/* bitmask for hash codes */
+	uint32		hash_mod;		/* mask for hash codes */
 };
 
 
@@ -52,15 +52,12 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	hspool->index = index;
 
 	/*
-	 * Determine the bitmask for hash code values.  Since there are currently
-	 * num_buckets buckets in the index, the appropriate mask can be computed
-	 * as follows.
-	 *
-	 * Note: at present, the passed-in num_buckets is always a power of 2, so
-	 * we could just compute num_buckets - 1.  We prefer not to assume that
-	 * here, though.
+	 * At this point max_buckets in hash index is num_buckets - 1.
+	 * The "hash key mod num_buckets" will indicate which bucket does the
+	 * hash key belongs to, and will be used to sort the index tuples based on
+	 * their bucket.
 	 */
-	hspool->hash_mask = (((uint32) 1) << _hash_log2(num_buckets)) - 1;
+	hspool->hash_mod = num_buckets;
 
 	/*
 	 * We size the sort area as maintenance_work_mem rather than work_mem to
@@ -69,7 +66,7 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	 */
 	hspool->sortstate = tuplesort_begin_index_hash(heap,
 												   index,
-												   hspool->hash_mask,
+												   hspool->hash_mod,
 												   maintenance_work_mem,
 												   false);
 
@@ -122,7 +119,7 @@ _h_indexbuild(HSpool *hspool, Relation heapRel)
 #ifdef USE_ASSERT_CHECKING
 		uint32		lasthashkey = hashkey;
 
-		hashkey = _hash_get_indextuple_hashkey(itup) & hspool->hash_mask;
+		hashkey = _hash_get_indextuple_hashkey(itup) % hspool->hash_mod;
 		Assert(hashkey >= lasthashkey);
 #endif
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e1e692d..8ff50a1 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -473,7 +473,7 @@ struct Tuplesortstate
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 
 	/* These are specific to the index_hash subcase: */
-	uint32		hash_mask;		/* mask for sortable part of hash code */
+	uint32		hash_mod;		/* mask for sortable part of hash code */
 
 	/*
 	 * These variables are specific to the Datum case; they are set by
@@ -991,7 +991,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 Tuplesortstate *
 tuplesort_begin_index_hash(Relation heapRel,
 						   Relation indexRel,
-						   uint32 hash_mask,
+						   uint32 hash_mod,
 						   int workMem, bool randomAccess)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
@@ -1002,8 +1002,8 @@ tuplesort_begin_index_hash(Relation heapRel,
 #ifdef TRACE_SORT
 	if (trace_sort)
 		elog(LOG,
-		"begin index sort: hash_mask = 0x%x, workMem = %d, randomAccess = %c",
-			 hash_mask,
+		"begin index sort: hash_mod = 0x%x, workMem = %d, randomAccess = %c",
+			 hash_mod,
 			 workMem, randomAccess ? 't' : 'f');
 #endif
 
@@ -1017,7 +1017,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
 
-	state->hash_mask = hash_mask;
+	state->hash_mod = hash_mod;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -4167,9 +4167,9 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 	 * that the first column of the index tuple is the hash key.
 	 */
 	Assert(!a->isnull1);
-	hash1 = DatumGetUInt32(a->datum1) & state->hash_mask;
+	hash1 = DatumGetUInt32(a->datum1) % state->hash_mod;
 	Assert(!b->isnull1);
-	hash2 = DatumGetUInt32(b->datum1) & state->hash_mask;
+	hash2 = DatumGetUInt32(b->datum1) % state->hash_mod;
 
 	if (hash1 > hash2)
 		return 1;
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 5b3f475..03594e7 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -72,7 +72,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 							int workMem, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
 						   Relation indexRel,
-						   uint32 hash_mask,
+						   uint32 hash_mod,
 						   int workMem, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 					  Oid sortOperator, Oid sortCollation,
sortbuild_hash_A.patchapplication/octet-stream; name=sortbuild_hash_A.patchDownload
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index 0e0f393..04d9c46 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -37,7 +37,15 @@ struct HSpool
 {
 	Tuplesortstate *sortstate;	/* state data for tuplesort.c */
 	Relation	index;
-	uint32		hash_mask;		/* bitmask for hash codes */
+
+	/*
+	 * We sort the hash keys based on the buckets they belong to. Below masks
+	 * are used in _hash_hashkey2bucket to determine the bucket of given hash
+	 * key.
+	 */
+	uint32		high_mask;
+	uint32		low_mask;
+	uint32		max_buckets;
 };
 
 
@@ -56,11 +64,12 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	 * num_buckets buckets in the index, the appropriate mask can be computed
 	 * as follows.
 	 *
-	 * Note: at present, the passed-in num_buckets is always a power of 2, so
-	 * we could just compute num_buckets - 1.  We prefer not to assume that
-	 * here, though.
+	 * NOTE : This hash mask calculation should be in sync with similar
+	 * calculation in _hash_init_metabuffer.
 	 */
-	hspool->hash_mask = (((uint32) 1) << _hash_log2(num_buckets)) - 1;
+	hspool->high_mask = (((uint32) 1) << _hash_log2(num_buckets)) - 1;
+	hspool->low_mask = (hspool->high_mask >> 1);
+	hspool->max_buckets = num_buckets - 1;
 
 	/*
 	 * We size the sort area as maintenance_work_mem rather than work_mem to
@@ -69,7 +78,9 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	 */
 	hspool->sortstate = tuplesort_begin_index_hash(heap,
 												   index,
-												   hspool->hash_mask,
+												   hspool->high_mask,
+												   hspool->low_mask,
+												   hspool->max_buckets,
 												   maintenance_work_mem,
 												   false);
 
@@ -122,7 +133,9 @@ _h_indexbuild(HSpool *hspool, Relation heapRel)
 #ifdef USE_ASSERT_CHECKING
 		uint32		lasthashkey = hashkey;
 
-		hashkey = _hash_get_indextuple_hashkey(itup) & hspool->hash_mask;
+		hashkey = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+									   hspool->max_buckets, hspool->high_mask,
+									   hspool->low_mask);
 		Assert(hashkey >= lasthashkey);
 #endif
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e1e692d..5b8aad1 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -127,6 +127,7 @@
 
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/hash.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
 #include "commands/tablespace.h"
@@ -473,7 +474,9 @@ struct Tuplesortstate
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 
 	/* These are specific to the index_hash subcase: */
-	uint32		hash_mask;		/* mask for sortable part of hash code */
+	uint32		high_mask;		/* masks for sortable part of hash code */
+	uint32		low_mask;
+	uint32		max_buckets;
 
 	/*
 	 * These variables are specific to the Datum case; they are set by
@@ -991,7 +994,9 @@ tuplesort_begin_index_btree(Relation heapRel,
 Tuplesortstate *
 tuplesort_begin_index_hash(Relation heapRel,
 						   Relation indexRel,
-						   uint32 hash_mask,
+						   uint32 high_mask,
+						   uint32 low_mask,
+						   uint32 max_buckets,
 						   int workMem, bool randomAccess)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
@@ -1002,8 +1007,11 @@ tuplesort_begin_index_hash(Relation heapRel,
 #ifdef TRACE_SORT
 	if (trace_sort)
 		elog(LOG,
-		"begin index sort: hash_mask = 0x%x, workMem = %d, randomAccess = %c",
-			 hash_mask,
+		"begin index sort: high_mask = 0x%x, low_mask = 0x%x, "
+		"max_buckets = 0x%x, workMem = %d, randomAccess = %c",
+			 high_mask,
+			 low_mask,
+			 max_buckets,
 			 workMem, randomAccess ? 't' : 'f');
 #endif
 
@@ -1017,7 +1025,9 @@ tuplesort_begin_index_hash(Relation heapRel,
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
 
-	state->hash_mask = hash_mask;
+	state->high_mask = high_mask;
+	state->low_mask = low_mask;
+	state->max_buckets = max_buckets;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -4157,8 +4167,8 @@ static int
 comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 					  Tuplesortstate *state)
 {
-	uint32		hash1;
-	uint32		hash2;
+	Bucket		bucket1;
+	Bucket		bucket2;
 	IndexTuple	tuple1;
 	IndexTuple	tuple2;
 
@@ -4167,13 +4177,14 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 	 * that the first column of the index tuple is the hash key.
 	 */
 	Assert(!a->isnull1);
-	hash1 = DatumGetUInt32(a->datum1) & state->hash_mask;
+	bucket1 = _hash_hashkey2bucket(DatumGetUInt32(a->datum1), state->max_buckets,
+								 state->high_mask, state->low_mask);
 	Assert(!b->isnull1);
-	hash2 = DatumGetUInt32(b->datum1) & state->hash_mask;
-
-	if (hash1 > hash2)
+	bucket2 = _hash_hashkey2bucket(DatumGetUInt32(b->datum1), state->max_buckets,
+								 state->high_mask, state->low_mask);
+	if (bucket1 > bucket2)
 		return 1;
-	else if (hash1 < hash2)
+	else if (bucket1 < bucket2)
 		return -1;
 
 	/*
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 5b3f475..9719db4 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -72,7 +72,9 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 							int workMem, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
 						   Relation indexRel,
-						   uint32 hash_mask,
+						   uint32 high_mask,
+						   uint32 low_mask,
+						   uint32 max_buckets,
 						   int workMem, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 					  Oid sortOperator, Oid sortCollation,
#27Robert Haas
robertmhaas@gmail.com
In reply to: Mithun Cy (#18)
Re: [POC] A better way to expand hash indexes.

On Tue, Mar 28, 2017 at 1:13 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

B. In tuple sort we can use hash function bucket = hash_key %
num_buckets instead of existing one which does bitwise "and" to
determine the bucket of hash key. This way we will not wrongly assign
buckets beyond max_buckets and sorted hash keys will be in sync with
actual insertion order of _hash_doinsert.

I think approach B is incorrect. Suppose we have 1536 buckets and
hash values 2048, 2049, 4096, 4097, 6144, 6145, 8192, and 8193. If I
understand correctly, each of these values should be mapped either to
bucket 0 or to bucket 1, and the goal of the sort is to put all of the
bucket 0 tuples before all of the bucket 1 tuples, so that we get
physical locality when inserting. With approach A, the sort keys will
match the bucket numbers -- we'll be sorting the list 0, 1, 0, 1, 0,
1, 0, 1 -- and we will end up doing all of the inserts to bucket 0
before any of the inserts to bucket 1. With approach B, we'll be
sorting 512, 513, 1024, 1025, 0, 1, 512, 513 and will end up
alternating inserts to bucket 0 with inserts to bucket 1.

To put that another way, see this comment at the top of hashsort.c:

* When building a very large hash index, we pre-sort the tuples by bucket
* number to improve locality of access to the index, and thereby avoid
* thrashing. We use tuplesort.c to sort the given index tuples into order.

So, you can't just decide to sort on a random number, which is what
approach B effectively does. Or, you can, but it completely misses
the point of sorting in the first place.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Robert Haas (#27)
Re: [POC] A better way to expand hash indexes.

On Thu, Mar 30, 2017 at 7:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I think approach B is incorrect. Suppose we have 1536 buckets and
hash values 2048, 2049, 4096, 4097, 6144, 6145, 8192, and 8193. If I
understand correctly, each of these values should be mapped either to
bucket 0 or to bucket 1, and the goal of the sort is to put all of the
bucket 0 tuples before all of the bucket 1 tuples, so that we get
physical locality when inserting. With approach A, the sort keys will
match the bucket numbers -- we'll be sorting the list 0, 1, 0, 1, 0,
1, 0, 1 -- and we will end up doing all of the inserts to bucket 0
before any of the inserts to bucket 1. With approach B, we'll be
sorting 512, 513, 1024, 1025, 0, 1, 512, 513 and will end up
alternating inserts to bucket 0 with inserts to bucket 1.

Oops sorry, yes 2 denominators are different (one used in an insert
and another used in sorting keys) we will end up with different bucket
numbers. I think in patch B, I should have actually taken next 2-power
number of 1536 as the denominator and try to get the mod value. If the
mod value is > 1536 then reduce the denominator by half and retake the
mod to get the bucket within 1536. Which is what effectively Patch A
is doing. Approach B is a blunder, I apologize for that mistake. I
think Patch A should be considered. If adding the members of struct
Tuplesortstate is a concern I will rewrite Patch B as said above.

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Robert Haas
robertmhaas@gmail.com
In reply to: Mithun Cy (#26)
Re: [POC] A better way to expand hash indexes.

On Wed, Mar 29, 2017 at 8:03 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

Thanks, Amit for a detailed review.

I think that the macros in hash.h need some more work:

- Pretty much any time you use the argument of a macro, you need to
parenthesize it in the macro definition to avoid surprises if the
macros is called using an expression. That isn't done consistently
here.

- The macros make extensive use of magic numbers like 1, 2, and 3. I
suggest something like:

#define SPLITPOINT_PHASE_BITS 2
#define SPLITPOINT_PHASES_PER_GROUP (1 << SPLITPOINT_PHASE_BITS)

And then use SPLITPOINT_PHASE_BITS any place where you're currently
saying 2. The reference to 3 is really SPLITPOINT_PHASE_BITS + 1.

- Many of these macros are only used in one place. Maybe just move
the computation to that place and get rid of the macro. For example,
_hash_spareindex() could be written like this:

if (splitpoint_group < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE)
return splitpoint_group;

/* account for single-phase groups */
splitpoint = SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE;

/* account for completed groups */
splitpoint += (splitpoint_group -
SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) << SPLITPOINT_PHASE_BITS;

/* account for phases within current group */
splitpoint += (bucket_num >> (SPLITPOINT_PHASE_BITS + 1)) &
SPLITPOINT_PHASE_MASK;

return splitpoint;

That eliminates the only use of two complicated macros and is in my
opinion more clear than what you've currently got.

- Some of these macros lack clear comments explaining their purpose.

- Some of them don't include HASH anywhere in the name, which is
essential for a header that may easily be included by non-hash index
code.

- The names don't all follow a consistent format. Maybe that's too
much to hope for at some level, but I think they could be more
consistent than they are.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Robert Haas (#29)
1 attachment(s)
Re: [POC] A better way to expand hash indexes.

Thanks Robert, I have tried to fix the comments given as below.

On Thu, Mar 30, 2017 at 9:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I think that the macros in hash.h need some more work:

- Pretty much any time you use the argument of a macro, you need to
parenthesize it in the macro definition to avoid surprises if the
macros is called using an expression. That isn't done consistently
here.

--I have tried to fix same in the latest patch.

- The macros make extensive use of magic numbers like 1, 2, and 3. I
suggest something like:

#define SPLITPOINT_PHASE_BITS 2
#define SPLITPOINT_PHASES_PER_GROUP (1 << SPLITPOINT_PHASE_BITS)

And then use SPLITPOINT_PHASE_BITS any place where you're currently
saying 2. The reference to 3 is really SPLITPOINT_PHASE_BITS + 1.

-- Taken modified same in the latest patch.

- Many of these macros are only used in one place. Maybe just move
the computation to that place and get rid of the macro. For example,
_hash_spareindex() could be written like this:

if (splitpoint_group < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE)
return splitpoint_group;

/* account for single-phase groups */
splitpoint = SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE;

/* account for completed groups */
splitpoint += (splitpoint_group -
SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) << SPLITPOINT_PHASE_BITS;

/* account for phases within current group */
splitpoint += (bucket_num >> (SPLITPOINT_PHASE_BITS + 1)) &
SPLITPOINT_PHASE_MASK;

return splitpoint;

That eliminates the only use of two complicated macros and is in my
opinion more clear than what you've currently got.

-- Taken, also rewrote _hash_get_totalbuckets in similar lines.

With that, we will end up with only 2 macros which have some computing code
+/* defines max number of splitpoit phases a hash index can have */
+#define HASH_MAX_SPLITPOINT_GROUP 32
+#define HASH_MAX_SPLITPOINTS \
+ (((HASH_MAX_SPLITPOINT_GROUP - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) * \
+  HASH_SPLITPOINT_PHASES_PER_GRP) + \
+ HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE)
+
+/* given a splitpoint phase get its group */
+#define HASH_SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(sp_phase) \
+ (((sp_phase) < HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) ? \
+ (sp_phase) : \
+ ((((sp_phase) - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) >> \
+   HASH_SPLITPOINT_PHASE_BITS) + \
+  HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE))

- Some of these macros lack clear comments explaining their purpose.

-- I have written some comments to explain the use of the macros.

- Some of them don't include HASH anywhere in the name, which is
essential for a header that may easily be included by non-hash index
code.

-- Fixed, all MACROS are prefixed with HASH

- The names don't all follow a consistent format. Maybe that's too
much to hope for at some level, but I think they could be more
consistent than they are.

-- Fixed, apart from old HASH_MAX_SPLITPOINTS rest all have a prefix
HASH_SPLITPOINT.

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

yet_another_expand_hashbucket_efficiently_12.patchapplication/octet-stream; name=yet_another_expand_hashbucket_efficiently_12.patchDownload
diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 3ba01f6..e287093 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -51,13 +51,13 @@ bsize     | 8152
 bmsize    | 4096
 bmshift   | 15
 maxbucket | 3
-highmask  | 7
-lowmask   | 3
+highmask  | 3
+lowmask   | 1
 ovflpoint | 2
 firstfree | 0
 nmaps     | 1
 procid    | 450
-spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 
 SELECT magic, version, ntuples, bsize, bmsize, bmshift, maxbucket, highmask,
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 9f41bb2..b3d0056 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -667,11 +667,11 @@ bmshift   | 15
 maxbucket | 12512
 highmask  | 16383
 lowmask   | 8191
-ovflpoint | 14
+ovflpoint | 28
 firstfree | 1204
 nmaps     | 1
 procid    | 450
-spares    | {0,0,0,0,0,0,1,1,1,1,1,4,59,704,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,0,0,0,1,1,1,1,1,1,1,1,3,4,4,4,45,55,58,59,508,567,628,704,1193,1202,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 </screen>
      </para>
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..c8a0ec7 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -58,35 +58,51 @@ rules to support a variable number of overflow pages while not having to
 move primary bucket pages around after they are created.
 
 Primary bucket pages (henceforth just "bucket pages") are allocated in
-power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
-are created when the index is initialized.  At the first split, buckets 2
-and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
-when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
-pages of a power-of-2 group appear consecutively in the index.  This
-addressing scheme allows the physical location of a bucket page to be
-computed from the bucket number relatively easily, using only a small
-amount of control information.  We take the log2() of the bucket number
-to determine which split point S the bucket belongs to, and then simply
-add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
-metapage) to compute the physical address.  hashm_spares[S] can be
-interpreted as the total number of overflow pages that have been allocated
-before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
-so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
-block numbers 1 and 2, just after the meta page.  We always have
-hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
-former.  The difference between the two represents the number of overflow
-pages appearing between the bucket page groups of splitpoints N and N+1.
-
+power-of-2 groups, called "split points" in the code.  That means at every new
+splitpoint we double the existing number of buckets.  Allocating huge chunks
+of bucket pages all at once isn't optimal and we will take ages to consume
+those.  To avoid this exponential growth of index size, we did use a trick to
+break up allocation of buckets at the splitpoint into 4 equal phases.  If
+(2 ^ x) are the total buckets need to be allocated at a splitpoint (from now on
+we shall call this as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2))
+of total buckets at each phase of splitpoint group.  Next quarter of allocation
+will only happen if buckets of the previous phase have been already consumed.
+For the initial splitpoint groups < 10 we will allocate all of their buckets in
+single phase only, as number of buckets allocated at initial groups are small
+in numbers.  And for the groups >= 10 the allocation process is distributed
+among four equal phases.  At group 10 we allocate (2 ^ 9) buckets in 4
+different phases {2 ^ 7, 2 ^ 7, 2 ^ 7, 2 ^ 7}, the numbers in curly braces
+indicate the number of buckets allocated within each phase of splitpoint group
+10.  And, for splitpoint group 11 and 12 allocation phases will be
+{2 ^ 8, 2 ^ 8, 2 ^ 8, 2 ^ 8} and {2 ^ 9, 2 ^ 9, 2 ^ 9, 2 ^ 9} respectively.  We
+can see that at each splitpoint group we double the total number of buckets
+from the previous group but in an incremental phase.  The bucket pages
+allocated within one phase of a splitpoint group will appear consecutively in
+the index.  This addressing scheme allows the physical location of a bucket
+page to be computed from the bucket number relatively easily, using only a
+small amount of control information.  If we look at the function
+_hash_spareindex for a given bucket number we first compute the
+splitpoint group it belongs to and then the phase to which the bucket belongs
+to.  Adding them we get the global splitpoint phase number S to which the
+bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[]
+is an array stored in the metapage) with given bucket number to compute its
+physical address.  The hashm_spares[S] can be interpreted as the total number
+of overflow pages that have been allocated before the bucket pages of
+splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
+always appear at block numbers 1 and 2, just after the meta page.  We always
+have hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
+former.  The difference between the two represents the number of overflow pages
+appearing between the bucket page groups of splitpoints phase N and N+1.
 (Note: the above describes what happens when filling an initially minimally
-sized hash index.  In practice, we try to estimate the required index size
-and allocate a suitable number of splitpoints immediately, to avoid
+sized hash index.  In practice, we try to estimate the required index size and
+allocate a suitable number of splitpoints phases immediately, to avoid
 expensive re-splitting during initial index build.)
 
 When S splitpoints exist altogether, the array entries hashm_spares[0]
 through hashm_spares[S] are valid; hashm_spares[S] records the current
 total number of overflow pages.  New overflow pages are created as needed
 at the end of the index, and recorded by incrementing hashm_spares[S].
-When it is time to create a new splitpoint's worth of bucket pages, we
+When it is time to create a new splitpoint phase's worth of bucket pages, we
 copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
 stored in the hashm_ovflpoint field of the meta page).  This has the
 effect of reserving the correct number of bucket pages at the end of the
@@ -101,7 +117,7 @@ We have to allow the case "greater than" because it's possible that during
 an index extension we crash after allocating filesystem space and before
 updating the metapage.  Note that on filesystems that allow "holes" in
 files, it's entirely likely that pages before the logical EOF are not yet
-allocated: when we allocate a new splitpoint's worth of bucket pages, we
+allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
 physically zero the last such page to force the EOF up, and the first such
 page will be used immediately, but the intervening pages are not written
 until needed.
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..fe0b4ef 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -49,7 +49,7 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 	 * Convert to absolute page number by adding the number of bucket pages
 	 * that exist before this split point.
 	 */
-	return (BlockNumber) ((1 << i) + ovflbitnum);
+	return (BlockNumber) (_hash_get_totalbuckets(i) + ovflbitnum);
 }
 
 /*
@@ -67,14 +67,15 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 	/* Determine the split number containing this page */
 	for (i = 1; i <= splitnum; i++)
 	{
-		if (ovflblkno <= (BlockNumber) (1 << i))
+		if (ovflblkno <= (BlockNumber) _hash_get_totalbuckets(i))
 			break;				/* oops */
-		bitnum = ovflblkno - (1 << i);
+		bitnum = ovflblkno - _hash_get_totalbuckets(i);
 
 		/*
 		 * bitnum has to be greater than number of overflow page added in
 		 * previous split point. The overflow page at this splitnum (i) if any
-		 * should start from ((2 ^ i) + metap->hashm_spares[i - 1] + 1).
+		 * should start from
+		 * (_hash_get_totalbuckets(i) + metap->hashm_spares[i - 1] + 1).
 		 */
 		if (bitnum > metap->hashm_spares[i - 1] &&
 			bitnum <= metap->hashm_spares[i])
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 61ca2ec..694ccd7 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -502,14 +502,15 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	Page		page;
 	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
+	uint32		spare_index;
 	uint32		i;
 
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
-	 * next power of 2, however, and always force at least 2 bucket pages. The
-	 * upper limit is determined by considerations explained in
+	 * total number of buckets which has to be allocated before using its
+	 * _hashm_spares index slot. However always force at least 2 bucket pages.
+	 * The upper limit is determined by considerations explained in
 	 * _hash_expandtable().
 	 */
 	dnumbuckets = num_tuples / ffactor;
@@ -518,11 +519,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	else if (dnumbuckets >= (double) 0x40000000)
 		num_buckets = 0x40000000;
 	else
-		num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+		num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
 
-	log2_num_buckets = _hash_log2(num_buckets);
-	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
-	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
+	spare_index = _hash_spareindex(num_buckets);
+	Assert(spare_index < HASH_MAX_SPLITPOINTS);
 
 	page = BufferGetPage(buf);
 	if (initpage)
@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
-	 * blocks 1 to N.  The first freespace bitmap page is in block N+1. Since
-	 * N is a power of 2, we can set the masks this way:
+	 * blocks 1 to N.  The first freespace bitmap page is in block N+1.
 	 */
-	metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
-	metap->hashm_highmask = (num_buckets << 1) - 1;
+	metap->hashm_maxbucket = num_buckets - 1;
+
+	/* set highmask, which should be sufficient to cover num_buckets. */
+	metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
+	metap->hashm_lowmask = (metap->hashm_highmask >> 1);
 
 	MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
 	/* Set up mapping for one spare page after the initial splitpoints */
-	metap->hashm_spares[log2_num_buckets] = 1;
-	metap->hashm_ovflpoint = log2_num_buckets;
+	metap->hashm_spares[spare_index] = 1;
+	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
 
 	/*
@@ -773,25 +775,44 @@ restart_expand:
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to allocate a new batch of bucket pages.
+	 * If the split point is increasing we need to allocate a new batch of
+	 * bucket pages.
 	 */
-	spare_ndx = _hash_log2(new_bucket + 1);
+	spare_ndx = _hash_spareindex(new_bucket + 1);
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
+		uint32		buckets_toadd = 0;
+		uint32		splitpoint_group = 0;
+
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
-		 * The number of buckets in the new splitpoint is equal to the total
-		 * number already in existence, i.e. new_bucket.  Currently this maps
-		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.  We treat allocation of buckets as a
-		 * separate WAL-logged action.  Even if we fail after this operation,
-		 * won't leak bucket pages; rather, the next split will consume this
-		 * space. In any case, even without failure we don't use all the space
-		 * in one split operation.
+		 * The number of buckets in the new splitpoint group is equal to the
+		 * total number already in existence. But we do not allocate them at
+		 * once. Each splitpoint group will have 4 slots, we distribute the
+		 * buckets equally among them. So we allocate only one fourth of total
+		 * buckets in new splitpoint group at a time to consume one phase after
+		 * another. We treat allocation of buckets as a separate WAL-logged
+		 * action. Even if we fail after this operation, won't leak bucket
+		 * pages; rather, the next split will consume this space. In any case,
+		 * even without failure we don't use all the space in one split
+		 * operation.
+		 */
+
+		splitpoint_group = HASH_SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(spare_ndx);
+
+		/*
+		 * Each phase in the splitpoint_group will allocate one fourth of total
+		 * buckets to be allocated in splitpoint_group. For
+		 * splitpoint_group < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE, have only
+		 * one phase of allocation so we allocate all of the buckets belonging
+		 * to that buckets at once.
 		 */
-		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+		buckets_toadd =
+			(splitpoint_group < HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) ?
+			(new_bucket) :
+			((1 << (splitpoint_group - 1)) / HASH_SPLITPOINT_PHASES_PER_GRP);
+		if (!_hash_alloc_buckets(rel, start_nblkno, buckets_toadd))
 		{
 			/* can't split due to BlockNumber overflow */
 			_hash_relbuf(rel, buf_oblkno);
@@ -836,10 +857,9 @@ restart_expand:
 	}
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to adjust the hashm_spares[] array and
-	 * hashm_ovflpoint so that future overflow pages will be created beyond
-	 * this new batch of bucket pages.
+	 * If the split point is increasing we need to adjust the hashm_spares[]
+	 * array and hashm_ovflpoint so that future overflow pages will be created
+	 * beyond this new batch of bucket pages.
 	 */
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..0d99051 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -150,6 +150,68 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_spareindex -- returns spare index / global splitpoint phase of the
+ *					   bucket
+ */
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
+	uint32		splitpoint_group;
+	uint32		splitpoint_phases;
+
+	splitpoint_group = _hash_log2(num_bucket);
+
+	if (splitpoint_group < HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE)
+		return splitpoint_group;
+
+	/* account for single-phase groups */
+	splitpoint_phases = HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE;
+
+	/* account for multi-phase groups before splitpoint_group */
+	splitpoint_phases +=
+		((splitpoint_group - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) <<
+		 HASH_SPLITPOINT_PHASE_BITS);
+
+	/* account for phases within current group */
+	splitpoint_phases +=
+		(((num_bucket - 1) >> (HASH_SPLITPOINT_PHASE_BITS + 1)) &
+		 HASH_SPLITPOINT_PHASE_MASK);	/* to 0-based value. */
+
+	return splitpoint_phases;
+}
+
+/*
+ *	_hash_get_totalbuckets -- returns total number of buckets allocated till
+ *							the given splitpoint phase.
+ */
+uint32
+_hash_get_totalbuckets(uint32 splitpoint_phase)
+{
+	uint32		splitpoint_group;
+	uint32		total_buckets;
+	uint32		phases_within_splitpoint_group;
+
+	splitpoint_group =
+		HASH_SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(splitpoint_phase);
+
+	if (splitpoint_group < HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE)
+		return (1 << splitpoint_group);
+
+	/* account for buckets before splitpoint_group */
+	total_buckets = (1 << (splitpoint_group - 1));
+
+	/* account for buckets within splitpoint_group */
+	phases_within_splitpoint_group =
+		(((splitpoint_phase - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) %
+		  HASH_SPLITPOINT_PHASES_PER_GRP) + 1);	/* from 0-based to 1-based */
+	total_buckets +=
+		(((1 << (splitpoint_group - 1)) / HASH_SPLITPOINT_PHASES_PER_GRP) *
+		 phases_within_splitpoint_group);
+
+	return total_buckets;
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..6759c83 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -36,7 +36,7 @@ typedef uint32 Bucket;
 #define InvalidBucket	((Bucket) 0xFFFFFFFF)
 
 #define BUCKET_TO_BLKNO(metap,B) \
-		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
+		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)
 
 /*
  * Special space for hash index pages.
@@ -176,13 +176,36 @@ typedef HashScanOpaqueData *HashScanOpaque;
  *
  * The limitation on the size of spares[] comes from the fact that there's
  * no point in having more than 2^32 buckets with only uint32 hashcodes.
+ * (Note: The value of HASH_MAX_SPLITPOINTS which is the size of spares[] is
+ * adjusted in such a way to accommodate multi phased allocation of buckets
+ * after HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE).
+ *
  * There is no particular upper limit on the size of mapp[], other than
  * needing to fit into the metapage.  (With 8K block size, 128 bitmaps
  * limit us to 64 GB of overflow space...)
  */
-#define HASH_MAX_SPLITPOINTS		32
 #define HASH_MAX_BITMAPS			128
 
+#define HASH_SPLITPOINT_PHASE_BITS	2
+#define HASH_SPLITPOINT_PHASES_PER_GRP	(1 << HASH_SPLITPOINT_PHASE_BITS)
+#define HASH_SPLITPOINT_PHASE_MASK		(HASH_SPLITPOINT_PHASES_PER_GRP - 1)
+#define HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE	10
+
+/* defines max number of splitpoit phases a hash index can have */
+#define HASH_MAX_SPLITPOINT_GROUP	32
+#define HASH_MAX_SPLITPOINTS \
+	(((HASH_MAX_SPLITPOINT_GROUP - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) * \
+	  HASH_SPLITPOINT_PHASES_PER_GRP) + \
+	 HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE)
+
+/* given a splitpoint phase get its group */
+#define HASH_SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(sp_phase) \
+	(((sp_phase) < HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) ? \
+	 (sp_phase) : \
+	 ((((sp_phase) - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) >> \
+	   HASH_SPLITPOINT_PHASE_BITS) + \
+	  HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE))
+
 typedef struct HashMetaPageData
 {
 	uint32		hashm_magic;	/* magic no. for hash tables */
@@ -382,6 +405,8 @@ extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
 extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
 					 uint32 highmask, uint32 lowmask);
 extern uint32 _hash_log2(uint32 num);
+extern uint32 _hash_spareindex(uint32 num_bucket);
+extern uint32 _hash_get_totalbuckets(uint32 splitpoint_phase);
 extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
 extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
 extern bool _hash_convert_tuple(Relation index,
#31Robert Haas
robertmhaas@gmail.com
In reply to: Mithun Cy (#30)
Re: [POC] A better way to expand hash indexes.

On Thu, Mar 30, 2017 at 2:36 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

Thanks Robert, I have tried to fix the comments given as below.

Thanks.

Since this changes the on-disk format of hash indexes in an
incompatible way, I think it should bump HASH_VERSION. (Hopefully
that doesn't preclude REINDEX?) pg_upgrade should probably also be
patched to issue a warning if upgrading from a version < 10 to a
version >= 10 whenever hash indexes are present; I thought we had
similar cases already, but I don't see them at the moment. Maybe we
can get Bruce or someone to give us some advice on exactly what should
be done here.

In a couple of places, you say that a splitpoint group has 4 slots
rather than 4 phases.

I think that in _hash_get_totalbuckets(), you should use blah &
HASH_SPLITPOINT_PHASE_MASK rather than blah %
HASH_SPLITPOINT_PHASES_PER_GRP for consistency with _hash_spareindex
and, perhaps, speed. Similarly, instead of blah /
HASH_SPLITPOINT_PHASES_PER_GRP, use blah >>
HASH_SPLITPOINT_PHASE_BITS.

buckets_toadd is punctuated oddly. buckets_to_add? Instead of
hand-calculating this, how about calculating it as
_hash_get_totalbuckets(spare_ndx) - _hash_get_totalbuckets(spare_ndx -
1)? That way you reuse the existing logic instead of writing a
slightly different thing in a new place and maybe making a mistake.
If you're going to calculate it, use & and >> rather than % and /, as
above, and drop the parentheses around new_bucket -- this isn't a
macro definition.

+ uint32 splitpoint_group = 0;

Don't need the = 0 here; the next reference to this variable is an
unconditional initialization.

+         */
+
+        splitpoint_group = HASH_SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(spare_ndx);

I would delete the blank line.

-         * should start from ((2 ^ i) + metap->hashm_spares[i - 1] + 1).
+         * should start from
+         * (_hash_get_totalbuckets(i) + metap->hashm_spares[i - 1] + 1).

Won't survive pgindent.

-         * The number of buckets in the new splitpoint is equal to the total
-         * number already in existence, i.e. new_bucket.  Currently this maps
-         * one-to-one to blocks required, but someday we may need a more
-         * complicated calculation here.  We treat allocation of buckets as a
-         * separate WAL-logged action.  Even if we fail after this operation,
-         * won't leak bucket pages; rather, the next split will consume this
-         * space. In any case, even without failure we don't use all the space
-         * in one split operation.
+         * The number of buckets in the new splitpoint group is equal to the
+         * total number already in existence. But we do not allocate them at
+         * once. Each splitpoint group will have 4 slots, we distribute the
+         * buckets equally among them. So we allocate only one fourth of total
+         * buckets in new splitpoint group at a time to consume one phase after
+         * another. We treat allocation of buckets as a separate WAL-logged
+         * action. Even if we fail after this operation, won't leak bucket
+         * pages; rather, the next split will consume this space. In any case,
+         * even without failure we don't use all the space in one split
+         * operation.

I think here you should break this into two paragraphs -- start a new
paragraph with the sentence that begins "We treat..."

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Robert Haas (#31)
1 attachment(s)
Re: [POC] A better way to expand hash indexes.

Thanks, I have tried to fix all of the comments.

On Fri, Mar 31, 2017 at 8:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 30, 2017 at 2:36 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

Thanks Robert, I have tried to fix the comments given as below.

Thanks.

Since this changes the on-disk format of hash indexes in an
incompatible way, I think it should bump HASH_VERSION. (Hopefully
that doesn't preclude REINDEX?) pg_upgrade should probably also be
patched to issue a warning if upgrading from a version < 10 to a
version >= 10 whenever hash indexes are present; I thought we had
similar cases already, but I don't see them at the moment. Maybe we
can get Bruce or someone to give us some advice on exactly what should
be done here.

As of now increasing version ask us to REINDEX (metapage access verify
whether we are in right version)
postgres=# set enable_seqscan= off;
SET
postgres=# select * from t1 where i = 10;
ERROR: index "hash2" has wrong hash version
HINT: Please REINDEX it.
postgres=# insert into t1 values(10);
ERROR: index "hash2" has wrong hash version
HINT: Please REINDEX it.

postgres=# REINDEX INDEX hash2;
REINDEX
postgres=# select * from t1 where i = 10;
i
----
10
(1 row)

Last time we changed this version from 1 to 2
(4adc2f72a4ccd6e55e594aca837f09130a6af62b), from logs I see no changes
for upgrade specifically.

Hi Bruce, can you please advise us what should be done here.

In a couple of places, you say that a splitpoint group has 4 slots
rather than 4 phases.

--Fixed

I think that in _hash_get_totalbuckets(), you should use blah &
HASH_SPLITPOINT_PHASE_MASK rather than blah %
HASH_SPLITPOINT_PHASES_PER_GRP for consistency with _hash_spareindex
and, perhaps, speed. Similarly, instead of blah /
HASH_SPLITPOINT_PHASES_PER_GRP, use blah >>
HASH_SPLITPOINT_PHASE_BITS.

--Fixed

buckets_toadd is punctuated oddly. buckets_to_add? Instead of
hand-calculating this, how about calculating it as
_hash_get_totalbuckets(spare_ndx) - _hash_get_totalbuckets(spare_ndx -
1)?

I think this should do that, considering new_bucket is nothng but
1-based max_buckets.
buckets_to_add = _hash_get_totalbuckets(spare_ndx) - new_bucket;

That makes me do away with

+#define HASH_SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(sp_phase) \
+ (((sp_phase) < HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) ? \
+ (sp_phase) : \
+ ((((sp_phase) - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) >> \
+   HASH_SPLITPOINT_PHASE_BITS) + \
+  HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE))

as this is now used in only one place _hash_get_totalbuckets.
I also think the comments above can be removed now. As we have removed
the code related to multi-phased allocation there.

+         * The number of buckets in the new splitpoint group is equal
to the
+         * total number already in existence. But we do not allocate
them at
+         * once. Each splitpoint group will have 4 phases, we
distribute the
+         * buckets equally among them. So we allocate only one fourth
of total
+         * buckets in new splitpoint group at a time to consume one phase after
+         * another.

+ uint32 splitpoint_group = 0;

Don't need the = 0 here; the next reference to this variable is an
unconditional initialization.

--Fixed, with new code splitpoint_group is not needed.

+         */
+
+        splitpoint_group = HASH_SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(spare_ndx);

I would delete the blank line.

--Fixed.

-         * should start from ((2 ^ i) + metap->hashm_spares[i - 1] + 1).
+         * should start from
+         * (_hash_get_totalbuckets(i) + metap->hashm_spares[i - 1] + 1).

Won't survive pgindent.

--Fixed as pgindent has suggested.

-         * The number of buckets in the new splitpoint is equal to the total
-         * number already in existence, i.e. new_bucket.  Currently this maps
-         * one-to-one to blocks required, but someday we may need a more
-         * complicated calculation here.  We treat allocation of buckets as a
-         * separate WAL-logged action.  Even if we fail after this operation,
-         * won't leak bucket pages; rather, the next split will consume this
-         * space. In any case, even without failure we don't use all the space
-         * in one split operation.
+         * The number of buckets in the new splitpoint group is equal to the
+         * total number already in existence. But we do not allocate them at
+         * once. Each splitpoint group will have 4 slots, we distribute the
+         * buckets equally among them. So we allocate only one fourth of total
+         * buckets in new splitpoint group at a time to consume one phase after
+         * another. We treat allocation of buckets as a separate WAL-logged
+         * action. Even if we fail after this operation, won't leak bucket
+         * pages; rather, the next split will consume this space. In any case,
+         * even without failure we don't use all the space in one split
+         * operation.

I think here you should break this into two paragraphs -- start a new
paragraph with the sentence that begins "We treat..."

-- Fixed, I have removed the first paragraph it appeared as an extra
information when we do
buckets_to_add = _hash_get_totalbuckets(spare_ndx) - new_bucket;

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

yet_another_expand_hashbucket_efficiently_14.patchapplication/octet-stream; name=yet_another_expand_hashbucket_efficiently_14.patchDownload
commit 6680fe401c0ab787bee8c59def3d3cf49c6e9f19
Author: mithun <mithun@localhost.localdomain>
Date:   Fri Mar 31 10:02:49 2017 +0530

    commit 13

diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 3ba01f6..e287093 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -51,13 +51,13 @@ bsize     | 8152
 bmsize    | 4096
 bmshift   | 15
 maxbucket | 3
-highmask  | 7
-lowmask   | 3
+highmask  | 3
+lowmask   | 1
 ovflpoint | 2
 firstfree | 0
 nmaps     | 1
 procid    | 450
-spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 
 SELECT magic, version, ntuples, bsize, bmsize, bmshift, maxbucket, highmask,
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 9f41bb2..b3d0056 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -667,11 +667,11 @@ bmshift   | 15
 maxbucket | 12512
 highmask  | 16383
 lowmask   | 8191
-ovflpoint | 14
+ovflpoint | 28
 firstfree | 1204
 nmaps     | 1
 procid    | 450
-spares    | {0,0,0,0,0,0,1,1,1,1,1,4,59,704,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,0,0,0,1,1,1,1,1,1,1,1,3,4,4,4,45,55,58,59,508,567,628,704,1193,1202,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 </screen>
      </para>
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..c8a0ec7 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -58,35 +58,51 @@ rules to support a variable number of overflow pages while not having to
 move primary bucket pages around after they are created.
 
 Primary bucket pages (henceforth just "bucket pages") are allocated in
-power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
-are created when the index is initialized.  At the first split, buckets 2
-and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
-when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
-pages of a power-of-2 group appear consecutively in the index.  This
-addressing scheme allows the physical location of a bucket page to be
-computed from the bucket number relatively easily, using only a small
-amount of control information.  We take the log2() of the bucket number
-to determine which split point S the bucket belongs to, and then simply
-add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
-metapage) to compute the physical address.  hashm_spares[S] can be
-interpreted as the total number of overflow pages that have been allocated
-before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
-so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
-block numbers 1 and 2, just after the meta page.  We always have
-hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
-former.  The difference between the two represents the number of overflow
-pages appearing between the bucket page groups of splitpoints N and N+1.
-
+power-of-2 groups, called "split points" in the code.  That means at every new
+splitpoint we double the existing number of buckets.  Allocating huge chunks
+of bucket pages all at once isn't optimal and we will take ages to consume
+those.  To avoid this exponential growth of index size, we did use a trick to
+break up allocation of buckets at the splitpoint into 4 equal phases.  If
+(2 ^ x) are the total buckets need to be allocated at a splitpoint (from now on
+we shall call this as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2))
+of total buckets at each phase of splitpoint group.  Next quarter of allocation
+will only happen if buckets of the previous phase have been already consumed.
+For the initial splitpoint groups < 10 we will allocate all of their buckets in
+single phase only, as number of buckets allocated at initial groups are small
+in numbers.  And for the groups >= 10 the allocation process is distributed
+among four equal phases.  At group 10 we allocate (2 ^ 9) buckets in 4
+different phases {2 ^ 7, 2 ^ 7, 2 ^ 7, 2 ^ 7}, the numbers in curly braces
+indicate the number of buckets allocated within each phase of splitpoint group
+10.  And, for splitpoint group 11 and 12 allocation phases will be
+{2 ^ 8, 2 ^ 8, 2 ^ 8, 2 ^ 8} and {2 ^ 9, 2 ^ 9, 2 ^ 9, 2 ^ 9} respectively.  We
+can see that at each splitpoint group we double the total number of buckets
+from the previous group but in an incremental phase.  The bucket pages
+allocated within one phase of a splitpoint group will appear consecutively in
+the index.  This addressing scheme allows the physical location of a bucket
+page to be computed from the bucket number relatively easily, using only a
+small amount of control information.  If we look at the function
+_hash_spareindex for a given bucket number we first compute the
+splitpoint group it belongs to and then the phase to which the bucket belongs
+to.  Adding them we get the global splitpoint phase number S to which the
+bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[]
+is an array stored in the metapage) with given bucket number to compute its
+physical address.  The hashm_spares[S] can be interpreted as the total number
+of overflow pages that have been allocated before the bucket pages of
+splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
+always appear at block numbers 1 and 2, just after the meta page.  We always
+have hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
+former.  The difference between the two represents the number of overflow pages
+appearing between the bucket page groups of splitpoints phase N and N+1.
 (Note: the above describes what happens when filling an initially minimally
-sized hash index.  In practice, we try to estimate the required index size
-and allocate a suitable number of splitpoints immediately, to avoid
+sized hash index.  In practice, we try to estimate the required index size and
+allocate a suitable number of splitpoints phases immediately, to avoid
 expensive re-splitting during initial index build.)
 
 When S splitpoints exist altogether, the array entries hashm_spares[0]
 through hashm_spares[S] are valid; hashm_spares[S] records the current
 total number of overflow pages.  New overflow pages are created as needed
 at the end of the index, and recorded by incrementing hashm_spares[S].
-When it is time to create a new splitpoint's worth of bucket pages, we
+When it is time to create a new splitpoint phase's worth of bucket pages, we
 copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
 stored in the hashm_ovflpoint field of the meta page).  This has the
 effect of reserving the correct number of bucket pages at the end of the
@@ -101,7 +117,7 @@ We have to allow the case "greater than" because it's possible that during
 an index extension we crash after allocating filesystem space and before
 updating the metapage.  Note that on filesystems that allow "holes" in
 files, it's entirely likely that pages before the logical EOF are not yet
-allocated: when we allocate a new splitpoint's worth of bucket pages, we
+allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
 physically zero the last such page to force the EOF up, and the first such
 page will be used immediately, but the intervening pages are not written
 until needed.
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..41ef654 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -49,7 +49,7 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 	 * Convert to absolute page number by adding the number of bucket pages
 	 * that exist before this split point.
 	 */
-	return (BlockNumber) ((1 << i) + ovflbitnum);
+	return (BlockNumber) (_hash_get_totalbuckets(i) + ovflbitnum);
 }
 
 /*
@@ -67,14 +67,15 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 	/* Determine the split number containing this page */
 	for (i = 1; i <= splitnum; i++)
 	{
-		if (ovflblkno <= (BlockNumber) (1 << i))
+		if (ovflblkno <= (BlockNumber) _hash_get_totalbuckets(i))
 			break;				/* oops */
-		bitnum = ovflblkno - (1 << i);
+		bitnum = ovflblkno - _hash_get_totalbuckets(i);
 
 		/*
 		 * bitnum has to be greater than number of overflow page added in
 		 * previous split point. The overflow page at this splitnum (i) if any
-		 * should start from ((2 ^ i) + metap->hashm_spares[i - 1] + 1).
+		 * should start from (_hash_get_totalbuckets(i) +
+		 * metap->hashm_spares[i - 1] + 1).
 		 */
 		if (bitnum > metap->hashm_spares[i - 1] &&
 			bitnum <= metap->hashm_spares[i])
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 61ca2ec..fb8a9f0 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -502,14 +502,15 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	Page		page;
 	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
+	uint32		spare_index;
 	uint32		i;
 
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
-	 * next power of 2, however, and always force at least 2 bucket pages. The
-	 * upper limit is determined by considerations explained in
+	 * total number of buckets which has to be allocated before using its
+	 * _hashm_spare element. However always force at least 2 bucket pages.
+	 * The upper limit is determined by considerations explained in
 	 * _hash_expandtable().
 	 */
 	dnumbuckets = num_tuples / ffactor;
@@ -518,11 +519,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	else if (dnumbuckets >= (double) 0x40000000)
 		num_buckets = 0x40000000;
 	else
-		num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+		num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
 
-	log2_num_buckets = _hash_log2(num_buckets);
-	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
-	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
+	spare_index = _hash_spareindex(num_buckets);
+	Assert(spare_index < HASH_MAX_SPLITPOINTS);
 
 	page = BufferGetPage(buf);
 	if (initpage)
@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
-	 * blocks 1 to N.  The first freespace bitmap page is in block N+1. Since
-	 * N is a power of 2, we can set the masks this way:
+	 * blocks 1 to N.  The first freespace bitmap page is in block N+1.
 	 */
-	metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
-	metap->hashm_highmask = (num_buckets << 1) - 1;
+	metap->hashm_maxbucket = num_buckets - 1;
+
+	/* set highmask, which should be sufficient to cover num_buckets. */
+	metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
+	metap->hashm_lowmask = (metap->hashm_highmask >> 1);
 
 	MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
 	/* Set up mapping for one spare page after the initial splitpoints */
-	metap->hashm_spares[log2_num_buckets] = 1;
-	metap->hashm_ovflpoint = log2_num_buckets;
+	metap->hashm_spares[spare_index] = 1;
+	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
 
 	/*
@@ -773,25 +775,25 @@ restart_expand:
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to allocate a new batch of bucket pages.
+	 * If the split point is increasing we need to allocate a new batch of
+	 * bucket pages.
 	 */
-	spare_ndx = _hash_log2(new_bucket + 1);
+	spare_ndx = _hash_spareindex(new_bucket + 1);
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
+		uint32		buckets_to_add;
+
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
-		 * The number of buckets in the new splitpoint is equal to the total
-		 * number already in existence, i.e. new_bucket.  Currently this maps
-		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.  We treat allocation of buckets as a
-		 * separate WAL-logged action.  Even if we fail after this operation,
-		 * won't leak bucket pages; rather, the next split will consume this
-		 * space. In any case, even without failure we don't use all the space
-		 * in one split operation.
+		 * We treat allocation of buckets as a separate WAL-logged action.
+		 * Even if we fail after this operation, won't leak bucket pages;
+		 * rather, the next split will consume this space. In any case, even
+		 * without failure we don't use all the space in one split
+		 * operation.
 		 */
-		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+		buckets_to_add = _hash_get_totalbuckets(spare_ndx) - new_bucket;
+		if (!_hash_alloc_buckets(rel, start_nblkno, buckets_to_add))
 		{
 			/* can't split due to BlockNumber overflow */
 			_hash_relbuf(rel, buf_oblkno);
@@ -836,10 +838,9 @@ restart_expand:
 	}
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to adjust the hashm_spares[] array and
-	 * hashm_ovflpoint so that future overflow pages will be created beyond
-	 * this new batch of bucket pages.
+	 * If the split point is increasing we need to adjust the hashm_spares[]
+	 * array and hashm_ovflpoint so that future overflow pages will be created
+	 * beyond this new batch of bucket pages.
 	 */
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..d679cf0 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -150,6 +150,71 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_spareindex -- returns spare index / global splitpoint phase of the
+ *					   bucket
+ */
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
+	uint32		splitpoint_group;
+	uint32		splitpoint_phases;
+
+	splitpoint_group = _hash_log2(num_bucket);
+
+	if (splitpoint_group < HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE)
+		return splitpoint_group;
+
+	/* account for single-phase groups */
+	splitpoint_phases = HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE;
+
+	/* account for multi-phase groups before splitpoint_group */
+	splitpoint_phases +=
+		((splitpoint_group - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) <<
+		 HASH_SPLITPOINT_PHASE_BITS);
+
+	/* account for phases within current group */
+	splitpoint_phases +=
+		(((num_bucket - 1) >> (HASH_SPLITPOINT_PHASE_BITS + 1)) &
+		 HASH_SPLITPOINT_PHASE_MASK);	/* to 0-based value. */
+
+	return splitpoint_phases;
+}
+
+/*
+ *	_hash_get_totalbuckets -- returns total number of buckets allocated till
+ *							the given splitpoint phase.
+ */
+uint32
+_hash_get_totalbuckets(uint32 splitpoint_phase)
+{
+	uint32		splitpoint_group;
+	uint32		total_buckets;
+	uint32		phases_within_splitpoint_group;
+
+	if (splitpoint_phase < HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE)
+		return (1 << splitpoint_phase);
+
+	/* get splitpoint's group */
+	splitpoint_group = HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE;
+	splitpoint_group +=
+		((splitpoint_phase - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) >>
+		 HASH_SPLITPOINT_PHASE_BITS);
+
+	/* account for buckets before splitpoint_group */
+	total_buckets = (1 << (splitpoint_group - 1));
+
+	/* account for buckets within splitpoint_group */
+	phases_within_splitpoint_group =
+		(((splitpoint_phase - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) &
+		  HASH_SPLITPOINT_PHASE_MASK) + 1);		/* from 0-based to 1-based */
+	total_buckets +=
+		(((1 << (splitpoint_group - 1)) >> HASH_SPLITPOINT_PHASE_BITS) *
+		 phases_within_splitpoint_group);
+
+	return total_buckets;
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..fcc3957 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -36,7 +36,7 @@ typedef uint32 Bucket;
 #define InvalidBucket	((Bucket) 0xFFFFFFFF)
 
 #define BUCKET_TO_BLKNO(metap,B) \
-		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
+		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)
 
 /*
  * Special space for hash index pages.
@@ -158,7 +158,8 @@ typedef HashScanOpaqueData *HashScanOpaque;
 #define HASH_METAPAGE	0		/* metapage is always block 0 */
 
 #define HASH_MAGIC		0x6440640
-#define HASH_VERSION	2		/* 2 signifies only hash key value is stored */
+#define HASH_VERSION	3		/* 3 signifies multi-phased bucket allocation
+								 * to reduce doubling */
 
 /*
  * spares[] holds the number of overflow pages currently allocated at or
@@ -176,13 +177,28 @@ typedef HashScanOpaqueData *HashScanOpaque;
  *
  * The limitation on the size of spares[] comes from the fact that there's
  * no point in having more than 2^32 buckets with only uint32 hashcodes.
+ * (Note: The value of HASH_MAX_SPLITPOINTS which is the size of spares[] is
+ * adjusted in such a way to accommodate multi phased allocation of buckets
+ * after HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE).
+ *
  * There is no particular upper limit on the size of mapp[], other than
  * needing to fit into the metapage.  (With 8K block size, 128 bitmaps
  * limit us to 64 GB of overflow space...)
  */
-#define HASH_MAX_SPLITPOINTS		32
 #define HASH_MAX_BITMAPS			128
 
+#define HASH_SPLITPOINT_PHASE_BITS	2
+#define HASH_SPLITPOINT_PHASES_PER_GRP	(1 << HASH_SPLITPOINT_PHASE_BITS)
+#define HASH_SPLITPOINT_PHASE_MASK		(HASH_SPLITPOINT_PHASES_PER_GRP - 1)
+#define HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE	10
+
+/* defines max number of splitpoit phases a hash index can have */
+#define HASH_MAX_SPLITPOINT_GROUP	32
+#define HASH_MAX_SPLITPOINTS \
+	(((HASH_MAX_SPLITPOINT_GROUP - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) * \
+	  HASH_SPLITPOINT_PHASES_PER_GRP) + \
+	 HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE)
+
 typedef struct HashMetaPageData
 {
 	uint32		hashm_magic;	/* magic no. for hash tables */
@@ -382,6 +398,8 @@ extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
 extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
 					 uint32 highmask, uint32 lowmask);
 extern uint32 _hash_log2(uint32 num);
+extern uint32 _hash_spareindex(uint32 num_bucket);
+extern uint32 _hash_get_totalbuckets(uint32 splitpoint_phase);
 extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
 extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
 extern bool _hash_convert_tuple(Relation index,
#33Robert Haas
robertmhaas@gmail.com
In reply to: Mithun Cy (#32)
Re: [POC] A better way to expand hash indexes.

On Fri, Mar 31, 2017 at 1:15 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

Thanks, I have tried to fix all of the comments.

Thanks.

Hmm, don't the changes to contrib/pageinspect/expected/hash.out
indicate that you've broken something? The hash index has only 4
buckets, so the new code shouldn't be doing anything differently, but
you've got highmask and lowmask changing for some reason.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Robert Haas (#33)
1 attachment(s)
Re: [POC] A better way to expand hash indexes.

On Sat, Apr 1, 2017 at 7:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Hmm, don't the changes to contrib/pageinspect/expected/hash.out
indicate that you've broken something? The hash index has only 4
buckets, so the new code shouldn't be doing anything differently, but
you've got highmask and lowmask changing for some reason.

The high mask calculation has got changed a bit to accommodate
num_buckets which are not 2-power number.
If num_bucket is not a 2-power number highmask should be its immediate
next ((2^x) - 1) and low mask should be (highmask >> 1) to cover the
first half of the buckets. Trying to generalize same has changed the
masks for 2-power num_buckets from older implementation.

+ /* set highmask, which should be sufficient to cover num_buckets. */
+ metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;

But this do not cause any adverse effect the high and low mask is
sufficiently enough to get the same mod, If we add one more bucket
then
@_hash_expandtable, immediately we make the masks bigger.
if (new_bucket > metap->hashm_highmask)
{
/* Starting a new doubling */
metap->hashm_lowmask = metap->hashm_highmask;
metap->hashm_highmask = new_bucket | metap->hashm_lowmask;

The state (metap->hashm_highmask == metap->hashm_maxbucket) is natural
state to occur while hash index is growing and just before doubling.

Another choice I could have made is bump a number so that for 2-power
num_buckets will get highmask as similar to old code, and non 2-power
num_buckets highmask will be immediate next ((2^x) - 1).
+ /* set highmask, which should be sufficient to cover num_buckets. */
+ metap->hashm_highmask = (1 << (_hash_log2(num_buckets + 1))) - 1;

It was just a personal preference I choose 1, as it appeared
consistent with running state of hash index expansion.

Also adding a patch which implements the 2nd way.

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

yet_another_expand_hashbucket_efficiently_15.patchapplication/octet-stream; name=yet_another_expand_hashbucket_efficiently_15.patchDownload
diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 3ba01f6..3937415 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -45,7 +45,7 @@ lowmask, ovflpoint, firstfree, nmaps, procid, spares, mapp FROM
 hash_metapage_info(get_raw_page('test_hash_a_idx', 0));
 -[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 magic     | 105121344
-version   | 2
+version   | 3
 ntuples   | 1
 bsize     | 8152
 bmsize    | 4096
@@ -57,7 +57,7 @@ ovflpoint | 2
 firstfree | 0
 nmaps     | 1
 procid    | 450
-spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 
 SELECT magic, version, ntuples, bsize, bmsize, bmshift, maxbucket, highmask,
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 9f41bb2..c87b160 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -658,7 +658,7 @@ test=# SELECT * FROM hash_bitmap_info('con_hash_index', 2052);
 test=# SELECT * FROM hash_metapage_info(get_raw_page('con_hash_index', 0));
 -[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 magic     | 105121344
-version   | 2
+version   | 3
 ntuples   | 500500
 ffactor   | 40
 bsize     | 8152
@@ -667,11 +667,11 @@ bmshift   | 15
 maxbucket | 12512
 highmask  | 16383
 lowmask   | 8191
-ovflpoint | 14
+ovflpoint | 28
 firstfree | 1204
 nmaps     | 1
 procid    | 450
-spares    | {0,0,0,0,0,0,1,1,1,1,1,4,59,704,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,0,0,0,1,1,1,1,1,1,1,1,3,4,4,4,45,55,58,59,508,567,628,704,1193,1202,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 </screen>
      </para>
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..c8a0ec7 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -58,35 +58,51 @@ rules to support a variable number of overflow pages while not having to
 move primary bucket pages around after they are created.
 
 Primary bucket pages (henceforth just "bucket pages") are allocated in
-power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
-are created when the index is initialized.  At the first split, buckets 2
-and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
-when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
-pages of a power-of-2 group appear consecutively in the index.  This
-addressing scheme allows the physical location of a bucket page to be
-computed from the bucket number relatively easily, using only a small
-amount of control information.  We take the log2() of the bucket number
-to determine which split point S the bucket belongs to, and then simply
-add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
-metapage) to compute the physical address.  hashm_spares[S] can be
-interpreted as the total number of overflow pages that have been allocated
-before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
-so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
-block numbers 1 and 2, just after the meta page.  We always have
-hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
-former.  The difference between the two represents the number of overflow
-pages appearing between the bucket page groups of splitpoints N and N+1.
-
+power-of-2 groups, called "split points" in the code.  That means at every new
+splitpoint we double the existing number of buckets.  Allocating huge chunks
+of bucket pages all at once isn't optimal and we will take ages to consume
+those.  To avoid this exponential growth of index size, we did use a trick to
+break up allocation of buckets at the splitpoint into 4 equal phases.  If
+(2 ^ x) are the total buckets need to be allocated at a splitpoint (from now on
+we shall call this as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2))
+of total buckets at each phase of splitpoint group.  Next quarter of allocation
+will only happen if buckets of the previous phase have been already consumed.
+For the initial splitpoint groups < 10 we will allocate all of their buckets in
+single phase only, as number of buckets allocated at initial groups are small
+in numbers.  And for the groups >= 10 the allocation process is distributed
+among four equal phases.  At group 10 we allocate (2 ^ 9) buckets in 4
+different phases {2 ^ 7, 2 ^ 7, 2 ^ 7, 2 ^ 7}, the numbers in curly braces
+indicate the number of buckets allocated within each phase of splitpoint group
+10.  And, for splitpoint group 11 and 12 allocation phases will be
+{2 ^ 8, 2 ^ 8, 2 ^ 8, 2 ^ 8} and {2 ^ 9, 2 ^ 9, 2 ^ 9, 2 ^ 9} respectively.  We
+can see that at each splitpoint group we double the total number of buckets
+from the previous group but in an incremental phase.  The bucket pages
+allocated within one phase of a splitpoint group will appear consecutively in
+the index.  This addressing scheme allows the physical location of a bucket
+page to be computed from the bucket number relatively easily, using only a
+small amount of control information.  If we look at the function
+_hash_spareindex for a given bucket number we first compute the
+splitpoint group it belongs to and then the phase to which the bucket belongs
+to.  Adding them we get the global splitpoint phase number S to which the
+bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[]
+is an array stored in the metapage) with given bucket number to compute its
+physical address.  The hashm_spares[S] can be interpreted as the total number
+of overflow pages that have been allocated before the bucket pages of
+splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
+always appear at block numbers 1 and 2, just after the meta page.  We always
+have hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
+former.  The difference between the two represents the number of overflow pages
+appearing between the bucket page groups of splitpoints phase N and N+1.
 (Note: the above describes what happens when filling an initially minimally
-sized hash index.  In practice, we try to estimate the required index size
-and allocate a suitable number of splitpoints immediately, to avoid
+sized hash index.  In practice, we try to estimate the required index size and
+allocate a suitable number of splitpoints phases immediately, to avoid
 expensive re-splitting during initial index build.)
 
 When S splitpoints exist altogether, the array entries hashm_spares[0]
 through hashm_spares[S] are valid; hashm_spares[S] records the current
 total number of overflow pages.  New overflow pages are created as needed
 at the end of the index, and recorded by incrementing hashm_spares[S].
-When it is time to create a new splitpoint's worth of bucket pages, we
+When it is time to create a new splitpoint phase's worth of bucket pages, we
 copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
 stored in the hashm_ovflpoint field of the meta page).  This has the
 effect of reserving the correct number of bucket pages at the end of the
@@ -101,7 +117,7 @@ We have to allow the case "greater than" because it's possible that during
 an index extension we crash after allocating filesystem space and before
 updating the metapage.  Note that on filesystems that allow "holes" in
 files, it's entirely likely that pages before the logical EOF are not yet
-allocated: when we allocate a new splitpoint's worth of bucket pages, we
+allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
 physically zero the last such page to force the EOF up, and the first such
 page will be used immediately, but the intervening pages are not written
 until needed.
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..41ef654 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -49,7 +49,7 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 	 * Convert to absolute page number by adding the number of bucket pages
 	 * that exist before this split point.
 	 */
-	return (BlockNumber) ((1 << i) + ovflbitnum);
+	return (BlockNumber) (_hash_get_totalbuckets(i) + ovflbitnum);
 }
 
 /*
@@ -67,14 +67,15 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 	/* Determine the split number containing this page */
 	for (i = 1; i <= splitnum; i++)
 	{
-		if (ovflblkno <= (BlockNumber) (1 << i))
+		if (ovflblkno <= (BlockNumber) _hash_get_totalbuckets(i))
 			break;				/* oops */
-		bitnum = ovflblkno - (1 << i);
+		bitnum = ovflblkno - _hash_get_totalbuckets(i);
 
 		/*
 		 * bitnum has to be greater than number of overflow page added in
 		 * previous split point. The overflow page at this splitnum (i) if any
-		 * should start from ((2 ^ i) + metap->hashm_spares[i - 1] + 1).
+		 * should start from (_hash_get_totalbuckets(i) +
+		 * metap->hashm_spares[i - 1] + 1).
 		 */
 		if (bitnum > metap->hashm_spares[i - 1] &&
 			bitnum <= metap->hashm_spares[i])
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 61ca2ec..b5a1c7e 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -502,14 +502,15 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	Page		page;
 	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
+	uint32		spare_index;
 	uint32		i;
 
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
-	 * next power of 2, however, and always force at least 2 bucket pages. The
-	 * upper limit is determined by considerations explained in
+	 * total number of buckets which has to be allocated before using its
+	 * _hashm_spare element. However always force at least 2 bucket pages.
+	 * The upper limit is determined by considerations explained in
 	 * _hash_expandtable().
 	 */
 	dnumbuckets = num_tuples / ffactor;
@@ -518,11 +519,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	else if (dnumbuckets >= (double) 0x40000000)
 		num_buckets = 0x40000000;
 	else
-		num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+		num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
 
-	log2_num_buckets = _hash_log2(num_buckets);
-	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
-	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
+	spare_index = _hash_spareindex(num_buckets);
+	Assert(spare_index < HASH_MAX_SPLITPOINTS);
 
 	page = BufferGetPage(buf);
 	if (initpage)
@@ -563,18 +563,23 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
-	 * blocks 1 to N.  The first freespace bitmap page is in block N+1. Since
-	 * N is a power of 2, we can set the masks this way:
+	 * blocks 1 to N.  The first freespace bitmap page is in block N+1.
 	 */
-	metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
-	metap->hashm_highmask = (num_buckets << 1) - 1;
+	metap->hashm_maxbucket = num_buckets - 1;
+
+	/*
+	 * Set highmask as next immediate ((2 ^ x) - 1), which should be sufficient
+	 * to cover num_buckets.
+	 */
+	metap->hashm_highmask = (1 << (_hash_log2(num_buckets + 1))) - 1;
+	metap->hashm_lowmask = (metap->hashm_highmask >> 1);
 
 	MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
 	/* Set up mapping for one spare page after the initial splitpoints */
-	metap->hashm_spares[log2_num_buckets] = 1;
-	metap->hashm_ovflpoint = log2_num_buckets;
+	metap->hashm_spares[spare_index] = 1;
+	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
 
 	/*
@@ -773,25 +778,25 @@ restart_expand:
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to allocate a new batch of bucket pages.
+	 * If the split point is increasing we need to allocate a new batch of
+	 * bucket pages.
 	 */
-	spare_ndx = _hash_log2(new_bucket + 1);
+	spare_ndx = _hash_spareindex(new_bucket + 1);
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
+		uint32		buckets_to_add;
+
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
-		 * The number of buckets in the new splitpoint is equal to the total
-		 * number already in existence, i.e. new_bucket.  Currently this maps
-		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.  We treat allocation of buckets as a
-		 * separate WAL-logged action.  Even if we fail after this operation,
-		 * won't leak bucket pages; rather, the next split will consume this
-		 * space. In any case, even without failure we don't use all the space
-		 * in one split operation.
+		 * We treat allocation of buckets as a separate WAL-logged action.
+		 * Even if we fail after this operation, won't leak bucket pages;
+		 * rather, the next split will consume this space. In any case, even
+		 * without failure we don't use all the space in one split
+		 * operation.
 		 */
-		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+		buckets_to_add = _hash_get_totalbuckets(spare_ndx) - new_bucket;
+		if (!_hash_alloc_buckets(rel, start_nblkno, buckets_to_add))
 		{
 			/* can't split due to BlockNumber overflow */
 			_hash_relbuf(rel, buf_oblkno);
@@ -836,10 +841,9 @@ restart_expand:
 	}
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to adjust the hashm_spares[] array and
-	 * hashm_ovflpoint so that future overflow pages will be created beyond
-	 * this new batch of bucket pages.
+	 * If the split point is increasing we need to adjust the hashm_spares[]
+	 * array and hashm_ovflpoint so that future overflow pages will be created
+	 * beyond this new batch of bucket pages.
 	 */
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..d679cf0 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -150,6 +150,71 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_spareindex -- returns spare index / global splitpoint phase of the
+ *					   bucket
+ */
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
+	uint32		splitpoint_group;
+	uint32		splitpoint_phases;
+
+	splitpoint_group = _hash_log2(num_bucket);
+
+	if (splitpoint_group < HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE)
+		return splitpoint_group;
+
+	/* account for single-phase groups */
+	splitpoint_phases = HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE;
+
+	/* account for multi-phase groups before splitpoint_group */
+	splitpoint_phases +=
+		((splitpoint_group - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) <<
+		 HASH_SPLITPOINT_PHASE_BITS);
+
+	/* account for phases within current group */
+	splitpoint_phases +=
+		(((num_bucket - 1) >> (HASH_SPLITPOINT_PHASE_BITS + 1)) &
+		 HASH_SPLITPOINT_PHASE_MASK);	/* to 0-based value. */
+
+	return splitpoint_phases;
+}
+
+/*
+ *	_hash_get_totalbuckets -- returns total number of buckets allocated till
+ *							the given splitpoint phase.
+ */
+uint32
+_hash_get_totalbuckets(uint32 splitpoint_phase)
+{
+	uint32		splitpoint_group;
+	uint32		total_buckets;
+	uint32		phases_within_splitpoint_group;
+
+	if (splitpoint_phase < HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE)
+		return (1 << splitpoint_phase);
+
+	/* get splitpoint's group */
+	splitpoint_group = HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE;
+	splitpoint_group +=
+		((splitpoint_phase - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) >>
+		 HASH_SPLITPOINT_PHASE_BITS);
+
+	/* account for buckets before splitpoint_group */
+	total_buckets = (1 << (splitpoint_group - 1));
+
+	/* account for buckets within splitpoint_group */
+	phases_within_splitpoint_group =
+		(((splitpoint_phase - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) &
+		  HASH_SPLITPOINT_PHASE_MASK) + 1);		/* from 0-based to 1-based */
+	total_buckets +=
+		(((1 << (splitpoint_group - 1)) >> HASH_SPLITPOINT_PHASE_BITS) *
+		 phases_within_splitpoint_group);
+
+	return total_buckets;
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..fcc3957 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -36,7 +36,7 @@ typedef uint32 Bucket;
 #define InvalidBucket	((Bucket) 0xFFFFFFFF)
 
 #define BUCKET_TO_BLKNO(metap,B) \
-		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
+		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)
 
 /*
  * Special space for hash index pages.
@@ -158,7 +158,8 @@ typedef HashScanOpaqueData *HashScanOpaque;
 #define HASH_METAPAGE	0		/* metapage is always block 0 */
 
 #define HASH_MAGIC		0x6440640
-#define HASH_VERSION	2		/* 2 signifies only hash key value is stored */
+#define HASH_VERSION	3		/* 3 signifies multi-phased bucket allocation
+								 * to reduce doubling */
 
 /*
  * spares[] holds the number of overflow pages currently allocated at or
@@ -176,13 +177,28 @@ typedef HashScanOpaqueData *HashScanOpaque;
  *
  * The limitation on the size of spares[] comes from the fact that there's
  * no point in having more than 2^32 buckets with only uint32 hashcodes.
+ * (Note: The value of HASH_MAX_SPLITPOINTS which is the size of spares[] is
+ * adjusted in such a way to accommodate multi phased allocation of buckets
+ * after HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE).
+ *
  * There is no particular upper limit on the size of mapp[], other than
  * needing to fit into the metapage.  (With 8K block size, 128 bitmaps
  * limit us to 64 GB of overflow space...)
  */
-#define HASH_MAX_SPLITPOINTS		32
 #define HASH_MAX_BITMAPS			128
 
+#define HASH_SPLITPOINT_PHASE_BITS	2
+#define HASH_SPLITPOINT_PHASES_PER_GRP	(1 << HASH_SPLITPOINT_PHASE_BITS)
+#define HASH_SPLITPOINT_PHASE_MASK		(HASH_SPLITPOINT_PHASES_PER_GRP - 1)
+#define HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE	10
+
+/* defines max number of splitpoit phases a hash index can have */
+#define HASH_MAX_SPLITPOINT_GROUP	32
+#define HASH_MAX_SPLITPOINTS \
+	(((HASH_MAX_SPLITPOINT_GROUP - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) * \
+	  HASH_SPLITPOINT_PHASES_PER_GRP) + \
+	 HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE)
+
 typedef struct HashMetaPageData
 {
 	uint32		hashm_magic;	/* magic no. for hash tables */
@@ -382,6 +398,8 @@ extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
 extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
 					 uint32 highmask, uint32 lowmask);
 extern uint32 _hash_log2(uint32 num);
+extern uint32 _hash_spareindex(uint32 num_bucket);
+extern uint32 _hash_get_totalbuckets(uint32 splitpoint_phase);
 extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
 extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
 extern bool _hash_convert_tuple(Relation index,
#35Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Mithun Cy (#34)
2 attachment(s)
Re: [POC] A better way to expand hash indexes.

On Sat, Apr 1, 2017 at 12:31 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

Also adding a patch which implements the 2nd way.

Sorry, I forgot to add sortbuild_hash patch, which also needs similar
changes for the hash_mask.

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

sortbuild_hash_A_2.patchapplication/octet-stream; name=sortbuild_hash_A_2.patchDownload
commit e73b27c9412ab916a7fb6e111238c9ecd1a72def
Author: mithun <mithun@localhost.localdomain>
Date:   Sat Apr 1 12:55:13 2017 +0530

    commit 2

diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index 0e0f393..41d615d 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -37,7 +37,15 @@ struct HSpool
 {
 	Tuplesortstate *sortstate;	/* state data for tuplesort.c */
 	Relation	index;
-	uint32		hash_mask;		/* bitmask for hash codes */
+
+	/*
+	 * We sort the hash keys based on the buckets they belong to. Below masks
+	 * are used in _hash_hashkey2bucket to determine the bucket of given hash
+	 * key.
+	 */
+	uint32		high_mask;
+	uint32		low_mask;
+	uint32		max_buckets;
 };
 
 
@@ -56,11 +64,12 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	 * num_buckets buckets in the index, the appropriate mask can be computed
 	 * as follows.
 	 *
-	 * Note: at present, the passed-in num_buckets is always a power of 2, so
-	 * we could just compute num_buckets - 1.  We prefer not to assume that
-	 * here, though.
+	 * NOTE : This hash mask calculation should be in sync with similar
+	 * calculation in _hash_init_metabuffer.
 	 */
-	hspool->hash_mask = (((uint32) 1) << _hash_log2(num_buckets)) - 1;
+	hspool->high_mask = (((uint32) 1) << _hash_log2(num_buckets + 1)) - 1;
+	hspool->low_mask = (hspool->high_mask >> 1);
+	hspool->max_buckets = num_buckets - 1;
 
 	/*
 	 * We size the sort area as maintenance_work_mem rather than work_mem to
@@ -69,7 +78,9 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 	 */
 	hspool->sortstate = tuplesort_begin_index_hash(heap,
 												   index,
-												   hspool->hash_mask,
+												   hspool->high_mask,
+												   hspool->low_mask,
+												   hspool->max_buckets,
 												   maintenance_work_mem,
 												   false);
 
@@ -122,7 +133,9 @@ _h_indexbuild(HSpool *hspool, Relation heapRel)
 #ifdef USE_ASSERT_CHECKING
 		uint32		lasthashkey = hashkey;
 
-		hashkey = _hash_get_indextuple_hashkey(itup) & hspool->hash_mask;
+		hashkey = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+									   hspool->max_buckets, hspool->high_mask,
+									   hspool->low_mask);
 		Assert(hashkey >= lasthashkey);
 #endif
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e1e692d..65cda52 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -127,6 +127,7 @@
 
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/hash.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
 #include "commands/tablespace.h"
@@ -473,7 +474,9 @@ struct Tuplesortstate
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 
 	/* These are specific to the index_hash subcase: */
-	uint32		hash_mask;		/* mask for sortable part of hash code */
+	uint32		high_mask;		/* masks for sortable part of hash code */
+	uint32		low_mask;
+	uint32		max_buckets;
 
 	/*
 	 * These variables are specific to the Datum case; they are set by
@@ -991,7 +994,9 @@ tuplesort_begin_index_btree(Relation heapRel,
 Tuplesortstate *
 tuplesort_begin_index_hash(Relation heapRel,
 						   Relation indexRel,
-						   uint32 hash_mask,
+						   uint32 high_mask,
+						   uint32 low_mask,
+						   uint32 max_buckets,
 						   int workMem, bool randomAccess)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
@@ -1002,8 +1007,11 @@ tuplesort_begin_index_hash(Relation heapRel,
 #ifdef TRACE_SORT
 	if (trace_sort)
 		elog(LOG,
-		"begin index sort: hash_mask = 0x%x, workMem = %d, randomAccess = %c",
-			 hash_mask,
+			 "begin index sort: high_mask = 0x%x, low_mask = 0x%x, "
+			 "max_buckets = 0x%x, workMem = %d, randomAccess = %c",
+			 high_mask,
+			 low_mask,
+			 max_buckets,
 			 workMem, randomAccess ? 't' : 'f');
 #endif
 
@@ -1017,7 +1025,9 @@ tuplesort_begin_index_hash(Relation heapRel,
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
 
-	state->hash_mask = hash_mask;
+	state->high_mask = high_mask;
+	state->low_mask = low_mask;
+	state->max_buckets = max_buckets;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -4157,8 +4167,8 @@ static int
 comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 					  Tuplesortstate *state)
 {
-	uint32		hash1;
-	uint32		hash2;
+	Bucket		bucket1;
+	Bucket		bucket2;
 	IndexTuple	tuple1;
 	IndexTuple	tuple2;
 
@@ -4167,13 +4177,16 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 	 * that the first column of the index tuple is the hash key.
 	 */
 	Assert(!a->isnull1);
-	hash1 = DatumGetUInt32(a->datum1) & state->hash_mask;
+	bucket1 = _hash_hashkey2bucket(DatumGetUInt32(a->datum1),
+								   state->max_buckets, state->high_mask,
+								   state->low_mask);
 	Assert(!b->isnull1);
-	hash2 = DatumGetUInt32(b->datum1) & state->hash_mask;
-
-	if (hash1 > hash2)
+	bucket2 = _hash_hashkey2bucket(DatumGetUInt32(b->datum1),
+								   state->max_buckets, state->high_mask,
+								   state->low_mask);
+	if (bucket1 > bucket2)
 		return 1;
-	else if (hash1 < hash2)
+	else if (bucket1 < bucket2)
 		return -1;
 
 	/*
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 5b3f475..9719db4 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -72,7 +72,9 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 							int workMem, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
 						   Relation indexRel,
-						   uint32 hash_mask,
+						   uint32 high_mask,
+						   uint32 low_mask,
+						   uint32 max_buckets,
 						   int workMem, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 					  Oid sortOperator, Oid sortCollation,
yet_another_expand_hashbucket_efficiently_15.patchapplication/octet-stream; name=yet_another_expand_hashbucket_efficiently_15.patchDownload
diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 3ba01f6..3937415 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -45,7 +45,7 @@ lowmask, ovflpoint, firstfree, nmaps, procid, spares, mapp FROM
 hash_metapage_info(get_raw_page('test_hash_a_idx', 0));
 -[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 magic     | 105121344
-version   | 2
+version   | 3
 ntuples   | 1
 bsize     | 8152
 bmsize    | 4096
@@ -57,7 +57,7 @@ ovflpoint | 2
 firstfree | 0
 nmaps     | 1
 procid    | 450
-spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 
 SELECT magic, version, ntuples, bsize, bmsize, bmshift, maxbucket, highmask,
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 9f41bb2..c87b160 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -658,7 +658,7 @@ test=# SELECT * FROM hash_bitmap_info('con_hash_index', 2052);
 test=# SELECT * FROM hash_metapage_info(get_raw_page('con_hash_index', 0));
 -[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 magic     | 105121344
-version   | 2
+version   | 3
 ntuples   | 500500
 ffactor   | 40
 bsize     | 8152
@@ -667,11 +667,11 @@ bmshift   | 15
 maxbucket | 12512
 highmask  | 16383
 lowmask   | 8191
-ovflpoint | 14
+ovflpoint | 28
 firstfree | 1204
 nmaps     | 1
 procid    | 450
-spares    | {0,0,0,0,0,0,1,1,1,1,1,4,59,704,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
+spares    | {0,0,0,0,0,0,1,1,1,1,1,1,1,1,3,4,4,4,45,55,58,59,508,567,628,704,1193,1202,1204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 mapp      | {65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
 </screen>
      </para>
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..c8a0ec7 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -58,35 +58,51 @@ rules to support a variable number of overflow pages while not having to
 move primary bucket pages around after they are created.
 
 Primary bucket pages (henceforth just "bucket pages") are allocated in
-power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
-are created when the index is initialized.  At the first split, buckets 2
-and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
-when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
-pages of a power-of-2 group appear consecutively in the index.  This
-addressing scheme allows the physical location of a bucket page to be
-computed from the bucket number relatively easily, using only a small
-amount of control information.  We take the log2() of the bucket number
-to determine which split point S the bucket belongs to, and then simply
-add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
-metapage) to compute the physical address.  hashm_spares[S] can be
-interpreted as the total number of overflow pages that have been allocated
-before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
-so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
-block numbers 1 and 2, just after the meta page.  We always have
-hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
-former.  The difference between the two represents the number of overflow
-pages appearing between the bucket page groups of splitpoints N and N+1.
-
+power-of-2 groups, called "split points" in the code.  That means at every new
+splitpoint we double the existing number of buckets.  Allocating huge chunks
+of bucket pages all at once isn't optimal and we will take ages to consume
+those.  To avoid this exponential growth of index size, we did use a trick to
+break up allocation of buckets at the splitpoint into 4 equal phases.  If
+(2 ^ x) are the total buckets need to be allocated at a splitpoint (from now on
+we shall call this as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2))
+of total buckets at each phase of splitpoint group.  Next quarter of allocation
+will only happen if buckets of the previous phase have been already consumed.
+For the initial splitpoint groups < 10 we will allocate all of their buckets in
+single phase only, as number of buckets allocated at initial groups are small
+in numbers.  And for the groups >= 10 the allocation process is distributed
+among four equal phases.  At group 10 we allocate (2 ^ 9) buckets in 4
+different phases {2 ^ 7, 2 ^ 7, 2 ^ 7, 2 ^ 7}, the numbers in curly braces
+indicate the number of buckets allocated within each phase of splitpoint group
+10.  And, for splitpoint group 11 and 12 allocation phases will be
+{2 ^ 8, 2 ^ 8, 2 ^ 8, 2 ^ 8} and {2 ^ 9, 2 ^ 9, 2 ^ 9, 2 ^ 9} respectively.  We
+can see that at each splitpoint group we double the total number of buckets
+from the previous group but in an incremental phase.  The bucket pages
+allocated within one phase of a splitpoint group will appear consecutively in
+the index.  This addressing scheme allows the physical location of a bucket
+page to be computed from the bucket number relatively easily, using only a
+small amount of control information.  If we look at the function
+_hash_spareindex for a given bucket number we first compute the
+splitpoint group it belongs to and then the phase to which the bucket belongs
+to.  Adding them we get the global splitpoint phase number S to which the
+bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[]
+is an array stored in the metapage) with given bucket number to compute its
+physical address.  The hashm_spares[S] can be interpreted as the total number
+of overflow pages that have been allocated before the bucket pages of
+splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
+always appear at block numbers 1 and 2, just after the meta page.  We always
+have hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
+former.  The difference between the two represents the number of overflow pages
+appearing between the bucket page groups of splitpoints phase N and N+1.
 (Note: the above describes what happens when filling an initially minimally
-sized hash index.  In practice, we try to estimate the required index size
-and allocate a suitable number of splitpoints immediately, to avoid
+sized hash index.  In practice, we try to estimate the required index size and
+allocate a suitable number of splitpoints phases immediately, to avoid
 expensive re-splitting during initial index build.)
 
 When S splitpoints exist altogether, the array entries hashm_spares[0]
 through hashm_spares[S] are valid; hashm_spares[S] records the current
 total number of overflow pages.  New overflow pages are created as needed
 at the end of the index, and recorded by incrementing hashm_spares[S].
-When it is time to create a new splitpoint's worth of bucket pages, we
+When it is time to create a new splitpoint phase's worth of bucket pages, we
 copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
 stored in the hashm_ovflpoint field of the meta page).  This has the
 effect of reserving the correct number of bucket pages at the end of the
@@ -101,7 +117,7 @@ We have to allow the case "greater than" because it's possible that during
 an index extension we crash after allocating filesystem space and before
 updating the metapage.  Note that on filesystems that allow "holes" in
 files, it's entirely likely that pages before the logical EOF are not yet
-allocated: when we allocate a new splitpoint's worth of bucket pages, we
+allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
 physically zero the last such page to force the EOF up, and the first such
 page will be used immediately, but the intervening pages are not written
 until needed.
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..41ef654 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -49,7 +49,7 @@ bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum)
 	 * Convert to absolute page number by adding the number of bucket pages
 	 * that exist before this split point.
 	 */
-	return (BlockNumber) ((1 << i) + ovflbitnum);
+	return (BlockNumber) (_hash_get_totalbuckets(i) + ovflbitnum);
 }
 
 /*
@@ -67,14 +67,15 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 	/* Determine the split number containing this page */
 	for (i = 1; i <= splitnum; i++)
 	{
-		if (ovflblkno <= (BlockNumber) (1 << i))
+		if (ovflblkno <= (BlockNumber) _hash_get_totalbuckets(i))
 			break;				/* oops */
-		bitnum = ovflblkno - (1 << i);
+		bitnum = ovflblkno - _hash_get_totalbuckets(i);
 
 		/*
 		 * bitnum has to be greater than number of overflow page added in
 		 * previous split point. The overflow page at this splitnum (i) if any
-		 * should start from ((2 ^ i) + metap->hashm_spares[i - 1] + 1).
+		 * should start from (_hash_get_totalbuckets(i) +
+		 * metap->hashm_spares[i - 1] + 1).
 		 */
 		if (bitnum > metap->hashm_spares[i - 1] &&
 			bitnum <= metap->hashm_spares[i])
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 61ca2ec..b5a1c7e 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -502,14 +502,15 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	Page		page;
 	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
+	uint32		spare_index;
 	uint32		i;
 
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
-	 * next power of 2, however, and always force at least 2 bucket pages. The
-	 * upper limit is determined by considerations explained in
+	 * total number of buckets which has to be allocated before using its
+	 * _hashm_spare element. However always force at least 2 bucket pages.
+	 * The upper limit is determined by considerations explained in
 	 * _hash_expandtable().
 	 */
 	dnumbuckets = num_tuples / ffactor;
@@ -518,11 +519,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	else if (dnumbuckets >= (double) 0x40000000)
 		num_buckets = 0x40000000;
 	else
-		num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+		num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
 
-	log2_num_buckets = _hash_log2(num_buckets);
-	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
-	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
+	spare_index = _hash_spareindex(num_buckets);
+	Assert(spare_index < HASH_MAX_SPLITPOINTS);
 
 	page = BufferGetPage(buf);
 	if (initpage)
@@ -563,18 +563,23 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
-	 * blocks 1 to N.  The first freespace bitmap page is in block N+1. Since
-	 * N is a power of 2, we can set the masks this way:
+	 * blocks 1 to N.  The first freespace bitmap page is in block N+1.
 	 */
-	metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
-	metap->hashm_highmask = (num_buckets << 1) - 1;
+	metap->hashm_maxbucket = num_buckets - 1;
+
+	/*
+	 * Set highmask as next immediate ((2 ^ x) - 1), which should be sufficient
+	 * to cover num_buckets.
+	 */
+	metap->hashm_highmask = (1 << (_hash_log2(num_buckets + 1))) - 1;
+	metap->hashm_lowmask = (metap->hashm_highmask >> 1);
 
 	MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));
 	MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));
 
 	/* Set up mapping for one spare page after the initial splitpoints */
-	metap->hashm_spares[log2_num_buckets] = 1;
-	metap->hashm_ovflpoint = log2_num_buckets;
+	metap->hashm_spares[spare_index] = 1;
+	metap->hashm_ovflpoint = spare_index;
 	metap->hashm_firstfree = 0;
 
 	/*
@@ -773,25 +778,25 @@ restart_expand:
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to allocate a new batch of bucket pages.
+	 * If the split point is increasing we need to allocate a new batch of
+	 * bucket pages.
 	 */
-	spare_ndx = _hash_log2(new_bucket + 1);
+	spare_ndx = _hash_spareindex(new_bucket + 1);
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
+		uint32		buckets_to_add;
+
 		Assert(spare_ndx == metap->hashm_ovflpoint + 1);
 
 		/*
-		 * The number of buckets in the new splitpoint is equal to the total
-		 * number already in existence, i.e. new_bucket.  Currently this maps
-		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.  We treat allocation of buckets as a
-		 * separate WAL-logged action.  Even if we fail after this operation,
-		 * won't leak bucket pages; rather, the next split will consume this
-		 * space. In any case, even without failure we don't use all the space
-		 * in one split operation.
+		 * We treat allocation of buckets as a separate WAL-logged action.
+		 * Even if we fail after this operation, won't leak bucket pages;
+		 * rather, the next split will consume this space. In any case, even
+		 * without failure we don't use all the space in one split
+		 * operation.
 		 */
-		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+		buckets_to_add = _hash_get_totalbuckets(spare_ndx) - new_bucket;
+		if (!_hash_alloc_buckets(rel, start_nblkno, buckets_to_add))
 		{
 			/* can't split due to BlockNumber overflow */
 			_hash_relbuf(rel, buf_oblkno);
@@ -836,10 +841,9 @@ restart_expand:
 	}
 
 	/*
-	 * If the split point is increasing (hashm_maxbucket's log base 2
-	 * increases), we need to adjust the hashm_spares[] array and
-	 * hashm_ovflpoint so that future overflow pages will be created beyond
-	 * this new batch of bucket pages.
+	 * If the split point is increasing we need to adjust the hashm_spares[]
+	 * array and hashm_ovflpoint so that future overflow pages will be created
+	 * beyond this new batch of bucket pages.
 	 */
 	if (spare_ndx > metap->hashm_ovflpoint)
 	{
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..d679cf0 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -150,6 +150,71 @@ _hash_log2(uint32 num)
 }
 
 /*
+ * _hash_spareindex -- returns spare index / global splitpoint phase of the
+ *					   bucket
+ */
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
+	uint32		splitpoint_group;
+	uint32		splitpoint_phases;
+
+	splitpoint_group = _hash_log2(num_bucket);
+
+	if (splitpoint_group < HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE)
+		return splitpoint_group;
+
+	/* account for single-phase groups */
+	splitpoint_phases = HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE;
+
+	/* account for multi-phase groups before splitpoint_group */
+	splitpoint_phases +=
+		((splitpoint_group - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) <<
+		 HASH_SPLITPOINT_PHASE_BITS);
+
+	/* account for phases within current group */
+	splitpoint_phases +=
+		(((num_bucket - 1) >> (HASH_SPLITPOINT_PHASE_BITS + 1)) &
+		 HASH_SPLITPOINT_PHASE_MASK);	/* to 0-based value. */
+
+	return splitpoint_phases;
+}
+
+/*
+ *	_hash_get_totalbuckets -- returns total number of buckets allocated till
+ *							the given splitpoint phase.
+ */
+uint32
+_hash_get_totalbuckets(uint32 splitpoint_phase)
+{
+	uint32		splitpoint_group;
+	uint32		total_buckets;
+	uint32		phases_within_splitpoint_group;
+
+	if (splitpoint_phase < HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE)
+		return (1 << splitpoint_phase);
+
+	/* get splitpoint's group */
+	splitpoint_group = HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE;
+	splitpoint_group +=
+		((splitpoint_phase - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) >>
+		 HASH_SPLITPOINT_PHASE_BITS);
+
+	/* account for buckets before splitpoint_group */
+	total_buckets = (1 << (splitpoint_group - 1));
+
+	/* account for buckets within splitpoint_group */
+	phases_within_splitpoint_group =
+		(((splitpoint_phase - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) &
+		  HASH_SPLITPOINT_PHASE_MASK) + 1);		/* from 0-based to 1-based */
+	total_buckets +=
+		(((1 << (splitpoint_group - 1)) >> HASH_SPLITPOINT_PHASE_BITS) *
+		 phases_within_splitpoint_group);
+
+	return total_buckets;
+}
+
+/*
  * _hash_checkpage -- sanity checks on the format of all hash pages
  *
  * If flags is not zero, it is a bitwise OR of the acceptable values of
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..fcc3957 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -36,7 +36,7 @@ typedef uint32 Bucket;
 #define InvalidBucket	((Bucket) 0xFFFFFFFF)
 
 #define BUCKET_TO_BLKNO(metap,B) \
-		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
+		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)
 
 /*
  * Special space for hash index pages.
@@ -158,7 +158,8 @@ typedef HashScanOpaqueData *HashScanOpaque;
 #define HASH_METAPAGE	0		/* metapage is always block 0 */
 
 #define HASH_MAGIC		0x6440640
-#define HASH_VERSION	2		/* 2 signifies only hash key value is stored */
+#define HASH_VERSION	3		/* 3 signifies multi-phased bucket allocation
+								 * to reduce doubling */
 
 /*
  * spares[] holds the number of overflow pages currently allocated at or
@@ -176,13 +177,28 @@ typedef HashScanOpaqueData *HashScanOpaque;
  *
  * The limitation on the size of spares[] comes from the fact that there's
  * no point in having more than 2^32 buckets with only uint32 hashcodes.
+ * (Note: The value of HASH_MAX_SPLITPOINTS which is the size of spares[] is
+ * adjusted in such a way to accommodate multi phased allocation of buckets
+ * after HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE).
+ *
  * There is no particular upper limit on the size of mapp[], other than
  * needing to fit into the metapage.  (With 8K block size, 128 bitmaps
  * limit us to 64 GB of overflow space...)
  */
-#define HASH_MAX_SPLITPOINTS		32
 #define HASH_MAX_BITMAPS			128
 
+#define HASH_SPLITPOINT_PHASE_BITS	2
+#define HASH_SPLITPOINT_PHASES_PER_GRP	(1 << HASH_SPLITPOINT_PHASE_BITS)
+#define HASH_SPLITPOINT_PHASE_MASK		(HASH_SPLITPOINT_PHASES_PER_GRP - 1)
+#define HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE	10
+
+/* defines max number of splitpoit phases a hash index can have */
+#define HASH_MAX_SPLITPOINT_GROUP	32
+#define HASH_MAX_SPLITPOINTS \
+	(((HASH_MAX_SPLITPOINT_GROUP - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) * \
+	  HASH_SPLITPOINT_PHASES_PER_GRP) + \
+	 HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE)
+
 typedef struct HashMetaPageData
 {
 	uint32		hashm_magic;	/* magic no. for hash tables */
@@ -382,6 +398,8 @@ extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
 extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
 					 uint32 highmask, uint32 lowmask);
 extern uint32 _hash_log2(uint32 num);
+extern uint32 _hash_spareindex(uint32 num_bucket);
+extern uint32 _hash_get_totalbuckets(uint32 splitpoint_phase);
 extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
 extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
 extern bool _hash_convert_tuple(Relation index,
#36Robert Haas
robertmhaas@gmail.com
In reply to: Mithun Cy (#35)
Re: [POC] A better way to expand hash indexes.

On Sat, Apr 1, 2017 at 3:29 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

On Sat, Apr 1, 2017 at 12:31 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

Also adding a patch which implements the 2nd way.

Sorry, I forgot to add sortbuild_hash patch, which also needs similar
changes for the hash_mask.

Committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Robert Haas (#36)
1 attachment(s)
Re: [POC] A better way to expand hash indexes.

On Tue, Apr 4, 2017 at 9:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Committed.

Thanks Robert,

And also sorry, one unfortunate thing happened in the last patch while
fixing one of the review comments a variable disappeared from the
equation
@_hash_spareindex.

        splitpoint_phases +=
-               (((num_bucket - 1) >> (HASH_SPLITPOINT_PHASE_BITS + 1)) &
+               (((num_bucket - 1) >>
+                 (splitpoint_group - (HASH_SPLITPOINT_PHASE_BITS + 1))) &
                 HASH_SPLITPOINT_PHASE_MASK);   /* to 0-based value. */

I wanted most significant 3 bits. And while fixing comments in patch11
I unknowingly somehow removed splitpoint_group from the equation.
Extremely sorry for the mistake. Thanks to Ashutosh Sharma for
pointing the mistake.

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

_hash_spareindex_defect.patchapplication/octet-stream; name=_hash_spareindex_defect.patchDownload
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index d679cf0..037582b 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -174,7 +174,8 @@ _hash_spareindex(uint32 num_bucket)
 
 	/* account for phases within current group */
 	splitpoint_phases +=
-		(((num_bucket - 1) >> (HASH_SPLITPOINT_PHASE_BITS + 1)) &
+		(((num_bucket - 1) >>
+		  (splitpoint_group - (HASH_SPLITPOINT_PHASE_BITS + 1))) &
 		 HASH_SPLITPOINT_PHASE_MASK);	/* to 0-based value. */
 
 	return splitpoint_phases;
#38Robert Haas
robertmhaas@gmail.com
In reply to: Mithun Cy (#37)
Re: [POC] A better way to expand hash indexes.

On Tue, Apr 4, 2017 at 6:33 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

On Tue, Apr 4, 2017 at 9:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Committed.

Thanks Robert,

And also sorry, one unfortunate thing happened in the last patch while
fixing one of the review comments a variable disappeared from the
equation
@_hash_spareindex.

splitpoint_phases +=
-               (((num_bucket - 1) >> (HASH_SPLITPOINT_PHASE_BITS + 1)) &
+               (((num_bucket - 1) >>
+                 (splitpoint_group - (HASH_SPLITPOINT_PHASE_BITS + 1))) &
HASH_SPLITPOINT_PHASE_MASK);   /* to 0-based value. */

I wanted most significant 3 bits. And while fixing comments in patch11
I unknowingly somehow removed splitpoint_group from the equation.
Extremely sorry for the mistake. Thanks to Ashutosh Sharma for
pointing the mistake.

Ugh, OK, committed that also. Please try to be more careful in the future.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers