[POC] A better way to expand hash indexes.
Hi all,
As of now, we expand the hash index by doubling the number of bucket
blocks. But unfortunately, those blocks will not be used immediately.
So I thought if we can differ bucket block allocation by some factor,
hash index size can grow much efficiently. I have written a POC patch
which does following things -
Say at overflow point 'i' we need to add new "x = 2^(i-1)" buckets as
per the old code, I think we can do this addition of buckets in a
controlled way. Instead of adding all of 'x' bucket blocks at a time,
the patch will add x/4 blocks at a time. And, once those blocks are
consumed, it adds next installment of x/4 blocks. And, this will
continue until all of 'x' blocks of the overflow point 'i' is
allocated. My test result shows index size grows in a more efficient
way with above patch.
Note: This patch is just a POC. It can have bugs and I have to change
comments in the code, and README which are related to new changes.
Test:
create table t1(t int);
create index i1 on t1 using hash(t);
And then, Incrementally add rows as below.
insert into t1 select generate_series(1, 10000000);
records base index new patch
in million -- size(MB) -- index size(MB)
10 384 384
20 768 768
30 1417 1161
40 1531 1531
50 2556 1788
60 2785 2273
70 2963 2709
80 3060 3061
90 5111 3575
100 5111 3575
To implement such an incremental addition of bucket blocks I have to
increase the size of array hashm_spares in meta page by four times.
Also, mapping methods which map a total number of buckets to a
split-point position of hashm_spares array, need to be changed. These
changes create backward compatibility issues.
Implementation Details in brief:
=======================
Each element of hashm_spares (we call overflow point) is expanded into
4 slots {0, 1, 2, 3}. If 'x' (a power2 number) is the total number of
buckets to be added before the overflow point we add only a quarter of
it (x/4) to each slot, once we have consumed previously added blocks
we add next quarter of buckets and so on.
As in old code new hashm_spares[i] stores total overflow pages
allocated between those bucket allocation.
--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com
Attachments:
expand_hashbucket_efficiently_01application/octet-stream; name=expand_hashbucket_efficiently_01Download+73-19
On Fri, Feb 17, 2017 at 7:21 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
To implement such an incremental addition of bucket blocks I have to
increase the size of array hashm_spares in meta page by four times.
Also, mapping methods which map a total number of buckets to a
split-point position of hashm_spares array, need to be changed. These
changes create backward compatibility issues.
How will high and lowmask calculations work in this new strategy?
Till now they always work on doubling strategy and I don't see you
have changed anything related to that code. Check below places.
_hash_metapinit()
{
..
/*
* We initialize the index with N buckets, 0 .. N-1, occupying physical
* blocks 1 to N. The first freespace bitmap page is in block N+1. Since
* N is a power of 2, we can set the masks this way:
*/
metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
metap->hashm_highmask = (num_buckets << 1) - 1;
..
}
_hash_expandtable()
{
..
if (new_bucket > metap->hashm_highmask)
{
/* Starting a new doubling */
metap->hashm_lowmask = metap->hashm_highmask;
metap->hashm_highmask = new_bucket | metap->hashm_lowmask;
}
..
}
Till now, we have worked hard for not changing the page format in a
backward incompatible way, so it will be better if we could find some
way of doing this without changing the meta page format in a backward
incompatible way. Have you considered to store some information in
shared memory based on which we can decide how much percentage of
buckets are allocated in current table half? I think we might be able
to construct this information after crash recovery as well.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thanks, Amit
On Mon, Feb 20, 2017 at 9:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
How will high and lowmask calculations work in this new strategy?
Till now they always work on doubling strategy and I don't see you
have changed anything related to that code. Check below places.
It is important that the mask has to be (2^x) -1, if we have to retain
the same hash map function. So mask variables will take same values as
before. Only place I think we need change is _hash_metapinit();
unfortunately, I did not test for the case where we build the hash
index on already existing tuples. Now I have fixed in the latest
patch.
Till now, we have worked hard for not changing the page format in a
backward incompatible way, so it will be better if we could find some
way of doing this without changing the meta page format in a backward
incompatible way.
We are not adding any new variable/ deleting some, we increase the
size of hashm_spares and hence mapping functions should be adjusted.
The problem is the block allocation, and its management is based on
the fact that all of the buckets(will be 2^x in number) belonging to a
particular split-point is allocated at once and together. The
hashm_spares is used to record those allocations and that will be used
further by map functions to reach a particular block in the file. If
we want to change the way we allocate the buckets then hashm_spares
will change and hence mapping function. So I do not think we can avoid
incompatibility issue.
One thing I can think of is if we can increase the hashm_version of
hash index; then for old indexes, we can continue to use doubling
method and its mapping. For new indexes, we can use new way as above.
Have you considered to store some information in
shared memory based on which we can decide how much percentage of
buckets are allocated in current table half? I think we might be able
to construct this information after crash recovery as well.
I think all of above data has to be persistent. I am not able to
understand what should be/can be stored in shared buffers. Can you
please correct me if I am wrong?
--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com
Attachments:
expand_hashbucket_efficiently_02application/octet-stream; name=expand_hashbucket_efficiently_02Download+94-39
On 2/21/17 4:58 AM, Mithun Cy wrote:
Thanks, Amit
On Mon, Feb 20, 2017 at 9:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
How will high and lowmask calculations work in this new strategy?
Till now they always work on doubling strategy and I don't see you
have changed anything related to that code. Check below places.It is important that the mask has to be (2^x) -1, if we have to retain
the same hash map function. So mask variables will take same values as
before. Only place I think we need change is _hash_metapinit();
unfortunately, I did not test for the case where we build the hash
index on already existing tuples. Now I have fixed in the latest
patch.Till now, we have worked hard for not changing the page format in a
backward incompatible way, so it will be better if we could find some
way of doing this without changing the meta page format in a backward
incompatible way.We are not adding any new variable/ deleting some, we increase the
size of hashm_spares and hence mapping functions should be adjusted.
The problem is the block allocation, and its management is based on
the fact that all of the buckets(will be 2^x in number) belonging to a
particular split-point is allocated at once and together. The
hashm_spares is used to record those allocations and that will be used
further by map functions to reach a particular block in the file. If
we want to change the way we allocate the buckets then hashm_spares
will change and hence mapping function. So I do not think we can avoid
incompatibility issue.One thing I can think of is if we can increase the hashm_version of
hash index; then for old indexes, we can continue to use doubling
method and its mapping. For new indexes, we can use new way as above.Have you considered to store some information in
shared memory based on which we can decide how much percentage of
buckets are allocated in current table half? I think we might be able
to construct this information after crash recovery as well.I think all of above data has to be persistent. I am not able to
understand what should be/can be stored in shared buffers. Can you
please correct me if I am wrong?
This patch does not apply at cccbdde:
$ patch -p1 < ../other/expand_hashbucket_efficiently_02
patching file src/backend/access/hash/hashovfl.c
Hunk #1 succeeded at 49 (offset 1 line).
Hunk #2 succeeded at 67 (offset 1 line).
patching file src/backend/access/hash/hashpage.c
Hunk #1 succeeded at 502 with fuzz 1 (offset 187 lines).
Hunk #2 succeeded at 518 with fuzz 2 (offset 168 lines).
Hunk #3 succeeded at 562 (offset 163 lines).
Hunk #4 succeeded at 744 (offset 124 lines).
Hunk #5 FAILED at 774.
Hunk #6 succeeded at 869 (offset 19 lines).
Hunk #7 succeeded at 1450 (offset 242 lines).
Hunk #8 succeeded at 1464 (offset 242 lines).
Hunk #9 succeeded at 1505 (offset 242 lines).
1 out of 9 hunks FAILED -- saving rejects to file
src/backend/access/hash/hashpage.c.rej
patching file src/backend/access/hash/hashutil.c
Hunk #1 succeeded at 150 (offset 1 line).
patching file src/include/access/hash.h
Hunk #2 succeeded at 180 (offset 12 lines).
Hunk #3 succeeded at 382 (offset 18 lines).
It does apply with fuzz on 2b32ac2, so it looks like c11453c and
subsequent commits are the cause. They represent a fairly substantial
change to hash indexes by introducing WAL logging so I think you should
reevaluate your patches to be sure they still function as expected.
Marked "Waiting on Author".
--
-David
david@pgmasters.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Mar 16, 2017 at 10:55 PM, David Steele <david@pgmasters.net> wrote:
It does apply with fuzz on 2b32ac2, so it looks like c11453c and
subsequent commits are the cause. They represent a fairly substantial
change to hash indexes by introducing WAL logging so I think you should
reevaluate your patches to be sure they still function as expected.
Thanks, David here is the new improved patch I have also corrected the
pageinspect's test output. Also, added notes in README regarding the
new way of adding bucket pages efficiently in hash index. I also did
some more tests pgbench read only and read write;
There is no performance impact due to the patch. The growth of index
size has become much efficient as the numbers posted in the initial
proposal mail.
--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com
Attachments:
expand_hashbucket_efficiently_03.patchapplication/octet-stream; name=expand_hashbucket_efficiently_03.patchDownload+167-60
On Sat, Mar 18, 2017 at 10:59 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
On Thu, Mar 16, 2017 at 10:55 PM, David Steele <david@pgmasters.net> wrote:
It does apply with fuzz on 2b32ac2, so it looks like c11453c and
subsequent commits are the cause. They represent a fairly substantial
change to hash indexes by introducing WAL logging so I think you should
reevaluate your patches to be sure they still function as expected.Thanks, David here is the new improved patch I have also corrected the
pageinspect's test output. Also, added notes in README regarding the
new way of adding bucket pages efficiently in hash index. I also did
some more tests pgbench read only and read write;
To make this work, I think the calculations you have introduced are
not so easy to understand. For example, it took me quite some time to
understand how the below function works to compute the location in
hash spares.
+uint32
+_hash_spareindex(uint32 num_bucket)
+{
..
+ /*
+ * The first 4 bucket belongs to corresponding first 4 splitpoint phases.
+ */
+ if (num_bucket <= 4)
+ return (num_bucket - 1); /* converted to base 0. */
+ splitpoint_group = _hash_log2(num_bucket) - 2; /* The are 4 buckets in
..
+ /*
+ * bucket's global splitpoint phase = total number of split point phases
+ * until its splitpoint group - splitpoint phase within this splitpoint
+ * group but after buckets own splitpoint phase.
+ */
+ tbuckets = (1 << (splitpoint_group + 2));
+ phases_beyond_bucket =
+ (tbuckets - num_bucket) / (1 << (splitpoint_group - 1));
+ return (((splitpoint_group + 1) << 2) - phases_beyond_bucket) - 1;
+}
I am not sure if it is just a matter of better comments to explain it
in a simple way or maybe we can try to find some simpler mechanism to
group the split into four (or more) equal parts. I think if someone
else can read and share their opinion it could be helpful. Another
idea could be to make hashm_spares a two-dimensional array
hashm_spares[32][4] where the first dimension will indicate the split
point and second will indicate the sub-split number. I am not sure
whether it will be simpler or complex than the method used in the
proposed patch, but I think we should think a bit more to see if we
can come up with some simple technique to solve this problem.
+ * allocate them at once. Each splitpoint group will have 4 slots, we
+ * distribute the buckets equally among them. So we allocate only one
+ * forth of total buckets in new splitpoint group at time to consume
+ * one phase after another.
spelling.
/forth/fourth
/at time/at a time
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Amit, Thanks for the review,
On Mon, Mar 20, 2017 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
idea could be to make hashm_spares a two-dimensional array
hashm_spares[32][4] where the first dimension will indicate the split
point and second will indicate the sub-split number. I am not sure
whether it will be simpler or complex than the method used in the
proposed patch, but I think we should think a bit more to see if we
can come up with some simple technique to solve this problem.
I think making it a 2-dimensional array will not be any useful in fact
we really treat the given array 2-dimensional elements now.
The main concern of yours I think is the calculation steps to find the
phase of the splitpoint group the bucket belongs to.
+ tbuckets = (1 << (splitpoint_group + 2));
+ phases_beyond_bucket =
+ (tbuckets - num_bucket) / (1 << (splitpoint_group - 1));
+ return (((splitpoint_group + 1) << 2) - phases_beyond_bucket) - 1;
Quickly thinking further we allocate 2^x of buckets on each phase of
any splitpoint group and we have 4 such phases in each group; At each
splitpoint group again we allocate 2^y buckets, So buckets number have
a pattern in their bitmap representation to find out which phase of
allocation they belong to.
As below
===========
Group 0 -- bit 0, 1 define which phase of group each bucket belong to.
0 -- 00000000
1 -- 00000001
2 -- 00000010
3 -- 00000011
===========
Group 1 -- bit 0, 1 define which phase of group each bucket belong to.
4 -- 00000100
5 -- 00000101
6 -- 00000110
7 -- 00000111
===========
Group 2 -- bit 1, 2 define which phase of group each bucket belong to.
8 -- 00001000
9 -- 00001001
10 -- 00001010
11 -- 00001011
12 -- 00001100
13 -- 00001101
14 -- 00001110
15 -- 00001111
===========
Group 3 -- bit 2, 3 define which phase of group each bucket belong to.
16 -- 00010000
17 -- 00010001
18 -- 00010010
19 -- 00010011
20 -- 00010100
21 -- 00010101
22 -- 00010110
23 -- 00010111
24 -- 00011000
25 -- 00011001
26 -- 00011010
27 -- 00011011
28 -- 00011100
29 -- 00011101
30 -- 00011110
31 -- 00011111
============
So we can say given a bucket x of group n > 0 bits (n-1, n) defines
which phase of group they belong to. I see an opportunity here to
completely simplify the above calculation into a simple bitwise anding
the bucket number with (n-1,n) bitmask to get the phase of allocation
of bucket x.
Formula can be (x >> (splitpoint_group - 1)) & 0x3 = phase of bucket
Does this satisfy your concern?
--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Mar 20, 2017 at 8:58 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
Hi Amit, Thanks for the review,
On Mon, Mar 20, 2017 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
idea could be to make hashm_spares a two-dimensional array
hashm_spares[32][4] where the first dimension will indicate the split
point and second will indicate the sub-split number. I am not sure
whether it will be simpler or complex than the method used in the
proposed patch, but I think we should think a bit more to see if we
can come up with some simple technique to solve this problem.I think making it a 2-dimensional array will not be any useful in fact
we really treat the given array 2-dimensional elements now.
Sure, I was telling you based on that. If you are implicitly treating
it as 2-dimensional array, it might be easier to compute the array
offsets.
The main concern of yours I think is the calculation steps to find the phase of the splitpoint group the bucket belongs to. + tbuckets = (1 << (splitpoint_group + 2)); + phases_beyond_bucket = + (tbuckets - num_bucket) / (1 << (splitpoint_group - 1)); + return (((splitpoint_group + 1) << 2) - phases_beyond_bucket) - 1;
It is not only about above calculation, but also what the patch is
doing in function _hash_get_tbuckets(). By the way function name also
seems unclear (mainly *tbuckets* in the name).
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Mar 21, 2017 at 8:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Mar 20, 2017 at 8:58 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
Hi Amit, Thanks for the review,
On Mon, Mar 20, 2017 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
idea could be to make hashm_spares a two-dimensional array
hashm_spares[32][4] where the first dimension will indicate the split
point and second will indicate the sub-split number. I am not sure
whether it will be simpler or complex than the method used in the
proposed patch, but I think we should think a bit more to see if we
can come up with some simple technique to solve this problem.I think making it a 2-dimensional array will not be any useful in fact
we really treat the given array 2-dimensional elements now.Sure, I was telling you based on that. If you are implicitly treating
it as 2-dimensional array, it might be easier to compute the array
offsets.
The above sentence looks incomplete.
If you are implicitly treating it as a 2-dimensional array, it might
be easier to compute the array offsets if you explicitly also treats
as a 2-dimensional array.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Amit please find the new patch
On Tue, Mar 21, 2017 at 8:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
It is not only about above calculation, but also what the patch is
doing in function _hash_get_tbuckets(). By the way function name also
seems unclear (mainly *tbuckets* in the name).
Fixed I have introduced some macros for readability and added more
comments to explain why some calculations are mad. Please let me know
if you think more improvements are needed.
spelling.
/forth/fourth
/at time/at a time
Thanks fixed.
--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com
Attachments:
expand_hashbucket_efficiently_04.patchapplication/octet-stream; name=expand_hashbucket_efficiently_04.patchDownload+173-60
On Fri, Mar 24, 2017 at 1:22 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
Hi Amit please find the new patch
The pageinspect.sgml has an example which shows the output of
"hash_metapage_info()". Since we increase the spares array and
eventually ovflpoint, I have updated the example with corresponding
values..
--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com
Attachments:
expand_hashbucket_efficiently_05.patchapplication/octet-stream; name=expand_hashbucket_efficiently_05.patchDownload+175-62
On Tue, Mar 21, 2017 at 8:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Sure, I was telling you based on that. If you are implicitly treating
it as 2-dimensional array, it might be easier to compute the array
offsets.
I think calculating spares offset will not become anyway much simpler
we still need to calculate split group and split phase separately. I
have added a patch to show how a 2-dimensional spares code looks like
and where all we need changes.
--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com
Attachments:
expand_hashbucket_efficiently_06_spares_2dimesion.patchapplication/octet-stream; name=expand_hashbucket_efficiently_06_spares_2dimesion.patchDownload+252-104
expand_hashbucket_efficiently_06_spares_1dimension.patchapplication/octet-stream; name=expand_hashbucket_efficiently_06_spares_1dimension.patchDownload+177-63
On Sat, Mar 25, 2017 at 10:13 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
On Tue, Mar 21, 2017 at 8:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Sure, I was telling you based on that. If you are implicitly treating
it as 2-dimensional array, it might be easier to compute the array
offsets.I think calculating spares offset will not become anyway much simpler
we still need to calculate split group and split phase separately. I
have added a patch to show how a 2-dimensional spares code looks like
and where all we need changes.
I think one-dimensional patch has fewer places to touch, so that looks
better to me. However, I think there is still hard coding and
assumptions in code which we should try to improve.
1.
+ /*
+ * The first 4 bucket belongs to first splitpoint group 0. And since group
+ * 0 have 4 = 2^2 buckets, we double them in group 1. So total buckets
+ * after group 1 is 8 = 2^3. Then again at group 2 we add another 2^3
+ * buckets to double the total number of buckets, which will become 2^4. I
+ * think by this time we can see a pattern which say if num_bucket > 4
+ * splitpoint group = log2(num_bucket) - 2
+ */
+ if (num_bucket <= 4)
+ splitpoint_group = 0; /* converted to base 0. */
+ else
+ splitpoint_group = _hash_log2(num_bucket) - 2;
This patch defines split point group zero has four buckets and based
on that above calculation is done. I feel you can define it like
#define Buckets_First_Split_Group 4 and then use it in above code.
Also, in else part number 2 looks awkward, can we define it as
log2_buckets_first_group = _hash_log2(Buckets_First_Split_Group) or
some thing like that. I think that way code will look neat. I don't
like the way above comment is worded even though it is helpful to
understand the calculation. If you want, then you can add such a
comment in file header, here it looks out of place.
2.
+power-of-2 groups, called "split points" in the code. That means on every new
+split points we double the existing number of buckets.
split points/split point
3.
+ * which phase of allocation the bucket_num belogs to with in the group.
/belogs/belongs
I have still not completely reviewed the patch as I have ran out of
time for today.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thanks, Amit for the review.
On Sat, Mar 25, 2017 at 7:03 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I think one-dimensional patch has fewer places to touch, so that looks
better to me. However, I think there is still hard coding and
assumptions in code which we should try to improve.
Great!, I will continue with spares 1-dimensional improvement.
1. + /* + * The first 4 bucket belongs to first splitpoint group 0. And since group + * 0 have 4 = 2^2 buckets, we double them in group 1. So total buckets + * after group 1 is 8 = 2^3. Then again at group 2 we add another 2^3 + * buckets to double the total number of buckets, which will become 2^4. I + * think by this time we can see a pattern which say if num_bucket > 4 + * splitpoint group = log2(num_bucket) - 2 + */+ if (num_bucket <= 4) + splitpoint_group = 0; /* converted to base 0. */ + else + splitpoint_group = _hash_log2(num_bucket) - 2;This patch defines split point group zero has four buckets and based
on that above calculation is done. I feel you can define it like
#define Buckets_First_Split_Group 4 and then use it in above code.
Also, in else part number 2 looks awkward, can we define it as
log2_buckets_first_group = _hash_log2(Buckets_First_Split_Group) or
some thing like that. I think that way code will look neat. I don't
like the way above comment is worded even though it is helpful to
understand the calculation. If you want, then you can add such a
comment in file header, here it looks out of place.
I have removed the comments. And, defined a new macro which maps
bucket to SPLIT GROUP
#define BUCKET_TO_SPLITPOINT_GRP(num_bucket) \
((num_bucket <= Buckets_First_Split_Group) ? 0 : \
(_hash_log2(num_bucket) - 2))
I could not find a way to explain why minus 2? better than " The
splitpoint group of a given bucket can be taken as
(_hash_log2(bucket) - 2). Subtracted by 2 because each group have 2 ^
(x + 2) buckets.". Now I have added those with existing comments I
think that should make it little better.
Adding comments about spares array in hashutils.c's file header did
not appear right to me. I think README has some details about same.
2. +power-of-2 groups, called "split points" in the code. That means on every new +split points we double the existing number of buckets.split points/split point
Fixed.
3.
+ * which phase of allocation the bucket_num belogs to with in the group./belogs/belongs
Fixed
--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com
Attachments:
expand_hashbucket_efficiently_07.patchapplication/octet-stream; name=expand_hashbucket_efficiently_07.patchDownload+174-63
On Sun, Mar 26, 2017 at 11:26 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
Thanks, Amit for the review.
On Sat, Mar 25, 2017 at 7:03 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:I think one-dimensional patch has fewer places to touch, so that looks
better to me. However, I think there is still hard coding and
assumptions in code which we should try to improve.Great!, I will continue with spares 1-dimensional improvement.
@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double
num_tuples, RegProcedure procid,\
{
..
else
- num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+ num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
..
..
- metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
- metap->hashm_highmask = (num_buckets << 1) - 1;
+ metap->hashm_maxbucket = num_buckets - 1;
+
+ /* set hishmask, which should be sufficient to cover num_buckets. */
+ metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
+ metap->hashm_lowmask = (metap->hashm_highmask >> 1);
}
I think we can't change the number of buckets to be created or lowmask
and highmask calculation here without modifying _h_spoolinit() because
it sorts the data to be inserted based on hashkey which in turn
depends on the number of buckets that we are going to create during
create index operation. We either need to allow create index
operation to still always create buckets in power-of-two fashion or we
need to update _h_spoolinit according to new computation. One minor
drawback of using power-of-two scheme for creation of buckets during
create index is that it can lead to wastage of space and will be
inconsistent with what the patch does during split operation.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Mar 27, 2017 at 11:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sun, Mar 26, 2017 at 11:26 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
Thanks, Amit for the review.
On Sat, Mar 25, 2017 at 7:03 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:I think one-dimensional patch has fewer places to touch, so that looks
better to me. However, I think there is still hard coding and
assumptions in code which we should try to improve.Great!, I will continue with spares 1-dimensional improvement.
@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,\ { .. else - num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets); + num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets)); .. .. - metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1; - metap->hashm_highmask = (num_buckets << 1) - 1; + metap->hashm_maxbucket = num_buckets - 1; + + /* set hishmask, which should be sufficient to cover num_buckets. */ + metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1; + metap->hashm_lowmask = (metap->hashm_highmask >> 1); }I think we can't change the number of buckets to be created or lowmask
and highmask calculation here without modifying _h_spoolinit() because
it sorts the data to be inserted based on hashkey which in turn
depends on the number of buckets that we are going to create during
create index operation. We either need to allow create index
operation to still always create buckets in power-of-two fashion or we
need to update _h_spoolinit according to new computation. One minor
drawback of using power-of-two scheme for creation of buckets during
create index is that it can lead to wastage of space and will be
inconsistent with what the patch does during split operation.
Few more comments:
1.
@@ -149,6 +149,84 @@ _hash_log2(uint32 num)
return i;
}
+#define SPLITPOINT_PHASES_PER_GRP 4
+#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
+#define Buckets_First_Split_Group 4
The defines should be at the beginning of file.
2.
+/*
+ * At splitpoint group 0 we have 2 ^ (0 + 2) = 4 buckets, then at splitpoint
+ * group 1 we have 2 ^ (1 + 2) = 8 total buckets. As the doubling continues at
+ * splitpoint group "x" we will have 2 ^ (x + 2) total buckets. Total buckets
+ * before x splitpoint group will be 2 ^ (x + 1). At each phase of allocation
+ * within splitpoint group we add 2 ^ (x - 1) buckets, as we have to divide the
+ * task of allocation of 2 ^ (x + 1) buckets among 4 phases.
+ *
+ * Also, splitpoint group of a given bucket can be taken as
+ * (_hash_log2(bucket) - 2). Subtracted by 2 because each group have
+ * 2 ^ (x + 2) buckets.
+ */
..
+#define BUCKET_TO_SPLITPOINT_GRP(num_bucket) \
+ ((num_bucket <= Buckets_First_Split_Group) ? 0 : \
+ (_hash_log2(num_bucket) - 2))
In the above computation +2 and -2 still bothers me. I think you need
to do this because you have defined split group zero to have four
buckets, how about if you don't force that and rather define to have
split point phases only from split point which has four or more
buckets?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Mithun,
On 03/26/2017 01:56 AM, Mithun Cy wrote:
Thanks, Amit for the review.
On Sat, Mar 25, 2017 at 7:03 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:I think one-dimensional patch has fewer places to touch, so that looks
better to me. However, I think there is still hard coding and
assumptions in code which we should try to improve.Great!, I will continue with spares 1-dimensional improvement.
I ran some performance scenarios on the patch to see if the increased
'spares' allocation had an impact. I havn't found any regressions in
that regard.
Attached patch contains some small fixes, mainly to the documentation -
on top of v7.
Best regards,
Jesper
Attachments:
hashbucket_fixes.patchtext/x-patch; name=hashbucket_fixes.patchDownload+50-51
On Mon, Mar 27, 2017 at 11:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I think we can't change the number of buckets to be created or lowmask
and highmask calculation here without modifying _h_spoolinit() because
it sorts the data to be inserted based on hashkey which in turn
depends on the number of buckets that we are going to create during
create index operation. We either need to allow create index
operation to still always create buckets in power-of-two fashion or we
need to update _h_spoolinit according to new computation. One minor
drawback of using power-of-two scheme for creation of buckets during
create index is that it can lead to wastage of space and will be
inconsistent with what the patch does during split operation.
Yes, this was a miss. Now Number of buckets allocated during
metap_init is not always a power-of-two number. The hashbuild which
uses just the hash_mask to decide which bucket does the hashkey
belong to is not sufficient. It can give buckets beyond max_buckets
and sorting of index values based on their buckets will be out of
order. When we try to actually insert the same in hash index we loose
the advantage of the spatial locality which existed before. And, hence
indexbuild performance can reduce.
As you have said we can solve it if we allocate buckets always in
power-of-2 when we do hash index meta page init. But on other
occasions, when we try to double the existing buckets we can do the
allocation in 4 equal phases.
But I think there are 2 more ways to solve same,
A. Why not pass all 3 parameters high_mask, low_mask, max-buckets to
tuplesort and let them use _hash_hashkey2bucket to figure out which
key belong to which bucket. And then sort them. I think this way we
make both sorting and insertion to hash index both consistent with
each other.
B. In tuple sort we can use hash function bucket = hash_key %
num_buckets instead of existing one which does bitwise "and" to
determine the bucket of hash key. This way we will not wrongly assign
buckets beyond max_buckets and sorted hash keys will be in sync with
actual insertion order of _hash_doinsert.
I am adding both the patches Patch_A and Patch_B. My preference is
Patch_A and I am open for suggestion.
+#define SPLITPOINT_PHASES_PER_GRP 4 +#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1) +#define Buckets_First_Split_Group 4
Fixed.
In the above computation +2 and -2 still bothers me. I think you need
to do this because you have defined split group zero to have four
buckets, how about if you don't force that and rather define to have
split point phases only from split point which has four or more
buckets?
Okay as suggested instead of group zero having 4 phases of 1 bucket
each, I have recalculated the spare mapping as below.
Allocating huge chunks of bucket pages all at once isn't optimal and
we will take ages to consume those. To avoid this exponential growth
of index size, we did use a trick to breakup allocation of buckets at
the splitpoint into 4 equal phases. If (2 ^ x) is the total buckets
need to be allocated at a splitpoint (from now on we shall call this
as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2)) of total
buckets at each phase of splitpoint group. Next quarter of allocation
will only happen if buckets of the previous phase have been already
consumed. Since for buckets number < 4 we cannot further divide it
into multiple phases, the first 3 group will have only one phase of
allocation. The groups 0, 1, 2 will allocate 1, 1, 2 buckets
respectively at once in one phase. For the groups > 2 Where we
allocate buckets > 4, the allocation process is distributed among four
equal phases. At group 3 we allocate 4 buckets in 4 different phases
{1, 1, 1, 1}, the numbers in curly braces indicate number of buckets
allocated within each phase of splitpoint group 3. And, for splitpoint
group 4 and 5 allocation phase will be {2, 2, 2, 2} = 16 buckets in
total and {4, 4, 4, 4} = 32 buckets in total. We can see that at each
splitpoint group
we double the total number of buckets from previous group but in an
incremental phase. The bucket pages allocated within one phase of a
splitpoint group will appear consecutively in the index.
The sortbuild_hash_*.patch can be applied independently on any of
expand_hashbucket_efficiently_08.patch
--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com
Attachments:
sortbuild_hash_B.patchapplication/octet-stream; name=sortbuild_hash_B.patchDownload+23-25
sortbuild_hash_A.patchapplication/octet-stream; name=sortbuild_hash_A.patchDownload+46-20
yet_another_expand_hashbucket_efficiently_08.patchapplication/octet-stream; name=yet_another_expand_hashbucket_efficiently_08.patchDownload+179-62
expand_hashbucket_efficiently_08.patchapplication/octet-stream; name=expand_hashbucket_efficiently_08.patchDownload+174-61
On Tue, Mar 28, 2017 at 10:43 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
On Mon, Mar 27, 2017 at 11:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
As you have said we can solve it if we allocate buckets always in
power-of-2 when we do hash index meta page init. But on other
occasions, when we try to double the existing buckets we can do the
allocation in 4 equal phases.But I think there are 2 more ways to solve same,
A. Why not pass all 3 parameters high_mask, low_mask, max-buckets to
tuplesort and let them use _hash_hashkey2bucket to figure out which
key belong to which bucket. And then sort them. I think this way we
make both sorting and insertion to hash index both consistent with
each other.B. In tuple sort we can use hash function bucket = hash_key %
num_buckets instead of existing one which does bitwise "and" to
determine the bucket of hash key. This way we will not wrongly assign
buckets beyond max_buckets and sorted hash keys will be in sync with
actual insertion order of _hash_doinsert.I am adding both the patches Patch_A and Patch_B. My preference is
Patch_A and I am open for suggestion.
I think both way it can work. I feel there is no hard pressed need
that we should make the computation in sorting same as what you do
_hash_doinsert. In patch_B, I don't think new naming for variables is
good.
Assert(!a->isnull1);
- hash1 = DatumGetUInt32(a->datum1) & state->hash_mask;
+ bucket1 = DatumGetUInt32(a->datum1) % state->num_buckets;
Can we use hash_mod instead of num_buckets and retain hash1 in above
code and similar other places?
+#define SPLITPOINT_PHASES_PER_GRP 4 +#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1) +#define Buckets_First_Split_Group 4Fixed.
In the above computation +2 and -2 still bothers me. I think you need
to do this because you have defined split group zero to have four
buckets, how about if you don't force that and rather define to have
split point phases only from split point which has four or more
buckets?Okay as suggested instead of group zero having 4 phases of 1 bucket
each, I have recalculated the spare mapping as below.
Few comments:
1.
+#define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \
+ ((sp_g < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) ? \
+ (1 << (sp_g - 1)) : \
+ ((nphase) * ((1 << (sp_g - 1)) >> 2)))
This will go wrong for split point group zero. In general, I feel if
you handle computation for split groups lesser than
SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE in the caller, then all your
macros will become much simpler and less error prone.
2.
+#define BUCKET_TO_SPLITPOINT_GRP(num_bucket) (_hash_log2(num_bucket))
What is the use of such a define, can't we directly use _hash_log2 in
the caller?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Mar 28, 2017 at 12:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I think both way it can work. I feel there is no hard pressed need
that we should make the computation in sorting same as what you do
_hash_doinsert. In patch_B, I don't think new naming for variables is
good.Assert(!a->isnull1); - hash1 = DatumGetUInt32(a->datum1) & state->hash_mask; + bucket1 = DatumGetUInt32(a->datum1) % state->num_buckets;Can we use hash_mod instead of num_buckets and retain hash1 in above
code and similar other places?
Yes done renamed it to hash_mod.
Few comments: 1. +#define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \ + ((sp_g < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) ? \ + (1 << (sp_g - 1)) : \ + ((nphase) * ((1 << (sp_g - 1)) >> 2)))This will go wrong for split point group zero. In general, I feel if
you handle computation for split groups lesser than
SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE in the caller, then all your
macros will become much simpler and less error prone.
Fixed, apart from SPLITPOINT_PHASE_TO_SPLITPOINT_GRP rest all macros
only handle multi phase group. The SPLITPOINT_PHASE_TO_SPLITPOINT_GRP
is used in one more place at expand index so thought kepeping it as it
is is better.
.
2.
+#define BUCKET_TO_SPLITPOINT_GRP(num_bucket) (_hash_log2(num_bucket))What is the use of such a define, can't we directly use _hash_log2 in
the caller?
No real reason, just that NAMED macro appeared more readable than just
_hash_log2. Now I have removed same.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com