Bloom Indexes - bit array length and the total number of bits (or hash functions ?? ) !

Started by Avinash Kumarover 6 years ago4 messages
#1Avinash Kumar
avinash.vallarapu@gmail.com

Hi All,

I was testing bloom indexes today. I understand bloom indexes uses bloom
filters.

As i understand, a bloom filter is a bit array of m bits and a constant "k"
number of hash functions are used to generate hashes for the data. And then
the appropriate bits are set to 1.

I was doing the following test where i was generating 10 million records
and testing bloom indexes.

CREATE TABLE foo.bar (id int, dept int, id2 int, id3 int, id4 int, id5
int,id6 int,id7 int,details text, zipcode int);

INSERT INTO foo.bar SELECT (random() * 1000000)::int, (random() *
1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,(random()
* 1000000)::int,(random() * 1000000)::int, (random() *
1000000)::int,(random() * 1000000)::int,md5(g::text), floor(random()*
(20000-9999 + 1) + 9999) from generate_series(1,100*1e6) g;

As per the documentation, bloom indexes accepts 2 parameters. *length* and
the *number of bits for each column*.

Here is the problem or the question i have.

I have created the following 2 Indexes.

*Index 1*
CREATE INDEX idx_bloom_bar ON foo.bar
USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode)
WITH (length=48, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4,
col8=4);

*Index 2*
CREATE INDEX idx_bloom_bar ON foo.bar
USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode)
WITH (length=48, col1=2, col2=2, col3=2, col4=2, col5=2, col6=2, col7=2,
col8=2);

With change in length, we of course see a difference in the Index size
which is understandable. Here i have the same length for both indexes. But,
i reduced the number of bits per each index column from 4 to 2. Both the
above indexes are of the same size. But, there is a very slight difference
in the execution time between Index 1 and Index 2 but with the same cost.

So the question here is -
I assume - number of bits = k. Where k is the total number of hash
functions used on top of the data that needs to validated. Is that correct
? If yes, why do we see the Index 1 performing better than Index 2 ?
Because, the data has to go through more hash functions (4 vs 2) in Index 1
than Index 2. So, with Index 1 it should take more time.
Also, both the indexes have ZERO false positives.
Please let me know if there is anything simple that i am missing here.

*Query *

EXPLAIN (ANALYZE, BUFFERS, VERBOSE) select * from foo.bar where id = 736833
and dept = 89861 and id2 = 573221 and id3 = 675911 and id4 = 943394 and id5
= 326756 and id6 = 597560 and zipcode = 10545;

*With Index 1 *

QUERY PLAN
----------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo.bar (cost=2647060.00..2647061.03 rows=1 width=69)
(actual time=307.000..307.000 rows=0 loops=1)
Output: id, dept, id2, id3, id4, id5, id6, id7, details, zipcode
Recheck Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 =
573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 =
326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
Buffers: shared hit=147059
-> Bitmap Index Scan on idx_bloom_bar (cost=0.00..2647060.00 rows=1
width=0) (actual time=306.997..306.997 rows=0 loops=1)
Index Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2
= 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 =
326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
Buffers: shared hit=147059
Planning Time: 0.106 ms
* Execution Time: 307.030 ms*
(9 rows)

*With Index 2 *

QUERY PLAN
----------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo.bar (cost=2647060.00..2647061.03 rows=1 width=69)
(actual time=420.881..420.881 rows=0 loops=1)
Output: id, dept, id2, id3, id4, id5, id6, id7, details, zipcode
Recheck Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 =
573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 =
326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
Buffers: shared hit=147059
-> Bitmap Index Scan on idx_bloom_bar (cost=0.00..2647060.00 rows=1
width=0) (actual time=420.878..420.878 rows=0 loops=1)
Index Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2
= 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 =
326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
Buffers: shared hit=147059
Planning Time: 0.104 ms
* Execution Time: 420.913 ms*
(9 rows)

Thanks,
Avinash.

#2Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Avinash Kumar (#1)
Re: Bloom Indexes - bit array length and the total number of bits (or hash functions ?? ) !

Hello Avinash,

I was testing bloom indexes today. I understand bloom indexes uses bloom
filters. [...]

So the question here is -
I assume - number of bits = k. Where k is the total number of hash
functions used on top of the data that needs to validated. Is that correct
? If yes, why do we see the Index 1 performing better than Index 2 ?
Because, the data has to go through more hash functions (4 vs 2) in Index 1
than Index 2. So, with Index 1 it should take more time.
Also, both the indexes have ZERO false positives.
Please let me know if there is anything simple that i am missing here.

You may have a look at the blog entry about these parameters I redacted a
few year ago:

http://blog.coelho.net/database/2016/12/11/postgresql-bloom-index.html

--
Fabien.

#3Avinash Kumar
avinash.vallarapu@gmail.com
In reply to: Fabien COELHO (#2)
Re: Bloom Indexes - bit array length and the total number of bits (or hash functions ?? ) !

Thanks Fabien,

But the 2 direct questions i have are :

1. What is the structure of the Bloom Index ? Can you please let me know
what are the fields of a Bloom Index ? Is it just the Item Pointer
and BloomSignatureWord ?
When i describe my bloom index it looks like following.

postgres=# \d+ foo.idx_bloom_bar
Index "foo.idx_bloom_bar"
Column | Type | Key? | Definition | Storage | Stats target
---------+---------+------+------------+---------+--------------
id | integer | yes | id | plain |
dept | integer | yes | dept | plain |
id2 | integer | yes | id2 | plain |
id3 | integer | yes | id3 | plain |
id4 | integer | yes | id4 | plain |
id5 | integer | yes | id5 | plain |
id6 | integer | yes | id6 | plain |
zipcode | integer | yes | zipcode | plain |
bloom, for table "foo.bar"
Options: length=64, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4,
col8=4

2. If we are storing just one signature word per row, how is this
aggregated for all column values of that row into one signature in high
level ?
For example, if length = 64, does it mean that a bit array of 64 bits is
generated per each row ?
If col1=4, does it mean the value of col1 is passed to 4 hash functions
that generate 4 positions that can be set to 1 in the bit array ?

On Sat, Jun 8, 2019 at 3:11 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Avinash,

I was testing bloom indexes today. I understand bloom indexes uses bloom
filters. [...]

So the question here is -
I assume - number of bits = k. Where k is the total number of hash
functions used on top of the data that needs to validated. Is that

correct

? If yes, why do we see the Index 1 performing better than Index 2 ?
Because, the data has to go through more hash functions (4 vs 2) in

Index 1

than Index 2. So, with Index 1 it should take more time.
Also, both the indexes have ZERO false positives.
Please let me know if there is anything simple that i am missing here.

You may have a look at the blog entry about these parameters I redacted a
few year ago:

http://blog.coelho.net/database/2016/12/11/postgresql-bloom-index.html

--
Fabien.

--
9000799060

#4Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Avinash Kumar (#3)
Re: Bloom Indexes - bit array length and the total number of bits (or hash functions ?? ) !

But the 2 direct questions i have are :

1. What is the structure of the Bloom Index ? Can you please let me know
what are the fields of a Bloom Index ? Is it just the Item Pointer
and BloomSignatureWord ?

I'm not sure of Postgres actual implementation, I have just looked at the
underlying hash functions implementation.

When i describe my bloom index it looks like following.

postgres=# \d+ foo.idx_bloom_bar
Index "foo.idx_bloom_bar"
Column | Type | Key? | Definition | Storage | Stats target
---------+---------+------+------------+---------+--------------
id | integer | yes | id | plain |
dept | integer | yes | dept | plain |
id2 | integer | yes | id2 | plain |
id3 | integer | yes | id3 | plain |
id4 | integer | yes | id4 | plain |
id5 | integer | yes | id5 | plain |
id6 | integer | yes | id6 | plain |
zipcode | integer | yes | zipcode | plain |
bloom, for table "foo.bar"

The bloom index associates a signature, i.e. a bitfield the size of which
is specified by the parameter "length", to the tuple location. The
bitfield is computed by hashing the value of columns which are listed
above in the index definition. The many values are somehow compressed into
the small signature.

Options: length=64, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4,
col8=4

I doubt that these parameters are good. The is no point to include a
unique column into a bloom index. If you look at my blog, the number of
bits associated to each field should depend on the expected selectivity,
which depends on the entropy of each field and the signature size. The
column entropy can be computed with a query.

2. If we are storing just one signature word per row, how is this
aggregated for all column values of that row into one signature in high
level ?

The aggregation, if performed, is not very useful in practice because it
can only be effective on a few (first) bits, which are randomly computed
anyway and the value of the query is not likely to hit them.

Fundamentally all bitfields are scanned to extract which tuples are
possibly of interest, i.e. are not excluded by the index. The "full scan"
of the bloom index is not a bad thing if it is much smaller than the table
itself.

For example, if length = 64, does it mean that a bit array of 64 bits is
generated per each row ?

Yes.

If col1=4, does it mean the value of col1 is passed to 4 hash functions
that generate 4 positions that can be set to 1 in the bit array ?

Yes.

Try to apply the formula in the blog to see what you get for your
parameters, but it is likely to be < 4.

--
Fabien.