WIP: BRIN bloom indexes

Started by Tomas Vondraabout 8 years ago13 messages
#1Tomas Vondra
tomas.vondra@2ndquadrant.com
1 attachment(s)

Hi,

The BRIN minmax opclasses work well only for data where the column is
somewhat correlated to physical location in a table. So it works great
for timestamps in append-only log tables, for example. When that is not
the case (non-correlated columns) the minmax ranges get very "wide" and
we end up scanning large portions of the tables.

A typical example are columns with types that are inherently random (or
rather non-correlated) like for example UUIDs, IP or MAC addresses, and
so on. But it can just as easily happen even with simple IDs generated
from a sequence.

This WIP patch implements another type of BRIN index, with "summary"
being represented by a bloom filter. So instead of building [min,max]
range for each page range. The index is of course larger than the
regular "minmax" BRIN index, but it's still orders of magnitude smaller
than a btree index.

Note: This is different from the index type implemented in the "bloom"
extension. Those are not related to BRIN at all, and the index builds a
bloom filter on individual rows (values in different columns).

BTW kudos to everyone who worked on the BRIN infrastructure and API. I
found it very intuitive and well-documented. Adding the new opclass was
extremely straight-forward, and

To demonstrate this on a small example, consider this table:

CREATE TABLE bloom_test (id uuid, padding text);

INSERT INTO bloom_test
SELECT md5((mod(i,1000000)/100)::text)::uuid, md5(i::text)
FROM generate_series(1,200000000) s(i);

VACUUM ANALYZE bloom_test;

List of relations
Schema | Name | Type | Owner | Size | Description
--------+------------+-------+-------+-------+-------------
public | bloom_test | table | tomas | 16 GB |
(1 row)

The table was built so that each range contains relatively small number
of distinct UUID values - this is typical e.g. for user activity with
"bursts" of actions from a particular user separated by long periods of
inactivity (with actions from other users).

Now, let's build BRIN "minmax", BRIN "bloom" and BTREE indexes on the id
column.

create index test_brin_idx on bloom_test
using brin (id);

create index test_bloom_idx on bloom_test
using brin (id uuid_bloom_ops);

create index test_btree_idx on bloom_test (id);

which gives us this:

List of relations
Schema | Name | Type | Owner | Table | Size
--------+----------------+-------+-------+------------+---------
public | test_bloom_idx | index | tomas | bloom_test | 12 MB
public | test_brin_idx | index | tomas | bloom_test | 832 kB
public | test_btree_idx | index | tomas | bloom_test | 6016 MB
(3 rows)

So yeah - the "bloom" index is larger than the default "minmax" index,
but it's still orders of magnitude smaller than the regular btree one.

Now, what about query performance? Unlike the "minmax" indexes, the
"bloom" filter can only handle equality conditions.

Let's see a query like this:

select * from bloom_test
where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';

The minmax index produces this plan

QUERY PLAN
------------------------------------------------------------------------
Bitmap Heap Scan on bloom_test
(cost=390.31..2130910.82 rows=20593 width=49)
(actual time=197.974..22707.311 rows=20000 loops=1)
Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
Rows Removed by Index Recheck: 199980000
Heap Blocks: lossy=2061856
-> Bitmap Index Scan on test_brin_idx
(cost=0.00..385.16 rows=5493161 width=0)
(actual time=133.463..133.463 rows=20619520 loops=1)
Index Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
Planning time: 0.165 ms
Execution time: 22707.891 ms
(8 rows)

Which is not that great, and the long duration is a direct consequence
of "wide" ranges - the bitmap heap scan had to scan pretty much the
whole table. (A sequential scan takes only about 15 seconds.)

Now, the bloom index:

QUERY PLAN
------------------------------------------------------------------------
Bitmap Heap Scan on bloom_test
(cost=5898.31..2136418.82 rows=20593 width=49)
(actual time=24.229..338.324 rows=20000 loops=1)
Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
Rows Removed by Index Recheck: 2500448
Heap Blocks: lossy=25984
-> Bitmap Index Scan on test_bloom_idx
(cost=0.00..5893.16 rows=5493161 width=0)
(actual time=22.811..22.811 rows=259840 loops=1)
Index Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
Planning time: 0.201 ms
Execution time: 338.946 ms
(8 rows)

So, way better.

For comparison, a simple index scan / bitmap index scan using the btree
take only about ~10ms in this case, but that's mostly thanks to the
whole dataset being entirely in-memory.

There are some remaining open questions.

1) sizing the bloom filter

Firstly, the size of the filter is currently static, based on 1000
distinct values per range, with 5% false-positive rate. Those are mostly
arbitrary values, and I don't have any clear idea how to determine
optimal values.

Maybe we could do the sizing based on ndistinct value for the indexed
column? It also depends on the pages_per_range value, so perhaps it
should be a user-defined option too.

An interesting feature is that the bloom indexes "degrade locally". If a
page range has significantly more distinct values, the bloom filter may
be way too small and will suffer by high false-positive rate. But it
only affects that one page range, and the rest may be perfectly fine.

I was thinking about disabling the bloom filter when it gets "too full"
(too many bits set, resulting in high false-positive cases). But that
seems like a bad idea - the false-positive rate automatically jumps to
100% for that range, and we only save much smaller amount of space in
the index. So even if the false-positive rate is 50% it still seems
efficient to keep the bloom filter.

Another thing to consider is what happens when there are very few
distinct values in a given page range. The current code tries to be a
bit smart in this case - instead of building the bloom filter right
away, it initially keeps the exact values and only switches to bloom
filter when filling the same space.

2) costing

The other thing is costing of BRIN indexes. At this point it's rather
simplistic and independent of the operator class, so the only thing that
matters is size of the BRIN index. That means a "minmax" index is always
considered cheaper than "bloom" index, because the optimizer has no idea
the "minmax" indexes are more vulnerable to "wide ranges".

But perhaps this is a non-issue, as the bloom index are meant for cases
when minmax indexes don't work. And when minmax indexes work, they work
better than bloom. So people are unlikely to build both of them at the
same time.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

brin-bloom-v1.patchtext/x-patch; name=brin-bloom-v1.patchDownload
diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml
index 91c0170..26483dd 100644
--- a/doc/src/sgml/brin.sgml
+++ b/doc/src/sgml/brin.sgml
@@ -118,6 +118,13 @@
    </thead>
    <tbody>
     <row>
+     <entry><literal>abstime_bloom_ops</literal></entry>
+     <entry><type>abstime</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>abstime_minmax_ops</literal></entry>
      <entry><type>abstime</type></entry>
      <entry>
@@ -129,6 +136,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>int8_bloom_ops</literal></entry>
+     <entry><type>bigint</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>int8_minmax_ops</literal></entry>
      <entry><type>bigint</type></entry>
      <entry>
@@ -180,6 +194,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>bytea_bloom_ops</literal></entry>
+     <entry><type>bytea</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>bytea_minmax_ops</literal></entry>
      <entry><type>bytea</type></entry>
      <entry>
@@ -191,6 +212,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>bpchar_bloom_ops</literal></entry>
+     <entry><type>character</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>bpchar_minmax_ops</literal></entry>
      <entry><type>character</type></entry>
      <entry>
@@ -202,6 +230,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>char_bloom_ops</literal></entry>
+     <entry><type>"char"</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>char_minmax_ops</literal></entry>
      <entry><type>"char"</type></entry>
      <entry>
@@ -213,6 +248,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>date_bloom_ops</literal></entry>
+     <entry><type>date</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>date_minmax_ops</literal></entry>
      <entry><type>date</type></entry>
      <entry>
@@ -224,6 +266,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>float8_bloom_ops</literal></entry>
+     <entry><type>double precision</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>float8_minmax_ops</literal></entry>
      <entry><type>double precision</type></entry>
      <entry>
@@ -235,6 +284,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>inet_bloom_ops</literal></entry>
+     <entry><type>inet</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>inet_minmax_ops</literal></entry>
      <entry><type>inet</type></entry>
      <entry>
@@ -258,6 +314,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>int4_bloom_ops</literal></entry>
+     <entry><type>integer</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>int4_minmax_ops</literal></entry>
      <entry><type>integer</type></entry>
      <entry>
@@ -269,6 +332,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>interval_bloom_ops</literal></entry>
+     <entry><type>interval</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>interval_minmax_ops</literal></entry>
      <entry><type>interval</type></entry>
      <entry>
@@ -280,6 +350,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>macaddr_bloom_ops</literal></entry>
+     <entry><type>macaddr</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>macaddr_minmax_ops</literal></entry>
      <entry><type>macaddr</type></entry>
      <entry>
@@ -291,6 +368,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>macaddr8_bloom_ops</literal></entry>
+     <entry><type>macaddr8</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>macaddr8_minmax_ops</literal></entry>
      <entry><type>macaddr8</type></entry>
      <entry>
@@ -302,6 +386,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>name_bloom_ops</literal></entry>
+     <entry><type>name</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>name_minmax_ops</literal></entry>
      <entry><type>name</type></entry>
      <entry>
@@ -313,6 +404,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>numeric_bloom_ops</literal></entry>
+     <entry><type>numeric</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>numeric_minmax_ops</literal></entry>
      <entry><type>numeric</type></entry>
      <entry>
@@ -324,6 +422,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>pg_lsn_bloom_ops</literal></entry>
+     <entry><type>pg_lsn</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>pg_lsn_minmax_ops</literal></entry>
      <entry><type>pg_lsn</type></entry>
      <entry>
@@ -335,6 +440,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>oid_bloom_ops</literal></entry>
+     <entry><type>oid</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>oid_minmax_ops</literal></entry>
      <entry><type>oid</type></entry>
      <entry>
@@ -366,6 +478,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>float4_bloom_ops</literal></entry>
+     <entry><type>real</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>float4_minmax_ops</literal></entry>
      <entry><type>real</type></entry>
      <entry>
@@ -377,6 +496,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>reltime_bloom_ops</literal></entry>
+     <entry><type>reltime</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>reltime_minmax_ops</literal></entry>
      <entry><type>reltime</type></entry>
      <entry>
@@ -388,6 +514,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>int2_bloom_ops</literal></entry>
+     <entry><type>smallint</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>int2_minmax_ops</literal></entry>
      <entry><type>smallint</type></entry>
      <entry>
@@ -399,6 +532,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>text_bloom_ops</literal></entry>
+     <entry><type>text</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>text_minmax_ops</literal></entry>
      <entry><type>text</type></entry>
      <entry>
@@ -410,6 +550,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>tid_bloom_ops</literal></entry>
+     <entry><type>tid</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>tid_minmax_ops</literal></entry>
      <entry><type>tid</type></entry>
      <entry>
@@ -421,6 +568,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>timestamp_bloom_ops</literal></entry>
+     <entry><type>timestamp without time zone</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>timestamp_minmax_ops</literal></entry>
      <entry><type>timestamp without time zone</type></entry>
      <entry>
@@ -432,6 +586,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>timestamptz_bloom_ops</literal></entry>
+     <entry><type>timestamp with time zone</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>timestamptz_minmax_ops</literal></entry>
      <entry><type>timestamp with time zone</type></entry>
      <entry>
@@ -443,6 +604,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>time_bloom_ops</literal></entry>
+     <entry><type>time without time zone</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>time_minmax_ops</literal></entry>
      <entry><type>time without time zone</type></entry>
      <entry>
@@ -454,6 +622,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>timetz_bloom_ops</literal></entry>
+     <entry><type>time with time zone</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>timetz_minmax_ops</literal></entry>
      <entry><type>time with time zone</type></entry>
      <entry>
@@ -465,6 +640,13 @@
      </entry>
     </row>
     <row>
+     <entry><literal>uuid_bloom_ops</literal></entry>
+     <entry><type>uuid</type></entry>
+     <entry>
+      <literal>=</literal>
+     </entry>
+    </row>
+    <row>
      <entry><literal>uuid_minmax_ops</literal></entry>
      <entry><type>uuid</type></entry>
      <entry>
@@ -814,6 +996,60 @@ typedef struct BrinOpcInfo
  </para>
 
  <para>
+  To write an operator class for a data type that implements only an equality
+  operator and supports hashing, it is possible to use the bloom support procedures
+  alongside the corresponding operators, as shown in
+  <xref linkend="brin-extensibility-bloom-table">.
+  All operator class members (procedures and operators) are mandatory.
+ </para>
+
+ <table id="brin-extensibility-bloom-table">
+  <title>Procedure and Support Numbers for Bloom Operator Classes</title>
+  <tgroup cols="2">
+   <thead>
+    <row>
+     <entry>Operator class member</entry>
+     <entry>Object</entry>
+    </row>
+   </thead>
+   <tbody>
+    <row>
+     <entry>Support Procedure 1</entry>
+     <entry>internal function <function>brin_bloom_opcinfo()</function></entry>
+    </row>
+    <row>
+     <entry>Support Procedure 2</entry>
+     <entry>internal function <function>brin_bloom_add_value()</function></entry>
+    </row>
+    <row>
+     <entry>Support Procedure 3</entry>
+     <entry>internal function <function>brin_bloom_consistent()</function></entry>
+    </row>
+    <row>
+     <entry>Support Procedure 4</entry>
+     <entry>internal function <function>brin_bloom_union()</function></entry>
+    </row>
+    <row>
+     <entry>Support Procedure 11</entry>
+     <entry>function to compute hash of an element</entry>
+    </row>
+    <row>
+     <entry>Operator Strategy 1</entry>
+     <entry>operator equal-to</entry>
+    </row>
+   </tbody>
+  </tgroup>
+ </table>
+
+ <para>
+    Support procedure numbers 1-10 are reserved for the BRIN internal
+    functions, so the SQL level functions start with number 11.  Support
+    function number 11 is the main function required to build the index.
+    It should accept one argument with the same data type as the operator class,
+    and return a hash of the value.
+ </para>
+
+ <para>
     Both minmax and inclusion operator classes support cross-data-type
     operators, though with these the dependencies become more complicated.
     The minmax operator class requires a full set of operators to be
diff --git a/src/backend/access/brin/Makefile b/src/backend/access/brin/Makefile
index 5aef925..a76d927 100644
--- a/src/backend/access/brin/Makefile
+++ b/src/backend/access/brin/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = brin.o brin_pageops.o brin_revmap.o brin_tuple.o brin_xlog.o \
-       brin_minmax.o brin_inclusion.o brin_validate.o
+       brin_minmax.o brin_inclusion.o brin_validate.o brin_bloom.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/brin/brin_bloom.c b/src/backend/access/brin/brin_bloom.c
new file mode 100644
index 0000000..c08f686
--- /dev/null
+++ b/src/backend/access/brin/brin_bloom.c
@@ -0,0 +1,727 @@
+/*
+ * brin_bloom.c
+ *		Implementation of Bloom opclass for BRIN
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/brin/brin_bloom.c
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/brin_internal.h"
+#include "access/brin_tuple.h"
+#include "access/hash.h"
+#include "access/stratnum.h"
+#include "catalog/pg_type.h"
+#include "catalog/pg_amop.h"
+#include "utils/builtins.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"
+
+#include <math.h>
+
+#define BloomEqualStrategyNumber	1
+
+/*
+ * Additional SQL level support functions
+ *
+ * Procedure numbers must not use values reserved for BRIN itself; see
+ * brin_internal.h.
+ */
+#define		BLOOM_MAX_PROCNUMS		4	/* maximum support procs we need */
+#define		PROCNUM_HASH			11	/* required */
+
+/*
+ * Subtract this from procnum to obtain index in BloomOpaque arrays
+ * (Must be equal to minimum of private procnums).
+ */
+#define		PROCNUM_BASE			11
+
+
+#define		BLOOM_PHASE_SORTED		1
+#define		BLOOM_PHASE_HASH		2
+
+/* how many hashes to accumulate before hashing */
+#define		BLOOM_MAX_UNSORTED		32
+#define		BLOOM_GROW_BYTES		32
+#define		BLOOM_NDISTINCT			1000	/* number of distinct values */
+#define		BLOOM_ERROR_RATE		0.05	/* 2% false positive rate */
+
+/*
+ * Bloom Filter
+ *
+ * Represents a bloom filter, built on hashes of the indexed values. That is,
+ * we compute a uint32 hash of the value, and then store this hash into the
+ * bloom filter (and compute additional hashes on it).
+ *
+ * We use an optimisation that initially we store the uint32 values directly,
+ * without the extra hashing step. And only later filling the bitmap space,
+ * we switch to the regular bloom filter mode.
+ *
+ * PHASE_SORTED
+ *
+ * Initially we copy the uint32 hash into the bitmap, regularly sorting the
+ * hash values for fast lookup (we keep at most BLOOM_MAX_UNSORTED unsorted
+ * values).
+ *
+ * The idea is that if we only see very few distinct values, we can store
+ * them in less space compared to the (sparse) bloom filter bitmap. It also
+ * stores them exactly, although that's not a big advantage as almost-empty
+ * bloom filter has false positive rate close to zero anyway.
+ *
+ * PHASE_HASH
+ *
+ * Once we fill the bitmap space in the sorted phase, we switch to the hash
+ * phase, where we actually use the bloom filter. We treat the uint32 hashes
+ * as input values, and hash them again with different seeds (to get the k
+ * hash functions needed for bloom filter).
+ *
+ *
+ * XXX Perhaps we could save a few bytes by using different data types, but
+ * considering the size of the bitmap, the difference is negligible.
+ *
+ * XXX We could also implement "sparse" bloom filters, keeping only the
+ * bytes that are not entirely 0. That might make the "sorted" phase
+ * mostly unnecessary.
+ *
+ * XXX We can also watch the number of bits set in the bloom filter, and
+ * then stop using it (and not store the bitmap, to save space) when the
+ * false positive rate gets too high.
+ */
+typedef struct BloomFilter
+{
+	/* varlena header (do not touch directly!) */
+	int32	vl_len_;
+
+	/* global bloom filter parameters */
+	uint32	phase;		/* phase (initially SORTED, then HASH) */
+	uint32	nhashes;	/* number of hash functions */
+	uint32	nbits;		/* number of bits in the bitmap (optimal) */
+	uint32	nbits_set;	/* number of bits set to 1 */
+
+	/* fields used only in the EXACT phase */
+	uint32	nvalues;	/* number of hashes stored (sorted + extra) */
+	uint32	nsorted;	/* number of uint32 hashes in sorted part */
+
+	/* bitmap of the bloom filter */
+	char 	bitmap[FLEXIBLE_ARRAY_MEMBER];
+
+} BloomFilter;
+
+/*
+ * bloom_init
+ * 		Initialize the Bloom Filter, allocate all the memory.
+ *
+ * The filter is initialized with optimal size for ndistinct expected
+ * values, and requested false positive rate. The filter is stored as
+ * varlena.
+ */
+static BloomFilter *
+bloom_init(int ndistinct, double false_positive_rate)
+{
+	Size			len;
+	BloomFilter	   *filter;
+
+	/* https://en.wikipedia.org/wiki/Bloom_filter */
+	int m;	/* number of bits */
+	int k;	/* number of hash functions */
+
+	Assert((ndistinct > 0) && (ndistinct < 10000));	/* 10k is mostly arbitrary limit */
+	Assert((false_positive_rate > 0) && (false_positive_rate < 1.0));
+
+	m = ceil((ndistinct * log(false_positive_rate)) / log(1.0 / (pow(2.0, log(2.0)))));
+
+	/* round m to whole bytes */
+	m = ((m + 7) / 8) * 8;
+
+	k = round(log(2.0) * m / ndistinct);
+
+	elog(DEBUG1, "create bloom filter m=%d k=%d", m, k);
+
+	/*
+	 * Allocate the bloom filter with a minimum size 64B (about 40B in the
+	 * bitmap part). We require space at least for the header.
+	 */
+	len = Max(offsetof(BloomFilter, bitmap), 64);
+
+	filter = (BloomFilter *) palloc0(len);
+
+	filter->phase = BLOOM_PHASE_SORTED;
+	filter->nhashes = k;
+	filter->nbits = m;
+
+	SET_VARSIZE(filter, len);
+
+	return filter;
+}
+
+/* simple uint32 comparator, for pg_qsort and bsearch */
+static int
+cmp_uint32(const void *a, const void *b)
+{
+	uint32 *ia = (uint32 *) a;
+	uint32 *ib = (uint32 *) b;
+
+	if (*ia == *ib)
+		return 0;
+	else if (*ia < *ib)
+		return -1;
+	else
+		return 1;
+}
+
+/*
+ * bloom_compact
+ *		Compact the filter during the 'sorted' phase.
+ *
+ * We sort the uint32 hashes and remove duplicates, for two main reasons.
+ * Firstly, to keep most of the data sorted for bsearch lookups. Secondly,
+ * we try to save space by removing the duplicates, allowing us to stay
+ * in the sorted phase a bit longer.
+ *
+ * We currently don't repalloc the bitmap, i.e. we don't free the memory
+ * here - in the worst case we waste space for up to 32 unsorted hashes
+ * (if all of them are already in the sorted part), so about 128B. We can
+ * either reduce the number of unsorted items (e.g. to 8 hashes, which
+ * would mean 32B), or start doing the repalloc.
+ *
+ * XXX Actually, we don't need to do repalloc - we just need to set the
+ * varlena header length!
+ */
+static void
+bloom_compact(BloomFilter *filter)
+{
+	int		i,
+			nvalues;
+	uint32 *values;
+
+	/* never call compact on filters in HASH phase */
+	Assert(filter->phase == BLOOM_PHASE_SORTED);
+
+	/* no chance to compact anything */
+	if (filter->nvalues == filter->nsorted)
+		return;
+
+	values = (uint32 *) palloc(filter->nvalues * sizeof(uint32));
+
+	/* copy the data, then reset the bitmap */
+	memcpy(values, filter->bitmap, filter->nvalues * sizeof(uint32));
+	memset(filter->bitmap, 0, filter->nvalues * sizeof(uint32));
+
+	/* FIXME optimization: sort only the unsorted part, then merge */
+	pg_qsort(values, filter->nvalues, sizeof(uint32), cmp_uint32);
+
+	nvalues = 1;
+	for (i = 1; i < filter->nvalues; i++)
+	{
+		/* if a new value, keep it */
+		if (values[i] != values[i-1])
+		{
+			values[nvalues] = values[i];
+			nvalues++;
+		}
+	}
+
+	filter->nvalues = nvalues;
+	filter->nsorted = nvalues;
+
+	memcpy(filter->bitmap, values, nvalues * sizeof(uint32));
+
+	pfree(values);
+}
+
+static BloomFilter *
+bloom_switch_to_hashing(BloomFilter *filter);
+
+/*
+ * bloom_add_value
+ * 		Add value to the bloom filter.
+ */
+static BloomFilter *
+bloom_add_value(BloomFilter *filter, uint32 value, bool *updated)
+{
+	int		i;
+
+	/* assume 'not updated' by default */
+	if (updated)
+		*updated = false;
+
+	Assert(filter);
+
+	/* if we're in the sorted phase, we store the hashes directly */
+	if (filter->phase == BLOOM_PHASE_SORTED)
+	{
+		/* how many uint32 hashes can we fit into the bitmap */
+		int maxvalues = filter->nbits / (8 * sizeof(uint32));
+
+		/* do not overflow the bitmap space or number of unsorted items */
+		Assert(filter->nvalues <= maxvalues);
+		Assert(filter->nvalues - filter->nsorted <= BLOOM_MAX_UNSORTED);
+
+		/*
+		 * In this branch we always update the filter - we either add the
+		 * hash to the unsorted part, or switch the filter to hashing.
+		 */
+		if (updated)
+			*updated = true;
+
+		/*
+		 * If the array is full, or if we reached the limit on unsorted
+		 * items, try to compact the filter first, before attempting to
+		 * add the new value.
+		 */
+		if ((filter->nvalues == maxvalues) ||
+			(filter->nvalues - filter->nsorted == BLOOM_MAX_UNSORTED))
+				bloom_compact(filter);
+
+		/*
+		 * Can we squeeze one more uint32 hash into the bitmap? Also make
+		 * sure there's enough space in the bytea value first.
+		 */
+		if (filter->nvalues < maxvalues)
+		{
+			Size len = VARSIZE_ANY(filter);
+			Size need = offsetof(BloomFilter,bitmap) + (filter->nvalues+1) * sizeof(uint32);
+
+			if (len < need)
+			{
+				/*
+				 * We don't double the size here, as in the first place we care about
+				 * reducing storage requirements, and the doubling happens automatically
+				 * in memory contexts anyway.
+				 *
+				 * XXX Zero the newly allocated part. Maybe not really needed?
+				 */
+				filter = (BloomFilter *) repalloc(filter, len + BLOOM_GROW_BYTES);
+				memset((char *)filter + len, 0, BLOOM_GROW_BYTES);
+				SET_VARSIZE(filter, len + BLOOM_GROW_BYTES);
+			}
+
+			/* copy the data into the bitmap */
+			memcpy(&filter->bitmap[filter->nvalues * sizeof(uint32)],
+				   &value, sizeof(uint32));
+
+			filter->nvalues++;
+
+			/* we're done */
+			return filter;
+		}
+
+		/* can't add any more exact hashes, so switch to hashing */
+		filter = bloom_switch_to_hashing(filter);
+	}
+
+	/* we better be in the regular hashing phase */
+	Assert(filter->phase == BLOOM_PHASE_HASH);
+
+	/* we're in the ah */
+	for (i = 0; i < filter->nhashes; i++)
+	{
+		/*
+		 * Our "hash functions" are effectively hashes of the original
+		 * hash, with different seeds.
+		 */
+		uint32 h = ((uint32)DatumGetInt64(hash_uint32_extended(value, i))) % filter->nbits;
+
+		int byte = (h / 8);
+		int bit  = (h % 8);
+
+		/* if the bit is not set, set it and remember we did that */
+		if (! (filter->bitmap[byte] & (0x01 << bit)))
+		{
+			filter->bitmap[byte] |= (0x01 << bit);
+			filter->nbits_set++;
+			if (updated)
+				*updated = true;
+		}
+	}
+
+	return filter;
+}
+
+/*
+ * bloom_switch_to_hashing
+ * 		Switch the bloom filter from sorted to hashing mode.
+ */
+static BloomFilter *
+bloom_switch_to_hashing(BloomFilter *filter)
+{
+	int		i;
+	uint32 *values;
+	Size			len;
+	BloomFilter	   *newfilter;
+
+	elog(WARNING, "switching %p to hashing (sorted %d, total %d)", filter, filter->nsorted, filter->nvalues);
+
+	/*
+	 * The new filter is allocated with all the memory, directly into
+	 * the HASH phase.
+	 */
+	len = offsetof(BloomFilter, bitmap) + (filter->nbits / 8);
+
+	newfilter = (BloomFilter *) palloc0(len);
+
+	newfilter->nhashes = filter->nhashes;
+	newfilter->nbits = filter->nbits;
+	newfilter->phase = BLOOM_PHASE_HASH;
+
+	SET_VARSIZE(newfilter, len);
+
+	values = (uint32 *) filter->bitmap;
+
+	for (i = 0; i < filter->nvalues; i++)
+		/* ignore the return value here, re don't repalloc in hashing mode */
+		bloom_add_value(newfilter, values[i], NULL);
+
+	/* free the original filter, return the newly allocated one */
+	pfree(filter);
+
+	return newfilter;
+}
+
+/*
+ * bloom_contains_value
+ * 		Check if the bloom filter contains a particular value.
+ */
+static bool
+bloom_contains_value(BloomFilter *filter, uint32 value)
+{
+	int		i;
+
+	Assert(filter);
+
+	/* in sorted mode we simply search the two arrays (sorted, unsorted) */
+	if (filter->phase == BLOOM_PHASE_SORTED)
+	{
+		int i;
+		uint32 *values = (uint32 *)filter->bitmap;
+
+		/* first search through the sorted part */
+		if ((filter->nsorted > 0) &&
+			(bsearch(&value, values, filter->nsorted, sizeof(uint32), cmp_uint32) != NULL))
+			return true;
+
+		/* now search through the unsorted part - linear search */
+		for (i = filter->nsorted; i < filter->nvalues; i++)
+			if (value == values[i])
+				return true;
+
+		/* nothing found */
+		return false;
+	}
+
+	/* now the regular hashing mode */
+	Assert(filter->phase == BLOOM_PHASE_HASH);
+
+	for (i = 0; i < filter->nhashes; i++)
+	{
+		/* compute the hashes with a seed, used for the bloom filter */
+		uint32 h = ((uint32)DatumGetInt64(hash_uint32_extended(value, i))) % filter->nbits;
+
+		int byte = (h / 8);
+		int bit  = (h % 8);
+
+		/* if the bit is not set, the value is not there */
+		if (! (filter->bitmap[byte] & (0x01 << bit)))
+			return false;
+	}
+
+	/* all hashes found in bloom filter */
+	return true;
+}
+
+/*
+ * bloom_filter_count
+ * 		Estimate the number of distinct values stored in the bloom filter.
+ */
+static int
+bloom_filter_count(BloomFilter *filter)
+{
+	if (filter->phase == BLOOM_PHASE_SORTED)
+		return (filter->nvalues + filter->nsorted);
+	else
+		return ceil(- (filter->nbits * 1.0 / filter->nhashes * log(1 - filter->nbits_set * 1.0 / filter->nbits)));
+}
+
+typedef struct BloomOpaque
+{
+	/*
+	 * XXX At this point we only need a single proc (to compute the hash),
+	 * but let's keep the array just like inclusion and minman opclasses,
+	 * for consistency. We may need additional procs in the future.
+	 */
+	FmgrInfo	extra_procinfos[BLOOM_MAX_PROCNUMS];
+	bool		extra_proc_missing[BLOOM_MAX_PROCNUMS];
+} BloomOpaque;
+
+static FmgrInfo *bloom_get_procinfo(BrinDesc *bdesc, uint16 attno,
+					   uint16 procnum);
+
+
+Datum
+brin_bloom_opcinfo(PG_FUNCTION_ARGS)
+{
+	BrinOpcInfo *result;
+
+	/*
+	 * opaque->strategy_procinfos is initialized lazily; here it is set to
+	 * all-uninitialized by palloc0 which sets fn_oid to InvalidOid.
+	 * 
+	 * bloom indexes only store a the filter as a single BYTEA column
+	 */
+
+	result = palloc0(MAXALIGN(SizeofBrinOpcInfo(1)) +
+					 sizeof(BloomOpaque));
+	result->oi_nstored = 1;
+	result->oi_opaque = (BloomOpaque *)
+		MAXALIGN((char *) result + SizeofBrinOpcInfo(1));
+	result->oi_typcache[0] = lookup_type_cache(BYTEAOID, 0);
+
+	PG_RETURN_POINTER(result);
+}
+
+/*
+ * Examine the given index tuple (which contains partial status of a certain
+ * page range) by comparing it to the given value that comes from another heap
+ * tuple.  If the new value is outside the bloom filter specified by the
+ * existing tuple values, update the index tuple and return true.  Otherwise,
+ * return false and do not modify in this case.
+ */
+Datum
+brin_bloom_add_value(PG_FUNCTION_ARGS)
+{
+	BrinDesc   *bdesc = (BrinDesc *) PG_GETARG_POINTER(0);
+	BrinValues *column = (BrinValues *) PG_GETARG_POINTER(1);
+	Datum		newval = PG_GETARG_DATUM(2);
+	bool		isnull = PG_GETARG_DATUM(3);
+	Oid			colloid = PG_GET_COLLATION();
+	FmgrInfo   *hashFn;
+	uint32		hashValue;
+	bool		updated = false;
+	AttrNumber	attno;
+	BloomFilter *filter;
+
+	/*
+	 * If the new value is null, we record that we saw it if it's the first
+	 * one; otherwise, there's nothing to do.
+	 */
+	if (isnull)
+	{
+		if (column->bv_hasnulls)
+			PG_RETURN_BOOL(false);
+
+		column->bv_hasnulls = true;
+		PG_RETURN_BOOL(true);
+	}
+
+	attno = column->bv_attno;
+
+	/*
+	 * If this is the first non-null value, we need to initialize the bloom
+	 * filter. Otherwise just extract the existing bloom filter from BrinValues.
+	 */
+	if (column->bv_allnulls)
+	{
+		filter = bloom_init(BLOOM_NDISTINCT, BLOOM_ERROR_RATE);
+		column->bv_values[0] = PointerGetDatum(filter);
+		column->bv_allnulls = false;
+		updated = true;
+	}
+	else
+		filter = (BloomFilter *) DatumGetPointer(column->bv_values[0]);
+
+	elog(DEBUG1, "brin_bloom_add_value: filter=%p bits=%d hashes=%d values=%d set=%d estimate=%d",
+				  filter, filter->nbits, filter->nhashes, filter->nvalues, filter->nbits_set,
+				  bloom_filter_count(filter));
+
+	/*
+	 * Compute the hash of the new value, using the supplied hash function,
+	 * and then add the hash value to the bloom filter.
+	 */
+	hashFn = bloom_get_procinfo(bdesc, attno, PROCNUM_HASH);
+
+	hashValue = DatumGetUInt32(FunctionCall1Coll(hashFn, colloid, newval));
+
+	column->bv_values[0] = PointerGetDatum(bloom_add_value(filter, hashValue, &updated));
+
+	PG_RETURN_BOOL(updated);
+}
+
+/*
+ * Given an index tuple corresponding to a certain page range and a scan key,
+ * return whether the scan key is consistent with the index tuple's bloom
+ * filter.  Return true if so, false otherwise.
+ */
+Datum
+brin_bloom_consistent(PG_FUNCTION_ARGS)
+{
+	BrinDesc   *bdesc = (BrinDesc *) PG_GETARG_POINTER(0);
+	BrinValues *column = (BrinValues *) PG_GETARG_POINTER(1);
+	ScanKey		key = (ScanKey) PG_GETARG_POINTER(2);
+	Oid			colloid = PG_GET_COLLATION();
+	AttrNumber	attno;
+	Datum		value;
+	Datum		matches;
+	FmgrInfo   *finfo;
+	uint32		hashValue;
+	BloomFilter *filter;
+
+	Assert(key->sk_attno == column->bv_attno);
+
+	/* handle IS NULL/IS NOT NULL tests */
+	if (key->sk_flags & SK_ISNULL)
+	{
+		if (key->sk_flags & SK_SEARCHNULL)
+		{
+			if (column->bv_allnulls || column->bv_hasnulls)
+				PG_RETURN_BOOL(true);
+			PG_RETURN_BOOL(false);
+		}
+
+		/*
+		 * For IS NOT NULL, we can only skip ranges that are known to have
+		 * only nulls.
+		 */
+		if (key->sk_flags & SK_SEARCHNOTNULL)
+			PG_RETURN_BOOL(!column->bv_allnulls);
+
+		/*
+		 * Neither IS NULL nor IS NOT NULL was used; assume all indexable
+		 * operators are strict and return false.
+		 */
+		PG_RETURN_BOOL(false);
+	}
+
+	/* if the range is all empty, it cannot possibly be consistent */
+	if (column->bv_allnulls)
+		PG_RETURN_BOOL(false);
+
+	filter = (BloomFilter *) DatumGetPointer(column->bv_values[0]);
+
+	Assert(filter);
+
+	attno = key->sk_attno;
+	value = key->sk_argument;
+	switch (key->sk_strategy)
+	{
+		case BloomEqualStrategyNumber:
+
+			/*
+			 * In the equality case (WHERE col = someval), we want to return
+			 * the current page range if the minimum value in the range <=
+			 * scan key, and the maximum value >= scan key.
+			 */
+			finfo = bloom_get_procinfo(bdesc, attno, PROCNUM_HASH);
+
+			hashValue = DatumGetUInt32(FunctionCall1Coll(finfo, colloid, value));
+			matches = bloom_contains_value(filter, hashValue);
+
+			break;
+		default:
+			/* shouldn't happen */
+			elog(ERROR, "invalid strategy number %d", key->sk_strategy);
+			matches = 0;
+			break;
+	}
+
+	PG_RETURN_DATUM(matches);
+}
+
+/*
+ * Given two BrinValues, update the first of them as a union of the summary
+ * values contained in both.  The second one is untouched.
+ *
+ * XXX We assume the bloom filters have the same parameters fow now. In the
+ * future we should have 'can union' function, to decide if we can combine
+ * two particular bloom filters.
+ */
+Datum
+brin_bloom_union(PG_FUNCTION_ARGS)
+{
+	BrinDesc   *bdesc = (BrinDesc *) PG_GETARG_POINTER(0);
+	BrinValues *col_a = (BrinValues *) PG_GETARG_POINTER(1);
+	BrinValues *col_b = (BrinValues *) PG_GETARG_POINTER(2);
+	AttrNumber	attno;
+	Form_pg_attribute attr;
+
+	Assert(col_a->bv_attno == col_b->bv_attno);
+
+	/* Adjust "hasnulls" */
+	if (!col_a->bv_hasnulls && col_b->bv_hasnulls)
+		col_a->bv_hasnulls = true;
+
+	/* If there are no values in B, there's nothing left to do */
+	if (col_b->bv_allnulls)
+		PG_RETURN_VOID();
+
+	attno = col_a->bv_attno;
+	attr = TupleDescAttr(bdesc->bd_tupdesc, attno - 1);
+
+	/*
+	 * Adjust "allnulls".  If A doesn't have values, just copy the values from
+	 * B into A, and we're done.  We cannot run the operators in this case,
+	 * because values in A might contain garbage.  Note we already established
+	 * that B contains values.
+	 */
+	if (col_a->bv_allnulls)
+	{
+		col_a->bv_allnulls = false;
+		col_a->bv_values[0] = datumCopy(col_b->bv_values[0],
+										attr->attbyval, attr->attlen);
+		PG_RETURN_VOID();
+	}
+
+	/* FIXME merge the two bloom filters */
+	elog(WARNING, "FIXME: merge bloom filters");
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Cache and return inclusion opclass support procedure
+ *
+ * Return the procedure corresponding to the given function support number
+ * or null if it is not exists.
+ */
+static FmgrInfo *
+bloom_get_procinfo(BrinDesc *bdesc, uint16 attno, uint16 procnum)
+{
+	BloomOpaque *opaque;
+	uint16		basenum = procnum - PROCNUM_BASE;
+
+	/*
+	 * We cache these in the opaque struct, to avoid repetitive syscache
+	 * lookups.
+	 */
+	opaque = (BloomOpaque *) bdesc->bd_info[attno - 1]->oi_opaque;
+
+	/*
+	 * If we already searched for this proc and didn't find it, don't bother
+	 * searching again.
+	 */
+	if (opaque->extra_proc_missing[basenum])
+		return NULL;
+
+	if (opaque->extra_procinfos[basenum].fn_oid == InvalidOid)
+	{
+		if (RegProcedureIsValid(index_getprocid(bdesc->bd_index, attno,
+												procnum)))
+		{
+			fmgr_info_copy(&opaque->extra_procinfos[basenum],
+						   index_getprocinfo(bdesc->bd_index, attno, procnum),
+						   bdesc->bd_context);
+		}
+		else
+		{
+			opaque->extra_proc_missing[basenum] = true;
+			return NULL;
+		}
+	}
+
+	return &opaque->extra_procinfos[basenum];
+}
diff --git a/src/include/catalog/pg_amop.h b/src/include/catalog/pg_amop.h
index f850be4..ef5b692 100644
--- a/src/include/catalog/pg_amop.h
+++ b/src/include/catalog/pg_amop.h
@@ -894,18 +894,24 @@ DATA(insert (	4064	 17   17 2 s	  1958	  3580 0 ));
 DATA(insert (	4064	 17   17 3 s	  1955	  3580 0 ));
 DATA(insert (	4064	 17   17 4 s	  1960	  3580 0 ));
 DATA(insert (	4064	 17   17 5 s	  1959	  3580 0 ));
+/* bloom bytea */
+DATA(insert (	5021	 17   17 1 s	  1955	  3580 0 ));
 /* minmax "char" */
 DATA(insert (	4062	 18   18 1 s	   631	  3580 0 ));
 DATA(insert (	4062	 18   18 2 s	   632	  3580 0 ));
 DATA(insert (	4062	 18   18 3 s		92	  3580 0 ));
 DATA(insert (	4062	 18   18 4 s	   634	  3580 0 ));
 DATA(insert (	4062	 18   18 5 s	   633	  3580 0 ));
+/* bloom "char" */
+DATA(insert (	5022	 18   18 1 s		92	  3580 0 ));
 /* minmax name */
 DATA(insert (	4065	 19   19 1 s	   660	  3580 0 ));
 DATA(insert (	4065	 19   19 2 s	   661	  3580 0 ));
 DATA(insert (	4065	 19   19 3 s		93	  3580 0 ));
 DATA(insert (	4065	 19   19 4 s	   663	  3580 0 ));
 DATA(insert (	4065	 19   19 5 s	   662	  3580 0 ));
+/* bloom name */
+DATA(insert (	5023	 19   19 1 s		93	  3580 0 ));
 /* minmax integer */
 DATA(insert (	4054	 20   20 1 s	   412	  3580 0 ));
 DATA(insert (	4054	 20   20 2 s	   414	  3580 0 ));
@@ -952,6 +958,16 @@ DATA(insert (	4054	 23   20 2 s		80	  3580 0 ));
 DATA(insert (	4054	 23   20 3 s		15	  3580 0 ));
 DATA(insert (	4054	 23   20 4 s		82	  3580 0 ));
 DATA(insert (	4054	 23   20 5 s		76	  3580 0 ));
+/* bloom integer */
+DATA(insert (	5024	 20   20 1 s	   410	  3580 0 ));
+DATA(insert (	5024	 20   21 1 s	  1868	  3580 0 ));
+DATA(insert (	5024	 20   23 1 s	   416	  3580 0 ));
+DATA(insert (	5024	 21   21 1 s		94	  3580 0 ));
+DATA(insert (	5024	 21   20 1 s	  1862	  3580 0 ));
+DATA(insert (	5024	 21   23 1 s	   532	  3580 0 ));
+DATA(insert (	5024	 23   23 1 s		96	  3580 0 ));
+DATA(insert (	5024	 23   21 1 s	   533	  3580 0 ));
+DATA(insert (	5024	 23   20 1 s		15	  3580 0 ));
 
 /* minmax text */
 DATA(insert (	4056	 25   25 1 s	   664	  3580 0 ));
@@ -959,12 +975,16 @@ DATA(insert (	4056	 25   25 2 s	   665	  3580 0 ));
 DATA(insert (	4056	 25   25 3 s		98	  3580 0 ));
 DATA(insert (	4056	 25   25 4 s	   667	  3580 0 ));
 DATA(insert (	4056	 25   25 5 s	   666	  3580 0 ));
+/* bloom text */
+DATA(insert (	5027	 25   25 1 s		98	  3580 0 ));
 /* minmax oid */
 DATA(insert (	4068	 26   26 1 s	   609	  3580 0 ));
 DATA(insert (	4068	 26   26 2 s	   611	  3580 0 ));
 DATA(insert (	4068	 26   26 3 s	   607	  3580 0 ));
 DATA(insert (	4068	 26   26 4 s	   612	  3580 0 ));
 DATA(insert (	4068	 26   26 5 s	   610	  3580 0 ));
+/* bloom oid */
+DATA(insert (	5028	 26   26 1 s	   607	  3580 0 ));
 /* minmax tid */
 DATA(insert (	4069	 27   27 1 s	  2799	  3580 0 ));
 DATA(insert (	4069	 27   27 2 s	  2801	  3580 0 ));
@@ -992,6 +1012,11 @@ DATA(insert (	4070	701  701 2 s	   673	  3580 0 ));
 DATA(insert (	4070	701  701 3 s	   670	  3580 0 ));
 DATA(insert (	4070	701  701 4 s	   675	  3580 0 ));
 DATA(insert (	4070	701  701 5 s	   674	  3580 0 ));
+/* bloom float (float4, float8) */
+DATA(insert (	5030	700  700 1 s	   620	  3580 0 ));
+DATA(insert (	5030	700  701 1 s	  1120	  3580 0 ));
+DATA(insert (	5030	701  700 1 s	  1130	  3580 0 ));
+DATA(insert (	5030	701  701 1 s	   670	  3580 0 ));
 
 /* minmax abstime */
 DATA(insert (	4072	702  702 1 s	   562	  3580 0 ));
@@ -999,24 +1024,32 @@ DATA(insert (	4072	702  702 2 s	   564	  3580 0 ));
 DATA(insert (	4072	702  702 3 s	   560	  3580 0 ));
 DATA(insert (	4072	702  702 4 s	   565	  3580 0 ));
 DATA(insert (	4072	702  702 5 s	   563	  3580 0 ));
+/* bloom abstime */
+DATA(insert (	5031	702  702 1 s	   560	  3580 0 ));
 /* minmax reltime */
 DATA(insert (	4073	703  703 1 s	   568	  3580 0 ));
 DATA(insert (	4073	703  703 2 s	   570	  3580 0 ));
 DATA(insert (	4073	703  703 3 s	   566	  3580 0 ));
 DATA(insert (	4073	703  703 4 s	   571	  3580 0 ));
 DATA(insert (	4073	703  703 5 s	   569	  3580 0 ));
+/* bloom reltime */
+DATA(insert (	5032	703  703 1 s	   566	  3580 0 ));
 /* minmax macaddr */
 DATA(insert (	4074	829  829 1 s	  1222	  3580 0 ));
 DATA(insert (	4074	829  829 2 s	  1223	  3580 0 ));
 DATA(insert (	4074	829  829 3 s	  1220	  3580 0 ));
 DATA(insert (	4074	829  829 4 s	  1225	  3580 0 ));
 DATA(insert (	4074	829  829 5 s	  1224	  3580 0 ));
+/* bloom macaddr */
+DATA(insert (	5033	829  829 1 s	  1220	  3580 0 ));
 /* minmax macaddr8 */
 DATA(insert (	4109	774  774 1 s	  3364	  3580 0 ));
 DATA(insert (	4109	774  774 2 s	  3365	  3580 0 ));
 DATA(insert (	4109	774  774 3 s	  3362	  3580 0 ));
 DATA(insert (	4109	774  774 4 s	  3367	  3580 0 ));
 DATA(insert (	4109	774  774 5 s	  3366	  3580 0 ));
+/* bloom macaddr8 */
+DATA(insert (	5034	774  774 1 s	  3362	  3580 0 ));
 /* minmax inet */
 DATA(insert (	4075	869  869 1 s	  1203	  3580 0 ));
 DATA(insert (	4075	869  869 2 s	  1204	  3580 0 ));
@@ -1030,18 +1063,24 @@ DATA(insert (	4102	869  869 8 s	   932	  3580 0 ));
 DATA(insert (	4102	869  869 18 s	  1201	  3580 0 ));
 DATA(insert (	4102	869  869 24 s	   933	  3580 0 ));
 DATA(insert (	4102	869  869 26 s	   931	  3580 0 ));
+/* bloom inet */
+DATA(insert (	5035	869  869 1 s	  1201	  3580 0 ));
 /* minmax character */
 DATA(insert (	4076   1042 1042 1 s	  1058	  3580 0 ));
 DATA(insert (	4076   1042 1042 2 s	  1059	  3580 0 ));
 DATA(insert (	4076   1042 1042 3 s	  1054	  3580 0 ));
 DATA(insert (	4076   1042 1042 4 s	  1061	  3580 0 ));
 DATA(insert (	4076   1042 1042 5 s	  1060	  3580 0 ));
+/* bloom character */
+DATA(insert (	5036   1042 1042 1 s	  1054	  3580 0 ));
 /* minmax time without time zone */
 DATA(insert (	4077   1083 1083 1 s	  1110	  3580 0 ));
 DATA(insert (	4077   1083 1083 2 s	  1111	  3580 0 ));
 DATA(insert (	4077   1083 1083 3 s	  1108	  3580 0 ));
 DATA(insert (	4077   1083 1083 4 s	  1113	  3580 0 ));
 DATA(insert (	4077   1083 1083 5 s	  1112	  3580 0 ));
+/* bloom time without time zone */
+DATA(insert (	5037   1083 1083 1 s	  1108	  3580 0 ));
 /* minmax datetime (date, timestamp, timestamptz) */
 DATA(insert (	4059   1114 1114 1 s	  2062	  3580 0 ));
 DATA(insert (	4059   1114 1114 2 s	  2063	  3580 0 ));
@@ -1088,6 +1127,16 @@ DATA(insert (	4059   1184 1184 2 s	  1323	  3580 0 ));
 DATA(insert (	4059   1184 1184 3 s	  1320	  3580 0 ));
 DATA(insert (	4059   1184 1184 4 s	  1325	  3580 0 ));
 DATA(insert (	4059   1184 1184 5 s	  1324	  3580 0 ));
+/* bloom datetime (date, timestamp, timestamptz) */
+DATA(insert (	5038   1114 1114 1 s	  2060	  3580 0 ));
+DATA(insert (	5038   1114 1082 1 s	  2373	  3580 0 ));
+DATA(insert (	5038   1114 1184 1 s	  2536	  3580 0 ));
+DATA(insert (	5038   1082 1082 1 s	  1093	  3580 0 ));
+DATA(insert (	5038   1082 1114 1 s	  2347	  3580 0 ));
+DATA(insert (	5038   1082 1184 1 s	  2360	  3580 0 ));
+DATA(insert (	5038   1184 1082 1 s	  2386	  3580 0 ));
+DATA(insert (	5038   1184 1114 1 s	  2542	  3580 0 ));
+DATA(insert (	5038   1184 1184 1 s	  1320	  3580 0 ));
 
 /* minmax interval */
 DATA(insert (	4078   1186 1186 1 s	  1332	  3580 0 ));
@@ -1095,12 +1144,16 @@ DATA(insert (	4078   1186 1186 2 s	  1333	  3580 0 ));
 DATA(insert (	4078   1186 1186 3 s	  1330	  3580 0 ));
 DATA(insert (	4078   1186 1186 4 s	  1335	  3580 0 ));
 DATA(insert (	4078   1186 1186 5 s	  1334	  3580 0 ));
+/* bloom interval */
+DATA(insert (	5041   1186 1186 1 s	  1330	  3580 0 ));
 /* minmax time with time zone */
 DATA(insert (	4058   1266 1266 1 s	  1552	  3580 0 ));
 DATA(insert (	4058   1266 1266 2 s	  1553	  3580 0 ));
 DATA(insert (	4058   1266 1266 3 s	  1550	  3580 0 ));
 DATA(insert (	4058   1266 1266 4 s	  1555	  3580 0 ));
 DATA(insert (	4058   1266 1266 5 s	  1554	  3580 0 ));
+/* bloom time with time zone */
+DATA(insert (	5042   1266 1266 1 s	  1550	  3580 0 ));
 /* minmax bit */
 DATA(insert (	4079   1560 1560 1 s	  1786	  3580 0 ));
 DATA(insert (	4079   1560 1560 2 s	  1788	  3580 0 ));
@@ -1119,12 +1172,16 @@ DATA(insert (	4055   1700 1700 2 s	  1755	  3580 0 ));
 DATA(insert (	4055   1700 1700 3 s	  1752	  3580 0 ));
 DATA(insert (	4055   1700 1700 4 s	  1757	  3580 0 ));
 DATA(insert (	4055   1700 1700 5 s	  1756	  3580 0 ));
+/* bloom numeric */
+DATA(insert (	5045   1700 1700 1 s	  1752	  3580 0 ));
 /* minmax uuid */
 DATA(insert (	4081   2950 2950 1 s	  2974	  3580 0 ));
 DATA(insert (	4081   2950 2950 2 s	  2976	  3580 0 ));
 DATA(insert (	4081   2950 2950 3 s	  2972	  3580 0 ));
 DATA(insert (	4081   2950 2950 4 s	  2977	  3580 0 ));
 DATA(insert (	4081   2950 2950 5 s	  2975	  3580 0 ));
+/* bloom uuid */
+DATA(insert (	5046   2950 2950 1 s	  2972	  3580 0 ));
 /* inclusion range types */
 DATA(insert (	4103   3831 3831  1 s	  3893	  3580 0 ));
 DATA(insert (	4103   3831 3831  2 s	  3895	  3580 0 ));
@@ -1146,6 +1203,8 @@ DATA(insert (	4082   3220 3220 2 s	  3226	  3580 0 ));
 DATA(insert (	4082   3220 3220 3 s	  3222	  3580 0 ));
 DATA(insert (	4082   3220 3220 4 s	  3227	  3580 0 ));
 DATA(insert (	4082   3220 3220 5 s	  3225	  3580 0 ));
+/* bloom pg_lsn */
+DATA(insert (	5047   3220 3220 1 s	  3222	  3580 0 ));
 /* inclusion box */
 DATA(insert (	4104	603  603  1 s	   493	  3580 0 ));
 DATA(insert (	4104	603  603  2 s	   494	  3580 0 ));
diff --git a/src/include/catalog/pg_amproc.h b/src/include/catalog/pg_amproc.h
index 1c95846..cef4a7c 100644
--- a/src/include/catalog/pg_amproc.h
+++ b/src/include/catalog/pg_amproc.h
@@ -341,16 +341,34 @@ DATA(insert (	4064	17	  17  1  3383 ));
 DATA(insert (	4064	17	  17  2  3384 ));
 DATA(insert (	4064	17	  17  3  3385 ));
 DATA(insert (	4064	17	  17  4  3386 ));
+/* bloom bytea */
+DATA(insert (	5021	17	  17  1  5017 ));
+DATA(insert (	5021	17	  17  2  5018 ));
+DATA(insert (	5021	17	  17  3  5019 ));
+DATA(insert (	5021	17	  17  4  5020 ));
+DATA(insert (	5021	17	  17 11   456 ));
 /* minmax "char" */
 DATA(insert (	4062	18	  18  1  3383 ));
 DATA(insert (	4062	18	  18  2  3384 ));
 DATA(insert (	4062	18	  18  3  3385 ));
 DATA(insert (	4062	18	  18  4  3386 ));
+/* bloom "char" */
+DATA(insert (	5022	18	  18  1  5017 ));
+DATA(insert (	5022	18	  18  2  5018 ));
+DATA(insert (	5022	18	  18  3  5019 ));
+DATA(insert (	5022	18	  18  4  5020 ));
+DATA(insert (	5022	18	  18 11   454 ));
 /* minmax name */
 DATA(insert (	4065	19	  19  1  3383 ));
 DATA(insert (	4065	19	  19  2  3384 ));
 DATA(insert (	4065	19	  19  3  3385 ));
 DATA(insert (	4065	19	  19  4  3386 ));
+/* bloom name */
+DATA(insert (	5023	19	  19  1  5017 ));
+DATA(insert (	5023	19	  19  2  5018 ));
+DATA(insert (	5023	19	  19  3  5019 ));
+DATA(insert (	5023	19	  19  4  5020 ));
+DATA(insert (	5023	19	  19 11   455 ));
 /* minmax integer: int2, int4, int8 */
 DATA(insert (	4054	20	  20  1  3383 ));
 DATA(insert (	4054	20	  20  2  3384 ));
@@ -391,16 +409,47 @@ DATA(insert (	4054	23	  21  2  3384 ));
 DATA(insert (	4054	23	  21  3  3385 ));
 DATA(insert (	4054	23	  21  4  3386 ));
 
+/* bloom integer: int2, int4, int8 */
+DATA(insert (	5024	20	  20  1  5017 ));
+DATA(insert (	5024	20	  20  2  5018 ));
+DATA(insert (	5024	20	  20  3  5019 ));
+DATA(insert (	5024	20	  20  4  5020 ));
+DATA(insert (	5024	20	  20 11   949 ));
+
+DATA(insert (	5024	21	  21  1  5017 ));
+DATA(insert (	5024	21	  21  2  5018 ));
+DATA(insert (	5024	21	  21  3  5019 ));
+DATA(insert (	5024	21	  21  4  5020 ));
+DATA(insert (	5024	21	  21 11   449 ));
+
+DATA(insert (	5024	23	  23  1  5017 ));
+DATA(insert (	5024	23	  23  2  5018 ));
+DATA(insert (	5024	23	  23  3  5019 ));
+DATA(insert (	5024	23	  23  4  5020 ));
+DATA(insert (	5024	23	  23 11   450 ));
+
 /* minmax text */
 DATA(insert (	4056	25	  25  1  3383 ));
 DATA(insert (	4056	25	  25  2  3384 ));
 DATA(insert (	4056	25	  25  3  3385 ));
 DATA(insert (	4056	25	  25  4  3386 ));
+/* bloom text */
+DATA(insert (	5027	25	  25  1  5017 ));
+DATA(insert (	5027	25	  25  2  5018 ));
+DATA(insert (	5027	25	  25  3  5019 ));
+DATA(insert (	5027	25	  25  4  5020 ));
+DATA(insert (	5027	25	  25 11   400 ));
 /* minmax oid */
 DATA(insert (	4068	26	  26  1  3383 ));
 DATA(insert (	4068	26	  26  2  3384 ));
 DATA(insert (	4068	26	  26  3  3385 ));
 DATA(insert (	4068	26	  26  4  3386 ));
+/* bloom oid */
+DATA(insert (	5028	26	  26  1  5017 ));
+DATA(insert (	5028	26	  26  2  5018 ));
+DATA(insert (	5028	26	  26  3  5019 ));
+DATA(insert (	5028	26	  26  4  5020 ));
+DATA(insert (	5028	26	  26 11   453 ));
 /* minmax tid */
 DATA(insert (	4069	27	  27  1  3383 ));
 DATA(insert (	4069	27	  27  2  3384 ));
@@ -427,26 +476,63 @@ DATA(insert (	4070   701	 700  2  3384 ));
 DATA(insert (	4070   701	 700  3  3385 ));
 DATA(insert (	4070   701	 700  4  3386 ));
 
+/* bloom float */
+DATA(insert (	5030   700	 700  1  5017 ));
+DATA(insert (	5030   700	 700  2  5018 ));
+DATA(insert (	5030   700	 700  3  5019 ));
+DATA(insert (	5030   700	 700  4  5020 ));
+DATA(insert (	5030   700	 700 11   451 ));
+
+DATA(insert (	5030   701	 701  1  5017 ));
+DATA(insert (	5030   701	 701  2  5018 ));
+DATA(insert (	5030   701	 701  3  5019 ));
+DATA(insert (	5030   701	 701  4  5020 ));
+DATA(insert (	5030   701	 701 11   452 ));
+
 /* minmax abstime */
 DATA(insert (	4072   702	 702  1  3383 ));
 DATA(insert (	4072   702	 702  2  3384 ));
 DATA(insert (	4072   702	 702  3  3385 ));
 DATA(insert (	4072   702	 702  4  3386 ));
+/* bloom abstime */
+DATA(insert (	5031   702	 702  1  5017 ));
+DATA(insert (	5031   702	 702  2  5018 ));
+DATA(insert (	5031   702	 702  3  5019 ));
+DATA(insert (	5031   702	 702  4  5020 ));
+DATA(insert (	5031   702	 702 11   450 ));
 /* minmax reltime */
 DATA(insert (	4073   703	 703  1  3383 ));
 DATA(insert (	4073   703	 703  2  3384 ));
 DATA(insert (	4073   703	 703  3  3385 ));
 DATA(insert (	4073   703	 703  4  3386 ));
+/* bloom reltime */
+DATA(insert (	5032   703	 703  1  5017 ));
+DATA(insert (	5032   703	 703  2  5018 ));
+DATA(insert (	5032   703	 703  3  5019 ));
+DATA(insert (	5032   703	 703  4  5020 ));
+DATA(insert (	5032   703	 703 11   450 ));
 /* minmax macaddr */
 DATA(insert (	4074   829	 829  1  3383 ));
 DATA(insert (	4074   829	 829  2  3384 ));
 DATA(insert (	4074   829	 829  3  3385 ));
 DATA(insert (	4074   829	 829  4  3386 ));
+/* bloom macaddr */
+DATA(insert (	5033   829	 829  1  5017 ));
+DATA(insert (	5033   829	 829  2  5018 ));
+DATA(insert (	5033   829	 829  3  5019 ));
+DATA(insert (	5033   829	 829  4  5020 ));
+DATA(insert (	5033   829	 829 11   399 ));
 /* minmax macaddr8 */
 DATA(insert (	4109   774	 774  1  3383 ));
 DATA(insert (	4109   774	 774  2  3384 ));
 DATA(insert (	4109   774	 774  3  3385 ));
 DATA(insert (	4109   774	 774  4  3386 ));
+/* bloom macaddr8 */
+DATA(insert (	5034   774	 774  1  5017 ));
+DATA(insert (	5034   774	 774  2  5018 ));
+DATA(insert (	5034   774	 774  3  5019 ));
+DATA(insert (	5034   774	 774  4  5020 ));
+DATA(insert (	5034   774	 774 11   328 ));
 /* minmax inet */
 DATA(insert (	4075   869	 869  1  3383 ));
 DATA(insert (	4075   869	 869  2  3384 ));
@@ -460,16 +546,34 @@ DATA(insert (	4102   869	 869  4  4108 ));
 DATA(insert (	4102   869	 869 11  4063 ));
 DATA(insert (	4102   869	 869 12  4071 ));
 DATA(insert (	4102   869	 869 13   930 ));
+/* bloom inet */
+DATA(insert (	5035   869	 869  1  5017 ));
+DATA(insert (	5035   869	 869  2  5018 ));
+DATA(insert (	5035   869	 869  3  5019 ));
+DATA(insert (	5035   869	 869  4  5020 ));
+DATA(insert (	5035   869	 869 11   422 ));
 /* minmax character */
 DATA(insert (	4076  1042	1042  1  3383 ));
 DATA(insert (	4076  1042	1042  2  3384 ));
 DATA(insert (	4076  1042	1042  3  3385 ));
 DATA(insert (	4076  1042	1042  4  3386 ));
+/* bloom character */
+DATA(insert (	5036  1042	1042  1  5017 ));
+DATA(insert (	5036  1042	1042  2  5018 ));
+DATA(insert (	5036  1042	1042  3  5019 ));
+DATA(insert (	5036  1042	1042  4  5020 ));
+DATA(insert (	5036  1042	1042 11   454 ));
 /* minmax time without time zone */
 DATA(insert (	4077  1083	1083  1  3383 ));
 DATA(insert (	4077  1083	1083  2  3384 ));
 DATA(insert (	4077  1083	1083  3  3385 ));
 DATA(insert (	4077  1083	1083  4  3386 ));
+/* bloom time without time zone */
+DATA(insert (	5037  1083	1083  1  5017 ));
+DATA(insert (	5037  1083	1083  2  5018 ));
+DATA(insert (	5037  1083	1083  3  5019 ));
+DATA(insert (	5037  1083	1083  4  5020 ));
+DATA(insert (	5037  1083	1083 11  1688 ));
 /* minmax datetime (date, timestamp, timestamptz) */
 DATA(insert (	4059  1114	1114  1  3383 ));
 DATA(insert (	4059  1114	1114  2  3384 ));
@@ -510,16 +614,47 @@ DATA(insert (	4059  1082	1184  2  3384 ));
 DATA(insert (	4059  1082	1184  3  3385 ));
 DATA(insert (	4059  1082	1184  4  3386 ));
 
+/* bloom datetime (date, timestamp, timestamptz) */
+DATA(insert (	5038  1114	1114  1  5017 ));
+DATA(insert (	5038  1114	1114  2  5018 ));
+DATA(insert (	5038  1114	1114  3  5019 ));
+DATA(insert (	5038  1114	1114  4  5020 ));
+DATA(insert (	5038  1114	1114 11  2039 ));
+
+DATA(insert (	5038  1184	1184  1  5017 ));
+DATA(insert (	5038  1184	1184  2  5018 ));
+DATA(insert (	5038  1184	1184  3  5019 ));
+DATA(insert (	5038  1184	1184  4  5020 ));
+DATA(insert (	5038  1184	1184 11  2039 ));
+
+DATA(insert (	5038  1082	1082  1  5017 ));
+DATA(insert (	5038  1082	1082  2  5018 ));
+DATA(insert (	5038  1082	1082  3  5019 ));
+DATA(insert (	5038  1082	1082  4  5020 ));
+DATA(insert (	5038  1082	1082 11  450 ));
+
 /* minmax interval */
 DATA(insert (	4078  1186	1186  1  3383 ));
 DATA(insert (	4078  1186	1186  2  3384 ));
 DATA(insert (	4078  1186	1186  3  3385 ));
 DATA(insert (	4078  1186	1186  4  3386 ));
+/* bloom interval */
+DATA(insert (	5041  1186	1186  1  5017 ));
+DATA(insert (	5041  1186	1186  2  5018 ));
+DATA(insert (	5041  1186	1186  3  5019 ));
+DATA(insert (	5041  1186	1186  4  5020 ));
+DATA(insert (	5041  1186	1186 11  1697 ));
 /* minmax time with time zone */
 DATA(insert (	4058  1266	1266  1  3383 ));
 DATA(insert (	4058  1266	1266  2  3384 ));
 DATA(insert (	4058  1266	1266  3  3385 ));
 DATA(insert (	4058  1266	1266  4  3386 ));
+/* bloom time with time zone */
+DATA(insert (	5042  1266	1266  1  5017 ));
+DATA(insert (	5042  1266	1266  2  5018 ));
+DATA(insert (	5042  1266	1266  3  5019 ));
+DATA(insert (	5042  1266	1266  4  5020 ));
+DATA(insert (	5042  1266	1266 11  1696 ));
 /* minmax bit */
 DATA(insert (	4079  1560	1560  1  3383 ));
 DATA(insert (	4079  1560	1560  2  3384 ));
@@ -535,11 +670,23 @@ DATA(insert (	4055  1700	1700  1  3383 ));
 DATA(insert (	4055  1700	1700  2  3384 ));
 DATA(insert (	4055  1700	1700  3  3385 ));
 DATA(insert (	4055  1700	1700  4  3386 ));
+/* bloom numeric */
+DATA(insert (	5045  1700	1700  1  5017 ));
+DATA(insert (	5045  1700	1700  2  5018 ));
+DATA(insert (	5045  1700	1700  3  5019 ));
+DATA(insert (	5045  1700	1700  4  5020 ));
+DATA(insert (	5045  1700	1700 11   432 ));
 /* minmax uuid */
 DATA(insert (	4081  2950	2950  1  3383 ));
 DATA(insert (	4081  2950	2950  2  3384 ));
 DATA(insert (	4081  2950	2950  3  3385 ));
 DATA(insert (	4081  2950	2950  4  3386 ));
+/* bloom uuid */
+DATA(insert (	5046  2950	2950  1  5017 ));
+DATA(insert (	5046  2950	2950  2  5018 ));
+DATA(insert (	5046  2950	2950  3  5019 ));
+DATA(insert (	5046  2950	2950  4  5020 ));
+DATA(insert (	5046  2950	2950 11  2963 ));
 /* inclusion range types */
 DATA(insert (	4103  3831	3831  1  4105 ));
 DATA(insert (	4103  3831	3831  2  4106 ));
@@ -553,6 +700,12 @@ DATA(insert (	4082  3220	3220  1  3383 ));
 DATA(insert (	4082  3220	3220  2  3384 ));
 DATA(insert (	4082  3220	3220  3  3385 ));
 DATA(insert (	4082  3220	3220  4  3386 ));
+/* bloom pg_lsn */
+DATA(insert (	5047  3220	3220  1  5017 ));
+DATA(insert (	5047  3220	3220  2  5018 ));
+DATA(insert (	5047  3220	3220  3  5019 ));
+DATA(insert (	5047  3220	3220  4  5020 ));
+DATA(insert (	5047  3220	3220 11  3252 ));
 /* inclusion box */
 DATA(insert (	4104   603	 603  1  4105 ));
 DATA(insert (	4104   603	 603  2  4106 ));
diff --git a/src/include/catalog/pg_opclass.h b/src/include/catalog/pg_opclass.h
index 28dbc74..ce28469 100644
--- a/src/include/catalog/pg_opclass.h
+++ b/src/include/catalog/pg_opclass.h
@@ -213,36 +213,61 @@ DATA(insert (	2742	jsonb_path_ops		PGNSP PGUID 4037  3802 f 23 ));
 /* BRIN operator classes */
 /* no brin opclass for bool */
 DATA(insert (	3580	bytea_minmax_ops		PGNSP PGUID 4064	17 t 17 ));
+DATA(insert (	3580	bytea_bloom_ops			PGNSP PGUID 5021	17 f 17 ));
 DATA(insert (	3580	char_minmax_ops			PGNSP PGUID 4062	18 t 18 ));
+DATA(insert (	3580	char_bloom_ops			PGNSP PGUID 5022	18 f 18 ));
 DATA(insert (	3580	name_minmax_ops			PGNSP PGUID 4065	19 t 19 ));
+DATA(insert (	3580	name_bloom_ops			PGNSP PGUID 5023	19 f 19 ));
 DATA(insert (	3580	int8_minmax_ops			PGNSP PGUID 4054	20 t 20 ));
+DATA(insert (	3580	int8_bloom_ops			PGNSP PGUID 5024	20 f 20 ));
 DATA(insert (	3580	int2_minmax_ops			PGNSP PGUID 4054	21 t 21 ));
+DATA(insert (	3580	int2_bloom_ops			PGNSP PGUID 5024	21 f 21 ));
 DATA(insert (	3580	int4_minmax_ops			PGNSP PGUID 4054	23 t 23 ));
+DATA(insert (	3580	int4_bloom_ops			PGNSP PGUID 5024	23 f 23 ));
 DATA(insert (	3580	text_minmax_ops			PGNSP PGUID 4056	25 t 25 ));
+DATA(insert (	3580	text_bloom_ops			PGNSP PGUID 5027	25 f 25 ));
 DATA(insert (	3580	oid_minmax_ops			PGNSP PGUID 4068	26 t 26 ));
+DATA(insert (	3580	oid_bloom_ops			PGNSP PGUID 5028	26 f 26 ));
 DATA(insert (	3580	tid_minmax_ops			PGNSP PGUID 4069	27 t 27 ));
 DATA(insert (	3580	float4_minmax_ops		PGNSP PGUID 4070   700 t 700 ));
+DATA(insert (	3580	float4_bloom_ops		PGNSP PGUID 5030   700 f 700 ));
 DATA(insert (	3580	float8_minmax_ops		PGNSP PGUID 4070   701 t 701 ));
+DATA(insert (	3580	float8_bloom_ops		PGNSP PGUID 5030   701 f 701 ));
 DATA(insert (	3580	abstime_minmax_ops		PGNSP PGUID 4072   702 t 702 ));
+DATA(insert (	3580	abstime_bloom_ops		PGNSP PGUID 5031   702 f 702 ));
 DATA(insert (	3580	reltime_minmax_ops		PGNSP PGUID 4073   703 t 703 ));
+DATA(insert (	3580	reltime_bloom_ops		PGNSP PGUID 5032   703 f 703 ));
 DATA(insert (	3580	macaddr_minmax_ops		PGNSP PGUID 4074   829 t 829 ));
+DATA(insert (	3580	macaddr_bloom_ops		PGNSP PGUID 5033   829 f 829 ));
 DATA(insert (	3580	macaddr8_minmax_ops		PGNSP PGUID 4109   774 t 774 ));
+DATA(insert (	3580	macaddr8_bloom_ops		PGNSP PGUID 5034   774 f 774 ));
 DATA(insert (	3580	inet_minmax_ops			PGNSP PGUID 4075   869 f 869 ));
 DATA(insert (	3580	inet_inclusion_ops		PGNSP PGUID 4102   869 t 869 ));
+DATA(insert (	3580	inet_bloom_ops			PGNSP PGUID 5035   869 f 869 ));
 DATA(insert (	3580	bpchar_minmax_ops		PGNSP PGUID 4076  1042 t 1042 ));
+DATA(insert (	3580	bpchar_bloom_ops		PGNSP PGUID 5036  1042 f 1042 ));
 DATA(insert (	3580	time_minmax_ops			PGNSP PGUID 4077  1083 t 1083 ));
+DATA(insert (	3580	time_bloom_ops			PGNSP PGUID 5037  1083 f 1083 ));
 DATA(insert (	3580	date_minmax_ops			PGNSP PGUID 4059  1082 t 1082 ));
+DATA(insert (	3580	date_bloom_ops			PGNSP PGUID 5038  1082 f 1082 ));
 DATA(insert (	3580	timestamp_minmax_ops	PGNSP PGUID 4059  1114 t 1114 ));
+DATA(insert (	3580	timestamp_bloom_ops		PGNSP PGUID 5038  1114 f 1114 ));
 DATA(insert (	3580	timestamptz_minmax_ops	PGNSP PGUID 4059  1184 t 1184 ));
+DATA(insert (	3580	timestamptz_bloom_ops	PGNSP PGUID 5038  1184 f 1184 ));
 DATA(insert (	3580	interval_minmax_ops		PGNSP PGUID 4078  1186 t 1186 ));
+DATA(insert (	3580	interval_bloom_ops		PGNSP PGUID 5041  1186 f 1186 ));
 DATA(insert (	3580	timetz_minmax_ops		PGNSP PGUID 4058  1266 t 1266 ));
+DATA(insert (	3580	timetz_bloom_ops		PGNSP PGUID 5042  1266 f 1266 ));
 DATA(insert (	3580	bit_minmax_ops			PGNSP PGUID 4079  1560 t 1560 ));
 DATA(insert (	3580	varbit_minmax_ops		PGNSP PGUID 4080  1562 t 1562 ));
 DATA(insert (	3580	numeric_minmax_ops		PGNSP PGUID 4055  1700 t 1700 ));
+DATA(insert (	3580	numeric_bloom_ops		PGNSP PGUID 5045  1700 f 1700 ));
 /* no brin opclass for record, anyarray */
 DATA(insert (	3580	uuid_minmax_ops			PGNSP PGUID 4081  2950 t 2950 ));
+DATA(insert (	3580	uuid_bloom_ops			PGNSP PGUID 5046  2950 f 2950 ));
 DATA(insert (	3580	range_inclusion_ops		PGNSP PGUID 4103  3831 t 3831 ));
 DATA(insert (	3580	pg_lsn_minmax_ops		PGNSP PGUID 4082  3220 t 3220 ));
+DATA(insert (	3580	pg_lsn_bloom_ops		PGNSP PGUID 5047  3220 f 3220 ));
 /* no brin opclass for enum, tsvector, tsquery, jsonb */
 DATA(insert (	3580	box_inclusion_ops		PGNSP PGUID 4104   603 t 603 ));
 /* no brin opclass for the geometric types except box */
diff --git a/src/include/catalog/pg_opfamily.h b/src/include/catalog/pg_opfamily.h
index 0d0ba7c..bf9c578 100644
--- a/src/include/catalog/pg_opfamily.h
+++ b/src/include/catalog/pg_opfamily.h
@@ -160,30 +160,50 @@ DATA(insert OID = 4036 (	2742	jsonb_ops		PGNSP PGUID ));
 DATA(insert OID = 4037 (	2742	jsonb_path_ops	PGNSP PGUID ));
 
 DATA(insert OID = 4054 (	3580	integer_minmax_ops		PGNSP PGUID ));
+DATA(insert OID = 5024 (	3580	integer_bloom_ops		PGNSP PGUID ));
 DATA(insert OID = 4055 (	3580	numeric_minmax_ops		PGNSP PGUID ));
+DATA(insert OID = 5045 (	3580	numeric_bloom_ops		PGNSP PGUID ));
 DATA(insert OID = 4056 (	3580	text_minmax_ops			PGNSP PGUID ));
+DATA(insert OID = 5027 (	3580	text_bloom_ops			PGNSP PGUID ));
 DATA(insert OID = 4058 (	3580	timetz_minmax_ops		PGNSP PGUID ));
+DATA(insert OID = 5042 (	3580	timetz_bloom_ops		PGNSP PGUID ));
 DATA(insert OID = 4059 (	3580	datetime_minmax_ops		PGNSP PGUID ));
+DATA(insert OID = 5038 (	3580	datetime_bloom_ops		PGNSP PGUID ));
 DATA(insert OID = 4062 (	3580	char_minmax_ops			PGNSP PGUID ));
+DATA(insert OID = 5022 (	3580	char_bloom_ops			PGNSP PGUID ));
 DATA(insert OID = 4064 (	3580	bytea_minmax_ops		PGNSP PGUID ));
+DATA(insert OID = 5021 (	3580	bytea_bloom_ops			PGNSP PGUID ));
 DATA(insert OID = 4065 (	3580	name_minmax_ops			PGNSP PGUID ));
+DATA(insert OID = 5023 (	3580	name_bloom_ops			PGNSP PGUID ));
 DATA(insert OID = 4068 (	3580	oid_minmax_ops			PGNSP PGUID ));
+DATA(insert OID = 5028 (	3580	oid_bloom_ops			PGNSP PGUID ));
 DATA(insert OID = 4069 (	3580	tid_minmax_ops			PGNSP PGUID ));
 DATA(insert OID = 4070 (	3580	float_minmax_ops		PGNSP PGUID ));
+DATA(insert OID = 5030 (	3580	float_bloom_ops			PGNSP PGUID ));
 DATA(insert OID = 4072 (	3580	abstime_minmax_ops		PGNSP PGUID ));
+DATA(insert OID = 5031 (	3580	abstime_bloom_ops		PGNSP PGUID ));
 DATA(insert OID = 4073 (	3580	reltime_minmax_ops		PGNSP PGUID ));
+DATA(insert OID = 5032 (	3580	reltime_bloom_ops		PGNSP PGUID ));
 DATA(insert OID = 4074 (	3580	macaddr_minmax_ops		PGNSP PGUID ));
+DATA(insert OID = 5033 (	3580	macaddr_bloom_ops		PGNSP PGUID ));
 DATA(insert OID = 4109 (	3580	macaddr8_minmax_ops		PGNSP PGUID ));
+DATA(insert OID = 5034 (	3580	macaddr8_bloom_ops		PGNSP PGUID ));
 DATA(insert OID = 4075 (	3580	network_minmax_ops		PGNSP PGUID ));
 DATA(insert OID = 4102 (	3580	network_inclusion_ops	PGNSP PGUID ));
+DATA(insert OID = 5035 (	3580	network_bloom_ops		PGNSP PGUID ));
 DATA(insert OID = 4076 (	3580	bpchar_minmax_ops		PGNSP PGUID ));
+DATA(insert OID = 5036 (	3580	bpchar_bloom_ops		PGNSP PGUID ));
 DATA(insert OID = 4077 (	3580	time_minmax_ops			PGNSP PGUID ));
+DATA(insert OID = 5037 (	3580	time_bloom_ops			PGNSP PGUID ));
 DATA(insert OID = 4078 (	3580	interval_minmax_ops		PGNSP PGUID ));
+DATA(insert OID = 5041 (	3580	interval_bloom_ops		PGNSP PGUID ));
 DATA(insert OID = 4079 (	3580	bit_minmax_ops			PGNSP PGUID ));
 DATA(insert OID = 4080 (	3580	varbit_minmax_ops		PGNSP PGUID ));
 DATA(insert OID = 4081 (	3580	uuid_minmax_ops			PGNSP PGUID ));
+DATA(insert OID = 5046 (	3580	uuid_bloom_ops			PGNSP PGUID ));
 DATA(insert OID = 4103 (	3580	range_inclusion_ops		PGNSP PGUID ));
 DATA(insert OID = 4082 (	3580	pg_lsn_minmax_ops		PGNSP PGUID ));
+DATA(insert OID = 5047 (	3580	pg_lsn_bloom_ops		PGNSP PGUID ));
 DATA(insert OID = 4104 (	3580	box_inclusion_ops		PGNSP PGUID ));
 DATA(insert OID = 5000 (	4000	box_ops		PGNSP PGUID ));
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 93c031a..5852496 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4374,6 +4374,16 @@ DESCR("BRIN inclusion support");
 DATA(insert OID = 4108 ( brin_inclusion_union	PGNSP PGUID 12 1 0 0 0 f f f f t f i s 3 0 16 "2281 2281 2281" _null_ _null_ _null_ _null_ _null_ brin_inclusion_union _null_ _null_ _null_ ));
 DESCR("BRIN inclusion support");
 
+/* BRIN bloom */
+DATA(insert OID = 5017 ( brin_bloom_opcinfo		PGNSP PGUID 12 1 0 0 0 f f f f t f i s 1 0 2281 "2281" _null_ _null_ _null_ _null_ _null_ brin_bloom_opcinfo _null_ _null_ _null_ ));
+DESCR("BRIN bloom support");
+DATA(insert OID = 5018 ( brin_bloom_add_value	PGNSP PGUID 12 1 0 0 0 f f f f t f i s 4 0 16 "2281 2281 2281 2281" _null_ _null_ _null_ _null_ _null_ brin_bloom_add_value _null_ _null_ _null_ ));
+DESCR("BRIN bloom support");
+DATA(insert OID = 5019 ( brin_bloom_consistent	PGNSP PGUID 12 1 0 0 0 f f f f t f i s 3 0 16 "2281 2281 2281" _null_ _null_ _null_ _null_ _null_ brin_bloom_consistent _null_ _null_ _null_ ));
+DESCR("BRIN bloom support");
+DATA(insert OID = 5020 ( brin_bloom_union		PGNSP PGUID 12 1 0 0 0 f f f f t f i s 3 0 16 "2281 2281 2281" _null_ _null_ _null_ _null_ _null_ brin_bloom_union _null_ _null_ _null_ ));
+DESCR("BRIN bloom support");
+
 /* userlock replacements */
 DATA(insert OID = 2880 (  pg_advisory_lock				PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1 0 2278 "20" _null_ _null_ _null_ _null_ _null_ pg_advisory_lock_int8 _null_ _null_ _null_ ));
 DESCR("obtain exclusive advisory lock");
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 684f7f2..91b2ecc 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -1819,6 +1819,7 @@ ORDER BY 1, 2, 3;
        2742 |           11 | ?&
        3580 |            1 | <
        3580 |            1 | <<
+       3580 |            1 | =
        3580 |            2 | &<
        3580 |            2 | <=
        3580 |            3 | &&
@@ -1880,7 +1881,7 @@ ORDER BY 1, 2, 3;
        4000 |           25 | <<=
        4000 |           26 | >>
        4000 |           27 | >>=
-(121 rows)
+(122 rows)
 
 -- Check that all opclass search operators have selectivity estimators.
 -- This is not absolutely required, but it seems a reasonable thing
#2Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#1)
Re: WIP: BRIN bloom indexes

On Thu, Oct 19, 2017 at 10:15 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Let's see a query like this:

select * from bloom_test
where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';

The minmax index produces this plan

Heap Blocks: lossy=2061856
Execution time: 22707.891 ms

Now, the bloom index:

Heap Blocks: lossy=25984
Execution time: 338.946 ms

It's neat to see BRIN being extended like this. Possibly we could
consider making it a contrib module rather than including it in core,
although I don't have strong feelings about it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Simon Riggs
simon@2ndquadrant.com
In reply to: Robert Haas (#2)
Re: WIP: BRIN bloom indexes

On 27 October 2017 at 07:20, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Oct 19, 2017 at 10:15 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Let's see a query like this:

select * from bloom_test
where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';

The minmax index produces this plan

Heap Blocks: lossy=2061856
Execution time: 22707.891 ms

Now, the bloom index:

Heap Blocks: lossy=25984
Execution time: 338.946 ms

It's neat to see BRIN being extended like this. Possibly we could
consider making it a contrib module rather than including it in core,
although I don't have strong feelings about it.

I see that SP-GIST includes two operator classes in core, one default.

Makes sense to do the same thing with BRIN and add this new op class
as a non-default option in core.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Simon Riggs (#3)
Re: WIP: BRIN bloom indexes

hi,

On 10/27/2017 09:34 AM, Simon Riggs wrote:

On 27 October 2017 at 07:20, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Oct 19, 2017 at 10:15 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Let's see a query like this:

select * from bloom_test
where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';

The minmax index produces this plan

Heap Blocks: lossy=2061856
Execution time: 22707.891 ms

Now, the bloom index:

Heap Blocks: lossy=25984
Execution time: 338.946 ms

It's neat to see BRIN being extended like this. Possibly we could
consider making it a contrib module rather than including it in core,
although I don't have strong feelings about it.

I see that SP-GIST includes two operator classes in core, one default.

Makes sense to do the same thing with BRIN and add this new op class
as a non-default option in core.

Not sure "a number of in-core opclasses" is a good reason to (not) add
new ones. Also, we already have two built-in BRIN opclasses (minmax and
inclusion).

In general, "BRIN bloom" can be packed as a contrib module (at least I
believe so). That's not the case for the "BRIN multi-range" which also
requires some changes to some code in brin.c (but the rest can be moved
to contrib module, of course).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Tomas Vondra (#4)
Re: WIP: BRIN bloom indexes

Tomas Vondra wrote:

Not sure "a number of in-core opclasses" is a good reason to (not) add
new ones. Also, we already have two built-in BRIN opclasses (minmax and
inclusion).

In general, "BRIN bloom" can be packed as a contrib module (at least I
believe so). That's not the case for the "BRIN multi-range" which also
requires some changes to some code in brin.c (but the rest can be moved
to contrib module, of course).

I was rather thinking that if we can make this very robust against the
index growing out of proportion, we should consider ditching the
original minmax and replace it with multirange minmax, which seems like
it'd have much better behavior.

I don't see any reason to put any of this in contrib.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#5)
Re: WIP: BRIN bloom indexes

On Fri, Oct 27, 2017 at 2:55 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

I was rather thinking that if we can make this very robust against the
index growing out of proportion, we should consider ditching the
original minmax and replace it with multirange minmax, which seems like
it'd have much better behavior.

If the multirange stuff can be done in such a way that it's just an
updated version of the same opclass, and backward-compatible on disk,
then I think this would be OK. But otherwise I don't think we ditch
what already exists. That would break upgrades via both pg_upgrade
and pg_dump, which seems like too high a price to pay to get rid of
some arguably-worse code. It's actually WORSE to drop an opclass
(which will make dumps not restore) than to do something like bump
HASH_VERSION (which doesn't affect pg_dump at all and for pg_upgrade
only requires post-upgrade steps rather than failing outright).

I don't see any reason to put any of this in contrib.

Well, for one thing, it makes it easier to drop stuff later if we
decide we don't really want it. I think that whichever BRIN opclasses
are thought to be high quality and of general utility can go into
core, just as we've done with other index AMs. However, if we think
that something is useful-ish but maybe not something to which we want
to make a permanent commitment, putting it into contrib is good for
that.

Upgrades are easier for things in contrib, too, because there's a
built-in mechanism for people to try updating the SQL extensions
(ALTER EXTENSION .. UPDATE) and if it fails they can adjust things and
try again. When you just make a hard change to SQL definitions in a
new release, any failures that result from those changes just turn
into upgrade failures, which is IMHO a lot more painful than a failure
to update an extension version while the database is still up and
usable the whole time.

For instance, if pg_stat_activity were bundled in an extension and we
made the C code backward-compatibility with old extension versions,
then some of the upgrade pain users have had with that over the years
could have been avoided. People could update to the new version
without failures and then at their leisure try to update. If the
update failed due to dependencies, then they would have time to figure
out what to do about it and try again later; in the meantime, they'd
be on the new version.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Sokolov Yura
y.sokolov@postgrespro.ru
In reply to: Tomas Vondra (#1)
Re: WIP: BRIN bloom indexes

On 2017-10-19 23:15, Tomas Vondra wrote:

Hi,

The BRIN minmax opclasses work well only for data where the column is
somewhat correlated to physical location in a table. So it works great
for timestamps in append-only log tables, for example. When that is not
the case (non-correlated columns) the minmax ranges get very "wide" and
we end up scanning large portions of the tables.

A typical example are columns with types that are inherently random (or
rather non-correlated) like for example UUIDs, IP or MAC addresses, and
so on. But it can just as easily happen even with simple IDs generated
from a sequence.

This WIP patch implements another type of BRIN index, with "summary"
being represented by a bloom filter. So instead of building [min,max]
range for each page range. The index is of course larger than the
regular "minmax" BRIN index, but it's still orders of magnitude smaller
than a btree index.

Note: This is different from the index type implemented in the "bloom"
extension. Those are not related to BRIN at all, and the index builds a
bloom filter on individual rows (values in different columns).

BTW kudos to everyone who worked on the BRIN infrastructure and API. I
found it very intuitive and well-documented. Adding the new opclass was
extremely straight-forward, and

To demonstrate this on a small example, consider this table:

CREATE TABLE bloom_test (id uuid, padding text);

INSERT INTO bloom_test
SELECT md5((mod(i,1000000)/100)::text)::uuid, md5(i::text)
FROM generate_series(1,200000000) s(i);

VACUUM ANALYZE bloom_test;

List of relations
Schema | Name | Type | Owner | Size | Description
--------+------------+-------+-------+-------+-------------
public | bloom_test | table | tomas | 16 GB |
(1 row)

The table was built so that each range contains relatively small number
of distinct UUID values - this is typical e.g. for user activity with
"bursts" of actions from a particular user separated by long periods of
inactivity (with actions from other users).

Now, let's build BRIN "minmax", BRIN "bloom" and BTREE indexes on the
id
column.

create index test_brin_idx on bloom_test
using brin (id);

create index test_bloom_idx on bloom_test
using brin (id uuid_bloom_ops);

create index test_btree_idx on bloom_test (id);

which gives us this:

List of relations
Schema | Name | Type | Owner | Table | Size
--------+----------------+-------+-------+------------+---------
public | test_bloom_idx | index | tomas | bloom_test | 12 MB
public | test_brin_idx | index | tomas | bloom_test | 832 kB
public | test_btree_idx | index | tomas | bloom_test | 6016 MB
(3 rows)

So yeah - the "bloom" index is larger than the default "minmax" index,
but it's still orders of magnitude smaller than the regular btree one.

Now, what about query performance? Unlike the "minmax" indexes, the
"bloom" filter can only handle equality conditions.

Let's see a query like this:

select * from bloom_test
where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';

The minmax index produces this plan

QUERY PLAN
------------------------------------------------------------------------
Bitmap Heap Scan on bloom_test
(cost=390.31..2130910.82 rows=20593 width=49)
(actual time=197.974..22707.311 rows=20000 loops=1)
Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
Rows Removed by Index Recheck: 199980000
Heap Blocks: lossy=2061856
-> Bitmap Index Scan on test_brin_idx
(cost=0.00..385.16 rows=5493161 width=0)
(actual time=133.463..133.463 rows=20619520 loops=1)
Index Cond: (id =
'8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
Planning time: 0.165 ms
Execution time: 22707.891 ms
(8 rows)

Which is not that great, and the long duration is a direct consequence
of "wide" ranges - the bitmap heap scan had to scan pretty much the
whole table. (A sequential scan takes only about 15 seconds.)

Now, the bloom index:

QUERY PLAN
------------------------------------------------------------------------
Bitmap Heap Scan on bloom_test
(cost=5898.31..2136418.82 rows=20593 width=49)
(actual time=24.229..338.324 rows=20000 loops=1)
Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
Rows Removed by Index Recheck: 2500448
Heap Blocks: lossy=25984
-> Bitmap Index Scan on test_bloom_idx
(cost=0.00..5893.16 rows=5493161 width=0)
(actual time=22.811..22.811 rows=259840 loops=1)
Index Cond: (id =
'8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
Planning time: 0.201 ms
Execution time: 338.946 ms
(8 rows)

So, way better.

For comparison, a simple index scan / bitmap index scan using the btree
take only about ~10ms in this case, but that's mostly thanks to the
whole dataset being entirely in-memory.

There are some remaining open questions.

1) sizing the bloom filter

Firstly, the size of the filter is currently static, based on 1000
distinct values per range, with 5% false-positive rate. Those are
mostly
arbitrary values, and I don't have any clear idea how to determine
optimal values.

Maybe we could do the sizing based on ndistinct value for the indexed
column? It also depends on the pages_per_range value, so perhaps it
should be a user-defined option too.

An interesting feature is that the bloom indexes "degrade locally". If
a
page range has significantly more distinct values, the bloom filter may
be way too small and will suffer by high false-positive rate. But it
only affects that one page range, and the rest may be perfectly fine.

I was thinking about disabling the bloom filter when it gets "too full"
(too many bits set, resulting in high false-positive cases). But that
seems like a bad idea - the false-positive rate automatically jumps to
100% for that range, and we only save much smaller amount of space in
the index. So even if the false-positive rate is 50% it still seems
efficient to keep the bloom filter.

Another thing to consider is what happens when there are very few
distinct values in a given page range. The current code tries to be a
bit smart in this case - instead of building the bloom filter right
away, it initially keeps the exact values and only switches to bloom
filter when filling the same space.

2) costing

The other thing is costing of BRIN indexes. At this point it's rather
simplistic and independent of the operator class, so the only thing
that
matters is size of the BRIN index. That means a "minmax" index is
always
considered cheaper than "bloom" index, because the optimizer has no
idea
the "minmax" indexes are more vulnerable to "wide ranges".

But perhaps this is a non-issue, as the bloom index are meant for cases
when minmax indexes don't work. And when minmax indexes work, they work
better than bloom. So people are unlikely to build both of them at the
same time.

regards

Hi, Tomas

BRIN bloom index is a really cool feature, that definitely should be in
core distribution (either in contrib or builtin)!!!

Small suggestion for algorithm:

It is well known practice not to calculate whole hash function for every
point, but use double hashing approach.
Example of proving double hashing approach for bloom filters:
https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf

So, instead of:
for (i = 0; i < filter->nhashes; i++)
{
/* compute the hashes with a seed, used for the bloom filter */
uint32 h = ((uint32)DatumGetInt64(hash_uint32_extended(value, i))) %
filter->nbits;

/* set or check bit h */
}

such construction could be used:
/* compute the hashes, used for the bloom filter */
uint32 big_h = ((uint32)DatumGetInt64(hash_uint32_extended(value, i)));
uint32 h = big_h % filter->nbits;
/* ensures d is never >= filter->nbits */
uint32 d = big_h % (filter->nbits - 1);

for (i = 0; i < filter->nhashes; i++)
{
/* set or check bit h */

/* compute next bit h */
h += d++;
if (h >= filter->nbits) h -= filter->nbits;
if (d == filter->nbits) d = 0;
}

Modulo of one 64bit result by two coprime numbers (`nbits` and
`nbits-1`)
gives two good-quality functions usable for double hashing.

With regards,
--
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Nico Williams
nico@cryptonector.com
In reply to: Tomas Vondra (#1)
Re: WIP: BRIN bloom indexes

On Thu, Oct 19, 2017 at 10:15:32PM +0200, Tomas Vondra wrote:

A bloom filter index would, indeed, be wonderful.

Comments:

+ * We use an optimisation that initially we store the uint32 values directly,
+ * without the extra hashing step. And only later filling the bitmap space,
+ * we switch to the regular bloom filter mode.

I don't think that optimization is worthwhile. If I'm going to be using
a Bloom filter, it's because I expect to have a lot of rows.

(Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new
table just won't have lots of rows, then I might want this optimization,
but I can always just drop the Bloom filter index, or not include
indexes in the LIKE.)

+ * XXX Perhaps we could save a few bytes by using different data types, but
+ * considering the size of the bitmap, the difference is negligible.

A bytea is all that's needed. See below.

+ * XXX We could also implement "sparse" bloom filters, keeping only the
+ * bytes that are not entirely 0. That might make the "sorted" phase
+ * mostly unnecessary.

Filter compression is not worthwhile. You want to have a fairly uniform
hash distribution, and you want to end up with a Bloom filter that's not
much more than 50% full. That means that when near 50% full the filter
will not be very sparse at all. Optimizing for the not common case
doesn't seem worthwhile, and adds complexity.

+ * XXX We can also watch the number of bits set in the bloom filter, and
+ * then stop using it (and not store the bitmap, to save space) when the
+ * false positive rate gets too high.

Ah, right, what you want is a "Scalable Bloom Filter".

A Scalable Bloom filter is actually a series of Bloom filters where all
but the newest filter are closed to additions, and the newest filter is
where you do all the additions. You generally want to make each new
filter bigger than the preceding one because when searching a Scalable
Bloom filter you have to search *all* of them, so you want to minimize
the number of filters.

Eventually, when you have enough sub-filters, you may want to re-write
the whole thing so that you start with a single sub-filter that is large
enough.

The format of the bytea might then be something like:

<size><bitmap>[[<size><bitmap>[...]]

where the last bitmap is the filter that is open to additions.

I wonder if there are write concurrency performance considerations
here...

It might be better to have a bytea value per-sub-filter so that there is
no lock contention for the closed filters. The closed sub-filters are
constant, thus not even shared locks are needed for them, and especially
not exclusive locks.

Writing to the filter will necessitate locking the entire open filter,
or else byte-range locking on it. Something to think about.

Now, what about query performance? Unlike the "minmax" indexes, the
"bloom" filter can only handle equality conditions.

A Bloom filter has non-zero false positives for existence, but zero
false positives for non-existence.

This means that a Bloom filter index can only be used for:

a) non-existence tests (with equality predicates, as you point out),

b) as an optimization to avoid slower index checks (or heap scans) when
the Bloom filter indicates non-existence.

(b) is really just an internal application of (a).

There might be applications where false positives might be ok in a query
like:

SELECT a.* FROM a a JOIN b b USING (some_col);

but for most real-world queries like that, allowing false positives by
default would be very bad. An option in the index declaration could be
used to enable existence equality predicates, but I would not include
such an option initially -- let's see if anyone actually has a use case
for it first.

Of course, for something like:

SELECT a.*, b.* FROM a a JOIN b b USING (some_col);

a Bloom filter can only be used as an optimization to avoid using a
slower index (or heap scan) on the inner table source.

What I'm getting at is that the query planner really needs to understand
that a Bloom filter is a probabilistic data structure.

Nico
--

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Sokolov Yura
y.sokolov@postgrespro.ru
In reply to: Nico Williams (#8)
Re: WIP: BRIN bloom indexes

On 2017-10-27 20:17, Nico Williams wrote:

On Thu, Oct 19, 2017 at 10:15:32PM +0200, Tomas Vondra wrote:

A bloom filter index would, indeed, be wonderful.

Comments:

+ * We use an optimisation that initially we store the uint32 values 
directly,
+ * without the extra hashing step. And only later filling the bitmap 
space,
+ * we switch to the regular bloom filter mode.

I don't think that optimization is worthwhile. If I'm going to be
using
a Bloom filter, it's because I expect to have a lot of rows.

(Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new
table just won't have lots of rows, then I might want this
optimization,
but I can always just drop the Bloom filter index, or not include
indexes in the LIKE.)

+ * XXX Perhaps we could save a few bytes by using different data 
types, but
+ * considering the size of the bitmap, the difference is negligible.

A bytea is all that's needed. See below.

+ * XXX We could also implement "sparse" bloom filters, keeping only 
the
+ * bytes that are not entirely 0. That might make the "sorted" phase
+ * mostly unnecessary.

Filter compression is not worthwhile. You want to have a fairly
uniform
hash distribution, and you want to end up with a Bloom filter that's
not
much more than 50% full. That means that when near 50% full the filter
will not be very sparse at all. Optimizing for the not common case
doesn't seem worthwhile, and adds complexity.

+ * XXX We can also watch the number of bits set in the bloom filter, 
and
+ * then stop using it (and not store the bitmap, to save space) when 
the
+ * false positive rate gets too high.

Ah, right, what you want is a "Scalable Bloom Filter".

A Scalable Bloom filter is actually a series of Bloom filters where all
but the newest filter are closed to additions, and the newest filter is
where you do all the additions. You generally want to make each new
filter bigger than the preceding one because when searching a Scalable
Bloom filter you have to search *all* of them, so you want to minimize
the number of filters.

Eventually, when you have enough sub-filters, you may want to re-write
the whole thing so that you start with a single sub-filter that is
large
enough.

The format of the bytea might then be something like:

<size><bitmap>[[<size><bitmap>[...]]

where the last bitmap is the filter that is open to additions.

I wonder if there are write concurrency performance considerations
here...

It might be better to have a bytea value per-sub-filter so that there
is
no lock contention for the closed filters. The closed sub-filters are
constant, thus not even shared locks are needed for them, and
especially
not exclusive locks.

Writing to the filter will necessitate locking the entire open filter,
or else byte-range locking on it. Something to think about.

Now, what about query performance? Unlike the "minmax" indexes, the
"bloom" filter can only handle equality conditions.

A Bloom filter has non-zero false positives for existence, but zero
false positives for non-existence.

This means that a Bloom filter index can only be used for:

a) non-existence tests (with equality predicates, as you point out),

b) as an optimization to avoid slower index checks (or heap scans) when
the Bloom filter indicates non-existence.

(b) is really just an internal application of (a).

There might be applications where false positives might be ok in a
query
like:

SELECT a.* FROM a a JOIN b b USING (some_col);

but for most real-world queries like that, allowing false positives by
default would be very bad. An option in the index declaration could be
used to enable existence equality predicates, but I would not include
such an option initially -- let's see if anyone actually has a use case
for it first.

Of course, for something like:

SELECT a.*, b.* FROM a a JOIN b b USING (some_col);

a Bloom filter can only be used as an optimization to avoid using a
slower index (or heap scan) on the inner table source.

What I'm getting at is that the query planner really needs to
understand
that a Bloom filter is a probabilistic data structure.

Nico
--

PostgreSQL has a lot of probabilistic indexes.

Some GIST opclasses returns false positives and tells to executor to
recheck their results.
Bitmap-index-scan, when there are a lot of result tuples, falls back to
storing only page numbers, without actual tid's, and executer then scans
heap-pages to find necessary tuples.

BRIN index at all just "recommends executor to scan that heap page".
Thus
BRIN index is at whole just an optimisation (regardless is it `minmax`
or
`bloom`).
So that is ok.

--
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Nico Williams (#8)
Re: WIP: BRIN bloom indexes

Hi,

On 10/27/2017 07:17 PM, Nico Williams wrote:

On Thu, Oct 19, 2017 at 10:15:32PM +0200, Tomas Vondra wrote:

A bloom filter index would, indeed, be wonderful.

Comments:

+ * We use an optimisation that initially we store the uint32 values directly,
+ * without the extra hashing step. And only later filling the bitmap space,
+ * we switch to the regular bloom filter mode.

I don't think that optimization is worthwhile. If I'm going to be using
a Bloom filter, it's because I expect to have a lot of rows.

(Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new
table just won't have lots of rows, then I might want this optimization,
but I can always just drop the Bloom filter index, or not include
indexes in the LIKE.)

I think you're confusing "rows" and "distinct values". Furthermore, it's
rather irrelevant how many rows are in the table because BRIN indexes
split the table into "ranges" that are 1MB by default. And you can only
squash certain number of rows into such range.

The idea of the optimization is to efficiently support cases where each
range contains only small number of distinct values - which is quite
common in the cases I described in my initial message (with bursts of
rows with the same IP / client identifier etc.).

+ * XXX Perhaps we could save a few bytes by using different data types, but
+ * considering the size of the bitmap, the difference is negligible.

A bytea is all that's needed. See below.

+ * XXX We could also implement "sparse" bloom filters, keeping only the
+ * bytes that are not entirely 0. That might make the "sorted" phase
+ * mostly unnecessary.

Filter compression is not worthwhile. You want to have a fairly uniform
hash distribution, and you want to end up with a Bloom filter that's not
much more than 50% full. That means that when near 50% full the filter
will not be very sparse at all. Optimizing for the not common case
doesn't seem worthwhile, and adds complexity.

Properly sizing the bloom filter requires knowledge of many variables,
in particular the number of distinct values expected to be added to the
filter. But we don't really know that number, and we also don't even
know many values useful for estimating that (how many rows will fit into
a range, number of distinct values in the whole table, etc.)

So the idea was to oversize the bloom filter, and then use the sparse
representation to reduce the size.

+ * XXX We can also watch the number of bits set in the bloom filter, and
+ * then stop using it (and not store the bitmap, to save space) when the
+ * false positive rate gets too high.

Ah, right, what you want is a "Scalable Bloom Filter".

That's not what I had in mind. My idea was to size the bloom filter on
"best effort" basis, and then if one range gets a bit too inaccurate
then just get rid of the filter. If that happens for many ranges, we
should probably allow specifying parameters as relopts for the index.

A Scalable Bloom filter is actually a series of Bloom filters where all
but the newest filter are closed to additions, and the newest filter is
where you do all the additions. You generally want to make each new
filter bigger than the preceding one because when searching a Scalable
Bloom filter you have to search *all* of them, so you want to minimize
the number of filters.

Eventually, when you have enough sub-filters, you may want to re-write
the whole thing so that you start with a single sub-filter that is large
enough.

The format of the bytea might then be something like:

<size><bitmap>[[<size><bitmap>[...]]

where the last bitmap is the filter that is open to additions.

I think this is really an over-engineering, and I certainly don't plan
to extend the patch in this direction.

I do not expect these parameters (particularly the number of distinct
values in a range) to significantly change over time, so the easiest
solution is to provide a reloption to specify that number in
CREATE/ALTER index.

Alternatively, we could allow the summarization process to gather some
statistics (either on the whole range, or perhaps the whole table), and
store them somewhere for later use. For example to compute the number of
distinct values per range (in the existing data), and use that later.

...

Of course, for something like:

SELECT a.*, b.* FROM a a JOIN b b USING (some_col);

a Bloom filter can only be used as an optimization to avoid using a
slower index (or heap scan) on the inner table source.

What I'm getting at is that the query planner really needs to understand
that a Bloom filter is a probabilistic data structure.

It does, and we never produce incorrect results. Which seems like the
right thing to do.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Sokolov Yura (#7)
Re: WIP: BRIN bloom indexes

Hi,

On 10/27/2017 05:22 PM, Sokolov Yura wrote:

Hi, Tomas

BRIN bloom index is a really cool feature, that definitely should be in
core distribution (either in contrib or builtin)!!!

Small suggestion for algorithm:

It is well known practice not to calculate whole hash function for every
point, but use double hashing approach.
Example of proving double hashing approach for bloom filters:
https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf

So, instead of:
    for (i = 0; i < filter->nhashes; i++)
    {
        /* compute the hashes with a seed, used for the bloom filter */
        uint32 h = ((uint32)DatumGetInt64(hash_uint32_extended(value,
i))) % filter->nbits;

        /* set or check bit h */
    }

such construction could be used:
    /* compute the hashes, used for the bloom filter */
    uint32 big_h = ((uint32)DatumGetInt64(hash_uint32_extended(value, i)));
    uint32 h = big_h % filter->nbits;
    /* ensures d is never >= filter->nbits */
    uint32 d = big_h % (filter->nbits - 1);

    for (i = 0; i < filter->nhashes; i++)
    {
        /* set or check bit h */

        /* compute next bit h */
        h += d++;
        if (h >= filter->nbits) h -= filter->nbits;
        if (d == filter->nbits) d = 0;
    }

Modulo of one 64bit result by two coprime numbers (`nbits` and `nbits-1`)
gives two good-quality functions usable for double hashing.

OK, thanks for the suggestion.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Nico Williams
nico@cryptonector.com
In reply to: Tomas Vondra (#10)
Re: WIP: BRIN bloom indexes

On Fri, Oct 27, 2017 at 10:06:58PM +0200, Tomas Vondra wrote:

+ * We use an optimisation that initially we store the uint32 values directly,
+ * without the extra hashing step. And only later filling the bitmap space,
+ * we switch to the regular bloom filter mode.

I don't think that optimization is worthwhile. If I'm going to be using
a Bloom filter, it's because I expect to have a lot of rows.

(Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new
table just won't have lots of rows, then I might want this optimization,
but I can always just drop the Bloom filter index, or not include
indexes in the LIKE.)

I think you're confusing "rows" and "distinct values". Furthermore, it's
rather irrelevant how many rows are in the table because BRIN indexes
split the table into "ranges" that are 1MB by default. And you can only
squash certain number of rows into such range.

The idea of the optimization is to efficiently support cases where each
range contains only small number of distinct values - which is quite
common in the cases I described in my initial message (with bursts of
rows with the same IP / client identifier etc.).

What range? It's just bits to set.

The point is that Bloom filters should ideally be about 50% full -- much
less than that and you are wasting space, much more than than and your
false positive rate becomes useless.

Filter compression is not worthwhile. You want to have a fairly uniform
hash distribution, and you want to end up with a Bloom filter that's not
much more than 50% full. That means that when near 50% full the filter
will not be very sparse at all. Optimizing for the not common case
doesn't seem worthwhile, and adds complexity.

Properly sizing the bloom filter requires knowledge of many variables,
in particular the number of distinct values expected to be added to the
filter. But we don't really know that number, and we also don't even
know many values useful for estimating that (how many rows will fit into
a range, number of distinct values in the whole table, etc.)

This is why Scalable Bloom filters exist: so you can adapt.

So the idea was to oversize the bloom filter, and then use the sparse
representation to reduce the size.

A space-efficient representation of sparse bitmaps is not as simple as a
Scalable Bloom filter.

And a filter that requires user intervention to size correctly, or which
requires rebuilding when it gets too full, is also not as convenient as
a Scalable Bloom filter.

+ * XXX We can also watch the number of bits set in the bloom filter, and
+ * then stop using it (and not store the bitmap, to save space) when the
+ * false positive rate gets too high.

Ah, right, what you want is a "Scalable Bloom Filter".

That's not what I had in mind. My idea was to size the bloom filter on
"best effort" basis, and then if one range gets a bit too inaccurate
then just get rid of the filter. If that happens for many ranges, we
should probably allow specifying parameters as relopts for the index.

Scalable Bloom filters are way more convenient than that. They're
always not-too-full, and only the open filter is less-than-full-enough.

And since you should grow them somewhat exponentially (but not quite as
much as a doubling in each generation), there should never be too many
filters to search. But you can always "vacuum" (rebuild) the filter
starting with the size of the last filter added prior to the vacuum.

I think this is really an over-engineering, and I certainly don't plan
to extend the patch in this direction.

Whereas I think a sparse bitmap representation is overly complex and
"over-engineering". Scalable Bloom filters are very well understood in
the literature -- there's nothing terribly complex to them.

I do not expect these parameters (particularly the number of distinct
values in a range) to significantly change over time, so the easiest
solution is to provide a reloption to specify that number in
CREATE/ALTER index.

Doesn't this depend on the use-case though? I think a self-tuning
system is better than one that doesn't self-tune. Call that
over-engineering if you like, but it doesn't make it not desirable :)

Alternatively, we could allow the summarization process to gather some
statistics (either on the whole range, or perhaps the whole table), and
store them somewhere for later use. For example to compute the number of
distinct values per range (in the existing data), and use that later.

Again, Scalable Bloom filters automatically adapt without needing a
statistics gathering exercise. All you need is a good hash function
(that's another topic).

Scalable Bloom filters are a trivial extension of Bloom filters.

What I'm getting at is that the query planner really needs to understand
that a Bloom filter is a probabilistic data structure.

It does, and we never produce incorrect results. Which seems like the
right thing to do.

OK, I saw your other response about this. I didn't know that PG already
has support for probabilistic indexes.

Nico
--

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Nico Williams (#12)
Re: WIP: BRIN bloom indexes

hi,

On 10/28/2017 02:41 AM, Nico Williams wrote:

On Fri, Oct 27, 2017 at 10:06:58PM +0200, Tomas Vondra wrote:

+ * We use an optimisation that initially we store the uint32 values directly,
+ * without the extra hashing step. And only later filling the bitmap space,
+ * we switch to the regular bloom filter mode.

I don't think that optimization is worthwhile. If I'm going to be using
a Bloom filter, it's because I expect to have a lot of rows.

(Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new
table just won't have lots of rows, then I might want this optimization,
but I can always just drop the Bloom filter index, or not include
indexes in the LIKE.)

I think you're confusing "rows" and "distinct values". Furthermore, it's
rather irrelevant how many rows are in the table because BRIN indexes
split the table into "ranges" that are 1MB by default. And you can only
squash certain number of rows into such range.

The idea of the optimization is to efficiently support cases where each
range contains only small number of distinct values - which is quite
common in the cases I described in my initial message (with bursts of
rows with the same IP / client identifier etc.).

What range? It's just bits to set.

The point is that Bloom filters should ideally be about 50% full -- much
less than that and you are wasting space, much more than than and your
false positive rate becomes useless.

The range defined by BRIN indexes. BRIN indexes split the table into a
sequence of page ranges (128 pages by default, i.e. 1MB), and the bloom
filters are built on those table "chunks". So it's irrelevant how many
rows are in the table in total, what matters is just the page range.

Filter compression is not worthwhile. You want to have a fairly uniform
hash distribution, and you want to end up with a Bloom filter that's not
much more than 50% full. That means that when near 50% full the filter
will not be very sparse at all. Optimizing for the not common case
doesn't seem worthwhile, and adds complexity.

Properly sizing the bloom filter requires knowledge of many variables,
in particular the number of distinct values expected to be added to the
filter. But we don't really know that number, and we also don't even
know many values useful for estimating that (how many rows will fit into
a range, number of distinct values in the whole table, etc.)

This is why Scalable Bloom filters exist: so you can adapt.

So the idea was to oversize the bloom filter, and then use the sparse
representation to reduce the size.

A space-efficient representation of sparse bitmaps is not as simple as a
Scalable Bloom filter.

And a filter that requires user intervention to size correctly, or which
requires rebuilding when it gets too full, is also not as convenient as
a Scalable Bloom filter.

Maybe. For now I'm fine with the simple relopts-based approach and don't
plan to spend much time on these improvements. Which is not to say that
they may not be worthwhile, but it's not the only thing I'm working on.

I also suspect there are practical implications, as some things are only
possible with equally-sized bloom filters. I believe the "union"
(missing in the current patch) will rely on that when merging bloom filters.

+ * XXX We can also watch the number of bits set in the bloom filter, and
+ * then stop using it (and not store the bitmap, to save space) when the
+ * false positive rate gets too high.

Ah, right, what you want is a "Scalable Bloom Filter".

That's not what I had in mind. My idea was to size the bloom filter on
"best effort" basis, and then if one range gets a bit too inaccurate
then just get rid of the filter. If that happens for many ranges, we
should probably allow specifying parameters as relopts for the index.

Scalable Bloom filters are way more convenient than that. They're
always not-too-full, and only the open filter is less-than-full-enough.

And since you should grow them somewhat exponentially (but not quite as
much as a doubling in each generation), there should never be too many
filters to search. But you can always "vacuum" (rebuild) the filter
starting with the size of the last filter added prior to the vacuum.

I think this is really an over-engineering, and I certainly don't plan
to extend the patch in this direction.

Whereas I think a sparse bitmap representation is overly complex and
"over-engineering". Scalable Bloom filters are very well understood in
the literature -- there's nothing terribly complex to them.

That "sparse bitmap" is mentioned merely in an XXX comment as a thing to
consider, not something I plan to work right now. Perhaps it's not
really a good idea in general, or maybe it's addressing a non-issue. In
which case we don't need scalable bloom filters either.

I do not expect these parameters (particularly the number of distinct
values in a range) to significantly change over time, so the easiest
solution is to provide a reloption to specify that number in
CREATE/ALTER index.

Doesn't this depend on the use-case though? I think a self-tuning
system is better than one that doesn't self-tune. Call that
over-engineering if you like, but it doesn't make it not desirable :)

I'm simply not convinced we need that additional complexity, which has
it's cost too. Also, if "self-tuning means" means creating multiple
bloom filters with different sizes, then it may have impact on some
operations (like "union").

Alternatively, we could allow the summarization process to gather some
statistics (either on the whole range, or perhaps the whole table), and
store them somewhere for later use. For example to compute the number of
distinct values per range (in the existing data), and use that later.

Again, Scalable Bloom filters automatically adapt without needing a
statistics gathering exercise. All you need is a good hash function
(that's another topic).

Scalable Bloom filters are a trivial extension of Bloom filters.

Sure. But it's additional complexity and I'm not convinced we need it
(considering how BRIN splits the table into ranges of limited size).
Hence I'll adopt the simple approach in my patch, and if it turns out to
be necessary, it can be improved in this direction.

What I'm getting at is that the query planner really needs to understand
that a Bloom filter is a probabilistic data structure.

It does, and we never produce incorrect results. Which seems like the
right thing to do.

OK, I saw your other response about this. I didn't know that PG already
has support for probabilistic indexes.

It's not as much "probabilistic" as understanding that some indexes may
produce false positives, in which case a recheck is needed.

Nico

Thanks for the comments and feedback!

cheers

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers