WIP: bloom filter in Hash Joins with batches
Hi,
while working on the Hash Join improvements, I've been repeatedly
running into the idea of bloom filter - various papers on hash joins
mention bloom filters as a way to optimize access to the hash table by
doing fewer lookups, etc.
Sadly, I've been unable to actually see any real benefit of using a
bloom filter, which I believe is mostly due to NTUP_PER_BUCKET=1, which
makes the lookups much more efficient (so the room for bloom filter
improvements is narrow).
The one case where bloom filter might still help, and that's when the
bloom filter fits into L3 cache (a few MBs) while the hash table (or
more accurately the buckets) do not. Then there's a chance that the
bloom filter (which needs to do multiple lookups) might help.
But I think there's another case where bloom filter might be way more
useful in Hash Join - when we do batching. What we do currently is that
we simply
1) build the batches for the hash table (inner relation)
2) read the outer relation (usually the larger one), and split it
into batches just like the hash table
3) while doing (2) we join the first batch, and write the remaining
batches to disk (temporary files)
4) we read the batches one by one (for both tables) and do the join
Now, imagine that only some of the rows in the outer table actually
match a row in the hash table. Currently, we do write those rows into
the temporary file, but with a bloom filter on the whole hash table (all
the batches at once) we can skip that for some types of joins.
For inner join we can immediately discard the outer row, for left join
we can immediately output the row. In both cases we can completely
eliminate the overhead with writing the tuple to the temporary file and
then reading it again.
The attached patch is a PoC of this approach - I'm pretty sure it's not
perfectly correct (e.g. I only tried it with inner join), but it's good
enough for demonstrating the benefits. It's rather incomplete (see the
end of this e-mail), and I'm mostly soliciting some early feedback at
this point.
The numbers presented here are for a test case like this:
CREATE TABLE dim (id INT, dval TEXT);
CREATE TABLE fact (id INT, fval TEXT);
INSERT INTO dim SELECT i, md5(i::text)
FROM generate_series(1,10000000) s(i);
-- repeat 10x
INSERT INTO fact SELECT * FROM dim;
and a query like this
SELECT COUNT(fval) FROM fact JOIN dim USING (id) WHERE dval < 'a';
with different values in the WHERE condition to select a fraction of the
inner 'dim' table - this directly affects what portion of the 'fact'
table has a matching row, and also the size of the hash table (and
number of batches).
Now, some numbers from a machine with 8GB of RAM (the 'fact' table has
~6.5GB of data, so there's actually quite a bit of memory pressure,
forcing the temp files to disk etc.).
With work_mem=16MB, it looks like this:
batches filter select. bloom master bloom/master
-----------------------------------------------------------
4 1 6.25% 23871 48631 49.09%
8 2 12.50% 25752 56692 45.42%
8 3 18.75% 31273 57455 54.43%
16 4 25.01% 37430 62325 60.06%
16 5 31.25% 39005 61143 63.79%
16 6 37.50% 46157 63533 72.65%
16 7 43.75% 53500 65483 81.70%
32 8 49.99% 53952 65730 82.08%
32 9 56.23% 55187 67521 81.73%
32 a 62.49% 64454 69448 92.81%
32 b 68.73% 66937 71692 93.37%
32 c 74.97% 73323 72060 101.75%
32 d 81.23% 76703 73513 104.34%
32 e 87.48% 81970 74890 109.45%
32 f 93.74% 86102 76257 112.91%
The 'batches' means how many batches were used for the join, 'filter' is
the value used in the WHERE condition, selectivity is the fraction of
the 'dim' table that matches the condition (and also the 'fact'). Bloom
and master are timings of the query in miliseconds, and bloom/master is
comparison of the runtimes - so for example 49% means the hash join with
bloom filter was running ~2x as fast.
Admittedly, work_mem=16MB is quite low, but that's just a way to force
batching. What really matters is the number of batches and selectivity
(how many tuples we can eliminate using the bloom filter).
For work_mem=64MB it looks like this:
batches filter select. bloom master bloom/master
-----------------------------------------------------------
1 1 6.25% 24846 23854 104.16%
2 2 12.50% 24369 45672 53.36%
2 3 18.75% 30432 47454 64.13%
4 4 25.01% 36175 59741 60.55%
4 5 31.25% 43103 62631 68.82%
4 6 37.50% 48179 64079 75.19%
So initially it's a bit slower (it's not doing any batching in this
case, but the code is a bit silly and while not building the bloom
filter it still does some checks). But once we start batching, it gets
2x as fast again, and then slowly degrades as the selectivity increases.
Attached is a spreadsheet with results for various work_mem values, and
also with a smaller data set (just 30M rows in the fact table), which
easily fits into memory. Yet it shows similar gains, shaving off ~40% in
the best case, suggesting that this is not just thanks to reduction of
I/O when forcing the temp files to disk.
As I mentioned, the patch is incomplete in several ways:
1) It does not count the bloom filter (which may be quite big) into
work_mem properly.
2) It probably does not work for outer joins at this point.
3) Currently the bloom filter is used whenever we do batching, but it
should really be driven by selectivity too - it'd be good to (a)
estimate the fraction of 'fact' tuples having a match in the hash
table, and not to do bloom if it's over ~60% or so. Also, maybe
the could should count the matches at runtime, and disable the
bloom filter if we reach some threshold.
But overall, this seems like a nice optimization opportunity for hash
joins on large data sets, where batching is necessary.
Ideas?
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
bloom-hashjoin-v1.patchtext/x-diff; name=bloom-hashjoin-v1.patchDownload
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 5e05ec3..aad5e98 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -36,6 +36,7 @@
#include "utils/memutils.h"
#include "utils/lsyscache.h"
#include "utils/syscache.h"
+#include "utils/murmur3.h"
static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
@@ -47,9 +48,16 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
uint32 hashvalue,
int bucketNumber);
static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
+static void ExecHashBloomAddValue(HashJoinTable hashtable, uint32 hashvalue);
static void *dense_alloc(HashJoinTable hashtable, Size size);
+static BloomFilter BloomFilterInit(double nrows, double error);
+
+/* let's shoot for 5% false positives error rate (arbitrary value) */
+#define BLOOM_ERROR_RATE 0.05
+
+
/* ----------------------------------------------------------------
* ExecHash
*
@@ -112,6 +120,8 @@ MultiExecHash(HashState *node)
{
int bucketNumber;
+ ExecHashBloomAddValue(hashtable, hashvalue);
+
bucketNumber = ExecHashGetSkewBucket(hashtable, hashvalue);
if (bucketNumber != INVALID_SKEW_BUCKET_NO)
{
@@ -310,6 +320,11 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
hashtable->spaceAllowedSkew =
hashtable->spaceAllowed * SKEW_WORK_MEM_PERCENT / 100;
hashtable->chunks = NULL;
+ hashtable->bloomFilter = NULL;
+
+ /* do bloom filter only when batching */
+ if (nbatch > 1)
+ hashtable->bloomFilter = BloomFilterInit(outerNode->plan_rows, BLOOM_ERROR_RATE);
/*
* Get info about the hash functions to be used for each hash key. Also
@@ -602,6 +617,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
long ninmemory;
long nfreed;
HashMemoryChunk oldchunks;
+ bool build_bloom_filter = false;
/* do nothing if we've decided to shut off growth */
if (!hashtable->growEnabled)
@@ -667,6 +683,36 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
sizeof(HashJoinTuple) * hashtable->nbuckets);
}
+ /* if we're switching to batched mode, we need to build the bloom filter */
+ if (oldnbatch == 1)
+ {
+ /*
+ * We can't use outerNode->plan_rows here, firstly because we don't
+ * have access to it, but most importantly because it's inaccurate
+ * anyway (we've expected to do no batching based on the value). So
+ * we'll just use double the number of entries in the hash table.
+ *
+ * We also need to make sure we added the hash values into the bloom
+ * filter in this case (that's what build_bloom_filter is for).
+ *
+ * XXX Maybe we should be more pessimistic and use a higher values,
+ * in case we need to further increment the number of batches.
+ *
+ * XXX We also need to set some maximum number of tuples when the
+ * false positive rate gets too bad, and stop using the bloom
+ * filter if we reach it (we can't resize the filter).
+ *
+ * XXX There was a paper about adding a larger bloom filter once
+ * we fill the existing one, and using them at the same time.
+ * Might be worth implementing if the whole bloom filter idea
+ * works in general.
+ */
+ hashtable->bloomFilter
+ = BloomFilterInit(2 * hashtable->totalTuples, BLOOM_ERROR_RATE);
+
+ build_bloom_filter = true;
+ }
+
/*
* We will scan through the chunks directly, so that we can reset the
* buckets now and not have to keep track which tuples in the buckets have
@@ -697,6 +743,9 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
&bucketno, &batchno);
+ if (build_bloom_filter)
+ ExecHashBloomAddValue(hashtable, hashTuple->hashvalue);
+
if (batchno == curbatch)
{
/* keep tuple in memory - copy it into the new chunk */
@@ -1705,3 +1754,78 @@ dense_alloc(HashJoinTable hashtable, Size size)
/* return pointer to the start of the tuple memory */
return ptr;
}
+
+static void
+ExecHashBloomAddValue(HashJoinTable hashtable, uint32 hashvalue)
+{
+ int i, byteIdx, bitIdx;
+ BloomFilter filter = hashtable->bloomFilter;
+
+ if (! filter)
+ return;
+
+ if (hashtable->curbatch > 0)
+ return;
+
+ Assert(hashtable->nbatch > 1); /* nbatch=1 implies bloomData=NULL */
+
+ for (i = 0; i < filter->nhashes; i++)
+ {
+ uint32_t seed = i;
+ uint32_t hash = 0;
+
+ MurmurHash3_x86_32(&hashvalue, sizeof(uint32), seed, &hash);
+
+ hash = hash % filter->nbits;
+
+ byteIdx = (hash / 8);
+ bitIdx = (hash % 8);
+
+ filter->data[byteIdx] |= (0x01 << bitIdx);
+ }
+}
+
+bool
+ExecHashBloomCheckValue(HashJoinTable hashtable, uint32 hashvalue)
+{
+ int i, byteIdx, bitIdx;
+ BloomFilter filter = hashtable->bloomFilter;
+
+ if (! filter)
+ return true;
+
+ for (i = 0; i < filter->nhashes; i++)
+ {
+ uint32_t seed = i;
+ uint32_t hash = 0;
+
+ MurmurHash3_x86_32(&hashvalue, sizeof(uint32), seed, &hash);
+
+ hash = hash % filter->nbits;
+
+ byteIdx = (hash / 8);
+ bitIdx = (hash % 8);
+
+ if (! (filter->data[byteIdx] & (0x01 << bitIdx)))
+ return false;
+ }
+
+ return true;
+}
+
+static BloomFilter
+BloomFilterInit(double nrows, double error)
+{
+ /* perhaps we should round nbits to multiples of 8 ? */
+ int nbits = ceil((nrows * log(error)) / log(1.0 / (pow(2.0, log(2.0)))));
+ int nhashes = round(log(2.0) * nbits / nrows);
+
+ BloomFilter filter = palloc0(offsetof(BloomFilterData, data) + ((nbits + 7) / 8));
+
+ filter->nbits = nbits;
+ filter->nhashes = nhashes;
+
+ elog(WARNING, "bloom filter: %d bits (%d bytes), %d hashes", nbits, (nbits + 7) / 8, nhashes);
+
+ return filter;
+}
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 1d78cdf..708877e 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -48,7 +48,6 @@ static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
TupleTableSlot *tupleSlot);
static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
-
/* ----------------------------------------------------------------
* ExecHashJoin
*
@@ -210,6 +209,7 @@ ExecHashJoin(HashJoinState *node)
outerTupleSlot = ExecHashJoinOuterGetTuple(outerNode,
node,
&hashvalue);
+
if (TupIsNull(outerTupleSlot))
{
/* end of batch, or maybe whole join */
@@ -224,6 +224,19 @@ ExecHashJoin(HashJoinState *node)
continue;
}
+ /* If still in the first batch, we check the bloom filter. */
+ if (hashtable->curbatch == 0)
+ {
+ node->hj_BloomLookups += 1;
+
+ if (! ExecHashBloomCheckValue(hashtable, hashvalue))
+ {
+ /* Loop around, staying in HJ_NEED_NEW_OUTER state */
+ node->hj_BloomEliminated += 1;
+ continue;
+ }
+ }
+
econtext->ecxt_outertuple = outerTupleSlot;
node->hj_MatchedOuter = false;
@@ -423,6 +436,9 @@ ExecHashJoin(HashJoinState *node)
(int) node->hj_JoinState);
}
}
+
+ elog(WARNING, "bloom filter lookups=%lu eliminated=%lu",
+ node->hj_BloomLookups, node->hj_BloomEliminated);
}
/* ----------------------------------------------------------------
@@ -591,6 +607,9 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
hjstate->hj_MatchedOuter = false;
hjstate->hj_OuterNotEmpty = false;
+ hjstate->hj_BloomLookups = 0;
+ hjstate->hj_BloomEliminated = 0;
+
return hjstate;
}
@@ -629,6 +648,9 @@ ExecEndHashJoin(HashJoinState *node)
*/
ExecEndNode(outerPlanState(node));
ExecEndNode(innerPlanState(node));
+
+ elog(WARNING, "bloom filter lookups=%lu eliminated=%lu",
+ node->hj_BloomLookups, node->hj_BloomEliminated);
}
/*
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 2cb7bab..365123b 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -16,7 +16,7 @@ OBJS = acl.o arrayfuncs.o array_expanded.o array_selfuncs.o \
float.o format_type.o formatting.o genfile.o \
geo_ops.o geo_selfuncs.o inet_cidr_ntop.o inet_net_pton.o int.o \
int8.o json.o jsonb.o jsonb_gin.o jsonb_op.o jsonb_util.o \
- jsonfuncs.o like.o lockfuncs.o mac.o misc.o nabstime.o name.o \
+ jsonfuncs.o like.o lockfuncs.o mac.o misc.o murmur3.o nabstime.o name.o \
network.o network_gist.o network_selfuncs.o \
numeric.o numutils.o oid.o oracle_compat.o \
orderedsetaggs.o pg_locale.o pg_lsn.o pg_upgrade_support.o \
diff --git a/src/backend/utils/adt/murmur3.c b/src/backend/utils/adt/murmur3.c
new file mode 100644
index 0000000..764aeab
--- /dev/null
+++ b/src/backend/utils/adt/murmur3.c
@@ -0,0 +1,315 @@
+//-----------------------------------------------------------------------------
+// MurmurHash3 was written by Austin Appleby, and is placed in the public
+// domain. The author hereby disclaims copyright to this source code.
+
+// Note - The x86 and x64 versions do _not_ produce the same results, as the
+// algorithms are optimized for their respective platforms. You can still
+// compile and run any of them on any platform, but your performance with the
+// non-native version will be less than optimal.
+
+#include "utils/murmur3.h"
+
+//-----------------------------------------------------------------------------
+// Platform-specific functions and macros
+
+#ifdef __GNUC__
+#define FORCE_INLINE __attribute__((always_inline)) inline
+#else
+#define FORCE_INLINE inline
+#endif
+
+static FORCE_INLINE uint32_t rotl32 ( uint32_t x, int8_t r )
+{
+ return (x << r) | (x >> (32 - r));
+}
+
+static FORCE_INLINE uint64_t rotl64 ( uint64_t x, int8_t r )
+{
+ return (x << r) | (x >> (64 - r));
+}
+
+#define ROTL32(x,y) rotl32(x,y)
+#define ROTL64(x,y) rotl64(x,y)
+
+#define BIG_CONSTANT(x) (x##LLU)
+
+//-----------------------------------------------------------------------------
+// Block read - if your platform needs to do endian-swapping or can only
+// handle aligned reads, do the conversion here
+
+#define getblock(p, i) (p[i])
+
+//-----------------------------------------------------------------------------
+// Finalization mix - force all bits of a hash block to avalanche
+
+static FORCE_INLINE uint32_t fmix32 ( uint32_t h )
+{
+ h ^= h >> 16;
+ h *= 0x85ebca6b;
+ h ^= h >> 13;
+ h *= 0xc2b2ae35;
+ h ^= h >> 16;
+
+ return h;
+}
+
+//----------
+
+static FORCE_INLINE uint64_t fmix64 ( uint64_t k )
+{
+ k ^= k >> 33;
+ k *= BIG_CONSTANT(0xff51afd7ed558ccd);
+ k ^= k >> 33;
+ k *= BIG_CONSTANT(0xc4ceb9fe1a85ec53);
+ k ^= k >> 33;
+
+ return k;
+}
+
+//-----------------------------------------------------------------------------
+
+void MurmurHash3_x86_32 ( const void * key, int len,
+ uint32_t seed, void * out )
+{
+ const uint8_t * data = (const uint8_t*)key;
+ const int nblocks = len / 4;
+ int i;
+
+ uint32_t h1 = seed;
+
+ uint32_t c1 = 0xcc9e2d51;
+ uint32_t c2 = 0x1b873593;
+
+ //----------
+ // body
+
+ const uint32_t * blocks = (const uint32_t *)(data + nblocks*4);
+
+ for(i = -nblocks; i; i++)
+ {
+ uint32_t k1 = getblock(blocks,i);
+
+ k1 *= c1;
+ k1 = ROTL32(k1,15);
+ k1 *= c2;
+
+ h1 ^= k1;
+ h1 = ROTL32(h1,13);
+ h1 = h1*5+0xe6546b64;
+ }
+
+ //----------
+ // tail
+
+ const uint8_t * tail = (const uint8_t*)(data + nblocks*4);
+
+ uint32_t k1 = 0;
+
+ switch(len & 3)
+ {
+ case 3: k1 ^= tail[2] << 16;
+ case 2: k1 ^= tail[1] << 8;
+ case 1: k1 ^= tail[0];
+ k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
+ };
+
+ //----------
+ // finalization
+
+ h1 ^= len;
+
+ h1 = fmix32(h1);
+
+ *(uint32_t*)out = h1;
+}
+
+//-----------------------------------------------------------------------------
+
+void MurmurHash3_x86_128 ( const void * key, const int len,
+ uint32_t seed, void * out )
+{
+ const uint8_t * data = (const uint8_t*)key;
+ const int nblocks = len / 16;
+ int i;
+
+ uint32_t h1 = seed;
+ uint32_t h2 = seed;
+ uint32_t h3 = seed;
+ uint32_t h4 = seed;
+
+ uint32_t c1 = 0x239b961b;
+ uint32_t c2 = 0xab0e9789;
+ uint32_t c3 = 0x38b34ae5;
+ uint32_t c4 = 0xa1e38b93;
+
+ //----------
+ // body
+
+ const uint32_t * blocks = (const uint32_t *)(data + nblocks*16);
+
+ for(i = -nblocks; i; i++)
+ {
+ uint32_t k1 = getblock(blocks,i*4+0);
+ uint32_t k2 = getblock(blocks,i*4+1);
+ uint32_t k3 = getblock(blocks,i*4+2);
+ uint32_t k4 = getblock(blocks,i*4+3);
+
+ k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
+
+ h1 = ROTL32(h1,19); h1 += h2; h1 = h1*5+0x561ccd1b;
+
+ k2 *= c2; k2 = ROTL32(k2,16); k2 *= c3; h2 ^= k2;
+
+ h2 = ROTL32(h2,17); h2 += h3; h2 = h2*5+0x0bcaa747;
+
+ k3 *= c3; k3 = ROTL32(k3,17); k3 *= c4; h3 ^= k3;
+
+ h3 = ROTL32(h3,15); h3 += h4; h3 = h3*5+0x96cd1c35;
+
+ k4 *= c4; k4 = ROTL32(k4,18); k4 *= c1; h4 ^= k4;
+
+ h4 = ROTL32(h4,13); h4 += h1; h4 = h4*5+0x32ac3b17;
+ }
+
+ //----------
+ // tail
+
+ const uint8_t * tail = (const uint8_t*)(data + nblocks*16);
+
+ uint32_t k1 = 0;
+ uint32_t k2 = 0;
+ uint32_t k3 = 0;
+ uint32_t k4 = 0;
+
+ switch(len & 15)
+ {
+ case 15: k4 ^= tail[14] << 16;
+ case 14: k4 ^= tail[13] << 8;
+ case 13: k4 ^= tail[12] << 0;
+ k4 *= c4; k4 = ROTL32(k4,18); k4 *= c1; h4 ^= k4;
+
+ case 12: k3 ^= tail[11] << 24;
+ case 11: k3 ^= tail[10] << 16;
+ case 10: k3 ^= tail[ 9] << 8;
+ case 9: k3 ^= tail[ 8] << 0;
+ k3 *= c3; k3 = ROTL32(k3,17); k3 *= c4; h3 ^= k3;
+
+ case 8: k2 ^= tail[ 7] << 24;
+ case 7: k2 ^= tail[ 6] << 16;
+ case 6: k2 ^= tail[ 5] << 8;
+ case 5: k2 ^= tail[ 4] << 0;
+ k2 *= c2; k2 = ROTL32(k2,16); k2 *= c3; h2 ^= k2;
+
+ case 4: k1 ^= tail[ 3] << 24;
+ case 3: k1 ^= tail[ 2] << 16;
+ case 2: k1 ^= tail[ 1] << 8;
+ case 1: k1 ^= tail[ 0] << 0;
+ k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
+ };
+
+ //----------
+ // finalization
+
+ h1 ^= len; h2 ^= len; h3 ^= len; h4 ^= len;
+
+ h1 += h2; h1 += h3; h1 += h4;
+ h2 += h1; h3 += h1; h4 += h1;
+
+ h1 = fmix32(h1);
+ h2 = fmix32(h2);
+ h3 = fmix32(h3);
+ h4 = fmix32(h4);
+
+ h1 += h2; h1 += h3; h1 += h4;
+ h2 += h1; h3 += h1; h4 += h1;
+
+ ((uint32_t*)out)[0] = h1;
+ ((uint32_t*)out)[1] = h2;
+ ((uint32_t*)out)[2] = h3;
+ ((uint32_t*)out)[3] = h4;
+}
+
+//-----------------------------------------------------------------------------
+
+void MurmurHash3_x64_128 ( const void * key, const int len,
+ const uint32_t seed, void * out )
+{
+ const uint8_t * data = (const uint8_t*)key;
+ const int nblocks = len / 16;
+ int i;
+
+ uint64_t h1 = seed;
+ uint64_t h2 = seed;
+
+ uint64_t c1 = BIG_CONSTANT(0x87c37b91114253d5);
+ uint64_t c2 = BIG_CONSTANT(0x4cf5ad432745937f);
+
+ //----------
+ // body
+
+ const uint64_t * blocks = (const uint64_t *)(data);
+
+ for(i = 0; i < nblocks; i++)
+ {
+ uint64_t k1 = getblock(blocks,i*2+0);
+ uint64_t k2 = getblock(blocks,i*2+1);
+
+ k1 *= c1; k1 = ROTL64(k1,31); k1 *= c2; h1 ^= k1;
+
+ h1 = ROTL64(h1,27); h1 += h2; h1 = h1*5+0x52dce729;
+
+ k2 *= c2; k2 = ROTL64(k2,33); k2 *= c1; h2 ^= k2;
+
+ h2 = ROTL64(h2,31); h2 += h1; h2 = h2*5+0x38495ab5;
+ }
+
+ //----------
+ // tail
+
+ const uint8_t * tail = (const uint8_t*)(data + nblocks*16);
+
+ uint64_t k1 = 0;
+ uint64_t k2 = 0;
+
+ switch(len & 15)
+ {
+ case 15: k2 ^= (uint64_t)(tail[14]) << 48;
+ case 14: k2 ^= (uint64_t)(tail[13]) << 40;
+ case 13: k2 ^= (uint64_t)(tail[12]) << 32;
+ case 12: k2 ^= (uint64_t)(tail[11]) << 24;
+ case 11: k2 ^= (uint64_t)(tail[10]) << 16;
+ case 10: k2 ^= (uint64_t)(tail[ 9]) << 8;
+ case 9: k2 ^= (uint64_t)(tail[ 8]) << 0;
+ k2 *= c2; k2 = ROTL64(k2,33); k2 *= c1; h2 ^= k2;
+
+ case 8: k1 ^= (uint64_t)(tail[ 7]) << 56;
+ case 7: k1 ^= (uint64_t)(tail[ 6]) << 48;
+ case 6: k1 ^= (uint64_t)(tail[ 5]) << 40;
+ case 5: k1 ^= (uint64_t)(tail[ 4]) << 32;
+ case 4: k1 ^= (uint64_t)(tail[ 3]) << 24;
+ case 3: k1 ^= (uint64_t)(tail[ 2]) << 16;
+ case 2: k1 ^= (uint64_t)(tail[ 1]) << 8;
+ case 1: k1 ^= (uint64_t)(tail[ 0]) << 0;
+ k1 *= c1; k1 = ROTL64(k1,31); k1 *= c2; h1 ^= k1;
+ };
+
+ //----------
+ // finalization
+
+ h1 ^= len; h2 ^= len;
+
+ h1 += h2;
+ h2 += h1;
+
+ h1 = fmix64(h1);
+ h2 = fmix64(h2);
+
+ h1 += h2;
+ h2 += h1;
+
+ ((uint64_t*)out)[0] = h1;
+ ((uint64_t*)out)[1] = h2;
+}
+
+//-----------------------------------------------------------------------------
+
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 7a51ea6..f2bee7f 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -72,6 +72,15 @@ typedef struct HashJoinTupleData
#define HJTUPLE_MINTUPLE(hjtup) \
((MinimalTuple) ((char *) (hjtup) + HJTUPLE_OVERHEAD))
+typedef struct BloomFilterData
+{
+ int nbits; /* m */
+ int nhashes; /* k */
+ char data[1]; /* bits */
+} BloomFilterData;
+
+typedef BloomFilterData *BloomFilter;
+
/*
* If the outer relation's distribution is sufficiently nonuniform, we attempt
* to optimize the join by treating the hash values corresponding to the outer
@@ -186,6 +195,11 @@ typedef struct HashJoinTableData
/* used for dense allocation of tuples (into linked chunks) */
HashMemoryChunk chunks; /* one list for the whole batch */
+
+ /* used only when the hash join has multiple batches */
+ BloomFilter bloomFilter; /* bloom filter on the hash values */
} HashJoinTableData;
+bool ExecHashBloomCheckValue(HashJoinTable hashtable, uint32 hashvalue);
+
#endif /* HASHJOIN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5ccf470..e4660bd 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1764,6 +1764,10 @@ typedef struct HashJoinState
int hj_JoinState;
bool hj_MatchedOuter;
bool hj_OuterNotEmpty;
+
+ /* debugging and profiling of bloom filters */
+ uint64 hj_BloomLookups;
+ uint64 hj_BloomEliminated;
} HashJoinState;
diff --git a/src/include/utils/murmur3.h b/src/include/utils/murmur3.h
new file mode 100644
index 0000000..e12bf08
--- /dev/null
+++ b/src/include/utils/murmur3.h
@@ -0,0 +1,29 @@
+//-----------------------------------------------------------------------------
+// MurmurHash3 was written by Austin Appleby, and is placed in the
+// public domain. The author hereby disclaims copyright to this source
+// code.
+
+#ifndef _MURMURHASH3_H_
+#define _MURMURHASH3_H_
+
+#include <stdint.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+//-----------------------------------------------------------------------------
+
+void MurmurHash3_x86_32 (const void *key, int len, uint32_t seed, void *out);
+
+void MurmurHash3_x86_128(const void *key, int len, uint32_t seed, void *out);
+
+void MurmurHash3_x64_128(const void *key, int len, uint32_t seed, void *out);
+
+//-----------------------------------------------------------------------------
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif // _MURMURHASH3_H_
hashjoin-bloom.pngimage/png; name=hashjoin-bloom.pngDownload
�PNG
IHDR ] T �) pHYs � �-z8 w�IDATx���\���g�pG���(b�����X"�+�D������11�11*D��b�15���54*T, 6���t��;����-�����K���Y��ggfgv�����3�S�TAA5��. � �4"PAD�"� �hA]DA-��� �uAA��."� ��EA���� � ZPAD�"� �hA]DA-��� �uAA��."� ��EA���� � ZPAD�"� �hA]D���T*U��V���]6�R��c����g���S� H�������F��[�._���]��K�T���G�m�R ��Ey6�?��w��.��CbbbCA� u�Cx<���7?���.�r�v�ZWWW������g���9;;2d����������;�g��x��R������U���iSsnz���\�f���+���<<<����Q�F��)//�]'N�`r����������?www�����7o6��<�<>..���a������n�h���]L w��=c�89^^^�W��cA������U��72�H��G�\\\��� !cy��$�uaa���[�/�3]Vooo333���������pn���{��,����z�eee�����_�6m��S�JKK�Rs���A]D�.����o/Z��W�^po�;�{����~�5n���c����/p���>�=g�������!�?���P(���B��s���;w���O��o�d�� ���7���cdd$���9GFF�d�O?�
�w���J����l�bL��wP�����K�.���PZ;;����Z��]L /^�m�6�����I�&�������� �t��I���*�s��}��"�������j���@����Mtb�j�e����{��q��e8nnn}����BV`���Zs��E@]D���}��7`6����-Z0a�{���/h����[���obb��<5�� ���N�:A����%K���"