Hash Joins vs. Bloom Filters / take 2
Hi,
In 2015/2016 I've been exploring if we could improve hash joins by
leveraging bloom filters [1]/messages/by-id/5670946E.8070705@2ndquadrant.com, and I was reminded about this idea in a
thread about amcheck [2]/messages/by-id/9b9fd273-18e7-2b07-7aa1-4b00ab59b8d1@2ndquadrant.com. I also see that bloom filters were briefly
mentioned in the thread about parallel hash [3]/messages/by-id/9b9fd273-18e7-2b07-7aa1-4b00ab59b8d1@2ndquadrant.com.
So I've decided to revive the old patch, rebase it to current master,
and see if we can resolve the issues that killed it in 2016.
The issues are essentially about predicting when the bloom filter can
improve anything, which is more a question of the hash join selectivity
than the bloom filter parameters.
Consider a join on tables with FK on the join condition. That is,
something like this:
CREATE TABLE dim (id serial primary key, ...);
CREATE TABLE fact (dim_id int references dim(id), ...);
Then if you do a join
SELECT * FROM fact JOIN dim (id = dim_id);
the bloom filter can't really help, because all the dim_id values are
guaranteed to be in the hash table. So it can only really help with
queries like this:
SELECT * FROM fact JOIN dim (id = dim_id) WHERE (dim.x = 1000);
where the additional conditions on dim make the hash table incomplete in
some sense (i.e. the bloom will return false, saving quite a bit of
expensive work - lookup in large hash table or spilling tuple to disk).
This is not the same as "false positive rate" which is a feature of the
bloom filter itself, and can be measured by simply counting bits set in
the filter. That is important too, of course, but simple to determine.
The bloom filter "selectivity" is more a feature of the join, i.e. what
fraction of the values looked up in the bloom filter are expected to be
rejected. If many are rejected, the bloom filter helps. If few, the
bloom filter is a waste of time.
After thinking about this a bit more, I think this is pretty much what
we do during join cardinality estimation - we estimate how many of the
rows in "fact" have a matching row in "dim". I do think we can reuse
that to decide if to use a bloom filter or not.
Of course, the decision may need to be more elaborate (and perhaps
formulated as part of costing), but the filter selectivity is the
crucial piece. The attached patch does not do that yet, though.
The attached patch:
1) Only works in non-parallel case for now. Fixing this should not be a
big deal, once the costing/planning gets addressed.
2) Adds details about the bloom filter to EXPLAIN ANALYZE output, with
various important metrics (size of the filter, number of hash functions,
lookups/matches, fill factor).
3) Removes the HLL counter. I came to the conclusion that using HLL to
size the bloom filter makes very little sense, because there are about
three cases:
a) no batching (hash table fits into work_mem)
We can easily walk the hash table and compute "good enough" ndistinct
estimate by counting occupied buckets (or just using number of tuples in
the hash table, assuming the values are distinct).
At this point, the patch does not build the bloom filter in single-batch
cases, unless forced to do that by setting force_hashjoin_bloom=true.
More about this later.
b) batching from the beginning
The HLL is useless, because we need to initialize the bloom filter
before actually seeing any values. Instead, the patch simply uses the
expected number of rows (assuming those are distinct).
c) starting in single-batch mode, switching to batches later
We could use HLL to estimate number of distinct values in the initial
batch (when starting to batch), but it's unclear how is that useful.
This case means our estimates are off, and we don't really know how many
batches will be there. We could use some arbitrary multiple, I guess.
What I decided to do instead in the patch is sizing the bloom filter for
1/8 of work_mem. That seems like a viable compromise, as it's unlikely
to increase the number of batches.
4) Initially, the patch was aimed at hashjoins with batches, on the
premise that it can save some of the serialize/deserialize costs. But as
mentioned in the discussion, bloom filters can be beneficial even in the
single-batch case, when three conditions are met:
a) join is selective (i.e. some rows don't have match in hash table)
b) hash table > CPU cache
c) bloom filter < CPU cache
We don't have a good way to determine size of the CPU cache, but I think
we can use some crude arbitrary limits. E.g. require that hash hash
table is larger than 8MB and bloom filter is smaller than 4MB, or
something like that.
Opinions?
regards
[1]: /messages/by-id/5670946E.8070705@2ndquadrant.com
[2]: /messages/by-id/9b9fd273-18e7-2b07-7aa1-4b00ab59b8d1@2ndquadrant.com
/messages/by-id/9b9fd273-18e7-2b07-7aa1-4b00ab59b8d1@2ndquadrant.com
[3]: /messages/by-id/9b9fd273-18e7-2b07-7aa1-4b00ab59b8d1@2ndquadrant.com
/messages/by-id/9b9fd273-18e7-2b07-7aa1-4b00ab59b8d1@2ndquadrant.com
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
0001-v2-of-hashjoin-bloom-patch.patchtext/x-patch; name=0001-v2-of-hashjoin-bloom-patch.patchDownload
From 610fed13c7e2f418c8d574e85e9fa6fef97dbef6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 19 Feb 2018 18:56:10 +0100
Subject: [PATCH] v2 of hashjoin-bloom patch
---
src/backend/commands/explain.c | 24 +++
src/backend/executor/nodeHash.c | 276 ++++++++++++++++++++++++++++++++-
src/backend/executor/nodeHashjoin.c | 22 +++
src/backend/utils/misc/guc.c | 18 +++
src/include/executor/hashjoin.h | 16 ++
src/include/nodes/execnodes.h | 5 +
src/include/optimizer/cost.h | 2 +
src/include/utils/hashutils.h | 35 +++++
src/test/regress/expected/sysviews.out | 3 +-
9 files changed, 399 insertions(+), 2 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 900fa74..2ff7949 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2474,6 +2474,30 @@ show_hash_info(HashState *hashstate, ExplainState *es)
spacePeakKb);
}
}
+
+ if (hinstrument.bloom_nbytes > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainPropertyLong("Bloom Filter Bytes", hinstrument.bloom_nbytes, es);
+ ExplainPropertyLong("Bloom Filter Hashes", hinstrument.bloom_nhashes, es);
+ ExplainPropertyLong("Bloom Filter Lookups", hinstrument.bloom_nlookups, es);
+ ExplainPropertyLong("Bloom Filter Matches", hinstrument.bloom_nmatches, es);
+ ExplainPropertyFloat("Bloom Filter Bits Set",
+ hinstrument.bloom_nbits * 100.0 / (hinstrument.bloom_nbytes * 8), 2, es);
+ }
+ else
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str,
+ "Bloom Filter Size: %ldkB Hashes: %d Lookups: %d Matches: %d Bits Set: %.2f%%\n",
+ hinstrument.bloom_nbytes/1024,
+ hinstrument.bloom_nhashes,
+ hinstrument.bloom_nlookups,
+ hinstrument.bloom_nmatches,
+ hinstrument.bloom_nbits * 100.0 / (hinstrument.bloom_nbytes * 8));
+ }
+ }
}
/*
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 06bb44b..aef423a 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -38,11 +38,11 @@
#include "pgstat.h"
#include "port/atomics.h"
#include "utils/dynahash.h"
+#include "utils/hashutils.h"
#include "utils/memutils.h"
#include "utils/lsyscache.h"
#include "utils/syscache.h"
-
static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
static void ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable);
@@ -54,6 +54,10 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
uint32 hashvalue,
int bucketNumber);
static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
+static void ExecHashBloomAddValue(HashJoinTable hashtable, uint32 hashvalue);
+
+static BloomFilter BloomFilterInitRows(double nrows, double error);
+static BloomFilter BloomFilterInitBytes(Size nbytes, double error);
static void *dense_alloc(HashJoinTable hashtable, Size size);
static HashJoinTuple ExecParallelHashTupleAlloc(HashJoinTable hashtable,
@@ -80,6 +84,12 @@ static bool ExecParallelHashTuplePrealloc(HashJoinTable hashtable,
static void ExecParallelHashMergeCounters(HashJoinTable hashtable);
static void ExecParallelHashCloseBatchAccessors(HashJoinTable hashtable);
+/* should we build bloom filter? should we force building it? */
+bool enable_hashjoin_bloom = true;
+bool force_hashjoin_bloom = false;
+
+/* Shoot for 5% false positives error rate (arbitrary value). */
+#define BLOOM_ERROR_RATE 0.05
/* ----------------------------------------------------------------
* ExecHash
@@ -172,6 +182,9 @@ MultiExecPrivateHash(HashState *node)
{
int bucketNumber;
+ /* Add the hash value to the bloom filter (if needed). */
+ ExecHashBloomAddValue(hashtable, hashvalue);
+
bucketNumber = ExecHashGetSkewBucket(hashtable, hashvalue);
if (bucketNumber != INVALID_SKEW_BUCKET_NO)
{
@@ -198,6 +211,56 @@ MultiExecPrivateHash(HashState *node)
if (hashtable->spaceUsed > hashtable->spacePeak)
hashtable->spacePeak = hashtable->spaceUsed;
+ /*
+ * If forced to build a bloom filter, do that now, but only when
+ * there's a single batch. For batched mode the decision would have
+ * happened elsewehere earlier (we can't make it here, as the hash
+ * table contains only subset of the data).
+ */
+ if (enable_hashjoin_bloom &&
+ force_hashjoin_bloom &&
+ (hashtable->nbatch == 1))
+ {
+ HashMemoryChunk chunks;
+
+ /* no bloom filter allocated yet */
+ Assert(!hashtable->bloomFilter);
+
+ /* batching, so build bloom filter (assume all values are unique) */
+ hashtable->bloomFilter
+ = BloomFilterInitRows(hashtable->totalTuples, BLOOM_ERROR_RATE);
+
+ chunks = hashtable->chunks;
+
+ /* so, let's scan through the chunks, and all tuples in each chunk */
+ while (chunks != NULL)
+ {
+ HashMemoryChunk nextchunks = chunks->next.unshared;
+
+ /* position within the buffer (up to chunks->used) */
+ size_t idx = 0;
+
+ /* process all tuples stored in this chunk */
+ while (idx < chunks->used)
+ {
+ HashJoinTuple hashTuple = (HashJoinTuple) (HASH_CHUNK_DATA(chunks) + idx);
+ MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+ int hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+ ExecHashBloomAddValue(hashtable, hashTuple->hashvalue);
+
+ /* next tuple in this chunk */
+ idx += MAXALIGN(hashTupleSize);
+
+ /* allow this loop to be cancellable */
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ /* we're done with this chunk - proceed to the next one */
+ chunks = nextchunks;
+ }
+ }
+
hashtable->partialTuples = hashtable->totalTuples;
}
@@ -509,6 +572,56 @@ ExecHashTableCreate(HashState *state, List *hashOperators, bool keepNulls)
hashtable->area = state->ps.state->es_query_dsa;
hashtable->batches = NULL;
+ /*
+ * We don't quite know how many distinct values to expect, which is
+ * essential for proper sizing of the bloom filter. One option is to
+ * just use the estimated number of rows, and assume they are all
+ * distinct. That value may be inaccurate for a number of reasons.
+ * The actual number of rows may be very different, there may be
+ * duplicate rows, etc.
+ *
+ * If we start in multi-batch mode, we rely on the estimated total
+ * number of rows, and assume all hash values are unique (which is
+ * often the case, e.g. when joining on PK-FK pair), and use that
+ * to size the filter. We could also assume there is certain number
+ * of duplicate values (and use e.g. nrows/10 to size the filter),
+ * but that would simply hurt other cases.
+ *
+ * When starting in a single-batch mode, we do nothing initially.
+ * If the whole hash table fits into a single batch, we can get
+ * sufficiently accurate ndistinct estimate by simply counting
+ * occupied buckets (thanks to shooting for NTUP_PER_BUCKET=1),
+ * or perhaps we could use something more elaborate (e.g. HLL).
+ * But we only build the bloom filter if the hash table is large
+ * enough to exceed on-CPU caches (e.g. 4MB).
+ *
+ * If we have to start batching, we don't have much choice. We
+ * can't really use the original rows estimate, because if it was
+ * correct we would probably start batching right away.
+ *
+ * The best we can do is using some multiple of the ndistinct
+ * estimate derived from the current hash table. We could double
+ * it, but it's quite likely the misestimate is worse and we will
+ * end up adding more batches. So we don't really know what's the
+ * proper number.
+ *
+ * But we can approach this from another angle and consider memory
+ * consumption. We cerainly don't want to end up with huge bloom
+ * filter, so we can limit it to 1/8 of the work_mem, and size the
+ * bloom filter using that.
+ *
+ * XXX There are cases where we know the hash table contains all
+ * rows, e.g. FK-PK join with no restrictions on the PK side.
+ */
+ if (enable_hashjoin_bloom && (nbatch > 1) && (!node->plan.parallel_aware))
+ {
+ /* batching, so build bloom filter */
+ hashtable->bloomFilter
+ = BloomFilterInitRows(rows, BLOOM_ERROR_RATE);
+ }
+ else /* not batching, so no bloom filter */
+ hashtable->bloomFilter = NULL;
+
#ifdef HJDEBUG
printf("Hashjoin %p: initial nbatch = %d, nbuckets = %d\n",
hashtable, nbatch, nbuckets);
@@ -954,6 +1067,24 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
}
/*
+ * If switching to batched mode, we start building the bloom filter.
+ * But we don't know how to size it, because the estimates we have are
+ * clearly off (if they were correct, we'd start in batch mode right away).
+ *
+ * Instead, we size the bloom filter to use work_mem/8 (mostly arbitrary
+ * fraction).
+ *
+ * We also need to make sure we added the hash values into the bloom
+ * filter in this case (that's what build_bloom_filter is for).
+ */
+ if (enable_hashjoin_bloom && (oldnbatch == 1))
+ {
+ hashtable->bloomFilter
+ = BloomFilterInitBytes(work_mem * 1024L/8,
+ BLOOM_ERROR_RATE);
+ }
+
+ /*
* We will scan through the chunks directly, so that we can reset the
* buckets now and not have to keep track which tuples in the buckets have
* already been processed. We will free the old chunks as we go.
@@ -984,6 +1115,13 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
&bucketno, &batchno);
+ /*
+ * Only bother to add the values to bloom filter during the initial
+ * doubling. For subsequent doublings the bloom filter is up to date.
+ */
+ if (oldnbatch == 1)
+ ExecHashBloomAddValue(hashtable, hashTuple->hashvalue);
+
if (batchno == curbatch)
{
/* keep tuple in memory - copy it into the new chunk */
@@ -2640,6 +2778,26 @@ ExecHashGetInstrumentation(HashInstrumentation *instrument,
instrument->nbatch = hashtable->nbatch;
instrument->nbatch_original = hashtable->nbatch_original;
instrument->space_peak = hashtable->spacePeak;
+
+ if (hashtable->bloomFilter)
+ {
+ int i, j;
+
+ instrument->bloom_nhashes = hashtable->bloomFilter->nhashes;
+ instrument->bloom_nbytes = hashtable->bloomFilter->nbits/8;
+ instrument->bloom_nlookups = hashtable->bloomFilter->nlookups;
+ instrument->bloom_nmatches = hashtable->bloomFilter->nmatches;
+ instrument->bloom_nbits = 0;
+
+ for (i = 0; i < hashtable->bloomFilter->nbits/8; i++)
+ {
+ for (j = 0; j < 8; j++)
+ {
+ if (hashtable->bloomFilter->data[i] & (0x01 << j))
+ instrument->bloom_nbits++;
+ }
+ }
+ }
}
/*
@@ -3306,3 +3464,119 @@ ExecParallelHashTuplePrealloc(HashJoinTable hashtable, int batchno, size_t size)
return true;
}
+
+static void
+ExecHashBloomAddValue(HashJoinTable hashtable, uint32 hashvalue)
+{
+ int i;
+ BloomFilter filter;
+
+ /* If bloom filter not initialized/requested, we're done. */
+ if (!hashtable->bloomFilter)
+ return;
+
+ /*
+ * We only build the bloom filter during the initial pass, so do
+ * nothing if we're processing the other batches.
+ */
+ if (hashtable->curbatch > 0)
+ return;
+
+ /* for convenience */
+ filter = hashtable->bloomFilter;
+
+ /*
+ * To get multiple independent hash functions, we simply use
+ * 32-bit murmur3 with seeds. Seems to be working fine.
+ */
+ for (i = 0; i < filter->nhashes; i++)
+ {
+ int byteIdx, bitIdx;
+ uint32 hash;
+
+ hash = murmurhash32_seed(i, hashvalue) % filter->nbits;
+
+ byteIdx = (hash / 8);
+ bitIdx = (hash % 8);
+
+ filter->data[byteIdx] |= (0x01 << bitIdx);
+ }
+}
+
+bool
+ExecHashBloomCheckValue(HashJoinTable hashtable, uint32 hashvalue)
+{
+ int i;
+ BloomFilter filter = hashtable->bloomFilter;
+
+ /* if there's no filter, we simply assume match */
+ if (!filter)
+ return true;
+
+ ++filter->nlookups;
+
+ for (i = 0; i < filter->nhashes; i++)
+ {
+ int byteIdx, bitIdx;
+ uint32 hash;
+
+ hash = murmurhash32_seed(i, hashvalue) % filter->nbits;
+
+ byteIdx = (hash / 8);
+ bitIdx = (hash % 8);
+
+ /* found missing bit -> mismatch */
+ if (! (filter->data[byteIdx] & (0x01 << bitIdx)))
+ return false;
+ }
+
+ /* if we got here, we know it's a match */
+ ++filter->nmatches;
+
+ return true;
+}
+
+/*
+ * Size the bloom filter starting with expected number of entries and
+ * requested error rate.
+ */
+static BloomFilter
+BloomFilterInitRows(double nrows, double error)
+{
+ BloomFilter filter;
+ Size nbits = ceil((nrows * log(error)) / log(1.0 / (pow(2.0, log(2.0)))));
+ int nhashes = round(log(2.0) * nbits / nrows);
+
+ /* round the size to whole bytes */
+ nbits = 8 * ((nbits + 7) / 8);
+
+ filter = palloc0(offsetof(BloomFilterData, data) + (nbits/8));
+
+ filter->nbits = nbits;
+ filter->nhashes = nhashes;
+
+ return filter;
+}
+
+/*
+ * Size the bloom filter starting with available space and error
+ * rate. We do compute the number of entries it can handle with the
+ * requested error rate, but that is mostly useless as we only use
+ * this version when switching from non-batched to batched mode, i.e.
+ * when the original estimate turned out to be incorrect.
+ */
+static BloomFilter
+BloomFilterInitBytes(Size nbytes, double error)
+{
+ BloomFilter filter;
+
+ double nrows = 8 * nbytes * log(1.0 / (pow(2.0, log(2.0)))) / log(error);
+ int nhashes = round(log(2.0) * (8 * nbytes) / nrows);
+
+ filter = palloc0(offsetof(BloomFilterData, data) + nbytes);
+
+ filter->nbits = 8 * nbytes;
+ filter->nhashes = nhashes;
+
+ return filter;
+}
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index ab91eb2..cd3d975 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -382,6 +382,15 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
hashvalue);
node->hj_CurTuple = NULL;
+ /* If still in the first batch, we check the bloom filter. */
+ if ((hashtable->curbatch == 0) &&
+ (! ExecHashBloomCheckValue(hashtable, hashvalue)))
+ {
+ /* no matches; check for possible outer-join fill */
+ node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
+ continue;
+ }
+
/*
* The tuple might not belong to the current batch (where
* "current batch" includes the skew buckets if any).
@@ -763,6 +772,19 @@ ExecEndHashJoin(HashJoinState *node)
*/
if (node->hj_HashTable)
{
+ HashJoinTable hashtable = node->hj_HashTable;
+
+ /*
+ * If there's a bloom filter, print some debug info before destroying the
+ * hash table.
+ */
+ if (hashtable->bloomFilter)
+ {
+ BloomFilter filter = hashtable->bloomFilter;
+ elog(LOG, "bloom filter lookups=%lu matches=%lu eliminated=%lu%%",
+ filter->nlookups, filter->nmatches, 100 - (100 * filter->nmatches) / Max(1,filter->nlookups));
+ }
+
ExecHashTableDestroy(node->hj_HashTable);
node->hj_HashTable = NULL;
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1db7845..9de2840 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -941,6 +941,24 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"enable_hashjoin_bloom", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the use of bloom filters in hash joins."),
+ NULL
+ },
+ &enable_hashjoin_bloom,
+ true,
+ NULL, NULL, NULL
+ },
+ {
+ {"force_hashjoin_bloom", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Forces the use of bloom filters in hash joins."),
+ NULL
+ },
+ &force_hashjoin_bloom,
+ false,
+ NULL, NULL, NULL
+ },
+ {
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
gettext_noop("This algorithm attempts to do planning without "
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 6fb2dc0..64978e5 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -80,6 +80,17 @@ typedef struct HashJoinTupleData
#define HJTUPLE_MINTUPLE(hjtup) \
((MinimalTuple) ((char *) (hjtup) + HJTUPLE_OVERHEAD))
+typedef struct BloomFilterData
+{
+ uint64 nlookups; /* number of lookups */
+ uint64 nmatches; /* number of matches */
+ int nbits; /* m */
+ int nhashes; /* k */
+ char data[1]; /* bits */
+} BloomFilterData;
+
+typedef BloomFilterData *BloomFilter;
+
/*
* If the outer relation's distribution is sufficiently nonuniform, we attempt
* to optimize the join by treating the hash values corresponding to the outer
@@ -350,6 +361,9 @@ typedef struct HashJoinTableData
/* used for dense allocation of tuples (into linked chunks) */
HashMemoryChunk chunks; /* one list for the whole batch */
+ /* bloom filter on Hash (used with batched hash joins) */
+ BloomFilter bloomFilter; /* bloom filter on the hash values */
+
/* Shared and private state for Parallel Hash. */
HashMemoryChunk current_chunk; /* this backend's current chunk */
dsa_area *area; /* DSA area to allocate memory from */
@@ -358,4 +372,6 @@ typedef struct HashJoinTableData
dsa_pointer current_chunk_shared;
} HashJoinTableData;
+bool ExecHashBloomCheckValue(HashJoinTable hashtable, uint32 hashvalue);
+
#endif /* HASHJOIN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a953820..249a7a0 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2037,6 +2037,11 @@ typedef struct HashInstrumentation
int nbatch; /* number of batches at end of execution */
int nbatch_original; /* planned number of batches */
size_t space_peak; /* speak memory usage in bytes */
+ int bloom_nhashes; /* number of hash functions */
+ size_t bloom_nbytes; /* number of bytes */
+ int bloom_nlookups; /* number of lookups */
+ int bloom_nmatches; /* number of matches */
+ int bloom_nbits; /* number of bits set */
} HashInstrumentation;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 132e355..5db07ba 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -66,6 +66,8 @@ extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
extern PGDLLIMPORT bool enable_hashjoin;
+extern PGDLLIMPORT bool enable_hashjoin_bloom;
+extern PGDLLIMPORT bool force_hashjoin_bloom;
extern PGDLLIMPORT bool enable_gathermerge;
extern PGDLLIMPORT bool enable_partitionwise_join;
extern PGDLLIMPORT bool enable_parallel_append;
diff --git a/src/include/utils/hashutils.h b/src/include/utils/hashutils.h
index 5e9fe65..3582c9f 100644
--- a/src/include/utils/hashutils.h
+++ b/src/include/utils/hashutils.h
@@ -50,4 +50,39 @@ murmurhash32(uint32 data)
return h;
}
+#define ROTL32(x,r) ((x << r) | (x >> (32 - r)))
+
+/*
+ * Simple inline murmur hash implementation hashing a 32 bit integer and
+ * 32 bit seed, for performance.
+ *
+ * XXX Check this actually produces same results as MurmurHash3_x86_32.
+ */
+static inline uint32
+murmurhash32_seed(uint32 seed, uint32 data)
+{
+ uint32 h = seed;
+ uint32 k = data;
+ uint32 c1 = 0xcc9e2d51;
+ uint32 c2 = 0x1b873593;
+
+ k *= c1;
+ k = ROTL32(k,15);
+ k *= c2;
+
+ h ^= k;
+ h = ROTL32(h,13);
+ h = h * 5 + 0xe6546b64;
+
+ h ^= sizeof(uint32);
+
+ h ^= h >> 16;
+ h *= 0x85ebca6b;
+ h ^= h >> 13;
+ h *= 0xc2b2ae35;
+ h ^= h >> 16;
+
+ return h;
+}
+
#endif /* HASHUTILS_H */
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 759f7d9..de2e716 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_gathermerge | on
enable_hashagg | on
enable_hashjoin | on
+ enable_hashjoin_bloom | on
enable_indexonlyscan | on
enable_indexscan | on
enable_material | on
@@ -87,7 +88,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(15 rows)
+(16 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
--
2.9.5
On Tue, Feb 20, 2018 at 1:23 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
In 2015/2016 I've been exploring if we could improve hash joins by
leveraging bloom filters [1], and I was reminded about this idea in a
thread about amcheck [2]. I also see that bloom filters were briefly
mentioned in the thread about parallel hash [3].So I've decided to revive the old patch, rebase it to current master,
and see if we can resolve the issues that killed it in 2016.
Cool.
The issues are essentially about predicting when the bloom filter can
improve anything, which is more a question of the hash join selectivity
than the bloom filter parameters.
Absolutely. A Bloom filter could be considered an opportunistic thing.
The false positive rate that you estimate is going to be based on the
planner's rowcount estimate for the inner side of the join, which is
liable to be wrong anyway. It's important to be resilient to
misestimations, which Bloom filters are good at.
This is not the same as "false positive rate" which is a feature of the
bloom filter itself, and can be measured by simply counting bits set in
the filter. That is important too, of course, but simple to determine.
Perhaps I'm being pedantic, but it's not exactly true that you can
measure the false positive rate by counting the bits. I do agree with
what I think you meant here, though, which is that the precise false
positive rate is not necessarily that important, while the selectivity
of the join is very important.
In general, hash joins work best when the join qual is highly
selective. This is not the same thing as having a small hash table, of
course.
After thinking about this a bit more, I think this is pretty much what
we do during join cardinality estimation - we estimate how many of the
rows in "fact" have a matching row in "dim". I do think we can reuse
that to decide if to use a bloom filter or not.Of course, the decision may need to be more elaborate (and perhaps
formulated as part of costing), but the filter selectivity is the
crucial piece. The attached patch does not do that yet, though.
I suspect that it could make sense to use a Bloom filter to summarize
the entire inner side of the join all at once, even when there are
multiple batches. I also suspect that this is particularly beneficial
with parallel hash joins, where IPC/synchronization overhead can be a
big problem.
4) Initially, the patch was aimed at hashjoins with batches, on the
premise that it can save some of the serialize/deserialize costs. But as
mentioned in the discussion, bloom filters can be beneficial even in the
single-batch case, when three conditions are met:a) join is selective (i.e. some rows don't have match in hash table)
b) hash table > CPU cache
c) bloom filter < CPU cache
Seems plausible. CPU cache efficiency is also affected by how many
hash functions you end up using -- use too many, and it slows things
down.
We don't have a good way to determine size of the CPU cache, but I think
we can use some crude arbitrary limits. E.g. require that hash hash
table is larger than 8MB and bloom filter is smaller than 4MB, or
something like that.
FWIW, the logical way to size the Bloom filter is based on the number
of (distinct) values, not based on the projected size of the hash
table. The lossy nature of Bloom filters makes the average size of the
original, hashed elements irrelevant to a sizing calculation that
targets a certain final false positive rate. Making Bloom filter size
any fixed fraction of hash table size seems wrong to me for that
reason alone.
You should try to exploit the fact that a Bloom filter can summarize a
large set reasonably well with a very compact, simple representation.
A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
for cases where Bloom probes will save a lot of work, it probably
doesn't make all that much difference -- the hash join is still much
faster. If you're willing to accept a 10% false positive rate, then
you can summarize a set of 40 million distinct values with only
22.85MB of memory and 3 hash functions. I think that the smallest
possible amount of memory that any hash table of 40 million elements
would require a much greater amount of memory than 22.85MB --
certainly closer to 20x than to 8x. Even that is pretty conservative.
I bet there are plenty of hash joins were the ratio could be 30x or
more. Not to mention cases with multiple batches.
--
Peter Geoghegan
On Tue, Feb 20, 2018 at 8:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:
You should try to exploit the fact that a Bloom filter can summarize a
large set reasonably well with a very compact, simple representation.
A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
for cases where Bloom probes will save a lot of work, it probably
doesn't make all that much difference -- the hash join is still much
faster. If you're willing to accept a 10% false positive rate, then
you can summarize a set of 40 million distinct values with only
22.85MB of memory and 3 hash functions. I think that the smallest
possible amount of memory that any hash table of 40 million elements
would require a much greater amount of memory than 22.85MB --
certainly closer to 20x than to 8x. Even that is pretty conservative.
I bet there are plenty of hash joins were the ratio could be 30x or
more. Not to mention cases with multiple batches.
I've worked a lot with bloom filters, and for large false positive
rates and large sets (multi-million entries), you get bloom filter
sizes of about 10 bits per distinct item.
That's efficient, but it's not magic. It can still happen that the
whole set can't fit in work_mem with an acceptable false positive
rate.
On Tue, Feb 20, 2018 at 3:17 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
I've worked a lot with bloom filters, and for large false positive
rates and large sets (multi-million entries), you get bloom filter
sizes of about 10 bits per distinct item.
It's generally true that you need 9.6 bits per element to get a 1%
false positive rate. 1% could be considered much too low here.
Do we need to eliminate 99% of all hash join probes (that find nothing
to join on) to make this Bloom filter optimization worthwhile?
Personally, I doubt it.
That's efficient, but it's not magic. It can still happen that the
whole set can't fit in work_mem with an acceptable false positive
rate.
A merge join is always going to be the better choice when extremely
memory constrained.
--
Peter Geoghegan
On Tue, Feb 20, 2018 at 8:23 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Feb 20, 2018 at 3:17 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
I've worked a lot with bloom filters, and for large false positive
rates and large sets (multi-million entries), you get bloom filter
sizes of about 10 bits per distinct item.It's generally true that you need 9.6 bits per element to get a 1%
false positive rate. 1% could be considered much too low here.Do we need to eliminate 99% of all hash join probes (that find nothing
to join on) to make this Bloom filter optimization worthwhile?
Personally, I doubt it.
Even for 90% it's about 4.6 bits per element.
What I'm saying is that you can't assume you can fit the whole thing
in work_mem. If it could be said that any set worth a hash join will
fit, you can build a work_mem-sized bloom filter and just accept that
if you exceed its capacity its filtering efficiency will drop
benignly. But IMO that can't be taken for granted, so at some point
you should just give up, the overhead of querying a low-quality BF
won't be worth it.
The HLL is good for that. You can keep adding to the BF until the HLL
tells you you're way past the capacity of a work_mem-sized BF, then
you free that memory and go on without it.
You might also want to consider scalable BFs. The smaller you can keep
the BF, the better chance you'll have of fitting it into the L3 cache
(you can fit quite a lot of entries into a decen-sized L3 cache).
On 02/21/2018 12:06 AM, Peter Geoghegan wrote:
On Tue, Feb 20, 2018 at 1:23 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:In 2015/2016 I've been exploring if we could improve hash joins by
leveraging bloom filters [1], and I was reminded about this idea in a
thread about amcheck [2]. I also see that bloom filters were briefly
mentioned in the thread about parallel hash [3].So I've decided to revive the old patch, rebase it to current master,
and see if we can resolve the issues that killed it in 2016.Cool.
The issues are essentially about predicting when the bloom filter can
improve anything, which is more a question of the hash join selectivity
than the bloom filter parameters.Absolutely. A Bloom filter could be considered an opportunistic thing.
The false positive rate that you estimate is going to be based on the
planner's rowcount estimate for the inner side of the join, which is
liable to be wrong anyway. It's important to be resilient to
misestimations, which Bloom filters are good at.
It's a bit more difficult than that, because rowcount is the total
number of rows, but it may not be the same as the number of distinct
values in the join key. But it's an estimate of the upper boundary.
But yeah, that's how the bloom filter is sized in the patch.
This is not the same as "false positive rate" which is a feature
of the bloom filter itself, and can be measured by simply counting
bits set in the filter. That is important too, of course, but
simple to determine.Perhaps I'm being pedantic, but it's not exactly true that you can
measure the false positive rate by counting the bits. I do agree with
what I think you meant here, though, which is that the precise false
positive rate is not necessarily that important, while the
selectivity of the join is very important.
My reasoning is that given a bloom filter with K out of M bits set, a
probability of the bloom filter saying "true" to a random value is
(K/M)^H
where H is the number of hash functions. I believe that's pretty much
the definition of false positive rate, but I have to admit I haven't
really checked the exact math.
The important takeaway here is that many bits set -> bad.
In general, hash joins work best when the join qual is highly
selective. This is not the same thing as having a small hash table,
of course.After thinking about this a bit more, I think this is pretty much
what we do during join cardinality estimation - we estimate how
many of the rows in "fact" have a matching row in "dim". I do think
we can reuse that to decide if to use a bloom filter or not.Of course, the decision may need to be more elaborate (and perhaps
formulated as part of costing), but the filter selectivity is the
crucial piece. The attached patch does not do that yet, though.I suspect that it could make sense to use a Bloom filter to
summarize the entire inner side of the join all at once, even when
there are multiple batches. I also suspect that this is particularly
beneficial with parallel hash joins, where IPC/synchronization
overhead can be a big problem.
But that's what the patch does, currently - the filter is built during
the initial pass through the data, and then used for all batches. This
comes from the original idea to throw away rows from the outer relation
(usually the larger one) that have no match in the hash table. Now we
stash them into a batch file, possibly shuffling them between the
batches if we happen to need to add more batches, only to throw them
away much later when we get to that batch.
Actually, now that I think about it - I think the patch should throw
away the filter away after the initial pass over the outer relation,
because at that point we've used all the information in the filter.
I'm not sure it would make sense to then build smaller bloom filters for
individual batches, but maybe it would?
4) Initially, the patch was aimed at hashjoins with batches, on the
premise that it can save some of the serialize/deserialize costs. But as
mentioned in the discussion, bloom filters can be beneficial even in the
single-batch case, when three conditions are met:a) join is selective (i.e. some rows don't have match in hash table)
b) hash table > CPU cache
c) bloom filter < CPU cache
Seems plausible. CPU cache efficiency is also affected by how many
hash functions you end up using -- use too many, and it slows things
down.
Yeah, I admit those are rather crude rules.
We don't have a good way to determine size of the CPU cache, but I
think we can use some crude arbitrary limits. E.g. require that
hash hash table is larger than 8MB and bloom filter is smaller than
4MB, or something like that.FWIW, the logical way to size the Bloom filter is based on the number
of (distinct) values, not based on the projected size of the hash
table. The lossy nature of Bloom filters makes the average size of the
original, hashed elements irrelevant to a sizing calculation that
targets a certain final false positive rate. Making Bloom filter size
any fixed fraction of hash table size seems wrong to me for that
reason alone.
Sure, and the patch does exactly that when we actually have a good
estimate (although, it's using expected number of rows of the inner
relation, not ndistinct).
The trouble is that when we start with a single batch and then find out
the estimate was wrong and we need to start batching, all bets are off.
At that point it seems reasonable to just say "Here is X MBs of RAM, do
what you can".
You should try to exploit the fact that a Bloom filter can summarize a
large set reasonably well with a very compact, simple representation.
A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
for cases where Bloom probes will save a lot of work, it probably
doesn't make all that much difference -- the hash join is still much
faster. If you're willing to accept a 10% false positive rate, then
you can summarize a set of 40 million distinct values with only
22.85MB of memory and 3 hash functions. I think that the smallest
possible amount of memory that any hash table of 40 million elements
would require a much greater amount of memory than 22.85MB --
certainly closer to 20x than to 8x. Even that is pretty conservative.
I bet there are plenty of hash joins were the ratio could be 30x or
more. Not to mention cases with multiple batches.
But the problem is that I don't know what is the total size of the hash
table, because we're building the bloom filter for all the batches at
once. And we don't know how many batches will be there - if we knew
that, we could estimate the number of distinct values and we could use
that to size the filter instead of doing this. (All of this only applies
to the state where we start with a single batch and then find out we
need to start batching, of course.)
In other words, I'm not claiming 1/8 of a hash table is a reasonable
size for bloom filter, but that 1/8 of work_mem seems like a reasonable
compromise for sizing the bloom filter. Perhaps there should be some
upper boundary for very large work_mem values, though.
Regarding false positive rate - I agree it may be more efficient to use
a smaller bloom filter with worse false positive rate, particularly if
the smaller one fits into a CPU cache and the larger one does not.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Feb 20, 2018 at 3:48 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
Do we need to eliminate 99% of all hash join probes (that find nothing
to join on) to make this Bloom filter optimization worthwhile?
Personally, I doubt it.Even for 90% it's about 4.6 bits per element.
4.6 bits is vastly less than typical hash join hash table entry sizes,
which are often comprised of several attributes. Even the skinniest
possible hash table elements have HJTUPLE_OVERHEAD() overhead, as well
as MinimalTuple overhead. To say nothing of all the serialization and
synchronization overhead that you could also have.
Whatever precise per-element overhead you come up with for your Bloom
filter, the *ratio* of those two things is what makes using a Bloom
filter potentially very effective. It could end up being something
like 1:100, which is a huge win for a highly selective hash join that
still has a fairly large hash table (that cannot fit in L3
cache/work_mem).
What I'm saying is that you can't assume you can fit the whole thing
in work_mem. If it could be said that any set worth a hash join will
fit, you can build a work_mem-sized bloom filter and just accept that
if you exceed its capacity its filtering efficiency will drop
benignly. But IMO that can't be taken for granted, so at some point
you should just give up, the overhead of querying a low-quality BF
won't be worth it.
It's probably true that we will need to bail out of using a Bloom
filter, and it is certainly true that the optimization won't always
work out. Still, once your Bloom filter proves useless, chances are
good that the plan isn't a very good one, with or without that extra
optimization. Even if the plan is truly the fastest possible plan, and
even if you avoid wasting extra effort on the Bloom filter, it will
still be very slow.
Join selectivity is what really matters with this optimization. Of
course the optimization might not work out, but you can say that about
almost anything that uses hashing -- that's why safety critical
realtime systems (e.g. avionics systems) often completely avoid using
hash tables. Bloom filters are fairly resilient to somewhat inaccurate
estimates, but nothing will save you from terrible estimates. I think
that it's worth risking a small, predictable regression when the join
turns out to not be selective, in order to get a potentially very
large improvement in performance when the planner estimates join
selectivity reasonably accurately (and it's a selective hash join that
won't simply have a small hash table to begin with).
--
Peter Geoghegan
On Tue, Feb 20, 2018 at 3:54 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
I suspect that it could make sense to use a Bloom filter to
summarize the entire inner side of the join all at once, even when
there are multiple batches. I also suspect that this is particularly
beneficial with parallel hash joins, where IPC/synchronization
overhead can be a big problem.But that's what the patch does, currently - the filter is built during
the initial pass through the data, and then used for all batches.
I misunderstood. I would probably do something like double or triple
the original rows estimate instead, though. The estimate must be at
least slightly inaccurate when we get to this point, but I don't think
that that's a good enough reason to give up on the estimate
completely.
Actually, now that I think about it - I think the patch should throw
away the filter away after the initial pass over the outer relation,
because at that point we've used all the information in the filter.
Makes sense.
I'm not sure it would make sense to then build smaller bloom filters for
individual batches, but maybe it would?
I doubt it.
Yeah, I admit those are rather crude rules.
You have to start somewhere.
The trouble is that when we start with a single batch and then find out
the estimate was wrong and we need to start batching, all bets are off.
At that point it seems reasonable to just say "Here is X MBs of RAM, do
what you can".
As I said above, I wouldn't say all bets are off when this happens --
not at all. Estimates are likely to often be somewhat wrong. If
they're completely wrong, we can probably swallow the cost of giving
up on a Bloom filter relatively late.
As I said, X should not be a portion of work_mem, because that has
only a weak relationship to what really matters.
You should try to exploit the fact that a Bloom filter can summarize a
large set reasonably well with a very compact, simple representation.
A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
for cases where Bloom probes will save a lot of work, it probably
doesn't make all that much difference -- the hash join is still much
faster.
But the problem is that I don't know what is the total size of the hash
table, because we're building the bloom filter for all the batches at
once. And we don't know how many batches will be there - if we knew
that, we could estimate the number of distinct values and we could use
that to size the filter instead of doing this. (All of this only applies
to the state where we start with a single batch and then find out we
need to start batching, of course.)
I don't think that the number of batches should matter that much to
the costing/sizing/estimation logic, even if it's an interesting thing
to talk about when considering the upsides of a Bloom filter. My sense
is that we should focus on making sure that using a Bloom filter is
going to win, and not worry so much about whether that's going to be a
huge win or a modest win.
Suppose that you thought you were going to have a 10% false positive
rate with a 22.85MB Bloom filter for 40 million elements (my earlier
example). Further suppose that it turns out to be 80 million elements.
This means that you're going to get a false positive rate of 30%. This
could still be a big win, though, so it's not really a bad situation.
With 120 million elements, it's about 45%, but even then it could
still work out, especially because the hash join will be very slow
anyway. You also have to bear in mind that the 40 million estimate is
much more likely to be too high than too low, because you're assuming
distinct key values for the hash table.
You're taking a risk, in a sense, but a lot of things have to go wrong
for you to lose, and even then you can cut your losses before the
extra cost is too high.
Do you have a test query for this patch, that gives us some idea of the upside?
--
Peter Geoghegan
On Wed, Feb 21, 2018 at 10:23 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
In 2015/2016 I've been exploring if we could improve hash joins by
leveraging bloom filters [1], and I was reminded about this idea in a
thread about amcheck [2]. I also see that bloom filters were briefly
mentioned in the thread about parallel hash [3].So I've decided to revive the old patch, rebase it to current master,
and see if we can resolve the issues that killed it in 2016.
Nice!
Opinions?
I'm definitely following this and interested in helping in some way if
I can. I have wondered about this subject and discussed it a bit with
Peter Geoghegan off-list.
Some assorted thoughts:
In the old thread, Peter pointed at a curious undergrad student
project from 2008[1]http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf evaluating Bloom filters for hash joins in
PostgreSQL 8.3, inspired by a couple of older papers[2]http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.8069&rep=rep1&type=pdf[3]https://pdfs.semanticscholar.org/ec99/6093805012b9e0ff06bb2050436091671091.pdf. While
your patch uses a Bloom filter to short-circuit the regular bucket
probe in ExecHashJoinImpl(), these approach push the Bloom filter down
into the outer relation scan. I suspect you're right about the fixed
sizing being a problem, but the general idea seems pretty interesting
to me and there seems to be no reason you couldn't make the filter
size dynamic as you have it and then share it via a parameter or
something. But is there any point?
On the one hand, pushing down Bloom filters requires the hash value to
be computed by the lower scan, and then computed again if the tuple
survives the filter and makes it into the Hash Join node (unless there
is some way to attach it to the tuple...). On the other hand,
throwing away tuples sooner can avoid more work, particularly in the
case of multi-joins.
Based on a hint from Jim Finnerty at PGCon 2017 and also some light
reading of ancient stuff on join pipelining, I've been wondering how
the potential effectiveness of Bloom filters is affected by the fact
that we prefer left-deep nested hash joins and for us build relation =
right/inner relation. That policy means that we build hash tables for
all the joins up front, which means we could potentially push all
their Bloom filters down to the ultimate outer scan (or as far down as
possible depending on the qual) as I now find described in [2]http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.8069&rep=rep1&type=pdf. So:
H1
/\
H2 u
/\
H3 t
/\
r s
You might be able to push filters B1, B2, B3 constructed from H1, H2,
H3 all the way down to the scan of r and then probe them in order of
selectivity (a bit like order_qual_clauses does with any other kind of
qual). (Because of the empty-outer optimisation's preliminary attempt
to pull a tuple on the left I guess you might need to push down a
placeholder filter first and then later replace it with the real
filter if/when you eventually build it.)
In contrast to that situation, suppose we introduced whole-query
memory budgets as Peter Geoghegan and several others have proposed.
Then we'd suddenly have a potential reason to prefer right-deep plans:
H1
/\
r H2
/\
s H3
/\
t u
Many other RDBMSs would consider this (modulo some confusion about
hash join polarity: other systems and literature have build = inner =
left, probe = outer = right so they'd draw that the other way around
and call it left deep, somewhat distractingly...) because it only
requires two hash tables to be loaded into memory at a time, a useful
property if you want to stay inside a whole-query memory budget.
That's because the output of each hash join is the input to the hash
table above it, so in theory we only need the one we're building and
the one we're probing at each moment. Whatever other pros and cons
might exist with this plan shape compared to the left-deep variant, my
point is that a right-deep join could only push each filter down to
the immediate scan of r, s, t, which wouldn't be much of a potential
improvement over the way your current patch does it (ie entirely
inside the hash join, no push down), because it's not able to move the
Bloom filter very far away from the hash join anyway. At best it
could perhaps move it before some more expensive/less selective other
quals in the scan.
In other words, I wonder if our left-deep plans stand to gain more
from Bloom filter push down.
[1]: http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf
[2]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.8069&rep=rep1&type=pdf
[3]: https://pdfs.semanticscholar.org/ec99/6093805012b9e0ff06bb2050436091671091.pdf
--
Thomas Munro
http://www.enterprisedb.com
On 02/21/2018 02:10 AM, Peter Geoghegan wrote:
On Tue, Feb 20, 2018 at 3:54 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:I suspect that it could make sense to use a Bloom filter to
summarize the entire inner side of the join all at once, even
when there are multiple batches. I also suspect that this is
particularly beneficial with parallel hash joins, where
IPC/synchronization overhead can be a big problem.But that's what the patch does, currently - the filter is built
during the initial pass through the data, and then used for all
batches.I misunderstood. I would probably do something like double or triple
the original rows estimate instead, though. The estimate must be at
least slightly inaccurate when we get to this point, but I don't
think that that's a good enough reason to give up on the estimate
completely.
That's a problem only for the multi-batch case, though.
With a single batch we can walk the hash table and count non-empty
buckets, to get a good ndistinct estimate cheaply. And then size the
filter considering both memory requirements (fits into CPU cache) and
false positive rate. There are other things we may need to consider
(memory usage vs. work_mem) but that's a separate issue.
With multiple batches I think we could use the "size the bloom filter
for a fraction of work_mem" which the current patch uses when switching
to multiple batches halfway-through. That pretty much entirely ignores
the estimate and essentially replaces it with a "fictional" estimate.
I think that's a better approach than using some arbitrary multiple of
the estimate. When we have to start batching halfway through, the
estimate is proven to be rather bogus anyway, but we may treat it as a
lower boundary for the bloom filter size.
Actually, now that I think about it - I think the patch should
throw away the filter away after the initial pass over the outer
relation, because at that point we've used all the information in
the filter.Makes sense.
Actually, the patch already does that - it stops using the filter if
(curbatch != 0). We don't throw it away, though, because it also
includes some additional instrumentation that are shown by explain analyze.
I'm not sure it would make sense to then build smaller bloom
filters for individual batches, but maybe it would?I doubt it.
I think it might help if the global bloom filter ended up having high
false positive rate. But only if the per-batch filters fit into CPU
cache (i.e. it's the same reasoning as for single-batch case).
But those "per-batch" filters are rather incompatible with pushing the
filter to scan nodes, I think.
Yeah, I admit those are rather crude rules.
You have to start somewhere.
The trouble is that when we start with a single batch and then find
out the estimate was wrong and we need to start batching, all bets
are off. At that point it seems reasonable to just say "Here is X
MBs of RAM, do what you can".As I said above, I wouldn't say all bets are off when this happens
-- not at all. Estimates are likely to often be somewhat wrong. If
they're completely wrong, we can probably swallow the cost of giving
up on a Bloom filter relatively late.As I said, X should not be a portion of work_mem, because that has
only a weak relationship to what really matters.
I agree a fixed fraction of work_mem may not be the right thing, but the
goal was to make the bloom filter part of the Hash memory budget, i.e.
bloom filter + hash table <= work_mem
(which I think we agree should be the case), without increasing the
number of batches too much. For example, if you size the filter ignoring
this, and it end up being 90% of work_mem, you may need to do the hash
join in 128 batches instead of just 16. Or something like that.
Maybe that would still be a win, though. Firstly, the higher number of
batches may not have a huge impact - in one case we need to serialie
15/16 and in the other one 127/128. That's 93% vs. 99%. And if the more
accurate filter allows us to discard much more data from the outer
relation ...
You should try to exploit the fact that a Bloom filter can summarize a
large set reasonably well with a very compact, simple representation.
A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
for cases where Bloom probes will save a lot of work, it probably
doesn't make all that much difference -- the hash join is still much
faster.But the problem is that I don't know what is the total size of the
hash table, because we're building the bloom filter for all the
batches at once. And we don't know how many batches will be there -
if we knew that, we could estimate the number of distinct values
and we could use that to size the filter instead of doing this.
(All of this only applies to the state where we start with a single
batch and then find out we need to start batching, of course.)I don't think that the number of batches should matter that much to
the costing/sizing/estimation logic, even if it's an interesting
thing to talk about when considering the upsides of a Bloom filter.
My sense is that we should focus on making sure that using a Bloom
filter is going to win, and not worry so much about whether that's
going to be a huge win or a modest win.Suppose that you thought you were going to have a 10% false positive
rate with a 22.85MB Bloom filter for 40 million elements (my earlier
example). Further suppose that it turns out to be 80 million
elements. This means that you're going to get a false positive rate
of 30%. This could still be a big win, though, so it's not really a
bad situation. With 120 million elements, it's about 45%, but even
then it could still work out, especially because the hash join will
be very slow anyway. You also have to bear in mind that the 40
million estimate is much more likely to be too high than too low,
because you're assuming distinct key values for the hash table.You're taking a risk, in a sense, but a lot of things have to go
wrong for you to lose, and even then you can cut your losses before
the extra cost is too high.Do you have a test query for this patch, that gives us some idea of
the upside?
I have to admit I've been using only some simplistic ad-hoc queries.
There was a more detailed analysis in the old thread, but I'm not sure
how relevant it still is.
So I did some measurements on a simple join, with different work_mem
values and join selectivity.
select count(*)
from fact join dim on (dim.a = fact.a and dim.b < :sel)
Attached are results for "small" data set 20M:1M, the full results and
scripts are available here:
https://github.com/tvondra/hashjoin-bloom-tests
The first list shows a summary of the results, with timings for
a) master
b) patched master (with bloom filters disabled)
c) patched master (with bloom filters used always)
The last two tables compare b/a and c/a. The first one shows that
there's 0-3% overhead when bloom filters are not used (but it might
easily be just noise or differences in the layout of the binary).
The second one is the more interesting one. It shows a couple of things:
a) For tiny hash tables there's about 11% overhead. 1% selectivity means
the hash table has only 10000 entries, which fits into ~483kB. This is
why I think we need rule that for small hash tables we don't need bloom
filters.
b) For low selectivity (70% or more rows get into the hash table), the
bloom filter is a net loss, costing up to ~11%. This is why we should
consider selectivity of the join, I think.
c) For selectivity between 10% and 70% it's a net win, with speedups
between ~10% (single batch) and ~20% (multi-batch).
Those are relatively modest improvements, I expect more significant
gains on the large data set.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
On 02/21/2018 08:17 AM, Thomas Munro wrote:
On Wed, Feb 21, 2018 at 10:23 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:In 2015/2016 I've been exploring if we could improve hash joins by
leveraging bloom filters [1], and I was reminded about this idea in a
thread about amcheck [2]. I also see that bloom filters were briefly
mentioned in the thread about parallel hash [3].So I've decided to revive the old patch, rebase it to current
master, and see if we can resolve the issues that killed it in
2016.Nice!
Opinions?
I'm definitely following this and interested in helping in some way
if I can. I have wondered about this subject and discussed it a bit
with Peter Geoghegan off-list.
Good ;-)
I think one important thing we need to figure out is the costing, or
some other way that would allow us to decide when to build the Bloom
filters (and what perhaps whether to prefer larger and more accurate
one, or a smaller one).
But if you want to look into adding support for parallel hash, or
pushing the bloom filter down to the scans, feel free to do so.
Some assorted thoughts:
In the old thread, Peter pointed at a curious undergrad student
project from 2008[1] evaluating Bloom filters for hash joins in
PostgreSQL 8.3, inspired by a couple of older papers[2][3]. While
your patch uses a Bloom filter to short-circuit the regular bucket
probe in ExecHashJoinImpl(), these approach push the Bloom filter down
into the outer relation scan. I suspect you're right about the fixed
sizing being a problem, but the general idea seems pretty interesting
to me and there seems to be no reason you couldn't make the filter
size dynamic as you have it and then share it via a parameter or
something. But is there any point?On the one hand, pushing down Bloom filters requires the hash value
to be computed by the lower scan, and then computed again if the
tuple survives the filter and makes it into the Hash Join node
(unless there is some way to attach it to the tuple...). On the
other hand, throwing away tuples sooner can avoid more work,
particularly in the case of multi-joins.
I do agree it's an interesting idea, and being able to push the filter
down would be great, particularly in case of very selective joins (i.e.
when many outer rows have no match in the hash table). I have no idea
how much infrastructure would it require, though, or how widely it could
be used.
Judging by your thoughts on impact of left-deep vs. right-deep joins
etc. you've already given this far more thought that I did ;-)
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Feb 21, 2018 at 11:21 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
On 02/21/2018 02:10 AM, Peter Geoghegan wrote:
On Tue, Feb 20, 2018 at 3:54 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:I suspect that it could make sense to use a Bloom filter to
summarize the entire inner side of the join all at once, even
when there are multiple batches. I also suspect that this is
particularly beneficial with parallel hash joins, where
IPC/synchronization overhead can be a big problem.But that's what the patch does, currently - the filter is built
during the initial pass through the data, and then used for all
batches.I misunderstood. I would probably do something like double or triple
the original rows estimate instead, though. The estimate must be at
least slightly inaccurate when we get to this point, but I don't
think that that's a good enough reason to give up on the estimate
completely.That's a problem only for the multi-batch case, though.
With a single batch we can walk the hash table and count non-empty
buckets, to get a good ndistinct estimate cheaply. And then size the
filter considering both memory requirements (fits into CPU cache) and
false positive rate. There are other things we may need to consider
(memory usage vs. work_mem) but that's a separate issue.With multiple batches I think we could use the "size the bloom filter
for a fraction of work_mem" which the current patch uses when switching
to multiple batches halfway-through. That pretty much entirely ignores
the estimate and essentially replaces it with a "fictional" estimate.I think that's a better approach than using some arbitrary multiple of
the estimate. When we have to start batching halfway through, the
estimate is proven to be rather bogus anyway, but we may treat it as a
lower boundary for the bloom filter size.
...
As I said, X should not be a portion of work_mem, because that has
only a weak relationship to what really matters.I agree a fixed fraction of work_mem may not be the right thing, but the
goal was to make the bloom filter part of the Hash memory budget, i.e.bloom filter + hash table <= work_mem
(which I think we agree should be the case), without increasing the
number of batches too much. For example, if you size the filter ignoring
this, and it end up being 90% of work_mem, you may need to do the hash
join in 128 batches instead of just 16. Or something like that.Maybe that would still be a win, though. Firstly, the higher number of
batches may not have a huge impact - in one case we need to serialie
15/16 and in the other one 127/128. That's 93% vs. 99%. And if the more
accurate filter allows us to discard much more data from the outer
relation ...
Let me reiterate, you can avoid both issues with scalable bloom filters[1]http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf.
An HLL can be used to estimate set size, the paper makes no mention of
it, probably assuming only distinct items are added to the set.
On 02/22/2018 12:44 PM, Claudio Freire wrote:
On Wed, Feb 21, 2018 at 11:21 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:On 02/21/2018 02:10 AM, Peter Geoghegan wrote:
...
I misunderstood. I would probably do something like double or triple
the original rows estimate instead, though. The estimate must be at
least slightly inaccurate when we get to this point, but I don't
think that that's a good enough reason to give up on the estimate
completely.That's a problem only for the multi-batch case, though.
With a single batch we can walk the hash table and count non-empty
buckets, to get a good ndistinct estimate cheaply. And then size the
filter considering both memory requirements (fits into CPU cache) and
false positive rate. There are other things we may need to consider
(memory usage vs. work_mem) but that's a separate issue.With multiple batches I think we could use the "size the bloom filter
for a fraction of work_mem" which the current patch uses when switching
to multiple batches halfway-through. That pretty much entirely ignores
the estimate and essentially replaces it with a "fictional" estimate.I think that's a better approach than using some arbitrary multiple of
the estimate. When we have to start batching halfway through, the
estimate is proven to be rather bogus anyway, but we may treat it as a
lower boundary for the bloom filter size....
As I said, X should not be a portion of work_mem, because that has
only a weak relationship to what really matters.I agree a fixed fraction of work_mem may not be the right thing, but the
goal was to make the bloom filter part of the Hash memory budget, i.e.bloom filter + hash table <= work_mem
(which I think we agree should be the case), without increasing the
number of batches too much. For example, if you size the filter ignoring
this, and it end up being 90% of work_mem, you may need to do the hash
join in 128 batches instead of just 16. Or something like that.Maybe that would still be a win, though. Firstly, the higher number of
batches may not have a huge impact - in one case we need to serialie
15/16 and in the other one 127/128. That's 93% vs. 99%. And if the more
accurate filter allows us to discard much more data from the outer
relation ...Let me reiterate, you can avoid both issues with scalable bloom filters[1].
I'm afraid it's not as straight-forward as "Use scalable bloom filters!"
This is not merely a question of unreliable estimates of number of
entries. That could have been solved by scalable bloom filters, which
are essentially a sequence of larger and larger bloom filters, added
when the smaller bloom filter "fills up" (1/2 full).
The problem is twofold:
(a) we need to decide what false positive rate to use (i.e. what is a
reasonable trade-off between filter size and how much work it saves)
(b) we also need to consider work_mem (which I assume we all agree we
must respect)
So for example we can't size the first bloom filter to just perfectly
fit into work_mem, only to add larger bloom filters later (each 2x the
size of the previous one). Not only will that increase the memory usage
to 7x the initial estimate, but it will also make the bloom filter less
efficient (having to probe larger and larger filters, likely not fitting
into CPU cache).
An HLL can be used to estimate set size, the paper makes no mention of
it, probably assuming only distinct items are added to the set.
The problem with HLL is, it's only an estimate of how many entries you
saw so far. It only tells you that after observing the items, and it
only tells you how many items you saw so far. What we need for sizing a
bloom filter is an estimate of number of distinct values in advance.
In other words, HLL is entirely useless for sizing Bloom Filters.
Furthermore, we could estimate number of observed distinct values from
the number of 1s in the bloom filter - we essentially ask "How many
items we observed if each item sets k random bits, and we have K bits
sets?" HLL does the same thing, but it throws away the ability to answer
which elements are in the set.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Feb 22, 2018 at 12:45 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
On 02/22/2018 12:44 PM, Claudio Freire wrote:
Let me reiterate, you can avoid both issues with scalable bloom filters[1].
I'm afraid it's not as straight-forward as "Use scalable bloom filters!"
This is not merely a question of unreliable estimates of number of
entries. That could have been solved by scalable bloom filters, which
are essentially a sequence of larger and larger bloom filters, added
when the smaller bloom filter "fills up" (1/2 full).The problem is twofold:
(a) we need to decide what false positive rate to use (i.e. what is a
reasonable trade-off between filter size and how much work it saves)(b) we also need to consider work_mem (which I assume we all agree we
must respect)
So for example we can't size the first bloom filter to just perfectly
fit into work_mem, only to add larger bloom filters later (each 2x the
size of the previous one). Not only will that increase the memory usage
to 7x the initial estimate
Aside from a, (b) is exactly the problem SBFs solve.
You decide how much of work_mem you're willing to devote for BFs, and
then keep creating bigger and bigger BFs until you've allocated that
many.
Then you either keep updating the biggest filter, or give up entirely.
You can use a rather conservative initial bloom filter size for that,
and let scalability expand until you hit the limit.
but it will also make the bloom filter less
efficient (having to probe larger and larger filters, likely not fitting
into CPU cache).
Again, you factor that into account when choosing the limit.
An HLL can be used to estimate set size, the paper makes no mention of
it, probably assuming only distinct items are added to the set.The problem with HLL is, it's only an estimate of how many entries you
saw so far. It only tells you that after observing the items, and it
only tells you how many items you saw so far. What we need for sizing a
bloom filter is an estimate of number of distinct values in advance.In other words, HLL is entirely useless for sizing Bloom Filters.
Normal BFs, yes. But that's exactly what you need for scalable BFs.
You need an estimate of the amount of distinct entries you've added to
your current filter, not the total set size.
Furthermore, we could estimate number of observed distinct values from
the number of 1s in the bloom filter
That's kinda slow to do per-item. I tried to "count" distinct items by
checking the BF before adding (don't add redundantly), but that's less
precise than a HLL in my experience.
On 02/22/2018 08:33 PM, Claudio Freire wrote:
On Thu, Feb 22, 2018 at 12:45 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:On 02/22/2018 12:44 PM, Claudio Freire wrote:
...
An HLL can be used to estimate set size, the paper makes no
mention of it, probably assuming only distinct items are added to
the set.The problem with HLL is, it's only an estimate of how many entries
you saw so far. It only tells you that after observing the items,
and it only tells you how many items you saw so far. What we need
for sizing a bloom filter is an estimate of number of distinct
values in advance.In other words, HLL is entirely useless for sizing Bloom Filters.
Normal BFs, yes. But that's exactly what you need for scalable BFs.
You need an estimate of the amount of distinct entries you've added
to your current filter, not the total set size.
No, you don't need that. For the SBF you need to know when the BF gets
full, which can be determined solely based on number of bits set to 1.
Essentially, the BF reaches the false positive rate when it reaches 50%.
Which is exactly what the SBF paper says on page 4: ... when filters get
full due to the limit on fill ratio, new one is added.
Furthermore, we could estimate number of observed distinct values from
the number of 1s in the bloom filterThat's kinda slow to do per-item. I tried to "count" distinct items by
checking the BF before adding (don't add redundantly), but that's less
precise than a HLL in my experience.
But you don't need to do that for each item you try to add - you know
that with M bits and K hash functions you can't reach 50% before at
least M/(2*K) additions. And you don't really need to track the number
of bits exactly - if it gets 55% full, it's not going to explode.
So just wait until M/(2*K) additions, see how many bits remain until the
threshold - rinse and repeat. And use some reasonable minimum distance
(say, 5% of the capacity), not to do the check too often.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Feb 22, 2018 at 5:11 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
On 02/22/2018 08:33 PM, Claudio Freire wrote:
That's kinda slow to do per-item. I tried to "count" distinct items by
checking the BF before adding (don't add redundantly), but that's less
precise than a HLL in my experience.But you don't need to do that for each item you try to add - you know
that with M bits and K hash functions you can't reach 50% before at
least M/(2*K) additions. And you don't really need to track the number
of bits exactly - if it gets 55% full, it's not going to explode.So just wait until M/(2*K) additions, see how many bits remain until the
threshold - rinse and repeat. And use some reasonable minimum distance
(say, 5% of the capacity), not to do the check too often.
Ok, that works too.
Point is, SBFs help to keep the BF size as small as possible, keep it
below work_mem, and to avoid caring about the total number of distinct
items.
You may want the planner to try and estimate that to figure out
whether it's worth trying BFs or not, but once you decide to try, SBFs
remove any dependency on the accuracy of that estimate.
On 02/22/2018 09:52 PM, Claudio Freire wrote:
On Thu, Feb 22, 2018 at 5:11 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:On 02/22/2018 08:33 PM, Claudio Freire wrote:
That's kinda slow to do per-item. I tried to "count" distinct items by
checking the BF before adding (don't add redundantly), but that's less
precise than a HLL in my experience.But you don't need to do that for each item you try to add - you know
that with M bits and K hash functions you can't reach 50% before at
least M/(2*K) additions. And you don't really need to track the number
of bits exactly - if it gets 55% full, it's not going to explode.So just wait until M/(2*K) additions, see how many bits remain until the
threshold - rinse and repeat. And use some reasonable minimum distance
(say, 5% of the capacity), not to do the check too often.Ok, that works too.
Point is, SBFs help to keep the BF size as small as possible, keep it
below work_mem, and to avoid caring about the total number of distinct
items.You may want the planner to try and estimate that to figure out
whether it's worth trying BFs or not, but once you decide to try,
SBFs remove any dependency on the accuracy of that estimate.
OK, thanks for reminding me about SBF and for the discussion.
At this point I'll probably focus on the other parts though -
determining selectivity of the join, etc. Which I think is crucial, and
we need to get that right even for accurate estimates. It's good to know
that we have a solution for that part, though.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Feb 22, 2018 at 1:14 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
OK, thanks for reminding me about SBF and for the discussion.
At this point I'll probably focus on the other parts though -
determining selectivity of the join, etc. Which I think is crucial, and
we need to get that right even for accurate estimates. It's good to know
that we have a solution for that part, though.
+1
There are probably significant opportunities to improve the Bloom
filter. That isn't that interesting right now, though. Figuring out
how scalable Bloom filters might save hash join from being reliant on
the accuracy of the initial estimate of set cardinality seems
premature at best, since we haven't established how sensitive this
use-case is to misestimations. My sense is that it's actually
naturally very insensitive, but there is no need to spend too much
time on it just yet.
It just makes sense, as a matter of procedure, to focus on the hash
join code, and drill down from there. Personally, I'm tired of talking
about the nitty-gritty details of Bloom filters rather than the actual
problem at hand.
--
Peter Geoghegan
Hi,
On 2018-02-20 22:23:54 +0100, Tomas Vondra wrote:
So I've decided to revive the old patch, rebase it to current master,
and see if we can resolve the issues that killed it in 2016.
There seems to be some good discussion in the thread. But the patch
arrived just before the last commitfest and certainly isn't a trivial
cleanup patch. Therefore I think it should be moved to the next CF?
Greetings,
Andres Freund
On 03/01/2018 11:01 PM, Andres Freund wrote:
Hi,
On 2018-02-20 22:23:54 +0100, Tomas Vondra wrote:
So I've decided to revive the old patch, rebase it to current master,
and see if we can resolve the issues that killed it in 2016.There seems to be some good discussion in the thread. But the patch
arrived just before the last commitfest and certainly isn't a trivial
cleanup patch. Therefore I think it should be moved to the next CF?
It isn't a massive invasive patch either, though, so I object to moving
it to 2018-09 right away.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On March 1, 2018 3:22:44 PM PST, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 03/01/2018 11:01 PM, Andres Freund wrote:
Hi,
On 2018-02-20 22:23:54 +0100, Tomas Vondra wrote:
So I've decided to revive the old patch, rebase it to current
master,
and see if we can resolve the issues that killed it in 2016.
There seems to be some good discussion in the thread. But the patch
arrived just before the last commitfest and certainly isn't a trivial
cleanup patch. Therefore I think it should be moved to the next CF?It isn't a massive invasive patch either, though, so I object to moving
it to 2018-09 right away.
Why do we have rules around not submitting large stuff to the last cf, if we just not follow through? We're neck deep in patches that are older. And you've already gotten a fair bit of feedback..
Andres
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 03/02/2018 12:31 AM, Andres Freund wrote:
On March 1, 2018 3:22:44 PM PST, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 03/01/2018 11:01 PM, Andres Freund wrote:
Hi,
On 2018-02-20 22:23:54 +0100, Tomas Vondra wrote:
So I've decided to revive the old patch, rebase it to current
master,
and see if we can resolve the issues that killed it in 2016.
There seems to be some good discussion in the thread. But the patch
arrived just before the last commitfest and certainly isn't a trivial
cleanup patch. Therefore I think it should be moved to the next CF?It isn't a massive invasive patch either, though, so I object to moving
it to 2018-09 right away.Why do we have rules around not submitting large stuff to the last
cf, if we just not follow through? We're neck deep in patches that
are older. And you've already gotten a fair bit of feedback..
It was not my intention to break (or even bend) the CF rules, of course.
I haven't considered the patch to be "large stuff", while you do. I see
Peter Geoghegan agrees with your conclusion on another thread, so go
ahead and move it to 2018-09.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 3/1/18 6:52 PM, Tomas Vondra wrote:
On 03/02/2018 12:31 AM, Andres Freund wrote:
On March 1, 2018 3:22:44 PM PST, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 03/01/2018 11:01 PM, Andres Freund wrote:
Hi,
On 2018-02-20 22:23:54 +0100, Tomas Vondra wrote:
So I've decided to revive the old patch, rebase it to current
master,
and see if we can resolve the issues that killed it in 2016.
There seems to be some good discussion in the thread. But the patch
arrived just before the last commitfest and certainly isn't a trivial
cleanup patch. Therefore I think it should be moved to the next CF?It isn't a massive invasive patch either, though, so I object to moving
it to 2018-09 right away.Why do we have rules around not submitting large stuff to the last
cf, if we just not follow through? We're neck deep in patches that
are older. And you've already gotten a fair bit of feedback..It was not my intention to break (or even bend) the CF rules, of course.
I haven't considered the patch to be "large stuff", while you do. I see
Peter Geoghegan agrees with your conclusion on another thread, so go
ahead and move it to 2018-09.
After reviewing the thread I also agree that this should be pushed to
2018-09, so I have done so.
I'm very excited by this patch, though. In general I agree with Peter
that a higher rate of false positives is acceptable to save memory. I
also don't see any reason why this can't be tuned with a parameter.
Start with a conservative default and allow the user to adjust as desired.
--
-David
david@pgmasters.net
On Thu, Mar 1, 2018 at 4:04 PM, David Steele <david@pgmasters.net> wrote:
On 3/1/18 6:52 PM, Tomas Vondra wrote:
On 03/02/2018 12:31 AM, Andres Freund wrote:
On March 1, 2018 3:22:44 PM PST, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:On 03/01/2018 11:01 PM, Andres Freund wrote:
Hi,
On 2018-02-20 22:23:54 +0100, Tomas Vondra wrote:
So I've decided to revive the old patch, rebase it to current
master,
and see if we can resolve the issues that killed it in 2016.
There seems to be some good discussion in the thread. But the patch
arrived just before the last commitfest and certainly isn't a trivial
cleanup patch. Therefore I think it should be moved to the next CF?It isn't a massive invasive patch either, though, so I object to moving
it to 2018-09 right away.Why do we have rules around not submitting large stuff to the last
cf, if we just not follow through? We're neck deep in patches that
are older. And you've already gotten a fair bit of feedback..It was not my intention to break (or even bend) the CF rules, of course.
I haven't considered the patch to be "large stuff", while you do. I see
Peter Geoghegan agrees with your conclusion on another thread, so go
ahead and move it to 2018-09.After reviewing the thread I also agree that this should be pushed to
2018-09, so I have done so.I'm very excited by this patch, though. In general I agree with Peter that
a higher rate of false positives is acceptable to save memory. I also don't
see any reason why this can't be tuned with a parameter. Start with a
conservative default and allow the user to adjust as desired.--
-David
david@pgmasters.net
Hi All --
I'm curious what has to be done to move this patch along. I looked
through the patched and I noticed that the work for deciding whether to
instantiate the bloom filter in single-batch mode is not complete yet
(or if it's in this change, I can't find it), contrary to this
comment:
+ * When starting in a single-batch mode, we do nothing initially.
+ * If the whole hash table fits into a single batch, we can get
+ * sufficiently accurate ndistinct estimate by simply counting
+ * occupied buckets (thanks to shooting for NTUP_PER_BUCKET=1),
+ * or perhaps we could use something more elaborate (e.g. HLL).
+ * But we only build the bloom filter if the hash table is large
+ * enough to exceed on-CPU caches (e.g. 4MB).
I did some basic benchmarking with two tables and a simple where
clause filtering 90% of rows and it does yield about a 1.2x speedup in
my tests. In the pathological case where two tables are joined on a
pk/fk relationship with no filtering, the penalty seems to be about
1-2%.
Patrick
On Thu, Mar 01, 2018 at 07:04:41PM -0500, David Steele wrote:
After reviewing the thread I also agree that this should be pushed to
2018-09, so I have done so.I'm very excited by this patch, though. In general I agree with Peter that
a higher rate of false positives is acceptable to save memory. I also don't
see any reason why this can't be tuned with a parameter. Start with a
conservative default and allow the user to adjust as desired.
Not much has happened since last March. The patch has conflicts in
regression tests. Thomas, you are registered as a reviewer of this
patch. Are you planning to look at it?
This is moved to next CF, waiting on author per the rotten bits.
--
Michael
On 10/01/2018 09:15 AM, Michael Paquier wrote:
On Thu, Mar 01, 2018 at 07:04:41PM -0500, David Steele wrote:
After reviewing the thread I also agree that this should be pushed to
2018-09, so I have done so.I'm very excited by this patch, though. In general I agree with Peter that
a higher rate of false positives is acceptable to save memory. I also don't
see any reason why this can't be tuned with a parameter. Start with a
conservative default and allow the user to adjust as desired.Not much has happened since last March. The patch has conflicts in
regression tests. Thomas, you are registered as a reviewer of this
patch. Are you planning to look at it?This is moved to next CF, waiting on author per the rotten bits.
--
I don't expect to work on this anytime soon, so maybe mark it as
returned with feedback instead of moving it to the next CF.
I think the idea to push down the bloom filter to the other side of the
hash join is what would make it much more interesting than the simple
approach in this patch - but it's also much more work to make it work.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Tomas,
I'm very interested in this patch, and particularly in possible
extensions to push the Bloom filter down on the probe side of the join. I
made a few small edits to the patch to enable it to compile on PG11, and can
send it to you if you're interested.
It is currently in the list of patches for the current commitfest, but
based on your previous post I'm not sure if you're planning to get back to
this patch just now. If you plan to resume work on it, I'll sign up as a
reviewer.
thank you,
/Jim Finnerty
-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
On Fri, Nov 2, 2018 at 9:23 AM Jim Finnerty <jfinnert@amazon.com> wrote:
I'm very interested in this patch, and particularly in possible
extensions to push the Bloom filter down on the probe side of the join. I
made a few small edits to the patch to enable it to compile on PG11, and can
send it to you if you're interested.
Hi Jim,
Would you compute the hash for the outer tuples in the scan, and then
again in the Hash Join when probing, or would you want to (somehow)
attach the hash to emitted tuples for later reuse by the higher node?
Someone pointed out to me off-list that a popular RDBMS emanating from
the bicycle capital of the North-West pushes down Bloom filters to
scans, but only when the key is a non-nullable integer; I wonder if
that is because they hash in both places, but consider that OK only
when it's really cheap to do so. (Along the same lines, if we could
attach extra data to tuples, I wonder if it would make sense to
transmit sort support information to higher nodes, so that (for
example) GatherMerge could use it to avoid full key comparison when
dealing with subplans that already did a sort and computed integers
for fast inequality checks.)
It is currently in the list of patches for the current commitfest, but
based on your previous post I'm not sure if you're planning to get back to
this patch just now. If you plan to resume work on it, I'll sign up as a
reviewer.
I'm also signed up to review.
--
Thomas Munro
http://www.enterprisedb.com
On 11/01/2018 10:06 PM, Thomas Munro wrote:
On Fri, Nov 2, 2018 at 9:23 AM Jim Finnerty <jfinnert@amazon.com> wrote:
I'm very interested in this patch, and particularly in possible
extensions to push the Bloom filter down on the probe side of the join. I
made a few small edits to the patch to enable it to compile on PG11, and can
send it to you if you're interested.Hi Jim,
Would you compute the hash for the outer tuples in the scan, and then
again in the Hash Join when probing, or would you want to (somehow)
attach the hash to emitted tuples for later reuse by the higher node?
Someone pointed out to me off-list that a popular RDBMS emanating from
the bicycle capital of the North-West pushes down Bloom filters to
scans, but only when the key is a non-nullable integer; I wonder if
that is because they hash in both places, but consider that OK only
when it's really cheap to do so. (Along the same lines, if we could
attach extra data to tuples, I wonder if it would make sense to
transmit sort support information to higher nodes, so that (for
example) GatherMerge could use it to avoid full key comparison when
dealing with subplans that already did a sort and computed integers
for fast inequality checks.)It is currently in the list of patches for the current commitfest, but
based on your previous post I'm not sure if you're planning to get back to
this patch just now. If you plan to resume work on it, I'll sign up as a
reviewer.I'm also signed up to review.
I haven't really planned to work on this anytime soon, unfortunately,
which is why I proposed to mark it as RwF at the end of the last CF. I
already have a couple other patches there, and (more importantly) I
don't have a very clear idea how to implement the filter pushdown.
That being said I still think it'd be an interesting improvement, and if
someone wants to take over I'm available as a reviewer / tester ...
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Nov 1, 2018 at 5:07 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
Would you compute the hash for the outer tuples in the scan, and then
again in the Hash Join when probing, or would you want to (somehow)
attach the hash to emitted tuples for later reuse by the higher node?
I'm interested in what Jim has to say, but to me it seems like we
should try to find a way to add a tlist entry for the hash value to
avoid recomputing it. That's likely to require some tricky planner
surgery, but it's probably doable.
What really seems finicky to me about this whole project is the
costing. In the best case it's a a huge win; in the worst case it's a
significant loss; and whether it's a gain or a loss is not easy to
figure out from the information that we have available. We generally
do not have an accurate count of the number of distinct values we're
likely to see (which is important).
Worse, when you start to consider pushdown, you realize that the cost
of the scan depends on the bloom filter we push down to it. So
consider something like A IJ B IJ C. It seems like it could be the
case that once we decide to do the A-B join as a hash join with a
bloom filter, it makes sense to also do the join to C as a hash join
and push down the bloom filter, because we'll be able to combine the
two filters and the extra probes will be basically free. But if we
weren't already doing the A-B join with a bloom filter, then maybe the
filter wouldn't make sense for C either.
Maybe I'm worrying over nothing here, or the wrong things, but costing
this well enough to avoid regressions really looks hard.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Nov 1, 2018 at 5:07 PM Thomas Munro
<[hidden email]> wrote:
Would you compute the hash for the outer tuples in the scan, and then
again in the Hash Join when probing, or would you want to (somehow)
attach the hash to emitted tuples for later reuse by the higher node?
I'm still considering your question (it sounds reasonable, but will add
complexity), but I want to mention a couple of things:
One benefit of a pushing a filter to the Scan operation is the ability to
apply its selectivity before certain other expensive operations. Some of
these benefits are storage-engine specific, so for example a column store
can greatly reduce row materialization costs by applying the runtime join
filter before row materialization. A row store that supports batch mode
operators has different benefits of pushing down the filter than a row store
that does not support batching. So we should probably think about pushing
runtime filters down to the Scan operator in the broader context of an API
to a pluggable storage engine [1]https://www.postgresql-archive.org/Pluggable-storage-tc5916322.html that may accept one or more runtime
filters. I have a lot of catching-up to do on the pluggable storage thread
before I can comment on that, but maybe others can offer their opinion.
The runtime join filter might be represented in different ways depending on
the data type and distinct cardinality of the join key(s) of the inner
relation. As Thomas Munroe hinted at, there are optimizations for the
integer key case, and in particular if the range of the integer key is
sufficiently small, then you can avoid hashing entirely and just do the
membership test in a bit array using the key itself. In that case the
membership test would be exact, and you don't need the hash to be computed
or passed up to the join operator.
re: "What really seems finicky to me about this whole project is the
costing. In the best case it's a a huge win; in the worst case it's a
significant loss"
An example of when you'd get pure overhead would be a hash join from the FK
side to the PK side of a RI constraint, with no local predicate on the PK
side. In that case, every row would pass (assuming RI constraints were
properly enforced), and so the filter wouldn't reject any rows from the FK
side.
For this specific case we could anticipate that creating a runtime join
filter wouldn't be helpful, but a robust way to handle this problem in
general is to collect runtime statistics and have the operator shut itself
off if it's not filtering enough rows to justify its overhead. For example,
maintain a count of rows processed and rows filtered, and when the number of
rows processed is above some minimum threshold and the selectivity is higher
than some threshold, then disable it either permanently or temporarily.
Accurate estimation of the benefits of applying the join selectivity during
Scan is going to be storage-engine dependent, but as long as the operation
can turn itself off if it's not filtering enough rows to justify the per-row
overhead on the outer, the risk of the approach can be made very small.
/Jim
[1]: https://www.postgresql-archive.org/Pluggable-storage-tc5916322.html
-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
On Thu, Nov 1, 2018 at 10:17 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
I haven't really planned to work on this anytime soon, unfortunately,
which is why I proposed to mark it as RwF at the end of the last CF. I
already have a couple other patches there, and (more importantly) I
don't have a very clear idea how to implement the filter pushdown.That being said I still think it'd be an interesting improvement, and if
someone wants to take over I'm available as a reviewer / tester ...
Since no one expressed any particular desire to take over the patch, I'm
marking it as "Returned with feedback". Although I really like the discussion
that happened here, maybe it worth to move a part of it to the wiki under the
"possible improvements" flag?