WIP: bloom filter in Hash Joins with batches
Hi,
while working on the Hash Join improvements, I've been repeatedly
running into the idea of bloom filter - various papers on hash joins
mention bloom filters as a way to optimize access to the hash table by
doing fewer lookups, etc.
Sadly, I've been unable to actually see any real benefit of using a
bloom filter, which I believe is mostly due to NTUP_PER_BUCKET=1, which
makes the lookups much more efficient (so the room for bloom filter
improvements is narrow).
The one case where bloom filter might still help, and that's when the
bloom filter fits into L3 cache (a few MBs) while the hash table (or
more accurately the buckets) do not. Then there's a chance that the
bloom filter (which needs to do multiple lookups) might help.
But I think there's another case where bloom filter might be way more
useful in Hash Join - when we do batching. What we do currently is that
we simply
1) build the batches for the hash table (inner relation)
2) read the outer relation (usually the larger one), and split it
into batches just like the hash table
3) while doing (2) we join the first batch, and write the remaining
batches to disk (temporary files)
4) we read the batches one by one (for both tables) and do the join
Now, imagine that only some of the rows in the outer table actually
match a row in the hash table. Currently, we do write those rows into
the temporary file, but with a bloom filter on the whole hash table (all
the batches at once) we can skip that for some types of joins.
For inner join we can immediately discard the outer row, for left join
we can immediately output the row. In both cases we can completely
eliminate the overhead with writing the tuple to the temporary file and
then reading it again.
The attached patch is a PoC of this approach - I'm pretty sure it's not
perfectly correct (e.g. I only tried it with inner join), but it's good
enough for demonstrating the benefits. It's rather incomplete (see the
end of this e-mail), and I'm mostly soliciting some early feedback at
this point.
The numbers presented here are for a test case like this:
CREATE TABLE dim (id INT, dval TEXT);
CREATE TABLE fact (id INT, fval TEXT);
INSERT INTO dim SELECT i, md5(i::text)
FROM generate_series(1,10000000) s(i);
-- repeat 10x
INSERT INTO fact SELECT * FROM dim;
and a query like this
SELECT COUNT(fval) FROM fact JOIN dim USING (id) WHERE dval < 'a';
with different values in the WHERE condition to select a fraction of the
inner 'dim' table - this directly affects what portion of the 'fact'
table has a matching row, and also the size of the hash table (and
number of batches).
Now, some numbers from a machine with 8GB of RAM (the 'fact' table has
~6.5GB of data, so there's actually quite a bit of memory pressure,
forcing the temp files to disk etc.).
With work_mem=16MB, it looks like this:
batches filter select. bloom master bloom/master
-----------------------------------------------------------
4 1 6.25% 23871 48631 49.09%
8 2 12.50% 25752 56692 45.42%
8 3 18.75% 31273 57455 54.43%
16 4 25.01% 37430 62325 60.06%
16 5 31.25% 39005 61143 63.79%
16 6 37.50% 46157 63533 72.65%
16 7 43.75% 53500 65483 81.70%
32 8 49.99% 53952 65730 82.08%
32 9 56.23% 55187 67521 81.73%
32 a 62.49% 64454 69448 92.81%
32 b 68.73% 66937 71692 93.37%
32 c 74.97% 73323 72060 101.75%
32 d 81.23% 76703 73513 104.34%
32 e 87.48% 81970 74890 109.45%
32 f 93.74% 86102 76257 112.91%
The 'batches' means how many batches were used for the join, 'filter' is
the value used in the WHERE condition, selectivity is the fraction of
the 'dim' table that matches the condition (and also the 'fact'). Bloom
and master are timings of the query in miliseconds, and bloom/master is
comparison of the runtimes - so for example 49% means the hash join with
bloom filter was running ~2x as fast.
Admittedly, work_mem=16MB is quite low, but that's just a way to force
batching. What really matters is the number of batches and selectivity
(how many tuples we can eliminate using the bloom filter).
For work_mem=64MB it looks like this:
batches filter select. bloom master bloom/master
-----------------------------------------------------------
1 1 6.25% 24846 23854 104.16%
2 2 12.50% 24369 45672 53.36%
2 3 18.75% 30432 47454 64.13%
4 4 25.01% 36175 59741 60.55%
4 5 31.25% 43103 62631 68.82%
4 6 37.50% 48179 64079 75.19%
So initially it's a bit slower (it's not doing any batching in this
case, but the code is a bit silly and while not building the bloom
filter it still does some checks). But once we start batching, it gets
2x as fast again, and then slowly degrades as the selectivity increases.
Attached is a spreadsheet with results for various work_mem values, and
also with a smaller data set (just 30M rows in the fact table), which
easily fits into memory. Yet it shows similar gains, shaving off ~40% in
the best case, suggesting that this is not just thanks to reduction of
I/O when forcing the temp files to disk.
As I mentioned, the patch is incomplete in several ways:
1) It does not count the bloom filter (which may be quite big) into
work_mem properly.
2) It probably does not work for outer joins at this point.
3) Currently the bloom filter is used whenever we do batching, but it
should really be driven by selectivity too - it'd be good to (a)
estimate the fraction of 'fact' tuples having a match in the hash
table, and not to do bloom if it's over ~60% or so. Also, maybe
the could should count the matches at runtime, and disable the
bloom filter if we reach some threshold.
But overall, this seems like a nice optimization opportunity for hash
joins on large data sets, where batching is necessary.
Ideas?
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Dec 15, 2015 at 11:30 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com
wrote:
Attached is a spreadsheet with results for various work_mem values, and
also with a smaller data set (just 30M rows in the fact table), which
easily fits into memory. Yet it shows similar gains, shaving off ~40% in
the best case, suggesting that this is not just thanks to reduction of I/O
when forcing the temp files to disk.
A neat idea! Have you possibly tried to also collect statistics about
actual false-positive rates and filter allocation sizes in every of the
collected data points?
--
Alex
On 15 December 2015 at 22:30, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
3) Currently the bloom filter is used whenever we do batching, but it
should really be driven by selectivity too - it'd be good to (a)
estimate the fraction of 'fact' tuples having a match in the hash
table, and not to do bloom if it's over ~60% or so. Also, maybe
the could should count the matches at runtime, and disable the
bloom filter if we reach some threshold.
Cool results.
It seems a good idea to build the bloom filter always, then discard it if
it would be ineffective.
My understanding is that the bloom filter would be ineffective in any of
these cases
* Hash table is too small
* Bloom filter too large
* Bloom selectivity > 50% - perhaps that can be applied dynamically, so
stop using it if it becomes ineffective
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 12/17/2015 10:50 AM, Shulgin, Oleksandr wrote:
On Tue, Dec 15, 2015 at 11:30 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:Attached is a spreadsheet with results for various work_mem
values, and also with a smaller data set (just 30M rows in the fact
table), which easily fits into memory. Yet it shows similar gains,
shaving off ~40% in the best case, suggesting that this is not just
thanks to reduction of I/O when forcing the temp files to disk.A neat idea! Have you possibly tried to also collect statistics
about actual false-positive rates and filter allocation sizes in
every of the collected data points?
The patch counts and prints the total number of lookups, and the number
of eliminated rows. The bloom filter is currently sized for 5% false
positives rate, and the numbers I've seen match that.
I think ultimately we'll need to measure the false positive rate, so
that we can use it to dynamically disable the bloom filter if it gets
inefficient. Also maybe put some of that into EXPLAIN ANALYZE.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 12/17/2015 11:44 AM, Simon Riggs wrote:
My understanding is that the bloom filter would be ineffective in any of
these cases
* Hash table is too small
Yes, although it depends what you mean by "too small".
Essentially if we can do with a single batch, then it's cheaper to do a
single lookup in the hash table instead of multiple lookups in the bloom
filter. The bloom filter might still win if it fits into L3 cache, but
that seems rather unlikely.
* Bloom filter too large
Too large with respect to what?
One obvious problem is that the bloom filter is built for all batches at
once, i.e. for all tuples, so it may be so big won't fit into work_mem
(or takes a significant part of it). Currently it's not accounted for,
but that'll need to change.
* Bloom selectivity > 50% - perhaps that can be applied dynamically,
so stop using it if it becomes ineffective
Yes. I think doing some preliminary selectivity estimation should not be
difficult - that's pretty much what calc_joinrel_size_estimate() already
does.
Doing that at dynamically is also possible, but quite tricky. Imagine
for example the outer relation is sorted - in that case we may get long
sequences of the same value (hash), and all of them will either have a
match in the inner relation, or not have a match. That may easily skew
the counters used for disabling the bloom filter dynamically.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tomas,
have you seen
/messages/by-id/4B4DD67F.9010506@sigaev.ru
I have very limited internet connection (no graphics) , so I may miss
something
Oleg
On Wed, Dec 16, 2015 at 4:15 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
Show quoted text
Hi,
while working on the Hash Join improvements, I've been repeatedly running
into the idea of bloom filter - various papers on hash joins mention bloom
filters as a way to optimize access to the hash table by doing fewer
lookups, etc.Sadly, I've been unable to actually see any real benefit of using a bloom
filter, which I believe is mostly due to NTUP_PER_BUCKET=1, which makes the
lookups much more efficient (so the room for bloom filter improvements is
narrow).The one case where bloom filter might still help, and that's when the
bloom filter fits into L3 cache (a few MBs) while the hash table (or more
accurately the buckets) do not. Then there's a chance that the bloom filter
(which needs to do multiple lookups) might help.But I think there's another case where bloom filter might be way more
useful in Hash Join - when we do batching. What we do currently is that we
simply1) build the batches for the hash table (inner relation)
2) read the outer relation (usually the larger one), and split it
into batches just like the hash table3) while doing (2) we join the first batch, and write the remaining
batches to disk (temporary files)4) we read the batches one by one (for both tables) and do the join
Now, imagine that only some of the rows in the outer table actually match
a row in the hash table. Currently, we do write those rows into the
temporary file, but with a bloom filter on the whole hash table (all the
batches at once) we can skip that for some types of joins.For inner join we can immediately discard the outer row, for left join we
can immediately output the row. In both cases we can completely eliminate
the overhead with writing the tuple to the temporary file and then reading
it again.The attached patch is a PoC of this approach - I'm pretty sure it's not
perfectly correct (e.g. I only tried it with inner join), but it's good
enough for demonstrating the benefits. It's rather incomplete (see the end
of this e-mail), and I'm mostly soliciting some early feedback at this
point.The numbers presented here are for a test case like this:
CREATE TABLE dim (id INT, dval TEXT);
CREATE TABLE fact (id INT, fval TEXT);INSERT INTO dim SELECT i, md5(i::text)
FROM generate_series(1,10000000) s(i);-- repeat 10x
INSERT INTO fact SELECT * FROM dim;and a query like this
SELECT COUNT(fval) FROM fact JOIN dim USING (id) WHERE dval < 'a';
with different values in the WHERE condition to select a fraction of the
inner 'dim' table - this directly affects what portion of the 'fact' table
has a matching row, and also the size of the hash table (and number of
batches).Now, some numbers from a machine with 8GB of RAM (the 'fact' table has
~6.5GB of data, so there's actually quite a bit of memory pressure, forcing
the temp files to disk etc.).With work_mem=16MB, it looks like this:
batches filter select. bloom master bloom/master
-----------------------------------------------------------
4 1 6.25% 23871 48631 49.09%
8 2 12.50% 25752 56692 45.42%
8 3 18.75% 31273 57455 54.43%
16 4 25.01% 37430 62325 60.06%
16 5 31.25% 39005 61143 63.79%
16 6 37.50% 46157 63533 72.65%
16 7 43.75% 53500 65483 81.70%
32 8 49.99% 53952 65730 82.08%
32 9 56.23% 55187 67521 81.73%
32 a 62.49% 64454 69448 92.81%
32 b 68.73% 66937 71692 93.37%
32 c 74.97% 73323 72060 101.75%
32 d 81.23% 76703 73513 104.34%
32 e 87.48% 81970 74890 109.45%
32 f 93.74% 86102 76257 112.91%The 'batches' means how many batches were used for the join, 'filter' is
the value used in the WHERE condition, selectivity is the fraction of the
'dim' table that matches the condition (and also the 'fact'). Bloom and
master are timings of the query in miliseconds, and bloom/master is
comparison of the runtimes - so for example 49% means the hash join with
bloom filter was running ~2x as fast.Admittedly, work_mem=16MB is quite low, but that's just a way to force
batching. What really matters is the number of batches and selectivity (how
many tuples we can eliminate using the bloom filter).For work_mem=64MB it looks like this:
batches filter select. bloom master bloom/master
-----------------------------------------------------------
1 1 6.25% 24846 23854 104.16%
2 2 12.50% 24369 45672 53.36%
2 3 18.75% 30432 47454 64.13%
4 4 25.01% 36175 59741 60.55%
4 5 31.25% 43103 62631 68.82%
4 6 37.50% 48179 64079 75.19%So initially it's a bit slower (it's not doing any batching in this case,
but the code is a bit silly and while not building the bloom filter it
still does some checks). But once we start batching, it gets 2x as fast
again, and then slowly degrades as the selectivity increases.Attached is a spreadsheet with results for various work_mem values, and
also with a smaller data set (just 30M rows in the fact table), which
easily fits into memory. Yet it shows similar gains, shaving off ~40% in
the best case, suggesting that this is not just thanks to reduction of I/O
when forcing the temp files to disk.As I mentioned, the patch is incomplete in several ways:
1) It does not count the bloom filter (which may be quite big) into
work_mem properly.2) It probably does not work for outer joins at this point.
3) Currently the bloom filter is used whenever we do batching, but it
should really be driven by selectivity too - it'd be good to (a)
estimate the fraction of 'fact' tuples having a match in the hash
table, and not to do bloom if it's over ~60% or so. Also, maybe
the could should count the matches at runtime, and disable the
bloom filter if we reach some threshold.But overall, this seems like a nice optimization opportunity for hash
joins on large data sets, where batching is necessary.Ideas?
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 12/20/2015 05:46 AM, Oleg Bartunov wrote:
Tomas,
have you seen
/messages/by-id/4B4DD67F.9010506@sigaev.ru
I have very limited internet connection (no graphics) , so I may miss
something
I haven't seen that, but I don't really see how that's related - your
post is about indexes, mine is about building temporary bloom filters
when executing hash joins.
FWIW, I think bloom filters should be easy to add to BRIN indexes, as
another type of 'summary'. That should address most of the missing
pieces in your implementation (e.g. WAL).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello, Tomas.
Great idea!
Did you consider to cache bloom filter or at least part(s) of it
somehow? I think this way we could gain some more TPS. This of course
assuming that creating a bloom filter is really a bottleneck here,
which would be nice be investigated first.
Best regards,
Aleksander
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 17 December 2015 at 16:00, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
On 12/17/2015 11:44 AM, Simon Riggs wrote:
My understanding is that the bloom filter would be ineffective in any of
these cases
* Hash table is too smallYes, although it depends what you mean by "too small".
Essentially if we can do with a single batch, then it's cheaper to do a
single lookup in the hash table instead of multiple lookups in the bloom
filter. The bloom filter might still win if it fits into L3 cache, but that
seems rather unlikely.* Bloom filter too large
Too large with respect to what?
One obvious problem is that the bloom filter is built for all batches at
once, i.e. for all tuples, so it may be so big won't fit into work_mem (or
takes a significant part of it). Currently it's not accounted for, but
that'll need to change.
The benefit seems to be related to cacheing, or at least that memory speed
is critical. If the hash table is too small, or the bloom filter too large
then there would be no benefit from performing the action (Lookup Bloom
then maybe Lookup Hash) compared with just doing (LookupHash).
So the objective must be to get a Bloom Filter that is small enough that it
lives in a higher/faster level of cache than the main Hash table. Or
possibly that we separate that into a two stage process so that the first
level can be applied by a GPU and then later checked against hash outside
of a GPU.
I think you also need to consider whether we use a hash bloom filter or
just simply apply an additional range predicate. The latter idea is similar
to my earlier thoughts here
/messages/by-id/CA+U5nMLYf2cgbq+YUw-ArLBTcPrqanBf5QiFEC-PBRJCFzOngg@mail.gmail.com
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/24/2015 02:51 PM, Simon Riggs wrote:
On 17 December 2015 at 16:00, Tomas Vondra <tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>> wrote:On 12/17/2015 11:44 AM, Simon Riggs wrote:
My understanding is that the bloom filter would be ineffective
in any of
these cases
* Hash table is too smallYes, although it depends what you mean by "too small".
Essentially if we can do with a single batch, then it's cheaper to
do a single lookup in the hash table instead of multiple lookups in
the bloom filter. The bloom filter might still win if it fits into
L3 cache, but that seems rather unlikely.* Bloom filter too large
Too large with respect to what?
One obvious problem is that the bloom filter is built for all
batches at once, i.e. for all tuples, so it may be so big won't fit
into work_mem (or takes a significant part of it). Currently it's
not accounted for, but that'll need to change.The benefit seems to be related to cacheing, or at least that memory
speed is critical. If the hash table is too small, or the bloom
filter too large then there would be no benefit from performing the
action (Lookup Bloom then maybe Lookup Hash) compared with just
doing(LookupHash).So the objective must be to get a Bloom Filter that is small enough
that it lives in a higher/faster level of cache than the main Hash
table. Or possibly that we separate that into a two stage process so
that the first level can be applied by a GPU and then later checked
against hash outside of a GPU.
I don't think that's quite true.
Had the primary goal been to replace expensive hashtable lookups with
cheap bloom filter checks, then sure - the size of the bloom filter
would be crucial (namely getting the whole bloom filter into L3, which
is an order of magnitude faster compared to RAM).
But that's not what the patch does - it aims to reduce the amount of
data that need to be serialized/deserialized when batching is necessary,
trading those costs for the bloom overhead. The size of the bloom filter
may still matter, but it's way less important because the
(de)serialization overhead is much more expensive than the hash lookup.
FWIW I've been trying to use the bloom filters in the "traditional" way
(doing "cheap" bloom checks before hashtable lookups), but I've been
unable to get any measurable benefit. Either I was doing it wrong
somehow, or perhaps NTUP_PER_BUCKET=1 made the hashtable too fast (it
essentially means we need a single lookup to see if the entry exists,
while bloom filter needs several lookups, depending on the desired false
positive ratio).
So I concluded that the attempt to use the bloom filter that way is
futile, and that the batching case seems like the only scenario where
the bloom filters may still be useful.
There's a number of additional problems with sizing the bloom filter to
fit into L3:
(1) We have no idea how much L3 there is, although this might be fixed
by either adding a GUC, read it from /proc/cpuinfo or just use some
reasonable default (say, most Xeon CPUs today have ~20 MB)
(2) The L3 is not per-process, but shared by all processes, so we don't
really have all the L3, but perhaps just L3/N where N is the number
of backends.
(3) The size of the bloom filter depends on the number of items it
should handle and false positive rate. I've used 5% in the patch,
but maybe it'd make sense to increase this in order to get smaller
filter?
Also, the bloom filter does not compete with the whole hash table, but
"just" with the "buckets" as that's sufficient to determine if there's a
tuple for a key or not.
Let's do some math for a hash table with 100k rows (so a fairly small
hash table):
* nbuckets = 131072 (2^17), so 1MB
* bloom filter (5% false positive rate [1]http://hur.st/bloomfilter?n=100000&p=0.05): 76kB, k=4 (see [1]http://hur.st/bloomfilter?n=100000&p=0.05)
So both the buckets and bloom fit into L3, and the bloom filter is
unlikely to speed things up.
The difference between buckets and bloom filter is about 10x, and both
sizes scale about linearly (with 1M items it'll be ~10MB vs. 760kB), so
there clearly is an area where bloom filter fits into L3 while buckets
don't.
Sadly I haven't been able to exploit that, but I'll try again.
In any case, none of this can help unless the bloom filter really
eliminates many lookups.
[1]: http://hur.st/bloomfilter?n=100000&p=0.05
I think you also need to consider whether we use a hash bloom filter or
just simply apply an additional range predicate. The latter idea is
similar to my earlier thoughts here
/messages/by-id/CA+U5nMLYf2cgbq+YUw-ArLBTcPrqanBf5QiFEC-PBRJCFzOngg@mail.gmail.com
Interesting, but I don't really see how this is related to bloom filters?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 18 December 2015 at 04:34, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
I think ultimately we'll need to measure the false positive rate, so that
we can use it to dynamically disable the bloom filter if it gets
inefficient. Also maybe put some of that into EXPLAIN ANALYZE.
I'm not so convinced that will be a good idea. What if the filter does not
help much to start with, we disable it because of that, then we get some
different probe values later in the scan which the bloom filter would have
helped to eliminate earlier.
Maybe it would be better to, once the filter is built, simply count the
number of 1 bits and only use the filter if there's less than <threshold> 1
bits compared to the size of the filter in bits. There's functionality in
bms_num_members() to do this, and there's also __builtin_popcount() in
newer version of GCC, which we could have some wrapper around, perhaps.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 12/28/2015 03:15 AM, David Rowley wrote:
On 18 December 2015 at 04:34, Tomas Vondra <tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>> wrote:I think ultimately we'll need to measure the false positive rate, so
that we can use it to dynamically disable the bloom filter if it
gets inefficient. Also maybe put some of that into EXPLAIN ANALYZE.I'm not so convinced that will be a good idea. What if the filter does
not help much to start with, we disable it because of that, then we get
some different probe values later in the scan which the bloom filter
would have helped to eliminate earlier.
Yes, that's true. This might happen quite easily for example when then
outer table is sorted - in that case what we'll see are (possibly long)
sequences of the same value, which might bias the decision quite easily.
I don't know how to fix this perfectly (if at all possible). But maybe
we don't really need to do that. The overall goal is to improve the
overall performance, and this check (disabling the bloom filter) is
meant to limit the loss when the filter returns "true" for most values
(making it somewhat pointless).
The costs however are asymmetric - once we build the hash table (and
build the filter), the cost for dropping the bloom filter is constant,
as we simply loose the time we spent building it. But the cost for
continuing to use of inefficient filter is O(N) where N is the number of
lookups (i.e. rows of outer table). So it's probably better to drop the
filter a bit too early and sacrifice the small amount of time we spent
building it, so that we have a change of not being much slower than the
current implementation.
That is not to say we can't come up with better solutions - for example
we can count (or estimate) the number of distinct hash values we've
seen, and only do the decision once we see enough of them. For example
we could use HyperLogLog to do that, or simply see the number of times a
value changed (hash value not equal to the previous value). That should
be better than simply waiting to X lookups, but I'm sure it's still
possible to come up with counter-examples.
Maybe it would be better to, once the filter is built, simply count the
number of 1 bits and only use the filter if there's less than
<threshold> 1 bits compared to the size of the filter in bits. There's
functionality in bms_num_members() to do this, and there's
also __builtin_popcount() in newer version of GCC, which we could have
some wrapper around, perhaps.
I don't really see how this is relevant with the previous point. The
number of 1s in the bitmap can tell you the false positive rate for the
bloom filter, not what fraction of lookups will be answered with "true".
So while this needs to be watched, so that we stop using the bloom
filter when it gets too full (e.g. due to under-estimate), it's
completely unrelated to the previous issue.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 28 December 2015 at 23:23, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
On 12/28/2015 03:15 AM, David Rowley wrote:
Maybe it would be better to, once the filter is built, simply count the
number of 1 bits and only use the filter if there's less than
<threshold> 1 bits compared to the size of the filter in bits. There's
functionality in bms_num_members() to do this, and there's
also __builtin_popcount() in newer version of GCC, which we could have
some wrapper around, perhaps.I don't really see how this is relevant with the previous point. The
number of 1s in the bitmap can tell you the false positive rate for the
bloom filter, not what fraction of lookups will be answered with "true".So while this needs to be watched, so that we stop using the bloom filter
when it gets too full (e.g. due to under-estimate), it's completely
unrelated to the previous issue.
Why is it not related? this has got me confused. If a bloom filter has all
of the bits set to 1, then it will never filter any Tuples right?
If so, then a filter with all 1 bits set should be thrown away, as it'll
never help us, and the filter should generally become more worthwhile as it
contains a higher ratio of 0 bits vs 1 bits. Of course we don't have a
count of how many Tuples matched each bit, so this is based on the
assumption that each bit matches an equal number of Tuples. Are you saying
this is not an assumption that we should make?
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 12/28/2015 11:38 AM, David Rowley wrote:
On 28 December 2015 at 23:23, Tomas Vondra <tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>> wrote:On 12/28/2015 03:15 AM, David Rowley wrote:
Maybe it would be better to, once the filter is built, simply
count thenumber of 1 bits and only use the filter if there's less than
<threshold> 1 bits compared to the size of the filter in bits.
There's
functionality in bms_num_members() to do this, and there's
also __builtin_popcount() in newer version of GCC, which we
could have
some wrapper around, perhaps.I don't really see how this is relevant with the previous point. The
number of 1s in the bitmap can tell you the false positive rate for
the bloom filter, not what fraction of lookups will be answered with
"true".So while this needs to be watched, so that we stop using the bloom
filter when it gets too full (e.g. due to under-estimate), it's
completely unrelated to the previous issue.Why is it not related? this has got me confused. If a bloom filter has
all of the bits set to 1, then it will never filter any Tuples right?
Because the false positive rate can be computed without having to look
at the lookups. So it's inherently independent on the ordering of outer
relation, and so on.
If so, then a filter with all 1 bits set should be thrown away, as
it'll never help us, and the filter should generally become more
worthwhile as it contains a higher ratio of 0 bits vs 1 bits. Of
course we don't have a count of how many Tuples matched each bit, so
this is based on the assumption that each bit matches an equal number
of Tuples. Are you saying this is not an assumption that we should
make?
Sure we should check that. All I'm saying is it has nothing to do with
the first problem described in the first part of the e-mail.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 28 December 2015 at 23:44, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
On 12/28/2015 11:38 AM, David Rowley wrote:
If so, then a filter with all 1 bits set should be thrown away, as
it'll never help us, and the filter should generally become more
worthwhile as it contains a higher ratio of 0 bits vs 1 bits. Of
course we don't have a count of how many Tuples matched each bit, so
this is based on the assumption that each bit matches an equal number
of Tuples. Are you saying this is not an assumption that we should
make?Sure we should check that. All I'm saying is it has nothing to do with the
first problem described in the first part of the e-mail.
Okay. I was merely suggesting this method as an alternative to checking
tracking and checking the usefulness of the filter during the hash probe. I
assumed that tracking and checking the usefulness during the hash probe
won't be free, and that it may be better if we can estimate or determine
the expected usefulness of the filter before the probe stage, and throw it
away before we waste cycles using it.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 12/28/2015 11:52 AM, David Rowley wrote:
On 28 December 2015 at 23:44, Tomas Vondra <tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>> wrote:On 12/28/2015 11:38 AM, David Rowley wrote:
If so, then a filter with all 1 bits set should be thrown away, as
it'll never help us, and the filter should generally become more
worthwhile as it contains a higher ratio of 0 bits vs 1 bits. Of
course we don't have a count of how many Tuples matched each bit, so
this is based on the assumption that each bit matches an equal
number
of Tuples. Are you saying this is not an assumption that we should
make?Sure we should check that. All I'm saying is it has nothing to do
with the first problem described in the first part of the e-mail.Okay. I was merely suggesting this method as an alternative to checking
tracking and checking the usefulness of the filter during the hash
probe. I assumed that tracking and checking the usefulness during the
hash probe won't be free, and that it may be better if we can estimate
or determine the expected usefulness of the filter before the probe
stage, and throw it away before we waste cycles using it.
Consider this example:
CREATE TABLE t (id INT);
INSERT INTO t SELECT i FROM generate_series(1,1000000) s(i);
ANALYZE;
SELECT * FROM t AS t1 JOIN t AS t2 ON (t1.id = t2.id)
WHERE t1.id < 10000 AND t2.id < 10000;
This will be executed like this:
QUERY PLAN
-----------------------------------------------------------------------
Hash Join (cost=17046.26..34008.58 rows=94 width=8)
Hash Cond: (t1.id = t2.id)
-> Seq Scan on t t1 (cost=0.00..16925.00 rows=9701 width=4)
Filter: (id < 10000)
-> Hash (cost=16925.00..16925.00 rows=9701 width=4)
-> Seq Scan on t t2 (cost=0.00..16925.00 rows=9701 width=4)
Filter: (id < 10000)
(7 rows)
But of course the problem is that the two relations are (trivially)
correlated, which means that in reality it works like this:
QUERY PLAN
---------------------------------------------------------
Hash Join (actual rows=9999 loops=1)
Hash Cond: (t1.id = t2.id)
-> Seq Scan on t t1 (actual rows=9999 loops=1)
Filter: (id < 10000)
Rows Removed by Filter: 990001
-> Hash (actual rows=9999 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 480kB
-> Seq Scan on t t2 (actual rows=9999 loops=1)
Filter: (id < 10000)
Rows Removed by Filter: 990001
Planning time: 0.316 ms
Execution time: 176.283 ms
(12 rows)
So while we have very good estimates on the scan nodes, the final
estimate is off - we expect about the bloom filter to eliminate ~99% of
rows, in reality 100% of rows matches (due to the correlation). And
that's even if the bloom filter is "perfect" in the sense that it has
very low false probability etc.
This example illustrates that such cases can't be really solved before
actually doing the lookups. Which does not make the checks you propose
pointless, but they simply address different cases (and I indeed planned
to implement them).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
attached is v2 of the patch, with a number of improvements:
0) This relies on the the other hashjoin patches (delayed build of
buckets and batching), as it allows sizing the bloom filter.
1) enable_hashjoin_bloom GUC
This is mostly meant for debugging and testing, not for committing.
2) Outer joins should be working fine now. That is, the results should
be correct and faster as the outer rows without matches should not
be batched at all.
3) The bloom filter is now built for all hash joins, not just when
batching is happening. I've been a bit skeptical about the
non-batched cases, but it seems that I can get a sizable speedup
(~20-30%, depending on the selectivity of the join).
4) The code is refactored quite a bit, adding BloomFilterData instead
of just sprinkling the fields on HashJoinState or HashJoinTableData.
5) To size the bloom filter, we now use HyperLogLog couter, which we
now have in core thanks to the sorting improvements done by Peter
Geoghegan. This allows making the bloom filter much smaller when
possible.
The patch also extends the HyperLogLog API a bit (which I'll submit
to the CF independently).
There's a bunch of comments in the code, mostly with ideas about more
possible improvements.
The main piece missing in the patch (IMHO) is optimizer code making
decisions whether to enable bloom filters for the hash join, based on
cardinality estimates. And also the executor code disabling the bloom
filter if they turn inefficient. I don't think that's a major issue at
this point, though, and I think it'll be easier to do based on testing
the current patch.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
attached are results for some basic performance evaluation of the patch
(and also scripts used for the evaluation). I've used a simple join of
two tables ("fact" and "dim"), with three different dataset sizes.
CREATE TABLE dim (id INT, r INT, val TEXT);
CREATE TABLE fact (id INT, val TEXT);
-- 1M rows into "dim"
INSERT INTO dim
SELECT i, mod(i,100), md5(i::text) FROM generate_series(1,1000000) s(i);
-- 10M rows into "fact"
INSERT INTO fact
SELECT mod(i,1000000)+1, md5((mod(i,1000000)+1)::text)
FROM generate_series(1,10000000) s(i);
Which means the "dim.r" column has 100 different values (0-99) with
uniform distribution. So e.g. "WHERE r < 15" matches 15%.
There are three dataset sizes:
1) small: dim 1M rows (73MB), fact 10M rows (650MB)
2) medium: dim: 10M rows (730MB), fact 100M rows (6.5GB)
3) large: dim: 5M rows (365MB), fact 250M rows (16GB)
The machine has 8GB of RAM, so "small" easily fits into RAM, "medium" is
just at the border, and "large" is clearly over (especially when batching).
For each dataset size there are two queries - one with inner and one
with outer join, with a filter on the smaller one ("dim") determining
the "selectivity".
-- inner join
SELECT COUNT(dim.val), COUNT(fact.val)
FROM fact JOIN dim ON (fact.id = dim.id) WHERE r < $1
-- outer join
SELECT COUNT(dim.val), COUNT(fact.val)
FROM fact LEFT JOIN (SELECT * FROM dim WHERE r < $1) dim
ON (fact.id = dim.id)
Those queries were executed with various work_mem sizes (to either use
no batching or different number of batches), and also with different
filter values (to filter to 10%, 20%, 30%, ...., 100% of data).
After collecting data both on master and with all the patches applied,
I've computed the speedup factor as
(time with patches) / (time on master)
and plotted that as a pivot tables with heat map. Values <1.0 (red) mean
"bloom filter made the query faster" while values > 1.0 (blue) means it
slowed the query down.
Let's quickly skim through the results for each dataset size. The
attached charts only show results for the "inner" queries, but the
results for "outer" queries are pretty much exactly the same.
1) small (hash-joins-bloom-1-10.png)
------------------------------------
There's a clear speedup as long as the hash join requires batching and
the selectivity is below 50-60%. In the best case (10% selectivity,
small work_mem and thus multiple batches) we save up to ~30% of time.
Once we get to non-batching mode, the bloom filter is ineffective no
matter the selectivity. I assume this is because for the small
selectivities (e.g. 10%) it's simply cheaper to check the hash table,
which is small and likely fits into L3 on the CPU (which is ~6MB on the
i5-2500 CPU in the system).
2) medium (hash-joins-bloom-10-100.png)
---------------------------------------
The results are similar to the "small" results, except that the speedup
is somewhat larger (up to 40%), and there's no sudden jump when the
queries stop batching (work_mem=1GB is enough to run the query in a
single batch even with 100% selectivity). The "break-even" selectivity
is again somewhere around 50% (a bit lower in non-batching mode).
3) large (hash-joins-bloom-5-250.png)
-------------------------------------
Pretty much the same as "medium".
So, this seems to bring reasonable speedup, as long as the selectivity
is below 50%, and the data set is sufficiently large.
It's however clear that the patch won't work without some sort of
selectivity estimation - either at planning time or optimization time
(or both). I've speculated about how to do that, but the current patches
don't implement any of that yet and any ideas are welcome.
Another open question is sizing the bloom filter in the multi-batch
case, which turns out to be quite tricky. What the patch does right now
is estimating the number of distinct values to store in the filter based
on the first batch (the other hashjoin patch in this CF makes this
possible). That is, there's a hyperloglog counter that gives us (very
accurate) ndistinct estimate for the first batch, and then we do (nbatch
* ndistinct), to estimate the ndistinct values in the whole data set.
That estimate is of course rather naive, and often produces too high -
for example we might have already seen all the distinct values in the
first batch. So the result is a bloom filter much larger than necessary.
Not sure how to fix this :-(
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
hash-joins-bloom-5-250.pngimage/png; name=hash-joins-bloom-5-250.pngDownload
hash-joins-bloom-10-100.pngimage/png; name=hash-joins-bloom-10-100.pngDownload+1-4
hash-joins-bloom-1-10.pngimage/png; name=hash-joins-bloom-1-10.pngDownload
hash-joins-bloom.odsapplication/vnd.oasis.opendocument.spreadsheet; name=hash-joins-bloom.odsDownload+4-0
On Sat, Jan 9, 2016 at 11:02 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
So, this seems to bring reasonable speedup, as long as the selectivity is
below 50%, and the data set is sufficiently large.
What about semijoins? Apparently they can use bloom filters
particularly effectively. Have you considered them as a special case?
Also, have you considered Hash join conditions with multiple
attributes as a special case? I'm thinking of cases like this:
regression=# set enable_mergejoin = off;
SET
regression=# explain analyze select * from tenk1 o join tenk2 t on
o.twenty = t.twenty and t.hundred = o.hundred;
QUERY PLAN
──────────────────────────────────────────────────────────────────────
Hash Join (cost=595.00..4103.00 rows=50000 width=488) (actual
time=12.086..1026.194 rows=1000000 loops=1)
Hash Cond: ((o.twenty = t.twenty) AND (o.hundred = t.hundred))
-> Seq Scan on tenk1 o (cost=0.00..458.00 rows=10000 width=244)
(actual time=0.017..4.212 rows=10000 loops=1)
-> Hash (cost=445.00..445.00 rows=10000 width=244) (actual
time=12.023..12.023 rows=10000 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 2824kB
-> Seq Scan on tenk2 t (cost=0.00..445.00 rows=10000
width=244) (actual time=0.006..3.453 rows=10000 loops=1)
Planning time: 0.567 ms
Execution time: 1116.094 ms
(8 rows)
(Note that while the optimizer has a slight preference for a merge
join in this case, the plan I show here is a bit faster on my
machine).
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Jan 9, 2016 at 4:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
Also, have you considered Hash join conditions with multiple
attributes as a special case? I'm thinking of cases like this:
Sorry, accidentally fat-fingered my enter key before I was finished
drafting that mail. That example isn't useful, because there is no
early/cheap elimination of tuples while scanning the outer relation.
My point about bloom filters on only one attribute (or perhaps
multiple bloom filters on multiple attributes) is that you might be
able to make the bloom filter more effective, particularly relative to
the amount of memory available, by only building it for one attribute
where that distinction (one attribute vs multiple) happens to exist.
Simon said: "So the objective must be to get a Bloom Filter that is
small enough that it lives in a higher/faster level of cache than the
main Hash table". I agree that that's really important here, although
I did note that Tomas wasn't so sure, emphasizing the importance of
avoiding serialization and deserialization overhead -- and certainly,
that's why the patch helps some cases a lot. But Simon's point applies
more to the worst case than the best or average cases, and evidently
we have plenty of worrying about the worst case left to do here.
So, one attribute could have relatively low cardinality, and thus
fewer distinct items in the bloom filter's set, making it far slower
to degrade (i.e. have increased probability of false positives) as
compared to a composite of two or more attributes, and yet perhaps not
significantly less useful in terms of its ability to eliminate
(de)serialization overhead. Or, with two bloom filters (one per
attribute), one bloom filter could degrade far faster than the other
(due to having more distinct values), often very unpredictably, and
yet it wouldn't matter because we'd discard the bad one (maybe really
early, during the scan of the inner relation, using HLL).
I notice that all these example queries involve less-than operators
(inner relation tuples are thereby prevented from being entered into
the hash table). It might not be a terrible idea to hash abbreviated
keys (or even truncated abbreviated keys) for the bloom filter in
certain cases, in particular when there is a correlation in the inner
relation attributes (equi-joined attribute, and attribute that is
other part of qual). There might not be a correlation, of course, but
with some care hashing abbreviated keys may have virtually no
additional overhead, making it worth doing despite the general
uncertainty about any correlation existing. If you're going to accept
that the bloom filter can't be very large, which I think you must,
then hashing an entire value may be no better than hashing an
abbreviated key (those 8 bytes tend to have a lot of entropy). This
point is rather hand-wavy, though.
I suspect that BLOOM_ERROR_RATE isn't the most useful constraint here.
Is the degree to which it helps in sympathetic cases at all
commensurate with the degree to which it hurts in unsympathetic cases?
In your "low selectivity" cases (the sympathetic cases), there are
relatively few values represented in the main hash table, and
therefore in the bloom filter, and so a smaller bloom filter results
automatically anyway. But in the bad cases, the BLOOM_ERROR_RATE
constraint just wastes memory bandwidth for no gain, I think. If the
number of elements is so large that a reasonable fixed sized bloom
filter has a poor false positive rate anyway, we may have already
lost.
My sense is that we should try to make the bloom filter as cheap as
possible more so than trying to make sure it's useful before much work
has been done.
Is it okay that you don't treat MCV/skew values as special? Actually,
looking at this code, I think I notice something:
@@ -238,6 +238,15 @@ ExecHashJoin(HashJoinState *node)
hashvalue);
node->hj_CurTuple = NULL;
+ /* If still in the first batch, we check the bloom filter. */
+ if ((hashtable->curbatch == 0) &&
+ (! ExecHashBloomCheckValue(hashtable, hashvalue)))
+ {
+ /* no matches; check for possible outer-join fill */
+ node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
+ continue;
+ }
Haven't we already established by now that the value of
"node->hj_CurSkewBucketNo" may not be INVALID_SKEW_BUCKET_NO, and so
the value certainly was found in the inner relation scan? Why bother
doing anything with the bloom filter in that common case? (Note that
I'm not suggesting that we don't establish "node->hj_CurSkewBucketNo"
first).
Also, are you aware of this?
http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf
It talks about bloom filters for hash joins in PostgreSQL
specifically. Interestingly, they talk about specific TPC-H queries.
BTW, I think code like this needs better comments:
+static BloomFilter
+BloomFilterInit(double nrows, double error)
+{
+ /* perhaps we should round nbits to multiples of 8 ? */
+ int nbits = ceil((nrows * log(error)) / log(1.0 / (pow(2.0, log(2.0)))));
+ int nhashes = round(log(2.0) * nbits / nrows);
+
+ BloomFilter filter = palloc0(offsetof(BloomFilterData, data) +
((nbits + 7) / 8));
+
+ filter->nbits = nbits;
+ filter->nhashes = nhashes;
Have you experimentally verified that you get the expected probability
of false positives in practice?
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers