hashjoins vs. Bloom filters (yet again)

Started by Tomas Vondra2 days ago4 messageshackers
Jump to latest
#1Tomas Vondra
tomas.vondra@2ndquadrant.com

Hi,

A random discussion at pgconf.dev made me revisit one of my ancient
patches, attempting to use Bloom filters to hash joins. I did work on
that twice in the past - first in 2015/6 [1]/messages/by-id/5670946E.8070705@2ndquadrant.com, then in 2018 [2]/messages/by-id/c902844d-837f-5f63-ced3-9f7fd222f175@2ndquadrant.com. So let
me briefly revisit that, before I get to the new patch.

old patches
-----------

Those old patches tried to do a fairly small thing during a hash join,
and that's building a Bloom filter on the inner relation (the one that
gets hashed), and then use that filter before probing the hash table.

The benefits come from Bloom filters being (fairly) cheap, and a
negative answer (hash is not in the filter) may allows us to skip a much
more expensive operation.

The old threads patches focused especially at two hash join cases:

(a) A very selective join, i.e. a significant fraction of outer tuples
does not have a match in the hash table.

(b) A selective hash join forced to do batching because the hash table
is too large, and thus forced to spill outer tuples to temporary files.

For (a), the benefit comes from Bloom filters being much cheaper to
probe than a hash table. The exact cost depends on the implementation,
sizes, etc. We're in the ballpark of 50 vs. 500 cycles, maybe. But if
the filter discards 90% of tuples, it can be a big win.

For (b), the filter (for all the batches at once) allows us to discard
some of the outer tuples without writing them to temporary files. Which
is way more expensive than probing a hash table.

The patches got stuck mostly because deciding if it makes sense to
build/use the Bloom filter is somewhat hard. For cases where 100% of the
tuples have a match it's pointless - it's just pure cost, no benefit.
The regressions are relatively small, though (<10%).

For (b) it's much less sensitive to this kind of issues, of course. The
cost of writing outer tuples to temporary files is much higher than
building/probing a Bloom filter.

Clearly, a filter that discards 99% of tuples is great. And a filter
that keeps 99% of tuples is not great. But where exactly are the
thresholds is not quite clear.

There's also a related question of sizing the filter. Bloom filters are
usually sized by specifying the number of distinct values and the
desired false positive rate. And we could try doing that - pick a
standard false positive rate (e.g. the built-in bloom_filter aims for
1-2%), estimate the ndistinct, and get the size of the Bloom filter.

However, chances are the filter is too big. We can't get work_mem, the
join is already using that for the hash table etc. We can maybe use a
fraction of it, and that may not be enough to fit the "perfect" filter.
We could bail out and not use any Bloom filter at all, but that seems a
bit silly. Maybe we can't fit the 2% filter, but 5% of 10% would be OK?

Surely if the join selectivity is 1% (i.e. it discards 99% tuples), then
using a "worse" Bloom filter with 10% false positives would be a win?
It'd still discard ~89% of tuples.

Yet another angle leading to this kind of questions is inaccurate
ndistinct estimates (and we all know those estimates can be quite
unreliable). Let's say we size the filter for 1M distinct values (and it
just about fits into the memory budget), but then during execution we
find there are 2M distinct values. Well, now we may have ~10% false
positive rate. Or maybe we got 5M, and it's 30%. Or 10M / 50%.

At some point the filter stops being worth it, and we should either not
build it, or we should stop probing it. But when is that?

I think we'd need some sort of cost model to make judgments about this.

Anyway, this was just me summarizing the old threads, and what I think
got them stuck. Most of these questions are still open, although I think
we may be able to solve them better than we could ~10 years ago. We have
extended stats, we know about FK constraints during planning, ...

new patch
---------

Now let's talk about the new experimental/PoC patch that came from the
pgconf.dev discussions. It doesn't really solve the issues I just went
through, it's more of an attempt to take it one step further.

One of the things mentioned in the 2018 thread was the possibility to
push the filter much deeper, instead of using it just in the hash join
node itself. It was merely discussed, but there was no code written, or
anything like that. But it's the thing I decided to take a stab at after
getting back from Vancouver.

Consider a starjoin query

SELECT + FROM f JOIN d1 (f.id1 = d1.id)
JOIN d2 (f.id2 = d2.id)
JOIN d2 (f.id3 = d3.id)
WHERE d1.x = 1
AND d2.y = 2
AND d3.z = 3;

which will be planned using a left-deep plan like this one:

HJ
/ \
D3 HJ
/ \
D2 HJ
/ \
D1 F

With hashes on "D" tables, and a scan on "F". With the "old" patches,
each HJ node would use a Bloom filter internally. But there's an
interesting opportunity to "push down" the filters to the scan on "F",
and evaluate them right there, a bit as if the scan had a local qual.

The attached patch implements a PoC of this, and it's pretty effective.

Of course, it depends on the selectivity of the joins (and thus how many
tuples get discarded by the filters). But because it moves all the
"cheap" filter probes *before* probing any of the hash tables, it has a
multiplication effect for the benefits.

Yes, it still has most of the open issues discussed earlier, and those
will need to be addressed. But this "multiplication" may also make it
somewhat less sensitive to the regressions.

In the example above, if each of the 3 joins has 20% selectivity (i.e.
20% tuples go through), then the total selectivity is ~1%. So the "F"
scan produces only 1/100 of tuples. Maybe we got one of the joins wrong,
and it does not eliminate any tuples? That still means the overall
selectivity is only ~4%.

Of course, this only works for larger joins, and maybe the joins are
correlated in some weird way, etc. Also, what does 4% selectivity mean
for the overall query duration?

Attached is a PDF with results from a simple benchmark using joins like
the one above - fact + 1-3 dimensions. The scripts (in the .tgz) set a
couple GUCs to eliminate variations in the plan. The dimension joins are
independent and match a variable fraction of the fact (1% - 100%).

The columns are for three branches - master, and "patched" with the
push-down disabled and enabled, for joins with 1-3 dimensions.

The last two column groups are comparing the "patched" results to
master. With "off" there's no difference (other than random noise), just
as expected. But with the push-down enabled, there are fairly
significant speedups (up to ~3x). Of course, this is just a benchmark,
practical queries may do other stuff, making the gains smaller. OTOH, it
may also be much better, if there are expensive nodes in between.

The PoC patch is not very big or complex. 280KB seems like a lot, but
like 99% of that is changes in test output, because the patch adds some
info about the Bloom filters to EXPLAIN. The actual .c changes are only
~1000 lines, and a half of that is comments.

The most interesting stuff happens in create_hashjoin_plan(), where we
attempt to push-down the filter to a scan in the outer subtree. If that
succeeds, then ExecInitHashJoin initializes the filter so that the scan
can find it, and Hash builds the filter along with the hash table. And
then the scan nodes probe the pushed-down filter in ExecScanExtended().

There's bunch of boilerplate so that setrefs does the right thing with
expressions, etc. But it's a couple lines here and there. I'm actually
surprised how little code this is.

There's one detail I haven't mentioned yet - there's a simple adaptive
behavior, to deal with filters that are not selective enough. Per some
initial tests there's little benefit when the filter keeps >75% tuples,
and for >90% there were measurable regressions (~50%). This was very
consistent for different data types, etc.

So the patch tracks number of matching tuples per 1000 probes, when it
exceeds 90% it switches to sampling. Only 1% of tuples gets probed in
the filter, and if the fraction drops <80%, all the tuples get probed
again. This is very simple, needs more thought. But for the purpose of
the testing it worked quite well. There still is a small regression
(~3%), which I assume is due to building the filter.

Aside from the issues with deciding if to use a filter at all, sizing
it, etc. - which are still valid (even with the adaptive thing), and
need to be solved, there's one more annoying issue specific to this new
push-down stuff.

Earlier, I mentioned the push-down happens in create_hashjoin_plan().
Which means it happens *after* planning and costing. There are reasons
for that, but it has some unfortunate & annoying consequences.

Ideally, we'd know about the filters when constructing the scan nodes,
so we'd have a chance to estimate how many tuples will be eliminated by
probing the filters (which is about the same thing as estimating the
join sizes). But we can't do that, because our planner works bottom-up.
When constructing the scan nodes we know which tables we'll join with,
but we have no idea which of the join algorithms we'll pick.

We'll consider all three join types, and the scan node has no say which
of those will win. But the Bloom filter push-down is specific to hash
joins. So what should the scan node do? Either it can assume it's under
hash join (and set rows/cost as if there's a Bloom filter), or it can
set costs in a join-agnostic way (like now).

The only "correct" way I can think of dealing with this in the bottom-up
world is having two sets of paths - one set for a hash join, one set for
other joins. But that's not just for scans. We'd need that for all
paths, and for different combinations of joins. For the query with 3
joins, we'd end up with 2^3 combinations. That seems not great.

So I tend to see this as an opportunistic optimization. We do the
planning assuming there's no Bloom filter push-down, and then after the
fact we see if there's an opportunity after all. Which means we may not
pick a plan with hash joins, not realizing it might be made faster.

But in my mind that's somewhat acceptable / defensible.

The bigger issue for me is that it may make the EXPLAIN ANALYZE output
way harder to understand. The estimated "rows" are calculated before the
filter push-down happens, while the actual "rows" are with the filter
probing, of course. But it seems pretty easy to get confused by this,
and think it's just an incorrect estimate.

summary
-------

I like the idea of pushing filters down to the scan nodes (or perhaps
even to some other intermediate nodes). But maybe it's too incompatible
with our bottom-up planning, and the issues with costing and/or EXPLAIN
output may be impossible to solve. I wonder what others think.

Now that I revisited the older threads, I think it probably makes sense
with using Bloom filters in the hash join, at least in the two cases
mentioned in the first section. It doesn't have the issues with
bottom-up planning/costing, because it happens in the hash join. And the
issues with that (deciding what fractions are OK, sizing the filter,
...) apply to both that simpler case, and to the push-down.

regards

[1]: /messages/by-id/5670946E.8070705@2ndquadrant.com

[2]: /messages/by-id/c902844d-837f-5f63-ced3-9f7fd222f175@2ndquadrant.com
/messages/by-id/c902844d-837f-5f63-ced3-9f7fd222f175@2ndquadrant.com

--
Tomas Vondra

Attachments:

hashjoin-bloom-filter.pdfapplication/pdf; name=hashjoin-bloom-filter.pdfDownload
v1-0001-PoC-hashjoin-bloom-filter-pushdown.patchtext/x-patch; charset=UTF-8; name=v1-0001-PoC-hashjoin-bloom-filter-pushdown.patchDownload+2313-228
hash-bloom-test.tgzapplication/x-compressed-tar; name=hash-bloom-test.tgzDownload
#2Andrew Dunstan
andrew@dunslane.net
In reply to: Tomas Vondra (#1)
Re: hashjoins vs. Bloom filters (yet again)

On 2026-05-29 Fr 8:55 PM, Tomas Vondra wrote:

Hi,

A random discussion at pgconf.dev made me revisit one of my ancient
patches, attempting to use Bloom filters to hash joins. I did work on
that twice in the past - first in 2015/6 [1], then in 2018 [2]. So let
me briefly revisit that, before I get to the new patch.

old patches
-----------

Those old patches tried to do a fairly small thing during a hash join,
and that's building a Bloom filter on the inner relation (the one that
gets hashed), and then use that filter before probing the hash table.

The benefits come from Bloom filters being (fairly) cheap, and a
negative answer (hash is not in the filter) may allows us to skip a much
more expensive operation.

The old threads patches focused especially at two hash join cases:

(a) A very selective join, i.e. a significant fraction of outer tuples
does not have a match in the hash table.

(b) A selective hash join forced to do batching because the hash table
is too large, and thus forced to spill outer tuples to temporary files.

For (a), the benefit comes from Bloom filters being much cheaper to
probe than a hash table. The exact cost depends on the implementation,
sizes, etc. We're in the ballpark of 50 vs. 500 cycles, maybe. But if
the filter discards 90% of tuples, it can be a big win.

For (b), the filter (for all the batches at once) allows us to discard
some of the outer tuples without writing them to temporary files. Which
is way more expensive than probing a hash table.

The patches got stuck mostly because deciding if it makes sense to
build/use the Bloom filter is somewhat hard. For cases where 100% of the
tuples have a match it's pointless - it's just pure cost, no benefit.
The regressions are relatively small, though (<10%).

For (b) it's much less sensitive to this kind of issues, of course. The
cost of writing outer tuples to temporary files is much higher than
building/probing a Bloom filter.

Clearly, a filter that discards 99% of tuples is great. And a filter
that keeps 99% of tuples is not great. But where exactly are the
thresholds is not quite clear.

There's also a related question of sizing the filter. Bloom filters are
usually sized by specifying the number of distinct values and the
desired false positive rate. And we could try doing that - pick a
standard false positive rate (e.g. the built-in bloom_filter aims for
1-2%), estimate the ndistinct, and get the size of the Bloom filter.

However, chances are the filter is too big. We can't get work_mem, the
join is already using that for the hash table etc. We can maybe use a
fraction of it, and that may not be enough to fit the "perfect" filter.
We could bail out and not use any Bloom filter at all, but that seems a
bit silly. Maybe we can't fit the 2% filter, but 5% of 10% would be OK?

Surely if the join selectivity is 1% (i.e. it discards 99% tuples), then
using a "worse" Bloom filter with 10% false positives would be a win?
It'd still discard ~89% of tuples.

Yet another angle leading to this kind of questions is inaccurate
ndistinct estimates (and we all know those estimates can be quite
unreliable). Let's say we size the filter for 1M distinct values (and it
just about fits into the memory budget), but then during execution we
find there are 2M distinct values. Well, now we may have ~10% false
positive rate. Or maybe we got 5M, and it's 30%. Or 10M / 50%.

At some point the filter stops being worth it, and we should either not
build it, or we should stop probing it. But when is that?

I think we'd need some sort of cost model to make judgments about this.

Anyway, this was just me summarizing the old threads, and what I think
got them stuck. Most of these questions are still open, although I think
we may be able to solve them better than we could ~10 years ago. We have
extended stats, we know about FK constraints during planning, ...

new patch
---------

Now let's talk about the new experimental/PoC patch that came from the
pgconf.dev discussions. It doesn't really solve the issues I just went
through, it's more of an attempt to take it one step further.

One of the things mentioned in the 2018 thread was the possibility to
push the filter much deeper, instead of using it just in the hash join
node itself. It was merely discussed, but there was no code written, or
anything like that. But it's the thing I decided to take a stab at after
getting back from Vancouver.

Consider a starjoin query

SELECT + FROM f JOIN d1 (f.id1 = d1.id)
JOIN d2 (f.id2 = d2.id)
JOIN d2 (f.id3 = d3.id)
WHERE d1.x = 1
AND d2.y = 2
AND d3.z = 3;

which will be planned using a left-deep plan like this one:

HJ
/ \
D3 HJ
/ \
D2 HJ
/ \
D1 F

With hashes on "D" tables, and a scan on "F". With the "old" patches,
each HJ node would use a Bloom filter internally. But there's an
interesting opportunity to "push down" the filters to the scan on "F",
and evaluate them right there, a bit as if the scan had a local qual.

The attached patch implements a PoC of this, and it's pretty effective.

Of course, it depends on the selectivity of the joins (and thus how many
tuples get discarded by the filters). But because it moves all the
"cheap" filter probes *before* probing any of the hash tables, it has a
multiplication effect for the benefits.

Yes, it still has most of the open issues discussed earlier, and those
will need to be addressed. But this "multiplication" may also make it
somewhat less sensitive to the regressions.

In the example above, if each of the 3 joins has 20% selectivity (i.e.
20% tuples go through), then the total selectivity is ~1%. So the "F"
scan produces only 1/100 of tuples. Maybe we got one of the joins wrong,
and it does not eliminate any tuples? That still means the overall
selectivity is only ~4%.

Of course, this only works for larger joins, and maybe the joins are
correlated in some weird way, etc. Also, what does 4% selectivity mean
for the overall query duration?

Attached is a PDF with results from a simple benchmark using joins like
the one above - fact + 1-3 dimensions. The scripts (in the .tgz) set a
couple GUCs to eliminate variations in the plan. The dimension joins are
independent and match a variable fraction of the fact (1% - 100%).

The columns are for three branches - master, and "patched" with the
push-down disabled and enabled, for joins with 1-3 dimensions.

The last two column groups are comparing the "patched" results to
master. With "off" there's no difference (other than random noise), just
as expected. But with the push-down enabled, there are fairly
significant speedups (up to ~3x). Of course, this is just a benchmark,
practical queries may do other stuff, making the gains smaller. OTOH, it
may also be much better, if there are expensive nodes in between.

The PoC patch is not very big or complex. 280KB seems like a lot, but
like 99% of that is changes in test output, because the patch adds some
info about the Bloom filters to EXPLAIN. The actual .c changes are only
~1000 lines, and a half of that is comments.

The most interesting stuff happens in create_hashjoin_plan(), where we
attempt to push-down the filter to a scan in the outer subtree. If that
succeeds, then ExecInitHashJoin initializes the filter so that the scan
can find it, and Hash builds the filter along with the hash table. And
then the scan nodes probe the pushed-down filter in ExecScanExtended().

There's bunch of boilerplate so that setrefs does the right thing with
expressions, etc. But it's a couple lines here and there. I'm actually
surprised how little code this is.

There's one detail I haven't mentioned yet - there's a simple adaptive
behavior, to deal with filters that are not selective enough. Per some
initial tests there's little benefit when the filter keeps >75% tuples,
and for >90% there were measurable regressions (~50%). This was very
consistent for different data types, etc.

So the patch tracks number of matching tuples per 1000 probes, when it
exceeds 90% it switches to sampling. Only 1% of tuples gets probed in
the filter, and if the fraction drops <80%, all the tuples get probed
again. This is very simple, needs more thought. But for the purpose of
the testing it worked quite well. There still is a small regression
(~3%), which I assume is due to building the filter.

Aside from the issues with deciding if to use a filter at all, sizing
it, etc. - which are still valid (even with the adaptive thing), and
need to be solved, there's one more annoying issue specific to this new
push-down stuff.

Earlier, I mentioned the push-down happens in create_hashjoin_plan().
Which means it happens *after* planning and costing. There are reasons
for that, but it has some unfortunate & annoying consequences.

Ideally, we'd know about the filters when constructing the scan nodes,
so we'd have a chance to estimate how many tuples will be eliminated by
probing the filters (which is about the same thing as estimating the
join sizes). But we can't do that, because our planner works bottom-up.
When constructing the scan nodes we know which tables we'll join with,
but we have no idea which of the join algorithms we'll pick.

We'll consider all three join types, and the scan node has no say which
of those will win. But the Bloom filter push-down is specific to hash
joins. So what should the scan node do? Either it can assume it's under
hash join (and set rows/cost as if there's a Bloom filter), or it can
set costs in a join-agnostic way (like now).

The only "correct" way I can think of dealing with this in the bottom-up
world is having two sets of paths - one set for a hash join, one set for
other joins. But that's not just for scans. We'd need that for all
paths, and for different combinations of joins. For the query with 3
joins, we'd end up with 2^3 combinations. That seems not great.

So I tend to see this as an opportunistic optimization. We do the
planning assuming there's no Bloom filter push-down, and then after the
fact we see if there's an opportunity after all. Which means we may not
pick a plan with hash joins, not realizing it might be made faster.

But in my mind that's somewhat acceptable / defensible.

The bigger issue for me is that it may make the EXPLAIN ANALYZE output
way harder to understand. The estimated "rows" are calculated before the
filter push-down happens, while the actual "rows" are with the filter
probing, of course. But it seems pretty easy to get confused by this,
and think it's just an incorrect estimate.

summary
-------

I like the idea of pushing filters down to the scan nodes (or perhaps
even to some other intermediate nodes). But maybe it's too incompatible
with our bottom-up planning, and the issues with costing and/or EXPLAIN
output may be impossible to solve. I wonder what others think.

Now that I revisited the older threads, I think it probably makes sense
with using Bloom filters in the hash join, at least in the two cases
mentioned in the first section. It doesn't have the issues with
bottom-up planning/costing, because it happens in the hash join. And the
issues with that (deciding what fractions are OK, sizing the filter,
...) apply to both that simpler case, and to the push-down.

Hi, Tomas

This is terrific and very timely from my POV.

I've been experimenting with a table AM (implemented as a
CustomScan scan provider), and bloom-filter pushdown from a hashjoin is one
of the bigger wins available to it: a fact-table scan joined to a filtered
dimension can use the filter to skip whole row groups and avoid
decompressing columns entirely, rather than just rejecting a tuple after
it's been produced. I'd hacked up a private version of this via a new
table-AM callback (the hashjoin walks the outer subtree, builds a filter
from the build side, and hands it to the AM's scan descriptor). Having now
read your PoC, I think your framework is the better foundation, and I'd
rather build on it than carry a parallel mechanism. But two things stand in
the way of a storage-level consumer using it, and I think both are
relatively
small.

1) A CustomScan can't currently be a recipient.

find_bloom_filter_recipient() only recognizes the stock scan tags, and the
probe itself lives in ExecScanExtended(), which a CustomScan never calls
(it dispatches to the provider's ExecCustomScan). The second part is
actually a feature, not a bug: if a CustomScan provider does its own
probing, it can choose the granularity -- per dictionary entry, per row
group, or per row -- instead of being locked into the per-row,
post-materialization probe that the stock nodes get. So all that's needed
on your side is to let the planner attach a filter to a base-relation
CustomScan; the provider takes care of consuming it.

Concretely, that's adding T_CustomScan to the scan-leaf case in
find_bloom_filter_recipient() (CustomScan embeds Scan first, so the
scanrelid test is identical; non-leaf custom nodes have scanrelid == 0 and
fall through to NULL), plus the matching fix_scan_bloom_filters() call in
set_customscan_references(). The provider then calls ExecInitBloomFilters()
in BeginCustomScan and ExecBloomFilters() (or a coarser-grained variant)
inside its scan loop. Everything else -- producer registration, the
es_bloom_producers lookup, the adaptive sampling, EXPLAIN -- is reused
unchanged.

2) The combined-hash filter can't be tested against a single column.

You build one filter keyed on hash32() of all the join keys combined. For a
single-key join that's ideal, and a column store can use it directly: hash
each distinct dictionary value once per row group and skip groups whose
values are all absent. For a multi-column join, though, the combined hash
mixes the keys, so it can only ever be tested per-row (with all key columns
in hand) -- it can't be checked against any one column's dictionary. The
per-row probe is still useful, but the row-group/dictionary skipping, which
is where most of the storage win comes from, isn't available.

The obvious thought is to key a filter per column instead. But I don't
think that should *replace* the combined filter, because per-column filters
are strictly less selective on multi-column joins: they only test whether
each column's value appears *somewhere* in the build side, not whether the
combination does. With build pairs {(1,10),(2,20)}, an outer (1,20) passes
both per-column filters even though it matches no build row, whereas the
combined filter rejects it. So for the row-level probe -- and especially
for plain heap -- the combined filter is the better one, and replacing it
would be a regression.

What I think would actually help is to let the framework *optionally* emit
per-column filters in addition to the combined one, when a recipient
signals it can use them. The combined filter stays the default and does the
precise per-row rejection (unchanged for heap, and usable per-row by a
column store too); the per-column filters are extra, built only on demand,
and let a storage consumer cheaply eliminate whole row groups before the
combined filter does the exact work. The cost is the build CPU and memory
for the extra filters -- but only for consumers that ask, so your design is
untouched when nobody does. For a single-key join the two filters
coincide, so
there'd be no reason to build both.

I'd be happy to work on patches for these.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andrew Dunstan (#2)
Re: hashjoins vs. Bloom filters (yet again)

On 5/30/26 19:12, Andrew Dunstan wrote:

On 2026-05-29 Fr 8:55 PM, Tomas Vondra wrote:

Hi,

A random discussion at pgconf.dev made me revisit one of my ancient
patches, attempting to use Bloom filters to hash joins. I did work on
that twice in the past - first in 2015/6 [1], then in 2018 [2]. So let
me briefly revisit that, before I get to the new patch.

old patches
-----------

Those old patches tried to do a fairly small thing during a hash join,
and that's building a Bloom filter on the inner relation (the one that
gets hashed), and then use that filter before probing the hash table.

The benefits come from Bloom filters being (fairly) cheap, and a
negative answer (hash is not in the filter) may allows us to skip a much
more expensive operation.

The old threads patches focused especially at two hash join cases:

(a) A very selective join, i.e. a significant fraction of outer tuples
does not have a match in the hash table.

(b) A selective hash join forced to do batching because the hash table
is too large, and thus forced to spill outer tuples to temporary files.

For (a), the benefit comes from Bloom filters being much cheaper to
probe than a hash table. The exact cost depends on the implementation,
sizes, etc. We're in the ballpark of 50 vs. 500 cycles, maybe. But if
the filter discards 90% of tuples, it can be a big win.

For (b), the filter (for all the batches at once) allows us to discard
some of the outer tuples without writing them to temporary files. Which
is way more expensive than probing a hash table.

The patches got stuck mostly because deciding if it makes sense to
build/use the Bloom filter is somewhat hard. For cases where 100% of the
tuples have a match it's pointless - it's just pure cost, no benefit.
The regressions are relatively small, though (<10%).

For (b) it's much less sensitive to this kind of issues, of course. The
cost of writing outer tuples to temporary files is much higher than
building/probing a Bloom filter.

Clearly, a filter that discards 99% of tuples is great. And a filter
that keeps 99% of tuples is not great. But where exactly are the
thresholds is not quite clear.

There's also a related question of sizing the filter. Bloom filters are
usually sized by specifying the number of distinct values and the
desired false positive rate. And we could try doing that - pick a
standard false positive rate (e.g. the built-in bloom_filter aims for
1-2%), estimate the ndistinct, and get the size of the Bloom filter.

However, chances are the filter is too big. We can't get work_mem, the
join is already using that for the hash table etc. We can maybe use a
fraction of it, and that may not be enough to fit the "perfect" filter.
We could bail out and not use any Bloom filter at all, but that seems a
bit silly. Maybe we can't fit the 2% filter, but 5% of 10% would be OK?

Surely if the join selectivity is 1% (i.e. it discards 99% tuples), then
using a "worse" Bloom filter with 10% false positives would be a win?
It'd still discard ~89% of tuples.

Yet another angle leading to this kind of questions is inaccurate
ndistinct estimates (and we all know those estimates can be quite
unreliable). Let's say we size the filter for 1M distinct values (and it
just about fits into the memory budget), but then during execution we
find there are 2M distinct values. Well, now we may have ~10% false
positive rate. Or maybe we got 5M, and it's 30%. Or 10M / 50%.

At some point the filter stops being worth it, and we should either not
build it, or we should stop probing it. But when is that?

I think we'd need some sort of cost model to make judgments about this.

Anyway, this was just me summarizing the old threads, and what I think
got them stuck. Most of these questions are still open, although I think
we may be able to solve them better than we could ~10 years ago. We have
extended stats, we know about FK constraints during planning, ...

new patch
---------

Now let's talk about the new experimental/PoC patch that came from the
pgconf.dev discussions. It doesn't really solve the issues I just went
through, it's more of an attempt to take it one step further.

One of the things mentioned in the 2018 thread was the possibility to
push the filter much deeper, instead of using it just in the hash join
node itself. It was merely discussed, but there was no code written, or
anything like that. But it's the thing I decided to take a stab at after
getting back from Vancouver.

Consider a starjoin query

SELECT + FROM f JOIN d1 (f.id1 = d1.id)
JOIN d2 (f.id2 = d2.id)
JOIN d2 (f.id3 = d3.id)
WHERE d1.x = 1
AND d2.y = 2
AND d3.z = 3;

which will be planned using a left-deep plan like this one:

HJ
/ \
D3 HJ
/ \
D2 HJ
/ \
D1 F

With hashes on "D" tables, and a scan on "F". With the "old" patches,
each HJ node would use a Bloom filter internally. But there's an
interesting opportunity to "push down" the filters to the scan on "F",
and evaluate them right there, a bit as if the scan had a local qual.

The attached patch implements a PoC of this, and it's pretty effective.

Of course, it depends on the selectivity of the joins (and thus how many
tuples get discarded by the filters). But because it moves all the
"cheap" filter probes *before* probing any of the hash tables, it has a
multiplication effect for the benefits.

Yes, it still has most of the open issues discussed earlier, and those
will need to be addressed. But this "multiplication" may also make it
somewhat less sensitive to the regressions.

In the example above, if each of the 3 joins has 20% selectivity (i.e.
20% tuples go through), then the total selectivity is ~1%. So the "F"
scan produces only 1/100 of tuples. Maybe we got one of the joins wrong,
and it does not eliminate any tuples? That still means the overall
selectivity is only ~4%.

Of course, this only works for larger joins, and maybe the joins are
correlated in some weird way, etc. Also, what does 4% selectivity mean
for the overall query duration?

Attached is a PDF with results from a simple benchmark using joins like
the one above - fact + 1-3 dimensions. The scripts (in the .tgz) set a
couple GUCs to eliminate variations in the plan. The dimension joins are
independent and match a variable fraction of the fact (1% - 100%).

The columns are for three branches - master, and "patched" with the
push-down disabled and enabled, for joins with 1-3 dimensions.

The last two column groups are comparing the "patched" results to
master. With "off" there's no difference (other than random noise), just
as expected. But with the push-down enabled, there are fairly
significant speedups (up to ~3x). Of course, this is just a benchmark,
practical queries may do other stuff, making the gains smaller. OTOH, it
may also be much better, if there are expensive nodes in between.

The PoC patch is not very big or complex. 280KB seems like a lot, but
like 99% of that is changes in test output, because the patch adds some
info about the Bloom filters to EXPLAIN. The actual .c changes are only
~1000 lines, and a half of that is comments.

The most interesting stuff happens in create_hashjoin_plan(), where we
attempt to push-down the filter to a scan in the outer subtree. If that
succeeds, then ExecInitHashJoin initializes the filter so that the scan
can find it, and Hash builds the filter along with the hash table. And
then the scan nodes probe the pushed-down filter in ExecScanExtended().

There's bunch of boilerplate so that setrefs does the right thing with
expressions, etc. But it's a couple lines here and there. I'm actually
surprised how little code this is.

There's one detail I haven't mentioned yet - there's a simple adaptive
behavior, to deal with filters that are not selective enough. Per some
initial tests there's little benefit when the filter keeps >75% tuples,
and for >90% there were measurable regressions (~50%). This was very
consistent for different data types, etc.

So the patch tracks number of matching tuples per 1000 probes, when it
exceeds 90% it switches to sampling. Only 1% of tuples gets probed in
the filter, and if the fraction drops <80%, all the tuples get probed
again. This is very simple, needs more thought. But for the purpose of
the testing it worked quite well. There still is a small regression
(~3%), which I assume is due to building the filter.

Aside from the issues with deciding if to use a filter at all, sizing
it, etc. - which are still valid (even with the adaptive thing), and
need to be solved, there's one more annoying issue specific to this new
push-down stuff.

Earlier, I mentioned the push-down happens in create_hashjoin_plan().
Which means it happens *after* planning and costing. There are reasons
for that, but it has some unfortunate & annoying consequences.

Ideally, we'd know about the filters when constructing the scan nodes,
so we'd have a chance to estimate how many tuples will be eliminated by
probing the filters (which is about the same thing as estimating the
join sizes). But we can't do that, because our planner works bottom-up.
When constructing the scan nodes we know which tables we'll join with,
but we have no idea which of the join algorithms we'll pick.

We'll consider all three join types, and the scan node has no say which
of those will win. But the Bloom filter push-down is specific to hash
joins. So what should the scan node do? Either it can assume it's under
hash join (and set rows/cost as if there's a Bloom filter), or it can
set costs in a join-agnostic way (like now).

The only "correct" way I can think of dealing with this in the bottom-up
world is having two sets of paths - one set for a hash join, one set for
other joins. But that's not just for scans. We'd need that for all
paths, and for different combinations of joins. For the query with 3
joins, we'd end up with 2^3 combinations. That seems not great.

So I tend to see this as an opportunistic optimization. We do the
planning assuming there's no Bloom filter push-down, and then after the
fact we see if there's an opportunity after all. Which means we may not
pick a plan with hash joins, not realizing it might be made faster.

But in my mind that's somewhat acceptable / defensible.

The bigger issue for me is that it may make the EXPLAIN ANALYZE output
way harder to understand. The estimated "rows" are calculated before the
filter push-down happens, while the actual "rows" are with the filter
probing, of course. But it seems pretty easy to get confused by this,
and think it's just an incorrect estimate.

summary
-------

I like the idea of pushing filters down to the scan nodes (or perhaps
even to some other intermediate nodes). But maybe it's too incompatible
with our bottom-up planning, and the issues with costing and/or EXPLAIN
output may be impossible to solve. I wonder what others think.

Now that I revisited the older threads, I think it probably makes sense
with using Bloom filters in the hash join, at least in the two cases
mentioned in the first section. It doesn't have the issues with
bottom-up planning/costing, because it happens in the hash join. And the
issues with that (deciding what fractions are OK, sizing the filter,
...) apply to both that simpler case, and to the push-down.

Hi, Tomas

This is terrific and very timely from my POV.

I've been experimenting with a table AM (implemented as a
CustomScan scan provider), and bloom-filter pushdown from a hashjoin is one
of the bigger wins available to it: a fact-table scan joined to a filtered
dimension can use the filter to skip whole row groups and avoid
decompressing columns entirely, rather than just rejecting a tuple after
it's been produced. I'd hacked up a private version of this via a new
table-AM callback (the hashjoin walks the outer subtree, builds a filter
from the build side, and hands it to the AM's scan descriptor). Having now
read your PoC, I think your framework is the better foundation, and I'd
rather build on it than carry a parallel mechanism. But two things stand in
the way of a storage-level consumer using it, and I think both are
relatively
small.

OK, good to hear. I was actually thinking about that use case too, i.e.
making it possible for the scan to do something smart with the filter
(like even pushing it even further down, to "storage"). Or maybe the
ForeignScan could push it to the remote side, so that it's actually
filtered there.

I didn't mention that my message, and there are some difficulties:

1) We only build the hash (and bloom) with a delay, after the scan
already produces some tuples. That complicates the pushdown, whiich may
need to happen when starting the scan. Presumably, we'd need to allow
disabling this optimization, optionally.

2) We'd need some sort of "portable" Bloom filter, with serialization
and deserialization, etc.

Both of these seem rather solvable.

1) A CustomScan can't currently be a recipient.

find_bloom_filter_recipient() only recognizes the stock scan tags, and the
probe itself lives in ExecScanExtended(), which a CustomScan never calls
(it dispatches to the provider's ExecCustomScan). The second part is
actually a feature, not a bug: if a CustomScan provider does its own
probing, it can choose the granularity -- per dictionary entry, per row
group, or per row -- instead of being locked into the per-row,
post-materialization probe that the stock nodes get. So all that's needed
on your side is to let the planner attach a filter to a base-relation
CustomScan; the provider takes care of consuming it.

Concretely, that's adding T_CustomScan to the scan-leaf case in
find_bloom_filter_recipient() (CustomScan embeds Scan first, so the
scanrelid test is identical; non-leaf custom nodes have scanrelid == 0 and
fall through to NULL), plus the matching fix_scan_bloom_filters() call in
set_customscan_references(). The provider then calls ExecInitBloomFilters()
in BeginCustomScan and ExecBloomFilters() (or a coarser-grained variant)
inside its scan loop. Everything else -- producer registration, the
es_bloom_producers lookup, the adaptive sampling, EXPLAIN -- is reused
unchanged.

Yes, that should work and it's a mostly mechanical change.

Maybe we'd want some sort of opt-in, so that the CustomScan can indicate
it can handle Bloom filters. Like, setting
CUSTOMPATH_SUPPORT_BLOOM_FILTERS to flags.

2) The combined-hash filter can't be tested against a single column.

You build one filter keyed on hash32() of all the join keys combined. For a
single-key join that's ideal, and a column store can use it directly: hash
each distinct dictionary value once per row group and skip groups whose
values are all absent. For a multi-column join, though, the combined hash
mixes the keys, so it can only ever be tested per-row (with all key columns
in hand) -- it can't be checked against any one column's dictionary. The
per-row probe is still useful, but the row-group/dictionary skipping, which
is where most of the storage win comes from, isn't available.

The obvious thought is to key a filter per column instead. But I don't
think that should *replace* the combined filter, because per-column filters
are strictly less selective on multi-column joins: they only test whether
each column's value appears *somewhere* in the build side, not whether the
combination does. With build pairs {(1,10),(2,20)}, an outer (1,20) passes
both per-column filters even though it matches no build row, whereas the
combined filter rejects it. So for the row-level probe -- and especially
for plain heap -- the combined filter is the better one, and replacing it
would be a regression.

What I think would actually help is to let the framework *optionally* emit
per-column filters in addition to the combined one, when a recipient
signals it can use them. The combined filter stays the default and does the
precise per-row rejection (unchanged for heap, and usable per-row by a
column store too); the per-column filters are extra, built only on demand,
and let a storage consumer cheaply eliminate whole row groups before the
combined filter does the exact work. The cost is the build CPU and memory
for the extra filters -- but only for consumers that ask, so your design is
untouched when nobody does. For a single-key join the two filters
coincide, so
there'd be no reason to build both.

I think I speculated about this (having per-key filters) in some of the
comments in the patch, although the use case was different. I haven't
thought about TAM, but about different joins where the join keys come
from both sides. Consider a join like

HJ
/ \
A HJ
/ \
B C

where A-(BC) is on (A.x = B.x AND A.y = C.y), so the complete filter
can't be pushed to either side. But we could:

(1) Push the filter on top of the BC join (which in this example is not
really a push-down).

(2) Build filters on (x) and (y) separately, and push-down these.

Or we could do both, really.

I suppose a variation of (2) would work for your use case too, except
we'd push all three filters (x,y), (x) and (y) to the same scan.

I guess this could also be opt-in, enabled by some CUSTOMPATH_ flag.

The question is how efficient can the smaller filters be. The complete
filter can be very selective, while the per-key filters are terrible.

I'd be happy to work on patches for these.

Great. It's and interesting experiment / area to explore.

FWIW I think the main difficulty for this PoC is going to be the
planning/costing stuff, and the impact on EXPLAIN.

regards

--
Tomas Vondra

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Tomas Vondra (#3)
Re: hashjoins vs. Bloom filters (yet again)

On 2026-05-30 Sa 2:14 PM, Tomas Vondra wrote:

Hi, Tomas

This is terrific and very timely from my POV.

I've been experimenting with a table AM (implemented as a
CustomScan scan provider), and bloom-filter pushdown from a hashjoin is one
of the bigger wins available to it: a fact-table scan joined to a filtered
dimension can use the filter to skip whole row groups and avoid
decompressing columns entirely, rather than just rejecting a tuple after
it's been produced. I'd hacked up a private version of this via a new
table-AM callback (the hashjoin walks the outer subtree, builds a filter
from the build side, and hands it to the AM's scan descriptor). Having now
read your PoC, I think your framework is the better foundation, and I'd
rather build on it than carry a parallel mechanism. But two things stand in
the way of a storage-level consumer using it, and I think both are
relatively
small.

OK, good to hear. I was actually thinking about that use case too, i.e.
making it possible for the scan to do something smart with the filter
(like even pushing it even further down, to "storage"). Or maybe the
ForeignScan could push it to the remote side, so that it's actually
filtered there.

I didn't mention that my message, and there are some difficulties:

1) We only build the hash (and bloom) with a delay, after the scan
already produces some tuples. That complicates the pushdown, whiich may
need to happen when starting the scan. Presumably, we'd need to allow
disabling this optimization, optionally.

2) We'd need some sort of "portable" Bloom filter, with serialization
and deserialization, etc.

Both of these seem rather solvable.

1) A CustomScan can't currently be a recipient.

find_bloom_filter_recipient() only recognizes the stock scan tags, and the
probe itself lives in ExecScanExtended(), which a CustomScan never calls
(it dispatches to the provider's ExecCustomScan). The second part is
actually a feature, not a bug: if a CustomScan provider does its own
probing, it can choose the granularity -- per dictionary entry, per row
group, or per row -- instead of being locked into the per-row,
post-materialization probe that the stock nodes get. So all that's needed
on your side is to let the planner attach a filter to a base-relation
CustomScan; the provider takes care of consuming it.

Concretely, that's adding T_CustomScan to the scan-leaf case in
find_bloom_filter_recipient() (CustomScan embeds Scan first, so the
scanrelid test is identical; non-leaf custom nodes have scanrelid == 0 and
fall through to NULL), plus the matching fix_scan_bloom_filters() call in
set_customscan_references(). The provider then calls ExecInitBloomFilters()
in BeginCustomScan and ExecBloomFilters() (or a coarser-grained variant)
inside its scan loop. Everything else -- producer registration, the
es_bloom_producers lookup, the adaptive sampling, EXPLAIN -- is reused
unchanged.

Yes, that should work and it's a mostly mechanical change.

Maybe we'd want some sort of opt-in, so that the CustomScan can indicate
it can handle Bloom filters. Like, setting
CUSTOMPATH_SUPPORT_BLOOM_FILTERS to flags.

2) The combined-hash filter can't be tested against a single column.

You build one filter keyed on hash32() of all the join keys combined. For a
single-key join that's ideal, and a column store can use it directly: hash
each distinct dictionary value once per row group and skip groups whose
values are all absent. For a multi-column join, though, the combined hash
mixes the keys, so it can only ever be tested per-row (with all key columns
in hand) -- it can't be checked against any one column's dictionary. The
per-row probe is still useful, but the row-group/dictionary skipping, which
is where most of the storage win comes from, isn't available.

The obvious thought is to key a filter per column instead. But I don't
think that should *replace* the combined filter, because per-column filters
are strictly less selective on multi-column joins: they only test whether
each column's value appears *somewhere* in the build side, not whether the
combination does. With build pairs {(1,10),(2,20)}, an outer (1,20) passes
both per-column filters even though it matches no build row, whereas the
combined filter rejects it. So for the row-level probe -- and especially
for plain heap -- the combined filter is the better one, and replacing it
would be a regression.

What I think would actually help is to let the framework *optionally* emit
per-column filters in addition to the combined one, when a recipient
signals it can use them. The combined filter stays the default and does the
precise per-row rejection (unchanged for heap, and usable per-row by a
column store too); the per-column filters are extra, built only on demand,
and let a storage consumer cheaply eliminate whole row groups before the
combined filter does the exact work. The cost is the build CPU and memory
for the extra filters -- but only for consumers that ask, so your design is
untouched when nobody does. For a single-key join the two filters
coincide, so
there'd be no reason to build both.

I think I speculated about this (having per-key filters) in some of the
comments in the patch, although the use case was different. I haven't
thought about TAM, but about different joins where the join keys come
from both sides. Consider a join like

HJ
/ \
A HJ
/ \
B C

where A-(BC) is on (A.x = B.x AND A.y = C.y), so the complete filter
can't be pushed to either side. But we could:

(1) Push the filter on top of the BC join (which in this example is not
really a push-down).

(2) Build filters on (x) and (y) separately, and push-down these.

Or we could do both, really.

I suppose a variation of (2) would work for your use case too, except
we'd push all three filters (x,y), (x) and (y) to the same scan.

I guess this could also be opt-in, enabled by some CUSTOMPATH_ flag.

The question is how efficient can the smaller filters be. The complete
filter can be very selective, while the per-key filters are terrible.

I'd be happy to work on patches for these.

Great. It's and interesting experiment / area to explore.

Here are 3 patches (developed using Claude) that sit on top of your POC.

Patch 1 enables the pushdown filters for custom scans. As you say it's
fairly mechanical and is enabled by a CUSTOMPATH_SUPPORT_BLOOM_FILTERS
path flag.

Patch 2 provides for building per-key filters in addition to the
multi-key filter if that flag is set. There may be other cases that
would want it, but this would suit my immediate use case.

Patch 3 provides for eager creation of the filter(s) in such cases,
disabling the optimization you mentioned in point 1 above.

FWIW I think the main difficulty for this PoC is going to be the
planning/costing stuff, and the impact on EXPLAIN.

I haven't dealt with that or other issues you raise, but I think this is
enough for me to begin testing. I have adapted my TAM to it and verified
that it acts as expected. I will start doing some benchmarks.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachments:

0001-Allow-a-CustomScan-to-receive-a-pushed-down-hashjoin.patch.texttext/plain; charset=UTF-8; name=0001-Allow-a-CustomScan-to-receive-a-pushed-down-hashjoin.patch.textDownload+41-1
0002-Optionally-build-per-key-hashjoin-bloom-filters-for-.patch.texttext/plain; charset=UTF-8; name=0002-Optionally-build-per-key-hashjoin-bloom-filters-for-.patch.textDownload+97-1
0003-Build-the-hashjoin-bloom-filter-eagerly-for-a-Custom.patc.texttext/plain; charset=UTF-8; name=0003-Build-the-hashjoin-bloom-filter-eagerly-for-a-Custom.patc.textDownload+43-9