hashjoins vs. Bloom filters (yet again)

Started by Tomas Vondraabout 2 months ago43 messageshackers

tomas.vondra@2ndquadrant.com

about 2 months ago

Hi,

A random discussion at pgconf.dev made me revisit one of my ancient
patches, attempting to use Bloom filters to hash joins. I did work on
that twice in the past - first in 2015/6 [1]/messages/by-id/5670946E.8070705@2ndquadrant.com, then in 2018 [2]/messages/by-id/c902844d-837f-5f63-ced3-9f7fd222f175@2ndquadrant.com. So let
me briefly revisit that, before I get to the new patch.

old patches
-----------

Those old patches tried to do a fairly small thing during a hash join,
and that's building a Bloom filter on the inner relation (the one that
gets hashed), and then use that filter before probing the hash table.

The benefits come from Bloom filters being (fairly) cheap, and a
negative answer (hash is not in the filter) may allows us to skip a much
more expensive operation.

The old threads patches focused especially at two hash join cases:

(a) A very selective join, i.e. a significant fraction of outer tuples
does not have a match in the hash table.

(b) A selective hash join forced to do batching because the hash table
is too large, and thus forced to spill outer tuples to temporary files.

For (a), the benefit comes from Bloom filters being much cheaper to
probe than a hash table. The exact cost depends on the implementation,
sizes, etc. We're in the ballpark of 50 vs. 500 cycles, maybe. But if
the filter discards 90% of tuples, it can be a big win.

For (b), the filter (for all the batches at once) allows us to discard
some of the outer tuples without writing them to temporary files. Which
is way more expensive than probing a hash table.

The patches got stuck mostly because deciding if it makes sense to
build/use the Bloom filter is somewhat hard. For cases where 100% of the
tuples have a match it's pointless - it's just pure cost, no benefit.
The regressions are relatively small, though (<10%).

For (b) it's much less sensitive to this kind of issues, of course. The
cost of writing outer tuples to temporary files is much higher than
building/probing a Bloom filter.

Clearly, a filter that discards 99% of tuples is great. And a filter
that keeps 99% of tuples is not great. But where exactly are the
thresholds is not quite clear.

There's also a related question of sizing the filter. Bloom filters are
usually sized by specifying the number of distinct values and the
desired false positive rate. And we could try doing that - pick a
standard false positive rate (e.g. the built-in bloom_filter aims for
1-2%), estimate the ndistinct, and get the size of the Bloom filter.

However, chances are the filter is too big. We can't get work_mem, the
join is already using that for the hash table etc. We can maybe use a
fraction of it, and that may not be enough to fit the "perfect" filter.
We could bail out and not use any Bloom filter at all, but that seems a
bit silly. Maybe we can't fit the 2% filter, but 5% of 10% would be OK?

Surely if the join selectivity is 1% (i.e. it discards 99% tuples), then
using a "worse" Bloom filter with 10% false positives would be a win?
It'd still discard ~89% of tuples.

Yet another angle leading to this kind of questions is inaccurate
ndistinct estimates (and we all know those estimates can be quite
unreliable). Let's say we size the filter for 1M distinct values (and it
just about fits into the memory budget), but then during execution we
find there are 2M distinct values. Well, now we may have ~10% false
positive rate. Or maybe we got 5M, and it's 30%. Or 10M / 50%.

At some point the filter stops being worth it, and we should either not
build it, or we should stop probing it. But when is that?

I think we'd need some sort of cost model to make judgments about this.

Anyway, this was just me summarizing the old threads, and what I think
got them stuck. Most of these questions are still open, although I think
we may be able to solve them better than we could ~10 years ago. We have
extended stats, we know about FK constraints during planning, ...

new patch
---------

Now let's talk about the new experimental/PoC patch that came from the
pgconf.dev discussions. It doesn't really solve the issues I just went
through, it's more of an attempt to take it one step further.

One of the things mentioned in the 2018 thread was the possibility to
push the filter much deeper, instead of using it just in the hash join
node itself. It was merely discussed, but there was no code written, or
anything like that. But it's the thing I decided to take a stab at after
getting back from Vancouver.

Consider a starjoin query

SELECT + FROM f JOIN d1 (f.id1 = d1.id)
JOIN d2 (f.id2 = d2.id)
JOIN d2 (f.id3 = d3.id)
WHERE d1.x = 1
AND d2.y = 2
AND d3.z = 3;

which will be planned using a left-deep plan like this one:

HJ
/ \
D3 HJ
/ \
D2 HJ
/ \
D1 F

With hashes on "D" tables, and a scan on "F". With the "old" patches,
each HJ node would use a Bloom filter internally. But there's an
interesting opportunity to "push down" the filters to the scan on "F",
and evaluate them right there, a bit as if the scan had a local qual.

The attached patch implements a PoC of this, and it's pretty effective.

Of course, it depends on the selectivity of the joins (and thus how many
tuples get discarded by the filters). But because it moves all the
"cheap" filter probes *before* probing any of the hash tables, it has a
multiplication effect for the benefits.

Yes, it still has most of the open issues discussed earlier, and those
will need to be addressed. But this "multiplication" may also make it
somewhat less sensitive to the regressions.

In the example above, if each of the 3 joins has 20% selectivity (i.e.
20% tuples go through), then the total selectivity is ~1%. So the "F"
scan produces only 1/100 of tuples. Maybe we got one of the joins wrong,
and it does not eliminate any tuples? That still means the overall
selectivity is only ~4%.

Of course, this only works for larger joins, and maybe the joins are
correlated in some weird way, etc. Also, what does 4% selectivity mean
for the overall query duration?

Attached is a PDF with results from a simple benchmark using joins like
the one above - fact + 1-3 dimensions. The scripts (in the .tgz) set a
couple GUCs to eliminate variations in the plan. The dimension joins are
independent and match a variable fraction of the fact (1% - 100%).

The columns are for three branches - master, and "patched" with the
push-down disabled and enabled, for joins with 1-3 dimensions.

The last two column groups are comparing the "patched" results to
master. With "off" there's no difference (other than random noise), just
as expected. But with the push-down enabled, there are fairly
significant speedups (up to ~3x). Of course, this is just a benchmark,
practical queries may do other stuff, making the gains smaller. OTOH, it
may also be much better, if there are expensive nodes in between.

The PoC patch is not very big or complex. 280KB seems like a lot, but
like 99% of that is changes in test output, because the patch adds some
info about the Bloom filters to EXPLAIN. The actual .c changes are only
~1000 lines, and a half of that is comments.

The most interesting stuff happens in create_hashjoin_plan(), where we
attempt to push-down the filter to a scan in the outer subtree. If that
succeeds, then ExecInitHashJoin initializes the filter so that the scan
can find it, and Hash builds the filter along with the hash table. And
then the scan nodes probe the pushed-down filter in ExecScanExtended().

There's bunch of boilerplate so that setrefs does the right thing with
expressions, etc. But it's a couple lines here and there. I'm actually
surprised how little code this is.

There's one detail I haven't mentioned yet - there's a simple adaptive
behavior, to deal with filters that are not selective enough. Per some
initial tests there's little benefit when the filter keeps >75% tuples,
and for >90% there were measurable regressions (~50%). This was very
consistent for different data types, etc.

So the patch tracks number of matching tuples per 1000 probes, when it
exceeds 90% it switches to sampling. Only 1% of tuples gets probed in
the filter, and if the fraction drops <80%, all the tuples get probed
again. This is very simple, needs more thought. But for the purpose of
the testing it worked quite well. There still is a small regression
(~3%), which I assume is due to building the filter.

Aside from the issues with deciding if to use a filter at all, sizing
it, etc. - which are still valid (even with the adaptive thing), and
need to be solved, there's one more annoying issue specific to this new
push-down stuff.

Earlier, I mentioned the push-down happens in create_hashjoin_plan().
Which means it happens *after* planning and costing. There are reasons
for that, but it has some unfortunate & annoying consequences.

Ideally, we'd know about the filters when constructing the scan nodes,
so we'd have a chance to estimate how many tuples will be eliminated by
probing the filters (which is about the same thing as estimating the
join sizes). But we can't do that, because our planner works bottom-up.
When constructing the scan nodes we know which tables we'll join with,
but we have no idea which of the join algorithms we'll pick.

We'll consider all three join types, and the scan node has no say which
of those will win. But the Bloom filter push-down is specific to hash
joins. So what should the scan node do? Either it can assume it's under
hash join (and set rows/cost as if there's a Bloom filter), or it can
set costs in a join-agnostic way (like now).

The only "correct" way I can think of dealing with this in the bottom-up
world is having two sets of paths - one set for a hash join, one set for
other joins. But that's not just for scans. We'd need that for all
paths, and for different combinations of joins. For the query with 3
joins, we'd end up with 2^3 combinations. That seems not great.

So I tend to see this as an opportunistic optimization. We do the
planning assuming there's no Bloom filter push-down, and then after the
fact we see if there's an opportunity after all. Which means we may not
pick a plan with hash joins, not realizing it might be made faster.

But in my mind that's somewhat acceptable / defensible.

The bigger issue for me is that it may make the EXPLAIN ANALYZE output
way harder to understand. The estimated "rows" are calculated before the
filter push-down happens, while the actual "rows" are with the filter
probing, of course. But it seems pretty easy to get confused by this,
and think it's just an incorrect estimate.

summary
-------

I like the idea of pushing filters down to the scan nodes (or perhaps
even to some other intermediate nodes). But maybe it's too incompatible
with our bottom-up planning, and the issues with costing and/or EXPLAIN
output may be impossible to solve. I wonder what others think.

Now that I revisited the older threads, I think it probably makes sense
with using Bloom filters in the hash join, at least in the two cases
mentioned in the first section. It doesn't have the issues with
bottom-up planning/costing, because it happens in the hash join. And the
issues with that (deciding what fractions are OK, sizing the filter,
...) apply to both that simpler case, and to the push-down.

regards

[1]: /messages/by-id/5670946E.8070705@2ndquadrant.com

[2]: /messages/by-id/c902844d-837f-5f63-ced3-9f7fd222f175@2ndquadrant.com
/messages/by-id/c902844d-837f-5f63-ced3-9f7fd222f175@2ndquadrant.com

--
Tomas Vondra

Andrew Dunstan

andrew@dunslane.net

about 2 months ago

In reply to: Tomas Vondra (#1)

Re: hashjoins vs. Bloom filters (yet again)

On 2026-05-29 Fr 8:55 PM, Tomas Vondra wrote:

Hi,

A random discussion at pgconf.dev made me revisit one of my ancient
patches, attempting to use Bloom filters to hash joins. I did work on
that twice in the past - first in 2015/6 [1], then in 2018 [2]. So let
me briefly revisit that, before I get to the new patch.

old patches
-----------

Those old patches tried to do a fairly small thing during a hash join,
and that's building a Bloom filter on the inner relation (the one that
gets hashed), and then use that filter before probing the hash table.

The benefits come from Bloom filters being (fairly) cheap, and a
negative answer (hash is not in the filter) may allows us to skip a much
more expensive operation.

The old threads patches focused especially at two hash join cases:

(a) A very selective join, i.e. a significant fraction of outer tuples
does not have a match in the hash table.

(b) A selective hash join forced to do batching because the hash table
is too large, and thus forced to spill outer tuples to temporary files.

For (a), the benefit comes from Bloom filters being much cheaper to
probe than a hash table. The exact cost depends on the implementation,
sizes, etc. We're in the ballpark of 50 vs. 500 cycles, maybe. But if
the filter discards 90% of tuples, it can be a big win.

For (b), the filter (for all the batches at once) allows us to discard
some of the outer tuples without writing them to temporary files. Which
is way more expensive than probing a hash table.

The patches got stuck mostly because deciding if it makes sense to
build/use the Bloom filter is somewhat hard. For cases where 100% of the
tuples have a match it's pointless - it's just pure cost, no benefit.
The regressions are relatively small, though (<10%).

For (b) it's much less sensitive to this kind of issues, of course. The
cost of writing outer tuples to temporary files is much higher than
building/probing a Bloom filter.

Clearly, a filter that discards 99% of tuples is great. And a filter
that keeps 99% of tuples is not great. But where exactly are the
thresholds is not quite clear.

There's also a related question of sizing the filter. Bloom filters are
usually sized by specifying the number of distinct values and the
desired false positive rate. And we could try doing that - pick a
standard false positive rate (e.g. the built-in bloom_filter aims for
1-2%), estimate the ndistinct, and get the size of the Bloom filter.

However, chances are the filter is too big. We can't get work_mem, the
join is already using that for the hash table etc. We can maybe use a
fraction of it, and that may not be enough to fit the "perfect" filter.
We could bail out and not use any Bloom filter at all, but that seems a
bit silly. Maybe we can't fit the 2% filter, but 5% of 10% would be OK?

Surely if the join selectivity is 1% (i.e. it discards 99% tuples), then
using a "worse" Bloom filter with 10% false positives would be a win?
It'd still discard ~89% of tuples.

Yet another angle leading to this kind of questions is inaccurate
ndistinct estimates (and we all know those estimates can be quite
unreliable). Let's say we size the filter for 1M distinct values (and it
just about fits into the memory budget), but then during execution we
find there are 2M distinct values. Well, now we may have ~10% false
positive rate. Or maybe we got 5M, and it's 30%. Or 10M / 50%.

At some point the filter stops being worth it, and we should either not
build it, or we should stop probing it. But when is that?

I think we'd need some sort of cost model to make judgments about this.

Anyway, this was just me summarizing the old threads, and what I think
got them stuck. Most of these questions are still open, although I think
we may be able to solve them better than we could ~10 years ago. We have
extended stats, we know about FK constraints during planning, ...

new patch
---------

Now let's talk about the new experimental/PoC patch that came from the
pgconf.dev discussions. It doesn't really solve the issues I just went
through, it's more of an attempt to take it one step further.

One of the things mentioned in the 2018 thread was the possibility to
push the filter much deeper, instead of using it just in the hash join
node itself. It was merely discussed, but there was no code written, or
anything like that. But it's the thing I decided to take a stab at after
getting back from Vancouver.

Consider a starjoin query

SELECT + FROM f JOIN d1 (f.id1 = d1.id)
JOIN d2 (f.id2 = d2.id)
JOIN d2 (f.id3 = d3.id)
WHERE d1.x = 1
AND d2.y = 2
AND d3.z = 3;

which will be planned using a left-deep plan like this one:

HJ
/ \
D3 HJ
/ \
D2 HJ
/ \
D1 F

With hashes on "D" tables, and a scan on "F". With the "old" patches,
each HJ node would use a Bloom filter internally. But there's an
interesting opportunity to "push down" the filters to the scan on "F",
and evaluate them right there, a bit as if the scan had a local qual.

The attached patch implements a PoC of this, and it's pretty effective.

Of course, it depends on the selectivity of the joins (and thus how many
tuples get discarded by the filters). But because it moves all the
"cheap" filter probes *before* probing any of the hash tables, it has a
multiplication effect for the benefits.

Yes, it still has most of the open issues discussed earlier, and those
will need to be addressed. But this "multiplication" may also make it
somewhat less sensitive to the regressions.

In the example above, if each of the 3 joins has 20% selectivity (i.e.
20% tuples go through), then the total selectivity is ~1%. So the "F"
scan produces only 1/100 of tuples. Maybe we got one of the joins wrong,
and it does not eliminate any tuples? That still means the overall
selectivity is only ~4%.

Of course, this only works for larger joins, and maybe the joins are
correlated in some weird way, etc. Also, what does 4% selectivity mean
for the overall query duration?

Attached is a PDF with results from a simple benchmark using joins like
the one above - fact + 1-3 dimensions. The scripts (in the .tgz) set a
couple GUCs to eliminate variations in the plan. The dimension joins are
independent and match a variable fraction of the fact (1% - 100%).

The columns are for three branches - master, and "patched" with the
push-down disabled and enabled, for joins with 1-3 dimensions.

The last two column groups are comparing the "patched" results to
master. With "off" there's no difference (other than random noise), just
as expected. But with the push-down enabled, there are fairly
significant speedups (up to ~3x). Of course, this is just a benchmark,
practical queries may do other stuff, making the gains smaller. OTOH, it
may also be much better, if there are expensive nodes in between.

The PoC patch is not very big or complex. 280KB seems like a lot, but
like 99% of that is changes in test output, because the patch adds some
info about the Bloom filters to EXPLAIN. The actual .c changes are only
~1000 lines, and a half of that is comments.

The most interesting stuff happens in create_hashjoin_plan(), where we
attempt to push-down the filter to a scan in the outer subtree. If that
succeeds, then ExecInitHashJoin initializes the filter so that the scan
can find it, and Hash builds the filter along with the hash table. And
then the scan nodes probe the pushed-down filter in ExecScanExtended().

There's bunch of boilerplate so that setrefs does the right thing with
expressions, etc. But it's a couple lines here and there. I'm actually
surprised how little code this is.

There's one detail I haven't mentioned yet - there's a simple adaptive
behavior, to deal with filters that are not selective enough. Per some
initial tests there's little benefit when the filter keeps >75% tuples,
and for >90% there were measurable regressions (~50%). This was very
consistent for different data types, etc.

So the patch tracks number of matching tuples per 1000 probes, when it
exceeds 90% it switches to sampling. Only 1% of tuples gets probed in
the filter, and if the fraction drops <80%, all the tuples get probed
again. This is very simple, needs more thought. But for the purpose of
the testing it worked quite well. There still is a small regression
(~3%), which I assume is due to building the filter.

Aside from the issues with deciding if to use a filter at all, sizing
it, etc. - which are still valid (even with the adaptive thing), and
need to be solved, there's one more annoying issue specific to this new
push-down stuff.

Earlier, I mentioned the push-down happens in create_hashjoin_plan().
Which means it happens *after* planning and costing. There are reasons
for that, but it has some unfortunate & annoying consequences.

Ideally, we'd know about the filters when constructing the scan nodes,
so we'd have a chance to estimate how many tuples will be eliminated by
probing the filters (which is about the same thing as estimating the
join sizes). But we can't do that, because our planner works bottom-up.
When constructing the scan nodes we know which tables we'll join with,
but we have no idea which of the join algorithms we'll pick.

We'll consider all three join types, and the scan node has no say which
of those will win. But the Bloom filter push-down is specific to hash
joins. So what should the scan node do? Either it can assume it's under
hash join (and set rows/cost as if there's a Bloom filter), or it can
set costs in a join-agnostic way (like now).

The only "correct" way I can think of dealing with this in the bottom-up
world is having two sets of paths - one set for a hash join, one set for
other joins. But that's not just for scans. We'd need that for all
paths, and for different combinations of joins. For the query with 3
joins, we'd end up with 2^3 combinations. That seems not great.

So I tend to see this as an opportunistic optimization. We do the
planning assuming there's no Bloom filter push-down, and then after the
fact we see if there's an opportunity after all. Which means we may not
pick a plan with hash joins, not realizing it might be made faster.

But in my mind that's somewhat acceptable / defensible.

The bigger issue for me is that it may make the EXPLAIN ANALYZE output
way harder to understand. The estimated "rows" are calculated before the
filter push-down happens, while the actual "rows" are with the filter
probing, of course. But it seems pretty easy to get confused by this,
and think it's just an incorrect estimate.

summary
-------

I like the idea of pushing filters down to the scan nodes (or perhaps
even to some other intermediate nodes). But maybe it's too incompatible
with our bottom-up planning, and the issues with costing and/or EXPLAIN
output may be impossible to solve. I wonder what others think.

Now that I revisited the older threads, I think it probably makes sense
with using Bloom filters in the hash join, at least in the two cases
mentioned in the first section. It doesn't have the issues with
bottom-up planning/costing, because it happens in the hash join. And the
issues with that (deciding what fractions are OK, sizing the filter,
...) apply to both that simpler case, and to the push-down.

Hi, Tomas

This is terrific and very timely from my POV.

I've been experimenting with a table AM (implemented as a
CustomScan scan provider), and bloom-filter pushdown from a hashjoin is one
of the bigger wins available to it: a fact-table scan joined to a filtered
dimension can use the filter to skip whole row groups and avoid
decompressing columns entirely, rather than just rejecting a tuple after
it's been produced. I'd hacked up a private version of this via a new
table-AM callback (the hashjoin walks the outer subtree, builds a filter
from the build side, and hands it to the AM's scan descriptor). Having now
read your PoC, I think your framework is the better foundation, and I'd
rather build on it than carry a parallel mechanism. But two things stand in
the way of a storage-level consumer using it, and I think both are
relatively
small.

1) A CustomScan can't currently be a recipient.

find_bloom_filter_recipient() only recognizes the stock scan tags, and the
probe itself lives in ExecScanExtended(), which a CustomScan never calls
(it dispatches to the provider's ExecCustomScan). The second part is
actually a feature, not a bug: if a CustomScan provider does its own
probing, it can choose the granularity -- per dictionary entry, per row
group, or per row -- instead of being locked into the per-row,
post-materialization probe that the stock nodes get. So all that's needed
on your side is to let the planner attach a filter to a base-relation
CustomScan; the provider takes care of consuming it.

Concretely, that's adding T_CustomScan to the scan-leaf case in
find_bloom_filter_recipient() (CustomScan embeds Scan first, so the
scanrelid test is identical; non-leaf custom nodes have scanrelid == 0 and
fall through to NULL), plus the matching fix_scan_bloom_filters() call in
set_customscan_references(). The provider then calls ExecInitBloomFilters()
in BeginCustomScan and ExecBloomFilters() (or a coarser-grained variant)
inside its scan loop. Everything else -- producer registration, the
es_bloom_producers lookup, the adaptive sampling, EXPLAIN -- is reused
unchanged.

2) The combined-hash filter can't be tested against a single column.

You build one filter keyed on hash32() of all the join keys combined. For a
single-key join that's ideal, and a column store can use it directly: hash
each distinct dictionary value once per row group and skip groups whose
values are all absent. For a multi-column join, though, the combined hash
mixes the keys, so it can only ever be tested per-row (with all key columns
in hand) -- it can't be checked against any one column's dictionary. The
per-row probe is still useful, but the row-group/dictionary skipping, which
is where most of the storage win comes from, isn't available.

The obvious thought is to key a filter per column instead. But I don't
think that should *replace* the combined filter, because per-column filters
are strictly less selective on multi-column joins: they only test whether
each column's value appears *somewhere* in the build side, not whether the
combination does. With build pairs {(1,10),(2,20)}, an outer (1,20) passes
both per-column filters even though it matches no build row, whereas the
combined filter rejects it. So for the row-level probe -- and especially
for plain heap -- the combined filter is the better one, and replacing it
would be a regression.

What I think would actually help is to let the framework *optionally* emit
per-column filters in addition to the combined one, when a recipient
signals it can use them. The combined filter stays the default and does the
precise per-row rejection (unchanged for heap, and usable per-row by a
column store too); the per-column filters are extra, built only on demand,
and let a storage consumer cheaply eliminate whole row groups before the
combined filter does the exact work. The cost is the build CPU and memory
for the extra filters -- but only for consumers that ask, so your design is
untouched when nobody does. For a single-key join the two filters
coincide, so
there'd be no reason to build both.

I'd be happy to work on patches for these.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 months ago

In reply to: Andrew Dunstan (#2)

Re: hashjoins vs. Bloom filters (yet again)

On 5/30/26 19:12, Andrew Dunstan wrote:

On 2026-05-29 Fr 8:55 PM, Tomas Vondra wrote:

Hi,

A random discussion at pgconf.dev made me revisit one of my ancient
patches, attempting to use Bloom filters to hash joins. I did work on
that twice in the past - first in 2015/6 [1], then in 2018 [2]. So let
me briefly revisit that, before I get to the new patch.

old patches
-----------

Those old patches tried to do a fairly small thing during a hash join,
and that's building a Bloom filter on the inner relation (the one that
gets hashed), and then use that filter before probing the hash table.

The benefits come from Bloom filters being (fairly) cheap, and a
negative answer (hash is not in the filter) may allows us to skip a much
more expensive operation.

The old threads patches focused especially at two hash join cases:

(a) A very selective join, i.e. a significant fraction of outer tuples
does not have a match in the hash table.

(b) A selective hash join forced to do batching because the hash table
is too large, and thus forced to spill outer tuples to temporary files.

For (a), the benefit comes from Bloom filters being much cheaper to
probe than a hash table. The exact cost depends on the implementation,
sizes, etc. We're in the ballpark of 50 vs. 500 cycles, maybe. But if
the filter discards 90% of tuples, it can be a big win.

For (b), the filter (for all the batches at once) allows us to discard
some of the outer tuples without writing them to temporary files. Which
is way more expensive than probing a hash table.

The patches got stuck mostly because deciding if it makes sense to
build/use the Bloom filter is somewhat hard. For cases where 100% of the
tuples have a match it's pointless - it's just pure cost, no benefit.
The regressions are relatively small, though (<10%).

For (b) it's much less sensitive to this kind of issues, of course. The
cost of writing outer tuples to temporary files is much higher than
building/probing a Bloom filter.

Clearly, a filter that discards 99% of tuples is great. And a filter
that keeps 99% of tuples is not great. But where exactly are the
thresholds is not quite clear.

There's also a related question of sizing the filter. Bloom filters are
usually sized by specifying the number of distinct values and the
desired false positive rate. And we could try doing that - pick a
standard false positive rate (e.g. the built-in bloom_filter aims for
1-2%), estimate the ndistinct, and get the size of the Bloom filter.

However, chances are the filter is too big. We can't get work_mem, the
join is already using that for the hash table etc. We can maybe use a
fraction of it, and that may not be enough to fit the "perfect" filter.
We could bail out and not use any Bloom filter at all, but that seems a
bit silly. Maybe we can't fit the 2% filter, but 5% of 10% would be OK?

Surely if the join selectivity is 1% (i.e. it discards 99% tuples), then
using a "worse" Bloom filter with 10% false positives would be a win?
It'd still discard ~89% of tuples.

Yet another angle leading to this kind of questions is inaccurate
ndistinct estimates (and we all know those estimates can be quite
unreliable). Let's say we size the filter for 1M distinct values (and it
just about fits into the memory budget), but then during execution we
find there are 2M distinct values. Well, now we may have ~10% false
positive rate. Or maybe we got 5M, and it's 30%. Or 10M / 50%.

At some point the filter stops being worth it, and we should either not
build it, or we should stop probing it. But when is that?

I think we'd need some sort of cost model to make judgments about this.

Anyway, this was just me summarizing the old threads, and what I think
got them stuck. Most of these questions are still open, although I think
we may be able to solve them better than we could ~10 years ago. We have
extended stats, we know about FK constraints during planning, ...

new patch
---------

Now let's talk about the new experimental/PoC patch that came from the
pgconf.dev discussions. It doesn't really solve the issues I just went
through, it's more of an attempt to take it one step further.

One of the things mentioned in the 2018 thread was the possibility to
push the filter much deeper, instead of using it just in the hash join
node itself. It was merely discussed, but there was no code written, or
anything like that. But it's the thing I decided to take a stab at after
getting back from Vancouver.

Consider a starjoin query

SELECT + FROM f JOIN d1 (f.id1 = d1.id)
JOIN d2 (f.id2 = d2.id)
JOIN d2 (f.id3 = d3.id)
WHERE d1.x = 1
AND d2.y = 2
AND d3.z = 3;

which will be planned using a left-deep plan like this one:

HJ
/ \
D3 HJ
/ \
D2 HJ
/ \
D1 F

With hashes on "D" tables, and a scan on "F". With the "old" patches,
each HJ node would use a Bloom filter internally. But there's an
interesting opportunity to "push down" the filters to the scan on "F",
and evaluate them right there, a bit as if the scan had a local qual.

The attached patch implements a PoC of this, and it's pretty effective.

Of course, it depends on the selectivity of the joins (and thus how many
tuples get discarded by the filters). But because it moves all the
"cheap" filter probes *before* probing any of the hash tables, it has a
multiplication effect for the benefits.

Yes, it still has most of the open issues discussed earlier, and those
will need to be addressed. But this "multiplication" may also make it
somewhat less sensitive to the regressions.

In the example above, if each of the 3 joins has 20% selectivity (i.e.
20% tuples go through), then the total selectivity is ~1%. So the "F"
scan produces only 1/100 of tuples. Maybe we got one of the joins wrong,
and it does not eliminate any tuples? That still means the overall
selectivity is only ~4%.

Of course, this only works for larger joins, and maybe the joins are
correlated in some weird way, etc. Also, what does 4% selectivity mean
for the overall query duration?

Attached is a PDF with results from a simple benchmark using joins like
the one above - fact + 1-3 dimensions. The scripts (in the .tgz) set a
couple GUCs to eliminate variations in the plan. The dimension joins are
independent and match a variable fraction of the fact (1% - 100%).

The columns are for three branches - master, and "patched" with the
push-down disabled and enabled, for joins with 1-3 dimensions.

The last two column groups are comparing the "patched" results to
master. With "off" there's no difference (other than random noise), just
as expected. But with the push-down enabled, there are fairly
significant speedups (up to ~3x). Of course, this is just a benchmark,
practical queries may do other stuff, making the gains smaller. OTOH, it
may also be much better, if there are expensive nodes in between.

The PoC patch is not very big or complex. 280KB seems like a lot, but
like 99% of that is changes in test output, because the patch adds some
info about the Bloom filters to EXPLAIN. The actual .c changes are only
~1000 lines, and a half of that is comments.

The most interesting stuff happens in create_hashjoin_plan(), where we
attempt to push-down the filter to a scan in the outer subtree. If that
succeeds, then ExecInitHashJoin initializes the filter so that the scan
can find it, and Hash builds the filter along with the hash table. And
then the scan nodes probe the pushed-down filter in ExecScanExtended().

There's bunch of boilerplate so that setrefs does the right thing with
expressions, etc. But it's a couple lines here and there. I'm actually
surprised how little code this is.

There's one detail I haven't mentioned yet - there's a simple adaptive
behavior, to deal with filters that are not selective enough. Per some
initial tests there's little benefit when the filter keeps >75% tuples,
and for >90% there were measurable regressions (~50%). This was very
consistent for different data types, etc.

So the patch tracks number of matching tuples per 1000 probes, when it
exceeds 90% it switches to sampling. Only 1% of tuples gets probed in
the filter, and if the fraction drops <80%, all the tuples get probed
again. This is very simple, needs more thought. But for the purpose of
the testing it worked quite well. There still is a small regression
(~3%), which I assume is due to building the filter.

Aside from the issues with deciding if to use a filter at all, sizing
it, etc. - which are still valid (even with the adaptive thing), and
need to be solved, there's one more annoying issue specific to this new
push-down stuff.

Earlier, I mentioned the push-down happens in create_hashjoin_plan().
Which means it happens *after* planning and costing. There are reasons
for that, but it has some unfortunate & annoying consequences.

Ideally, we'd know about the filters when constructing the scan nodes,
so we'd have a chance to estimate how many tuples will be eliminated by
probing the filters (which is about the same thing as estimating the
join sizes). But we can't do that, because our planner works bottom-up.
When constructing the scan nodes we know which tables we'll join with,
but we have no idea which of the join algorithms we'll pick.

We'll consider all three join types, and the scan node has no say which
of those will win. But the Bloom filter push-down is specific to hash
joins. So what should the scan node do? Either it can assume it's under
hash join (and set rows/cost as if there's a Bloom filter), or it can
set costs in a join-agnostic way (like now).

The only "correct" way I can think of dealing with this in the bottom-up
world is having two sets of paths - one set for a hash join, one set for
other joins. But that's not just for scans. We'd need that for all
paths, and for different combinations of joins. For the query with 3
joins, we'd end up with 2^3 combinations. That seems not great.

So I tend to see this as an opportunistic optimization. We do the
planning assuming there's no Bloom filter push-down, and then after the
fact we see if there's an opportunity after all. Which means we may not
pick a plan with hash joins, not realizing it might be made faster.

But in my mind that's somewhat acceptable / defensible.

The bigger issue for me is that it may make the EXPLAIN ANALYZE output
way harder to understand. The estimated "rows" are calculated before the
filter push-down happens, while the actual "rows" are with the filter
probing, of course. But it seems pretty easy to get confused by this,
and think it's just an incorrect estimate.

summary
-------

I like the idea of pushing filters down to the scan nodes (or perhaps
even to some other intermediate nodes). But maybe it's too incompatible
with our bottom-up planning, and the issues with costing and/or EXPLAIN
output may be impossible to solve. I wonder what others think.

Now that I revisited the older threads, I think it probably makes sense
with using Bloom filters in the hash join, at least in the two cases
mentioned in the first section. It doesn't have the issues with
bottom-up planning/costing, because it happens in the hash join. And the
issues with that (deciding what fractions are OK, sizing the filter,
...) apply to both that simpler case, and to the push-down.

Hi, Tomas

This is terrific and very timely from my POV.

I've been experimenting with a table AM (implemented as a
CustomScan scan provider), and bloom-filter pushdown from a hashjoin is one
of the bigger wins available to it: a fact-table scan joined to a filtered
dimension can use the filter to skip whole row groups and avoid
decompressing columns entirely, rather than just rejecting a tuple after
it's been produced. I'd hacked up a private version of this via a new
table-AM callback (the hashjoin walks the outer subtree, builds a filter
from the build side, and hands it to the AM's scan descriptor). Having now
read your PoC, I think your framework is the better foundation, and I'd
rather build on it than carry a parallel mechanism. But two things stand in
the way of a storage-level consumer using it, and I think both are
relatively
small.

OK, good to hear. I was actually thinking about that use case too, i.e.
making it possible for the scan to do something smart with the filter
(like even pushing it even further down, to "storage"). Or maybe the
ForeignScan could push it to the remote side, so that it's actually
filtered there.

I didn't mention that my message, and there are some difficulties:

1) We only build the hash (and bloom) with a delay, after the scan
already produces some tuples. That complicates the pushdown, whiich may
need to happen when starting the scan. Presumably, we'd need to allow
disabling this optimization, optionally.

2) We'd need some sort of "portable" Bloom filter, with serialization
and deserialization, etc.

Both of these seem rather solvable.

1) A CustomScan can't currently be a recipient.

find_bloom_filter_recipient() only recognizes the stock scan tags, and the
probe itself lives in ExecScanExtended(), which a CustomScan never calls
(it dispatches to the provider's ExecCustomScan). The second part is
actually a feature, not a bug: if a CustomScan provider does its own
probing, it can choose the granularity -- per dictionary entry, per row
group, or per row -- instead of being locked into the per-row,
post-materialization probe that the stock nodes get. So all that's needed
on your side is to let the planner attach a filter to a base-relation
CustomScan; the provider takes care of consuming it.

Concretely, that's adding T_CustomScan to the scan-leaf case in
find_bloom_filter_recipient() (CustomScan embeds Scan first, so the
scanrelid test is identical; non-leaf custom nodes have scanrelid == 0 and
fall through to NULL), plus the matching fix_scan_bloom_filters() call in
set_customscan_references(). The provider then calls ExecInitBloomFilters()
in BeginCustomScan and ExecBloomFilters() (or a coarser-grained variant)
inside its scan loop. Everything else -- producer registration, the
es_bloom_producers lookup, the adaptive sampling, EXPLAIN -- is reused
unchanged.

Yes, that should work and it's a mostly mechanical change.

Maybe we'd want some sort of opt-in, so that the CustomScan can indicate
it can handle Bloom filters. Like, setting
CUSTOMPATH_SUPPORT_BLOOM_FILTERS to flags.

2) The combined-hash filter can't be tested against a single column.

You build one filter keyed on hash32() of all the join keys combined. For a
single-key join that's ideal, and a column store can use it directly: hash
each distinct dictionary value once per row group and skip groups whose
values are all absent. For a multi-column join, though, the combined hash
mixes the keys, so it can only ever be tested per-row (with all key columns
in hand) -- it can't be checked against any one column's dictionary. The
per-row probe is still useful, but the row-group/dictionary skipping, which
is where most of the storage win comes from, isn't available.

The obvious thought is to key a filter per column instead. But I don't
think that should *replace* the combined filter, because per-column filters
are strictly less selective on multi-column joins: they only test whether
each column's value appears *somewhere* in the build side, not whether the
combination does. With build pairs {(1,10),(2,20)}, an outer (1,20) passes
both per-column filters even though it matches no build row, whereas the
combined filter rejects it. So for the row-level probe -- and especially
for plain heap -- the combined filter is the better one, and replacing it
would be a regression.

What I think would actually help is to let the framework *optionally* emit
per-column filters in addition to the combined one, when a recipient
signals it can use them. The combined filter stays the default and does the
precise per-row rejection (unchanged for heap, and usable per-row by a
column store too); the per-column filters are extra, built only on demand,
and let a storage consumer cheaply eliminate whole row groups before the
combined filter does the exact work. The cost is the build CPU and memory
for the extra filters -- but only for consumers that ask, so your design is
untouched when nobody does. For a single-key join the two filters
coincide, so
there'd be no reason to build both.

I think I speculated about this (having per-key filters) in some of the
comments in the patch, although the use case was different. I haven't
thought about TAM, but about different joins where the join keys come
from both sides. Consider a join like

HJ
/ \
A HJ
/ \
B C

where A-(BC) is on (A.x = B.x AND A.y = C.y), so the complete filter
can't be pushed to either side. But we could:

(1) Push the filter on top of the BC join (which in this example is not
really a push-down).

(2) Build filters on (x) and (y) separately, and push-down these.

Or we could do both, really.

I suppose a variation of (2) would work for your use case too, except
we'd push all three filters (x,y), (x) and (y) to the same scan.

I guess this could also be opt-in, enabled by some CUSTOMPATH_ flag.

The question is how efficient can the smaller filters be. The complete
filter can be very selective, while the per-key filters are terrible.

I'd be happy to work on patches for these.

Great. It's and interesting experiment / area to explore.

FWIW I think the main difficulty for this PoC is going to be the
planning/costing stuff, and the impact on EXPLAIN.

regards

--
Tomas Vondra

Andrew Dunstan

andrew@dunslane.net

about 2 months ago

In reply to: Tomas Vondra (#3)

Re: hashjoins vs. Bloom filters (yet again)

On 2026-05-30 Sa 2:14 PM, Tomas Vondra wrote:

Hi, Tomas

This is terrific and very timely from my POV.

I've been experimenting with a table AM (implemented as a
CustomScan scan provider), and bloom-filter pushdown from a hashjoin is one
of the bigger wins available to it: a fact-table scan joined to a filtered
dimension can use the filter to skip whole row groups and avoid
decompressing columns entirely, rather than just rejecting a tuple after
it's been produced. I'd hacked up a private version of this via a new
table-AM callback (the hashjoin walks the outer subtree, builds a filter
from the build side, and hands it to the AM's scan descriptor). Having now
read your PoC, I think your framework is the better foundation, and I'd
rather build on it than carry a parallel mechanism. But two things stand in
the way of a storage-level consumer using it, and I think both are
relatively
small.

OK, good to hear. I was actually thinking about that use case too, i.e.
making it possible for the scan to do something smart with the filter
(like even pushing it even further down, to "storage"). Or maybe the
ForeignScan could push it to the remote side, so that it's actually
filtered there.

I didn't mention that my message, and there are some difficulties:

1) We only build the hash (and bloom) with a delay, after the scan
already produces some tuples. That complicates the pushdown, whiich may
need to happen when starting the scan. Presumably, we'd need to allow
disabling this optimization, optionally.

2) We'd need some sort of "portable" Bloom filter, with serialization
and deserialization, etc.

Both of these seem rather solvable.

1) A CustomScan can't currently be a recipient.

find_bloom_filter_recipient() only recognizes the stock scan tags, and the
probe itself lives in ExecScanExtended(), which a CustomScan never calls
(it dispatches to the provider's ExecCustomScan). The second part is
actually a feature, not a bug: if a CustomScan provider does its own
probing, it can choose the granularity -- per dictionary entry, per row
group, or per row -- instead of being locked into the per-row,
post-materialization probe that the stock nodes get. So all that's needed
on your side is to let the planner attach a filter to a base-relation
CustomScan; the provider takes care of consuming it.

Concretely, that's adding T_CustomScan to the scan-leaf case in
find_bloom_filter_recipient() (CustomScan embeds Scan first, so the
scanrelid test is identical; non-leaf custom nodes have scanrelid == 0 and
fall through to NULL), plus the matching fix_scan_bloom_filters() call in
set_customscan_references(). The provider then calls ExecInitBloomFilters()
in BeginCustomScan and ExecBloomFilters() (or a coarser-grained variant)
inside its scan loop. Everything else -- producer registration, the
es_bloom_producers lookup, the adaptive sampling, EXPLAIN -- is reused
unchanged.

Yes, that should work and it's a mostly mechanical change.

Maybe we'd want some sort of opt-in, so that the CustomScan can indicate
it can handle Bloom filters. Like, setting
CUSTOMPATH_SUPPORT_BLOOM_FILTERS to flags.

2) The combined-hash filter can't be tested against a single column.

You build one filter keyed on hash32() of all the join keys combined. For a
single-key join that's ideal, and a column store can use it directly: hash
each distinct dictionary value once per row group and skip groups whose
values are all absent. For a multi-column join, though, the combined hash
mixes the keys, so it can only ever be tested per-row (with all key columns
in hand) -- it can't be checked against any one column's dictionary. The
per-row probe is still useful, but the row-group/dictionary skipping, which
is where most of the storage win comes from, isn't available.

The obvious thought is to key a filter per column instead. But I don't
think that should *replace* the combined filter, because per-column filters
are strictly less selective on multi-column joins: they only test whether
each column's value appears *somewhere* in the build side, not whether the
combination does. With build pairs {(1,10),(2,20)}, an outer (1,20) passes
both per-column filters even though it matches no build row, whereas the
combined filter rejects it. So for the row-level probe -- and especially
for plain heap -- the combined filter is the better one, and replacing it
would be a regression.

What I think would actually help is to let the framework *optionally* emit
per-column filters in addition to the combined one, when a recipient
signals it can use them. The combined filter stays the default and does the
precise per-row rejection (unchanged for heap, and usable per-row by a
column store too); the per-column filters are extra, built only on demand,
and let a storage consumer cheaply eliminate whole row groups before the
combined filter does the exact work. The cost is the build CPU and memory
for the extra filters -- but only for consumers that ask, so your design is
untouched when nobody does. For a single-key join the two filters
coincide, so
there'd be no reason to build both.

I think I speculated about this (having per-key filters) in some of the
comments in the patch, although the use case was different. I haven't
thought about TAM, but about different joins where the join keys come
from both sides. Consider a join like

HJ
/ \
A HJ
/ \
B C

where A-(BC) is on (A.x = B.x AND A.y = C.y), so the complete filter
can't be pushed to either side. But we could:

(1) Push the filter on top of the BC join (which in this example is not
really a push-down).

(2) Build filters on (x) and (y) separately, and push-down these.

Or we could do both, really.

I suppose a variation of (2) would work for your use case too, except
we'd push all three filters (x,y), (x) and (y) to the same scan.

I guess this could also be opt-in, enabled by some CUSTOMPATH_ flag.

The question is how efficient can the smaller filters be. The complete
filter can be very selective, while the per-key filters are terrible.

I'd be happy to work on patches for these.

Great. It's and interesting experiment / area to explore.

Here are 3 patches (developed using Claude) that sit on top of your POC.

Patch 1 enables the pushdown filters for custom scans. As you say it's
fairly mechanical and is enabled by a CUSTOMPATH_SUPPORT_BLOOM_FILTERS
path flag.

Patch 2 provides for building per-key filters in addition to the
multi-key filter if that flag is set. There may be other cases that
would want it, but this would suit my immediate use case.

Patch 3 provides for eager creation of the filter(s) in such cases,
disabling the optimization you mentioned in point 1 above.

FWIW I think the main difficulty for this PoC is going to be the
planning/costing stuff, and the impact on EXPLAIN.

I haven't dealt with that or other issues you raise, but I think this is
enough for me to begin testing. I have adapted my TAM to it and verified
that it acts as expected. I will start doing some benchmarks.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Andrei Lepikhov

lepihov@gmail.com

about 2 months ago

In reply to: Tomas Vondra (#1)

Re: hashjoins vs. Bloom filters (yet again)

Postgres is still gaining ground in this area. It’s helpful to see how other
databases handle these challenges.

On 30/05/2026 02:55, Tomas Vondra wrote:

The patches got stuck mostly because deciding if it makes sense to
build/use the Bloom filter is somewhat hard. For cases where 100% of the
tuples have a match it's pointless - it's just pure cost, no benefit.
The regressions are relatively small, though (<10%).

We ran into the same problem when trying to estimate the number of 'generated'
NULLs on the nullable side. So, it makes sense to focus on the estimation method
for 'unmatched' tuples as a separate task.

However, chances are the filter is too big. We can't get work_mem, the
join is already using that for the hash table etc. We can maybe use a
fraction of it, and that may not be enough to fit the "perfect" filter.
We could bail out and not use any Bloom filter at all, but that seems a
bit silly. Maybe we can't fit the 2% filter, but 5% of 10% would be OK?

Looking at DuckDB’s code, using bloom filters during hash table construction
solves this issue.
From what I can tell, Apache Impala [1]https://impala.apache.org/docs/build/html/topics/impala_runtime_filtering.html and Spark [2]https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-performance.html use the same approach.

Of course, it depends on the selectivity of the joins (and thus how many
tuples get discarded by the filters). But because it moves all the
"cheap" filter probes *before* probing any of the hash tables, it has a
multiplication effect for the benefits.

In my experience, the outer side often has a complex subtree and is sometimes
capped by a GROUP BY statement, or even a HAVING clause, which can break all
estimations. A bloom filter might help if there is an accidental misestimate.

So I tend to see this as an opportunistic optimization. We do the
planning assuming there's no Bloom filter push-down, and then after the
fact we see if there's an opportunity after all. Which means we may not
pick a plan with hash joins, not realizing it might be made faster.

This approach should not cause any issues. It is likely a reasonable way to
improve performance without expanding the optimisation scope, which would
increase planning time. We can always adjust it later if needed.
For example, I am designing the post-optimising NestLoop 'lazy join' [3]/messages/by-id/3d749085-72b6-46d6-a26a-7c95805c1adb@gmail.com using
the 'gating' concept.

The bigger issue for me is that it may make the EXPLAIN ANALYZE output
way harder to understand. The estimated "rows" are calculated before the
filter push-down happens, while the actual "rows" are with the filter
probing, of course. But it seems pretty easy to get confused by this,
and think it's just an incorrect estimate.

People are often confused when trying to understand the correctness of
estimation for parallel plans and, in some cases, MergeJoin plans. Personally, I
don't think it's a big issue.

Overall, I think there are even more useful ways to apply bloom filters in the
planner:
1. Real-time partition pruning
2. FDW pushed-down filters, which are especially helpful for sharded tables.
3. Skipping storage layer blocks. I know of at least one attempt to use the
BRIN+FSM approach to avoid reading parts of a large table that definitely don't
match the filter. Bloom filters could be used here as well.

So, I'm excited about your proposal. Even if you start with a simple case, just
make it available for extension modules.

[1]: https://impala.apache.org/docs/build/html/topics/impala_runtime_filtering.html
[2]: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-performance.html
[3]: /messages/by-id/3d749085-72b6-46d6-a26a-7c95805c1adb@gmail.com
/messages/by-id/3d749085-72b6-46d6-a26a-7c95805c1adb@gmail.com

--
regards, Andrei Lepikhov,
pgEdge

Andrei Lepikhov

lepihov@gmail.com

about 2 months ago

In reply to: Tomas Vondra (#1)

Re: hashjoins vs. Bloom filters (yet again)

On 30/05/2026 02:55, Tomas Vondra wrote:

Earlier, I mentioned the push-down happens in create_hashjoin_plan().
Which means it happens *after* planning and costing. There are reasons
for that, but it has some unfortunate & annoying consequences.

Ideally, we'd know about the filters when constructing the scan nodes,
so we'd have a chance to estimate how many tuples will be eliminated by
probing the filters (which is about the same thing as estimating the
join sizes). But we can't do that, because our planner works bottom-up.
When constructing the scan nodes we know which tables we'll join with,
but we have no idea which of the join algorithms we'll pick.

We'll consider all three join types, and the scan node has no say which
of those will win. But the Bloom filter push-down is specific to hash
joins. So what should the scan node do? Either it can assume it's under
hash join (and set rows/cost as if there's a Bloom filter), or it can
set costs in a join-agnostic way (like now).

The only "correct" way I can think of dealing with this in the bottom-up
world is having two sets of paths - one set for a hash join, one set for
other joins. But that's not just for scans. We'd need that for all
paths, and for different combinations of joins. For the query with 3
joins, we'd end up with 2^3 combinations. That seems not great.

I overlooked this part of your first message, so let me add a quick comment.

In principle, the optimiser is not restricted to bottom-up planning. For
example, in extension modules, I sometimes use the create_upper_paths_hook to
add a 'Top-Down' iteration after 'bottom-up' planning [1]https://github.com/danolivo/conf/blob/main/2025-MiddleOut/MiddleOut.pdf.

This helps improve complex query plans, such as adding a Memoize node at the
head of a subplan when the number of distinct input parameter values is expected
to be low. It can also use the startup_cost-optimal subpaths in MergeJoin if
histogram comparisons indicate that only a small portion of the input will be
scanned. There are other possible cases involving LIMIT and sort propagation as
well.

I'm not sure whether this approach makes sense for the specific technique you
develop, since it's already quite complex. Also, additional planning iteration
is a pure overhead in most of cases except complex analytical queries. However,
it might provide an idea for future improvement.

[1]: https://github.com/danolivo/conf/blob/main/2025-MiddleOut/MiddleOut.pdf

--
regards, Andrei Lepikhov,
pgEdge

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 months ago

In reply to: Tomas Vondra (#1)

Re: hashjoins vs. Bloom filters (yet again)

Hi,

I kept thinking about the various issues discussed after I posted the v1
pshdown patch. Some of the issues are specific to the pushdown (to scan
nodes), but a lot of the issues seem to be shared with using Bloom
filters within the hashjoin (which is what the old threads were about).

We'd need to do something about these issues no matter where we place
the filter, so it's a bit of prerequisite for using Bloom in hash joins
in general. And they seem somewhat more limited / easier to solve than
the planning/costing issues.

So I decided it'd be interesting to see how beneficial can the Bloom
filters be in the scope of a single hashjoin, without pushing it all the
way to the scan nodes, and see what we can do about the issues.

Attached is a PoC patch series optinally adding Bloom filters to a hash
join, both for serial and parallel joins. It's labeled as v2, but it's
really independent of the v1 pushdoown patch posted last week. Some of
the ideas implemented in this could be applied to the pushdown patch too
(in particular all the adaptive behavior).

I'm not sure if we should try to merge these two things into a single
patch series, or whether it'd be better to split those into two threads
(otherwise it'll just keep confusing both people and cfbot).

how the patch works
-------------------

Anyway, let me briefly explain what the patch does (see the commit
messages and comments for more details, I tried to keep those
comprehensive). I suggest focusing on the serial case (in 0001), the
parallel joins are a direct extension of that - but inherently harder to
understand, due to the parallel hash build, shmem etc.

In principle, using Bloom filters is pretty simple - while adding tuples
from the inner relation to the hash table, build also a Bloom filter and
then use it to discard outer tuples cheaply, without having to do an
expensive lookup in a hash table. It does not depend if the hash table
is in private or shared memory.

The difficulty is to figure out whether it makes sense to build/probe
the filter. For that to be the case, the filter needs to eliminate
enough outer tuples, so that the hash table lookup is not needed, and/or
the tuple can be discarded without spilling it to disk (with nbatch>1).

Note: With the pushdown, the benefits "compound" by combining multiple
filters (if there are multiple joins) and/or by skipping some
intermediate operators (between the scan and the hashjoin). So it's
maybe less risky, but the issue still exists.

adaptive build / probing
------------------------

I see two complementary ways to deal with this - during planning (based
on estimates and a cost model), and adaptively during execution (based
on probe/lookup stats). The v2 patch does the latter, mostly because I
think it's beneficial even if we eventually add some smarts to the
planning phase.

The adaptive behavior decides (a) when a filter is built, and (b) if a
filter is probed before hash table lookups.

For builds, we don't want to build filters when ~100% of lookups in the
hash table find a match. It'd not pay for itself. So when the hash table
fits into memory (nbatch=1), we wait for the first 1000 lookups, and
only build the filter only if <90% have a match (and recheck once in a
while, so the filter may be built later).

But with batched joins (nbatch>1) we can't delay building the filter, we
have to decide before spilling some of the tuples to disk (otherwise the
filter would be incomplete, and we couldn't reject tuples from later
batches - which is the main benefit with batched joins). So with batched
joins we build the filter, and hope that either it helps, or the
overhead is negligible overall.

Then when probing, we don't want to use filter that does not reject any
tuples. To deal with this, the patch tracks number of probes and number
of rejections, and if fewer than 10% of probes reject the tuple (i.e.
the filter is ineffective), it gets temporarily "disabled". When
disabled, a filter samples 1% of probes, and then may get enabled again
if the fraction of rejected tuples gets >20%.

Overall, this seems to work pretty well. Of course, it can be improved
in various ways. For example, the thresholds 10% and 20% are somewhat
arbitrary - it's based on earlier experiments, and it works OK on a
number of machines, with different queries / data types. But having a
more formal "cost model" for Bloom filters might help.

Another possible improvement is about maybe doing some decisions during
planning, particularly when the decisions are reliable. I'm rather
skeptical about deciding to build a Bloom filter based on estimates. I
think it's better to do that decision during execution, as explained in
the preceding sections. We could still consider the "expected" Bloom
filter for costing purposed, but leave the decision for execution.

However, in some cases we may be able to know for sure a Bloom filter is
useless. For example, if we know a given join is on a FK, every outer
tuple will have a match. In that case the filter can't help. The patch
won't build it anyway (at least for nbatch=1), thanks to the adaptive
build heuristics. But we could short-circuit that entirely.

perf evaluation
---------------

Now, some numbers. Attached is a .tgz with benchmark script running a
hashjoin on two tables (fact-dimension), varying the selectivity of the
join (5%-100%), work_mem, number of parallel workers, data types of the
join keys, and size of the tables. There's a .csv with more complete
results of the tests, I'll focus on results for scale 100, i.e. fact
100M rows, dimension 10M rows.

The two attached PDFs show timings for master + patched branch, with
enable_hashjoin_bloomm=on/off. And then columns showing timing relative
to master (<1.0 speedup / green, >1.0 regression / red). Green = good.
BTW this is from my ryzen machine (Ryzen 9 9900X).

The results for serial queries (workers=0) seem pretty nice. For
selective joins (>50% outer tuples discarded) it's about 20% faster, and
with 5% selectivity (95% discarded), it's ~2x faster. Which seems nice.
The adaptive thresholds seem to about match reality.

For parallel queries it's a bit worse. There are some nice speedups, but
the benefits are clearly more limited. One interesting observation is
that while for serial queries, the cases that most benefit are with
batching, while with parallel joins it's exactly the opposite. See the
hashjoin-bloom-batched.pdf, which shows timings only for queries with
batched joins.

I'm not sure why is that, but it's entirely possible it's due to a bug
in the patch - the parallel join is fairly complex, I can't rule this
out. Or it might be due to some hardware bottlenecks or whatever?

I'd definitely welcome some review and ideas what might be causing this.

One thing I realized when looking at the results is that this may need
some different trade offs regarding the size of the filter. The library
lib/bloomfilter.c aims for 1-2% false positive rate, but we sometimes
end up with a filter like this:

Bloom Filter: Size: 16384kB Hash Functions: 10
False Positive Rate: 0.077%

This is for work_mem=64MB, with batched join:

Buckets: 2097152 Batches: 16 Memory Usage: 82784kB

so maybe it's not that large. But maybe it'd be better to accept
somewhat higher false-positive rate (e.g. ~10%) in exchange for a much
smaller filter, and fewer hash functions (i.e. fewer bits to check)?

regards

--
Tomas Vondra

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 months ago

In reply to: Andrew Dunstan (#4)

Re: hashjoins vs. Bloom filters (yet again)

On 5/31/26 17:03, Andrew Dunstan wrote:

..>
Here are 3 patches (developed using Claude) that sit on top of your POC.

Patch 1 enables the pushdown filters for custom scans. As you say it's
fairly mechanical and is enabled by a CUSTOMPATH_SUPPORT_BLOOM_FILTERS
path flag.

Patch 2 provides for building per-key filters in addition to the multi-
key filter if that flag is set. There may be other cases that would want
it, but this would suit my immediate use case.

Patch 3 provides for eager creation of the filter(s) in such cases,
disabling the optimization you mentioned in point 1 above.

Thanks. I'll take a look when I have time.

FWIW I think the main difficulty for this PoC is going to be the
planning/costing stuff, and the impact on EXPLAIN.

I haven't dealt with that or other issues you raise, but I think this is
enough for me to begin testing. I have adapted my TAM to it and verified
that it acts as expected. I will start doing some benchmarks.

OK. I think it's enough for testing, i.e. to see if it's actually worth
pursuing further. But I think we'll eventually need to solve the issues
planning/costing somehow. I'm not sure it'll be committable without
having some sort of solution.

I happened to find this 2025 paper:

Including Bloom Filters in Bottom-up Optimization
https://arxiv.org/html/2505.02994v1

I read it over the weekend, and interestingly enough it's exactly about
the planning issues I outlined last week, i.e. difficulties with costing
paths that might include pushed-down Bloom filters.

They even describe a solution that kinda looks a bit like "tracking a
separate set of paths" from my e-mail, although they use somewhat
different terminology (sub-plan == our path, etc.). But if you squint a
little bit, it talks about the costing issue, path explosion, etc.

Their solutions is some sort of two-phase process, which I'm not sure we
can do. It'd require a fundamental rework of how we construct join rels
and all that, and TBH I don't have an ambition to do that.

But while reading the paper, I kept thinking about how we deal with
pathkeys. I wonder if we could do something similar to that? That is,
have a concept of "potentially interesting" filters, and construct the
extra paths only for those, to limit the number of extra paths.

Imagine we construct the the baserels (essentially scan nodes), and then
do a pass over those. Each scan would look at what joins it participates
in, and which of those could benefit from a Bloom filter (some can't,
because it's a FK join, or we don't expect many rejected tuples, or
maybe it's a LEFT JOIN, ... etc.).

And then we'd maybe have some additional heuristics to pick which "Bloom
filters" to attach to the path. And then later, when planning that
particular join involving that path, we'd reject the join if it's not a
hash join. The scan would always have to construct a "clean path" not
requiring any filters, similarly to what custom scans need to do for
parallel paths.

It's just a rough idea, but I think it would work. Worth a try.

regards

--
Tomas Vondra

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 months ago

In reply to: Andrei Lepikhov (#5)

Re: hashjoins vs. Bloom filters (yet again)

On 6/1/26 11:30, Andrei Lepikhov wrote:

Postgres is still gaining ground in this area. It’s helpful to see how other
databases handle these challenges.

On 30/05/2026 02:55, Tomas Vondra wrote:

The patches got stuck mostly because deciding if it makes sense to
build/use the Bloom filter is somewhat hard. For cases where 100% of the
tuples have a match it's pointless - it's just pure cost, no benefit.
The regressions are relatively small, though (<10%).

We ran into the same problem when trying to estimate the number of 'generated'
NULLs on the nullable side. So, it makes sense to focus on the estimation method
for 'unmatched' tuples as a separate task.

Not sure how treating it as a separate task solves that? In any case,
the patches I posted a couple minutes ago (for filters in the scope of a
single hashjoin, but the problem is the same) deal with this by delaying
the build until execution time, when we have better idea how many outer
tuples match the hash table.

However, chances are the filter is too big. We can't get work_mem, the
join is already using that for the hash table etc. We can maybe use a
fraction of it, and that may not be enough to fit the "perfect" filter.
We could bail out and not use any Bloom filter at all, but that seems a
bit silly. Maybe we can't fit the 2% filter, but 5% of 10% would be OK?

Looking at DuckDB’s code, using bloom filters during hash table construction
solves this issue.
From what I can tell, Apache Impala [1] and Spark [2] use the same approach.

It's not clear how any of these solve the issue I described (about
sizing the filter and the trade offs). The links just say that the
feature exist.

Of course, it depends on the selectivity of the joins (and thus how many
tuples get discarded by the filters). But because it moves all the
"cheap" filter probes *before* probing any of the hash tables, it has a
multiplication effect for the benefits.

In my experience, the outer side often has a complex subtree and is sometimes
capped by a GROUP BY statement, or even a HAVING clause, which can break all
estimations. A bloom filter might help if there is an accidental misestimate.

Perhaps, but with substantial misestimates all bets are off. Maybe it'd
be better to discuss a particular example.

So I tend to see this as an opportunistic optimization. We do the
planning assuming there's no Bloom filter push-down, and then after the
fact we see if there's an opportunity after all. Which means we may not
pick a plan with hash joins, not realizing it might be made faster.

This approach should not cause any issues. It is likely a reasonable way to
improve performance without expanding the optimisation scope, which would
increase planning time. We can always adjust it later if needed.
For example, I am designing the post-optimising NestLoop 'lazy join' [3] using
the 'gating' concept.

I agree, except that it also makes EXPLAIN pretty difficult to
interpret, because it "breaks" the row counts.

The bigger issue for me is that it may make the EXPLAIN ANALYZE output
way harder to understand. The estimated "rows" are calculated before the
filter push-down happens, while the actual "rows" are with the filter
probing, of course. But it seems pretty easy to get confused by this,
and think it's just an incorrect estimate.

People are often confused when trying to understand the correctness of
estimation for parallel plans and, in some cases, MergeJoin plans. Personally, I
don't think it's a big issue.

I disagree. The fact that people may be confused by plans does not mean
we can just make plans confusing for everyone.

Overall, I think there are even more useful ways to apply bloom filters in the
planner:
1. Real-time partition pruning
2. FDW pushed-down filters, which are especially helpful for sharded tables.
3. Skipping storage layer blocks. I know of at least one attempt to use the
BRIN+FSM approach to avoid reading parts of a large table that definitely don't
match the filter. Bloom filters could be used here as well.

Could be. I speculated about options (2) and (3) myself elsewhere in
this thread.

regards

--
Tomas Vondra

#10

Oleg Bartunov

oleg@sai.msu.su

about 2 months ago

In reply to: Tomas Vondra (#1)

Re: hashjoins vs. Bloom filters (yet again)

On Sat, May 30, 2026, 01:56 Tomas Vondra <tomas@vondra.me> wrote:

Hi,

A random discussion at pgconf.dev made me revisit one of my ancient
patches, attempting to use Bloom filters to hash joins. I did work on
that twice in the past - first in 2015/6 [1], then in 2018 [2]. So let
me briefly revisit that, before I get to the new patch.

old patches
-----------

Those old patches tried to do a fairly small thing during a hash join,
and that's building a Bloom filter on the inner relation (the one that
gets hashed), and then use that filter before probing the hash table.

The benefits come from Bloom filters being (fairly) cheap, and a
negative answer (hash is not in the filter) may allows us to skip a much
more expensive operation.

The old threads patches focused especially at two hash join cases:

(a) A very selective join, i.e. a significant fraction of outer tuples
does not have a match in the hash table.

(b) A selective hash join forced to do batching because the hash table
is too large, and thus forced to spill outer tuples to temporary files.

For (a), the benefit comes from Bloom filters being much cheaper to
probe than a hash table. The exact cost depends on the implementation,
sizes, etc. We're in the ballpark of 50 vs. 500 cycles, maybe. But if
the filter discards 90% of tuples, it can be a big win.

For (b), the filter (for all the batches at once) allows us to discard
some of the outer tuples without writing them to temporary files. Which
is way more expensive than probing a hash table.

The patches got stuck mostly because deciding if it makes sense to
build/use the Bloom filter is somewhat hard. For cases where 100% of the
tuples have a match it's pointless - it's just pure cost, no benefit.
The regressions are relatively small, though (<10%).

For (b) it's much less sensitive to this kind of issues, of course. The
cost of writing outer tuples to temporary files is much higher than
building/probing a Bloom filter.

Clearly, a filter that discards 99% of tuples is great. And a filter
that keeps 99% of tuples is not great. But where exactly are the
thresholds is not quite clear.

There's also a related question of sizing the filter. Bloom filters are
usually sized by specifying the number of distinct values and the
desired false positive rate. And we could try doing that - pick a
standard false positive rate (e.g. the built-in bloom_filter aims for
1-2%), estimate the ndistinct, and get the size of the Bloom filter.

However, chances are the filter is too big. We can't get work_mem, the
join is already using that for the hash table etc. We can maybe use a
fraction of it, and that may not be enough to fit the "perfect" filter.
We could bail out and not use any Bloom filter at all, but that seems a
bit silly. Maybe we can't fit the 2% filter, but 5% of 10% would be OK?

Surely if the join selectivity is 1% (i.e. it discards 99% tuples), then
using a "worse" Bloom filter with 10% false positives would be a win?
It'd still discard ~89% of tuples.

Yet another angle leading to this kind of questions is inaccurate
ndistinct estimates (and we all know those estimates can be quite
unreliable). Let's say we size the filter for 1M distinct values (and it
just about fits into the memory budget), but then during execution we
find there are 2M distinct values. Well, now we may have ~10% false
positive rate. Or maybe we got 5M, and it's 30%. Or 10M / 50%.

At some point the filter stops being worth it, and we should either not
build it, or we should stop probing it. But when is that?

I think we'd need some sort of cost model to make judgments about this.

Anyway, this was just me summarizing the old threads, and what I think
got them stuck. Most of these questions are still open, although I think
we may be able to solve them better than we could ~10 years ago. We have
extended stats, we know about FK constraints during planning, ...

new patch
---------

Now let's talk about the new experimental/PoC patch that came from the
pgconf.dev discussions. It doesn't really solve the issues I just went
through, it's more of an attempt to take it one step further.

One of the things mentioned in the 2018 thread was the possibility to
push the filter much deeper, instead of using it just in the hash join
node itself. It was merely discussed, but there was no code written, or
anything like that. But it's the thing I decided to take a stab at after
getting back from Vancouver.

Consider a starjoin query

SELECT + FROM f JOIN d1 (f.id1 = d1.id)
JOIN d2 (f.id2 = d2.id)
JOIN d2 (f.id3 = d3.id)
WHERE d1.x = 1
AND d2.y = 2
AND d3.z = 3;

which will be planned using a left-deep plan like this one:

HJ
/ \
D3 HJ
/ \
D2 HJ
/ \
D1 F

With hashes on "D" tables, and a scan on "F". With the "old" patches,
each HJ node would use a Bloom filter internally. But there's an
interesting opportunity to "push down" the filters to the scan on "F",
and evaluate them right there, a bit as if the scan had a local qual.

The attached patch implements a PoC of this, and it's pretty effective.

Of course, it depends on the selectivity of the joins (and thus how many
tuples get discarded by the filters). But because it moves all the
"cheap" filter probes *before* probing any of the hash tables, it has a
multiplication effect for the benefits.

Yes, it still has most of the open issues discussed earlier, and those
will need to be addressed. But this "multiplication" may also make it
somewhat less sensitive to the regressions.

In the example above, if each of the 3 joins has 20% selectivity (i.e.
20% tuples go through), then the total selectivity is ~1%. So the "F"
scan produces only 1/100 of tuples. Maybe we got one of the joins wrong,
and it does not eliminate any tuples? That still means the overall
selectivity is only ~4%.

Of course, this only works for larger joins, and maybe the joins are
correlated in some weird way, etc. Also, what does 4% selectivity mean
for the overall query duration?

Attached is a PDF with results from a simple benchmark using joins like
the one above - fact + 1-3 dimensions. The scripts (in the .tgz) set a
couple GUCs to eliminate variations in the plan. The dimension joins are
independent and match a variable fraction of the fact (1% - 100%).

The columns are for three branches - master, and "patched" with the
push-down disabled and enabled, for joins with 1-3 dimensions.

The last two column groups are comparing the "patched" results to
master. With "off" there's no difference (other than random noise), just
as expected. But with the push-down enabled, there are fairly
significant speedups (up to ~3x). Of course, this is just a benchmark,
practical queries may do other stuff, making the gains smaller. OTOH, it
may also be much better, if there are expensive nodes in between.

The PoC patch is not very big or complex. 280KB seems like a lot, but
like 99% of that is changes in test output, because the patch adds some
info about the Bloom filters to EXPLAIN. The actual .c changes are only
~1000 lines, and a half of that is comments.

The most interesting stuff happens in create_hashjoin_plan(), where we
attempt to push-down the filter to a scan in the outer subtree. If that
succeeds, then ExecInitHashJoin initializes the filter so that the scan
can find it, and Hash builds the filter along with the hash table. And
then the scan nodes probe the pushed-down filter in ExecScanExtended().

There's bunch of boilerplate so that setrefs does the right thing with
expressions, etc. But it's a couple lines here and there. I'm actually
surprised how little code this is.

There's one detail I haven't mentioned yet - there's a simple adaptive
behavior, to deal with filters that are not selective enough. Per some
initial tests there's little benefit when the filter keeps >75% tuples,
and for >90% there were measurable regressions (~50%). This was very
consistent for different data types, etc.

So the patch tracks number of matching tuples per 1000 probes, when it
exceeds 90% it switches to sampling. Only 1% of tuples gets probed in
the filter, and if the fraction drops <80%, all the tuples get probed
again. This is very simple, needs more thought. But for the purpose of
the testing it worked quite well. There still is a small regression
(~3%), which I assume is due to building the filter.

Aside from the issues with deciding if to use a filter at all, sizing
it, etc. - which are still valid (even with the adaptive thing), and
need to be solved, there's one more annoying issue specific to this new
push-down stuff.

Earlier, I mentioned the push-down happens in create_hashjoin_plan().
Which means it happens *after* planning and costing. There are reasons
for that, but it has some unfortunate & annoying consequences.

Ideally, we'd know about the filters when constructing the scan nodes,
so we'd have a chance to estimate how many tuples will be eliminated by
probing the filters (which is about the same thing as estimating the
join sizes). But we can't do that, because our planner works bottom-up.
When constructing the scan nodes we know which tables we'll join with,
but we have no idea which of the join algorithms we'll pick.

We'll consider all three join types, and the scan node has no say which
of those will win. But the Bloom filter push-down is specific to hash
joins. So what should the scan node do? Either it can assume it's under
hash join (and set rows/cost as if there's a Bloom filter), or it can
set costs in a join-agnostic way (like now).

The only "correct" way I can think of dealing with this in the bottom-up
world is having two sets of paths - one set for a hash join, one set for
other joins. But that's not just for scans. We'd need that for all
paths, and for different combinations of joins. For the query with 3
joins, we'd end up with 2^3 combinations. That seems not great.

So I tend to see this as an opportunistic optimization. We do the
planning assuming there's no Bloom filter push-down, and then after the
fact we see if there's an opportunity after all. Which means we may not
pick a plan with hash joins, not realizing it might be made faster.

But in my mind that's somewhat acceptable / defensible.

The bigger issue for me is that it may make the EXPLAIN ANALYZE output
way harder to understand. The estimated "rows" are calculated before the
filter push-down happens, while the actual "rows" are with the filter
probing, of course. But it seems pretty easy to get confused by this,
and think it's just an incorrect estimate.

summary
-------

I like the idea of pushing filters down to the scan nodes (or perhaps
even to some other intermediate nodes). But maybe it's too incompatible
with our bottom-up planning, and the issues with costing and/or EXPLAIN
output may be impossible to solve. I wonder what others think.

Now that I revisited the older threads, I think it probably makes sense
with using Bloom filters in the hash join, at least in the two cases
mentioned in the first section. It doesn't have the issues with
bottom-up planning/costing, because it happens in the hash join. And the
issues with that (deciding what fractions are OK, sizing the filter,
...) apply to both that simpler case, and to the push-down.

Bloom filters have two rather different roles here.

For a local Hash Join optimization, Bloom does not require any particular
physical ordering of the heap. It can be useful simply when the join is
selective enough, or when batching/spilling makes failed probes expensive:
the Bloom filter rejects many outer tuples before a full hash-table probe
or before writing them to temporary batches.

But once we talk about pushing a runtime filter down to the scan/storage
layer, the physical preconditions become crucial. To get more than a cheap
per-row check, the scan must have something coarse-grained to skip:
partitions, row groups, chunks, block ranges, dictionaries, min/max
metadata, BRIN-like summaries, etc. Without that, the filter is still
correct, but the benefit is mostly CPU/probe reduction rather than avoiding
data production.

So for me the most interesting part of this thread is not Bloom itself, but
the architectural idea: pushing runtime knowledge down to the scan node,
against the normal direction of data flow. The build side of a join
produces compact knowledge about admissible keys, and lower layers may use
it before rows are materialized and sent upward.

I saw this in my own experiments with zone/chunk-oriented storage for
Postgres: static predicates could prune zones nicely, but joins were the
hard case because the useful filtering knowledge was produced above the
scan. A runtime semi-join filter pushed from the Hash Join build side into
the scan could turn join-derived knowledge into scan-level pruning.

For example:

SELECT sum(e.cost)
FROM events e
JOIN accounts a ON e.account_id = a.id
WHERE a.region = 'NP'; -- Nepal

The events scan does not know which account_id values are EU accounts. That
knowledge is produced above it, on the build side of the join. A runtime
semi-join filter pushed from the Hash Join build side down into the events
scan could let the scan reject impossible account_id values before
producing tuples.

For a plain heap scan this may mostly save hash probes. But with
zone/chunk-oriented storage, where chunks have dictionaries, min/max
metadata, Bloom summaries, or tenant ranges, the same runtime filter can
skip whole chunks. That is the part I find most interesting: turning
join-derived knowledge into scan-level pruning, against the normal
direction of data flow.

Bloom is just one carrier for that knowledge. The real feature is a
pluggable runtime-filter mechanism that heap, CustomScan, FDW,
columnar/table AMs, partitioned storage, or chunk/cold storage can consume
at the level they understand.

This may be a topic for a separate thread, because it quickly becomes less
about Hash Join Bloom filters and more about runtime knowledge pushdown
into storage.

Show quoted text

regards

[1]
/messages/by-id/5670946E.8070705@2ndquadrant.com

[2]

/messages/by-id/c902844d-837f-5f63-ced3-9f7fd222f175@2ndquadrant.com

--
Tomas Vondra

#11

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 months ago

In reply to: Oleg Bartunov (#10)

Re: hashjoins vs. Bloom filters (yet again)

On 6/3/26 11:20, Oleg Bartunov wrote:

...

Bloom filters have two rather different roles here.

For a local Hash Join optimization, Bloom does not require any
particular physical ordering of the heap. It can be useful simply when
the join is selective enough, or when batching/spilling makes failed
probes expensive: the Bloom filter rejects many outer tuples before a
full hash-table probe or before writing them to temporary batches.

Right. Adding a filter within a hash join is certainly less ambitious,
and the possible benefits are smaller.

But once we talk about pushing a runtime filter down to the scan/storage
layer, the physical preconditions become crucial. To get more than a
cheap per-row check, the scan must have something coarse-grained to
skip: partitions, row groups, chunks, block ranges, dictionaries, min/
max metadata, BRIN-like summaries, etc. Without that, the filter is
still correct, but the benefit is mostly CPU/probe reduction rather than
avoiding data production.

Maybe, but there's also ongoing work on adding batches to the executor,
in which case we'd eliminate "row groups" even when using a filter in
the scope of a hashjoin operator. Of course, the tuples will flow all
the way up to that operator.

So for me the most interesting part of this thread is not Bloom itself,
but the architectural idea: pushing runtime knowledge down to the scan
node, against the normal direction of data flow. The build side of a
join produces compact knowledge about admissible keys, and lower layers
may use it before rows are materialized and sent upward.

I saw this in my own experiments with zone/chunk-oriented storage for
Postgres: static predicates could prune zones nicely, but joins were the
hard case because the useful filtering knowledge was produced above the
scan. A runtime semi-join filter pushed from the Hash Join build side
into the scan could turn join-derived knowledge into scan-level pruning.

For example:

SELECT sum(e.cost)
FROM events e
JOIN accounts a ON e.account_id = a.id <http://a.id>
WHERE a.region = 'NP'; -- Nepal

The events scan does not know which account_id values are EU accounts.
That knowledge is produced above it, on the build side of the join. A
runtime semi-join filter pushed from the Hash Join build side down into
the events scan could let the scan reject impossible account_id values
before producing tuples.

Yes. This is known as "predicate transfer" in academic papers.

For a plain heap scan this may mostly save hash probes. But with zone/
chunk-oriented storage, where chunks have dictionaries, min/max
metadata, Bloom summaries, or tenant ranges, the same runtime filter can
skip whole chunks. That is the part I find most interesting: turning
join-derived knowledge into scan-level pruning, against the normal
direction of data flow.

Bloom is just one carrier for that knowledge. The real feature is a
pluggable runtime-filter mechanism that heap, CustomScan, FDW, columnar/
table AMs, partitioned storage, or chunk/cold storage can consume at the
level they understand.

This may be a topic for a separate thread, because it quickly becomes
less about Hash Join Bloom filters and more about runtime knowledge
pushdown into storage.

Right, there's a general concept of a "filter", and Bloom filters are
just one example of that. And maybe we could build other types of
filters more suitable for the scan. But I think it'll still be tied to a
hash join, because what other nodes / joins can build the filter?

regards

--
Tomas Vondra

#12

Ben Mejia

benjamin.arthur.mejia@gmail.com

about 1 month ago

In reply to: Tomas Vondra (#1)

Re: hashjoins vs. Bloom filters (yet again)

On 5/29/26 5:55 PM, Tomas Vondra wrote:

old patches
-----------

Those old patches tried to do a fairly small thing during a hash join,
and that's building a Bloom filter on the inner relation (the one that
gets hashed), and then use that filter before probing the hash table.

The benefits come from Bloom filters being (fairly) cheap, and a
negative answer (hash is not in the filter) may allows us to skip a much
more expensive operation.

The old threads patches focused especially at two hash join cases:

(a) A very selective join, i.e. a significant fraction of outer tuples
does not have a match in the hash table.

(b) A selective hash join forced to do batching because the hash table
is too large, and thus forced to spill outer tuples to temporary files.

For (a), the benefit comes from Bloom filters being much cheaper to
probe than a hash table. The exact cost depends on the implementation,
sizes, etc. We're in the ballpark of 50 vs. 500 cycles, maybe. But if
the filter discards 90% of tuples, it can be a big win.

For (b), the filter (for all the batches at once) allows us to discard
some of the outer tuples without writing them to temporary files. Which
is way more expensive than probing a hash table.

As it happens, I've been exploring the use of a bitmap filter for the
same two cases you mention. This has some relevance to the issues you
mention in your post about sizing, false-positive rate, etc.

Instead of a Bloom filter, I chose to use a bitmap filter, with one bit
per bucket on the build side. As the inner table is built, I set a bit
in the bitmap filter for every occupied bucket. If a bucket is empty,
there are no matching hashes and those hash values can be skipped where
appropriate. The advantages of this bitmap over a Bloom filter are:

- sizing is pre-determined by nbuckets
- small bitmaps (4k for 32k buckets)
- cheaper - nominal cost to set/check bits

A well-chosen Bloom filter will be more discriminating, but the bitmap
has the same no-false-negatives guarantee and costs much less space and
time to build.

I implemented both of your cases:

Drop-before-spill: (Case b)
Build per-batch bitmaps during inner partition pass and drop tuples that
don't have a bit set. Saves I/O on tuples that will never match. This
only works for inner and semi joins.

Single-Batch probe: (Case a)
Only pays off in high-miss-rate joins and a bucket array larger than
L2/L3 cache. This case has a higher penalty for hash table lookup than
the in-cache bitmap check. This case works in multi-batch, but the I/O
cost dominates and there is no gain.

I put runtime guards on both of these; I sampled the drop rate over a
window and disable the filter for the rest of the pass if the rate falls
below a threshold. (~5% for case b; ~25% for case a)

The benchmarks are encouraging:

For case a, I was able to see a best-case improvement of ~15% for
carefully chosen data (dependent on L2/L3 cache size).

For case b, I tested 3 cases with sparse, average and dense probe hits:

sparse probe (~95% miss): +18% to +36%
avg probe (~37% miss): +9% to +13%
dense probe (FK-like, ~0% miss): flat, within noise

(This was on a 8-core x86-64, L1 32KB/core, L2 4MB/core, L3 32MB, 31 GB
RAM. PostgreSQL 19devel, serial hash join,
max_parallel_workers_per_gather = 0, across work_mem = 1-8MB)

Happy to share the patch and full benchmark data if useful.

-Ben Mejia

#13

Tomas Vondra

tomas.vondra@2ndquadrant.com

30 days ago

In reply to: Ben Mejia (#12)

Re: hashjoins vs. Bloom filters (yet again)

On 6/24/26 00:36, Ben Mejia wrote:

On 5/29/26 5:55 PM, Tomas Vondra wrote:

old patches
-----------

Those old patches tried to do a fairly small thing during a hash join,
and that's building a Bloom filter on the inner relation (the one that
gets hashed), and then use that filter before probing the hash table.

The benefits come from Bloom filters being (fairly) cheap, and a
negative answer (hash is not in the filter) may allows us to skip a much
more expensive operation.

The old threads patches focused especially at two hash join cases:

(a) A very selective join, i.e. a significant fraction of outer tuples
does not have a match in the hash table.

(b) A selective hash join forced to do batching because the hash table
is too large, and thus forced to spill outer tuples to temporary files.

For (a), the benefit comes from Bloom filters being much cheaper to
probe than a hash table. The exact cost depends on the implementation,
sizes, etc. We're in the ballpark of 50 vs. 500 cycles, maybe. But if
the filter discards 90% of tuples, it can be a big win.

For (b), the filter (for all the batches at once) allows us to discard
some of the outer tuples without writing them to temporary files. Which
is way more expensive than probing a hash table.

As it happens, I've been exploring the use of a bitmap filter for the
same two cases you mention. This has some relevance to the issues you
mention in your post about sizing, false-positive rate, etc.

Instead of a Bloom filter, I chose to use a bitmap filter, with one bit
per bucket on the build side. As the inner table is built, I set a bit
in the bitmap filter for every occupied bucket. If a bucket is empty,
there are no matching hashes and those hash values can be skipped where
appropriate. The advantages of this bitmap over a Bloom filter are:

- sizing is pre-determined by nbuckets
- small bitmaps (4k for 32k buckets)
- cheaper - nominal cost to set/check bits

A well-chosen Bloom filter will be more discriminating, but the bitmap
has the same no-false-negatives guarantee and costs much less space and
time to build.

Isn't that pretty much the same thing as a Bloom filter with a single
hash function? So it has the same false positive properties, i.e. it may
be sacrificing some of the accuracy for not having to calculate any more
hashes. Could be a win.

I implemented both of your cases:

Drop-before-spill: (Case b)
Build per-batch bitmaps during inner partition pass and drop tuples that
don't have a bit set. Saves I/O on tuples that will never match. This
only works for inner and semi joins.

OK, makes sense.

Single-Batch probe: (Case a)
Only pays off in high-miss-rate joins and a bucket array larger than L2/
L3 cache. This case has a higher penalty for hash table lookup than the
in-cache bitmap check. This case works in multi-batch, but the I/O cost
dominates and there is no gain.

Right, it's hard to beat the hash table in this case.

I put runtime guards on both of these; I sampled the drop rate over a
window and disable the filter for the rest of the pass if the rate falls
below a threshold. (~5% for case b; ~25% for case a)

This seems similar to the adaptive behavior I implemented in v2. I
haven't thought of the idea of using different thresholds for the two
cases - I like that.

The benchmarks are encouraging:

For case a, I was able to see a best-case improvement of ~15% for
carefully chosen data (dependent on L2/L3 cache size).

For case b, I tested 3 cases with sparse, average and dense probe hits:

    sparse probe (~95% miss):           +18% to +36%
    avg probe    (~37% miss):            +9% to +13%
    dense probe (FK-like, ~0% miss):    flat, within noise

(This was on a 8-core x86-64, L1 32KB/core, L2 4MB/core, L3 32MB, 31 GB
RAM. PostgreSQL 19devel, serial hash join,
max_parallel_workers_per_gather = 0, across work_mem = 1-8MB)

Happy to share the patch and full benchmark data if useful.

I'd be happy to collaborate of some of this, if you're interested. Feel
free to post your patch here, or in a separate thread. Up to you.
Separate threads might be easier for cfbot to track.

I've spent some time hacking on v1, i.e. the pushdown - making the
planning work properly, etc. That's mostly orthogonal to this, with some
overlap. Ideally we'd want to do both I think.

regards

--
Tomas Vondra

#14

Tomas Vondra

tomas.vondra@2ndquadrant.com

24 days ago

In reply to: Tomas Vondra (#8)

Re: hashjoins vs. Bloom filters (yet again)

Hi,

On 6/2/26 17:46, Tomas Vondra wrote:

On 5/31/26 17:03, Andrew Dunstan wrote:

...

OK. I think it's enough for testing, i.e. to see if it's actually worth
pursuing further. But I think we'll eventually need to solve the issues
planning/costing somehow. I'm not sure it'll be committable without
having some sort of solution.

I happened to find this 2025 paper:

Including Bloom Filters in Bottom-up Optimization
https://arxiv.org/html/2505.02994v1

I read it over the weekend, and interestingly enough it's exactly about
the planning issues I outlined last week, i.e. difficulties with costing
paths that might include pushed-down Bloom filters.

They even describe a solution that kinda looks a bit like "tracking a
separate set of paths" from my e-mail, although they use somewhat
different terminology (sub-plan == our path, etc.). But if you squint a
little bit, it talks about the costing issue, path explosion, etc.

Their solutions is some sort of two-phase process, which I'm not sure we
can do. It'd require a fundamental rework of how we construct join rels
and all that, and TBH I don't have an ambition to do that.

But while reading the paper, I kept thinking about how we deal with
pathkeys. I wonder if we could do something similar to that? That is,
have a concept of "potentially interesting" filters, and construct the
extra paths only for those, to limit the number of extra paths.

Imagine we construct the the baserels (essentially scan nodes), and then
do a pass over those. Each scan would look at what joins it participates
in, and which of those could benefit from a Bloom filter (some can't,
because it's a FK join, or we don't expect many rejected tuples, or
maybe it's a LEFT JOIN, ... etc.).

And then we'd maybe have some additional heuristics to pick which "Bloom
filters" to attach to the path. And then later, when planning that
particular join involving that path, we'd reject the join if it's not a
hash join. The scan would always have to construct a "clean path" not
requiring any filters, similarly to what custom scans need to do for
parallel paths.

It's just a rough idea, but I think it would work. Worth a try.

I kept thinking about how we could improve the v1 patch (with filter
pushdown to scans deep below the join), so that it does the decisions
"properly" during planning, i.e. during the bottom-up phase when
constructing paths. I kept going back to PathKeys as a precedent for
considering this kind of Path feature, so I gave that a try.

Attached is a v3, a PoC patch reworking the v1 patch in this way. The
patch looks pretty large (~300kB), but the vast majority of that is
churn in regression tests. And also a lot of FIXME/XXX comments
explaining remaining issues or opportunities for improvement. The actual
code changes are rather modest (maybe ~1000 lines).

It also looks much larger because I chose to include v1 as 0001, with
0002 reworking it. But 0002 undoes a lot of the changes made by 0001,
and the resulting diff is way *smaller* than either of the parts.

So it really is not that large in the end.

Note: Most of the changes in regression tests is due to plan changes,
where a Bloom filter gets pushed down, and maybe it even changes larger
plan changes (e.g. because it makes hashjoin cheaper, and so it wins).
There's also changes in results, because the ordering changes (which is
fine, if the plan switches to a hashjoin). I haven't investigated all
the individual changes in detail, but those that I looked at seem OK.

The v1 patch (still included as 0001) did the pushdown in createplan.c,
in create_hashjoin_plan. And that's problematic for the various reasons
I explained in the previous message (costing, missing better plans, ...)

The v2 patch (in 0002) reworks that to actually determine the filters
during the bottom-up phase.

In short:

(a) Path gets "expected_filters" list with filters it expects to be
satisfied by a later join.

(b) While creating paths for scans, we now also generate paths for
interesting filters. This happens in set_rel_pathlist(), it looks at
joins the relation participates in, decides which are selective enough,
and adds a couple paths for combinations of expected filters.

The default is to pick 3 selective filters, and generate all
combinations. So ~8 new paths for each scan path. There's a bunch of
open questions regarding which filters to pick, how many, which
combinations to create.

(c) While creating paths for joins, we also consider these additional
paths with expected filters. A join may either "satisfy" a filter (if
it's a hashjoin and the filter is pushed down by the join), or it can
propagate it (if it's pushed from a later join).

This means nestloop/mergejoin "discard" paths with filters matching that
join, but can still propagate the other filters.

The paths with filters don't compete with regular paths, so we still
have the regular paths with no filters. So there should be no risk of
not being able to find a plan, or something.

(d) At planning / execution this works similarly to v1, except that all
the decisions were already done. The code in createplan.c and executor
is more for book-keeping and allowing lookup of filters.

There's more details in the commit message and the various comments,
along with a bunch of XXX comments about needed improvements.

I haven't incorporated the two patches posted by Andrew:

1) making it work with CustomScans

2) supporting per-key filters

3) allow eager creation of filters (disable delayed Hash build)

I agree those seem like a worthwhile improvements, and the patches
seemed to be OK too, but I was focusing on reworking the planning. Based
on some off-list discussion, Andrew (or one of his colleagues) should be
able to adjust those for this v3 patch.

While I think this v3 approach is generally correct, there certainly is
enough issues to solve.

For example, I really dislike how generate_expected_filter_paths()
constructs the new paths by "cloning" the existing paths. That works,
but does not seem right - for example, the scan may be able to do
something smart with the filter (I presume this is why Andrew wanted the
per-key filters). In which case it probably would like to adjust the
costing accordingly. But with the memcpy() clone that's not possible. So
we'd need something better.

v3 also does not do much regarding the filter tradeoffs (larger filter
vs. higher false positive). Or even setting a memory limit.

In fact, I'm thinking it might be better to abstract the "filter" and
stop thinking just about Bloom filters. There probably are other kinds
of interesting filters, so maybe we should not assume all filters are
Bloom filters. Those filters probably won't be that different, it's more
about not putting "Bloom" in naming etc.

There's a lot more XXX comments, discussing all kinds of ideas (or stuff
that needs improving). I'm not going to repeat all of that here.

While working on the v3 changes, I've been looking for papers discussing
this sort of things. Because we're certainly not the only (or first)
database considering this. Aside from the paper I already mentioned in
my previous message:

Including Bloom Filters in Bottom-up Optimization
https://arxiv.org/html/2505.02994v1

I found two more related recent papers:

Predicate Transfer: Efficient Pre-Filtering on Multi-Join Queries
https://vldb.org/cidrdb/papers/2024/p22-yang.pdf

Debunking the Myth of Join Ordering: Toward Robust SQL Analytics
https://dl.acm.org/doi/pdf/10.1145/3725283

These papers talk about "predicate transafer", but that's really the
same thing as building and pushing down Bloom filters. Because that
effectively "transfers" the predicates (WHERE conditions) from the join
to the other part of the query.

I have not read those papers in detail yet, but I have a hunch they may
be answers to some of the open questions (e.g. a better heuristics how
to pick the interesting filters, etc.).

regards

--
Tomas Vondra

#15

Matheus Alcantara

matheusssilv97@gmail.com

22 days ago

In reply to: Tomas Vondra (#14)

Re: hashjoins vs. Bloom filters (yet again)

On Wed Jul 1, 2026 at 10:25 AM -03, Tomas Vondra wrote:

(d) At planning / execution this works similarly to v1, except that all
the decisions were already done. The code in createplan.c and executor
is more for book-keeping and allowing lookup of filters.

I'm wondering if the adaptive probing can mess the execution of the
choose plan. Let's say that a plan was chosen with bloom filters to
pushdown because it would reduce e.g 50% of the rows, what if at
runtime the bloom filter is proved to not be effective and it get
disabled making the scan node to produce 100% of the rows to the node
above that is not expecting? Do we end up with the same issue "expected
x actual rows"?

I think that if we keep using the unnefective bloom filter the scan node
will still produce more rows than expected, but perhaps this is easier
to understand?

I haven't incorporated the two patches posted by Andrew:

1) making it work with CustomScans

2) supporting per-key filters

3) allow eager creation of filters (disable delayed Hash build)

I agree those seem like a worthwhile improvements, and the patches
seemed to be OK too, but I was focusing on reworking the planning. Based
on some off-list discussion, Andrew (or one of his colleagues) should be
able to adjust those for this v3 patch.

I'm attaching a new v4 patchset incorporating Andrew patches with test
cases. 0001 and 0002 are your v3 untouched, 0003 is some tests added to
exercice the CustomScan path and 0004 is the Andrew changes with a few
changes required from v3 version:

Unlike the v1 PoC that pushed filters down in create_hashjoin_plan
(where it could simply walk the finished plan tree and accept any scan
node), the filters are now decided during bottom-up path construction,
so a scan only receives a filter if a filter-bearing path was generated
for its base relation. So the main change is teaching path generation
about the custom scan.

In fact, I'm thinking it might be better to abstract the "filter" and
stop thinking just about Bloom filters. There probably are other kinds
of interesting filters, so maybe we should not assume all filters are
Bloom filters. Those filters probably won't be that different, it's more
about not putting "Bloom" in naming etc.

I was also thinking about this when reading v1. I was checking other
databases and I realize that Trino [1]https://trino.io/docs/current/admin/dynamic-filtering.html and Duckdb (I didn't find any
documentation from Duckdb but you can see this information on explain
analyse output) have "Dynamic Filtering" and IIUC is a similar
optimization that this patch is about.

I'm wondering if we could name as "Dynamic Filtering" or "Runtime filter
pushdown" or something else instead of Bloom Filters to make it more
generic.

On explain(analyze) output we can show as as "Runtime filter" and put
the columns (perhaps the values?) used on the filter.

I think that this can abstract and let the implementation decide if it
will use a bloom filter or not. It's a rough idea, but what do you
think?

I'm still thinking about your other points and about the XXX comments.
I'll share more soon.

[1]: https://trino.io/docs/current/admin/dynamic-filtering.html

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

#16

Tomas Vondra

tomas.vondra@2ndquadrant.com

22 days ago

In reply to: Matheus Alcantara (#15)

Re: hashjoins vs. Bloom filters (yet again)

On 7/2/26 22:31, Matheus Alcantara wrote:

On Wed Jul 1, 2026 at 10:25 AM -03, Tomas Vondra wrote:

(d) At planning / execution this works similarly to v1, except that all
the decisions were already done. The code in createplan.c and executor
is more for book-keeping and allowing lookup of filters.

I'm wondering if the adaptive probing can mess the execution of the
choose plan. Let's say that a plan was chosen with bloom filters to
pushdown because it would reduce e.g 50% of the rows, what if at
runtime the bloom filter is proved to not be effective and it get
disabled making the scan node to produce 100% of the rows to the node
above that is not expecting? Do we end up with the same issue "expected
x actual rows"?

I think that if we keep using the unnefective bloom filter the scan node
will still produce more rows than expected, but perhaps this is easier
to understand?

Yes. If we pick a filter expecting it to eliminate 99% of tuples, but
then find it does not eliminate any tuples, that'll affect the counts in
explain. But that's simply how misestimates work, it's not specific to
filter pushdown. It may happen for WHERE clauses etc.

The estimate would hit us at some point anyway - it's the selectivity of
the join, so even without the pushdown the cardinality would be off
above the join.

Also, what else could we do? I don't think we can do planning assuming
the estimates are off - we have to assume the estimates are OK. We might
consider how "safe" a given estimate is, and then maybe not push down
filters for "risky" ones, but we don't have any such capability.

I haven't incorporated the two patches posted by Andrew:

1) making it work with CustomScans

2) supporting per-key filters

3) allow eager creation of filters (disable delayed Hash build)

I agree those seem like a worthwhile improvements, and the patches
seemed to be OK too, but I was focusing on reworking the planning. Based
on some off-list discussion, Andrew (or one of his colleagues) should be
able to adjust those for this v3 patch.

I'm attaching a new v4 patchset incorporating Andrew patches with test
cases. 0001 and 0002 are your v3 untouched, 0003 is some tests added to
exercice the CustomScan path and 0004 is the Andrew changes with a few
changes required from v3 version:

Unlike the v1 PoC that pushed filters down in create_hashjoin_plan
(where it could simply walk the finished plan tree and accept any scan
node), the filters are now decided during bottom-up path construction,
so a scan only receives a filter if a filter-bearing path was generated
for its base relation. So the main change is teaching path generation
about the custom scan.

Thanks, I'll take a look early next week. My plan is to create a branch
carrying all the pieces, and then gradually refine that.

One thing I'm a bit concerned about is the CustomScan support. We don't
have a single module in the tree, so how would we test the changes? I
think it might be necessary to introduce a minimal CustomScan module, so
that we can test the new pieces.

In fact, I'm thinking it might be better to abstract the "filter" and
stop thinking just about Bloom filters. There probably are other kinds
of interesting filters, so maybe we should not assume all filters are
Bloom filters. Those filters probably won't be that different, it's more
about not putting "Bloom" in naming etc.

I was also thinking about this when reading v1. I was checking other
databases and I realize that Trino [1] and Duckdb (I didn't find any
documentation from Duckdb but you can see this information on explain
analyse output) have "Dynamic Filtering" and IIUC is a similar
optimization that this patch is about.

I'm wondering if we could name as "Dynamic Filtering" or "Runtime filter
pushdown" or something else instead of Bloom Filters to make it more
generic.

On explain(analyze) output we can show as as "Runtime filter" and put
the columns (perhaps the values?) used on the filter.

I think that this can abstract and let the implementation decide if it
will use a bloom filter or not. It's a rough idea, but what do you
think?

Yeah. I don't think the name is that important, it's more about what
kind of flexibility is needed.

I'm still thinking about your other points and about the XXX comments.
I'll share more soon.

Thanks!

--
Tomas Vondra

#17

Matheus Alcantara

matheusssilv97@gmail.com

22 days ago

In reply to: Tomas Vondra (#16)

Re: hashjoins vs. Bloom filters (yet again)

On Fri Jul 3, 2026 at 7:10 AM -03, Tomas Vondra wrote:

I'm wondering if the adaptive probing can mess the execution of the
choose plan. Let's say that a plan was chosen with bloom filters to
pushdown because it would reduce e.g 50% of the rows, what if at
runtime the bloom filter is proved to not be effective and it get
disabled making the scan node to produce 100% of the rows to the node
above that is not expecting? Do we end up with the same issue "expected
x actual rows"?

I think that if we keep using the unnefective bloom filter the scan node
will still produce more rows than expected, but perhaps this is easier
to understand?

Yes. If we pick a filter expecting it to eliminate 99% of tuples, but
then find it does not eliminate any tuples, that'll affect the counts in
explain. But that's simply how misestimates work, it's not specific to
filter pushdown. It may happen for WHERE clauses etc.

The estimate would hit us at some point anyway - it's the selectivity of
the join, so even without the pushdown the cardinality would be off
above the join.

Also, what else could we do? I don't think we can do planning assuming
the estimates are off - we have to assume the estimates are OK. We might
consider how "safe" a given estimate is, and then maybe not push down
filters for "risky" ones, but we don't have any such capability.

Yeah, I was just worried about this may add more confusing, but after
more thinking it does not seems to be the case. I was imagining a
scenario where the expected x actual rows is not very accurate and it
could be hard to know if it was because outdated stats or because the
adaptive probing decided to turn off the filter pushdown but I think
that in the end it's all the same problem: The adaptive probing may turn
off the filter because it's not very effective but this can happen
because of outdated stats that make the planner think that the push down
filters would help but it's not.

I haven't incorporated the two patches posted by Andrew:

1) making it work with CustomScans

2) supporting per-key filters

3) allow eager creation of filters (disable delayed Hash build)

I agree those seem like a worthwhile improvements, and the patches
seemed to be OK too, but I was focusing on reworking the planning. Based
on some off-list discussion, Andrew (or one of his colleagues) should be
able to adjust those for this v3 patch.

I'm attaching a new v4 patchset incorporating Andrew patches with test
cases. 0001 and 0002 are your v3 untouched, 0003 is some tests added to
exercice the CustomScan path and 0004 is the Andrew changes with a few
changes required from v3 version:

Unlike the v1 PoC that pushed filters down in create_hashjoin_plan
(where it could simply walk the finished plan tree and accept any scan
node), the filters are now decided during bottom-up path construction,
so a scan only receives a filter if a filter-bearing path was generated
for its base relation. So the main change is teaching path generation
about the custom scan.

Thanks, I'll take a look early next week. My plan is to create a branch
carrying all the pieces, and then gradually refine that.

Good, thanks.

One thing I'm a bit concerned about is the CustomScan support. We don't
have a single module in the tree, so how would we test the changes? I
think it might be necessary to introduce a minimal CustomScan module, so
that we can test the new pieces.

On 0003 I've created a new test extension that register a CustomScan
method. It also expose some sql functions to assert some behaviours of
filter pushdown. I think that it's still on early stage but it may help.
Let me know what do you think.

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

#18

Tomas Vondra

tomas.vondra@2ndquadrant.com

19 days ago

In reply to: Matheus Alcantara (#15)

Re: hashjoins vs. Bloom filters (yet again)

On 7/2/26 22:31, Matheus Alcantara wrote:

On Wed Jul 1, 2026 at 10:25 AM -03, Tomas Vondra wrote:

(d) At planning / execution this works similarly to v1, except that all
the decisions were already done. The code in createplan.c and executor
is more for book-keeping and allowing lookup of filters.

I'm wondering if the adaptive probing can mess the execution of the
choose plan. Let's say that a plan was chosen with bloom filters to
pushdown because it would reduce e.g 50% of the rows, what if at
runtime the bloom filter is proved to not be effective and it get
disabled making the scan node to produce 100% of the rows to the node
above that is not expecting? Do we end up with the same issue "expected
x actual rows"?

I think that if we keep using the unnefective bloom filter the scan node
will still produce more rows than expected, but perhaps this is easier
to understand?

I haven't incorporated the two patches posted by Andrew:

1) making it work with CustomScans

2) supporting per-key filters

3) allow eager creation of filters (disable delayed Hash build)

I agree those seem like a worthwhile improvements, and the patches
seemed to be OK too, but I was focusing on reworking the planning. Based
on some off-list discussion, Andrew (or one of his colleagues) should be
able to adjust those for this v3 patch.

I'm attaching a new v4 patchset incorporating Andrew patches with test
cases. 0001 and 0002 are your v3 untouched, 0003 is some tests added to
exercice the CustomScan path and 0004 is the Andrew changes with a few
changes required from v3 version:

Unlike the v1 PoC that pushed filters down in create_hashjoin_plan
(where it could simply walk the finished plan tree and accept any scan
node), the filters are now decided during bottom-up path construction,
so a scan only receives a filter if a filter-bearing path was generated
for its base relation. So the main change is teaching path generation
about the custom scan.

I took a quick look at the v4 patches adding the earlier patches from
Andrew,, and I think it looks sensible enough for a PoC. Thanks!

I think it'd make sense to split some of these into separate commits, as
it's not necessarily tied to just CustomScan nodes. And even if it is,
it can be a separate easier-to-review change.

I propose we move these two "features" into separate commits:

- "eager" creation of filters

- requesting per-key filters

I think it might be useful for other scans (or even non-scan nodes, if
we choose to allow filter pushdown for these). And it should be a
"generic" thing so that find_bloom_filter_recipient does not need to
check if the recipient is CustomScan. In fact, I think this should be
part of the information about expected filters "propagated" up during
planning.

FWIW It might make sense to show these things in the explain, at least
for development. AFAICS the CustomScan (used by the tests in 0003) does
not actually print the filters, right? And I'm not sure if the per-key
filters will be printed at all. But that's minor / easy to fix.

To move this whole thing forward, we'll need to figure out what to do
about the *big* questions. The four main ones I can think of (based on
memory and skimming the XXX comments in the patches) are:

1) selection of 'interesting' filters

Currently, we pick 'candidate' filters based solely on the selectivity,
but that only works if the scans don't use the filter in some smart way
(i.e. when it's evaluated at the very end). But if the custom scan can
do something very smart with some filters, the selectivity alone may not
be enough. The per-key filters make it even harder, I think.

2) construction of paths with filters (adjusting cost)

Right now the paths are constructed by "cloning" existing scan paths,
which is what create_filtered_scan_path does. But as already mentioned
in some XXX comments, I don't think it can work like this.

First, there's the question of costing. Do we need to adjust the cost in
some path-specific way? Maybe we don't, at least for built-in core
paths? In that case the filters are evaluated on the tuples the path
would produce, so maybe we can just add a bit of CPU cost per tuple, to
account for the evaluation of the filter (and reduce the row estimate).
Ff the filter is selective it'd still win, as expected.

But it's a bit dubious even for built-in core paths, because we'd
operate only on paths that already went through add_path and survived
the selection. Maybe we already eliminated some interesting path?

With "smart" CustomScan it's a bigger problem, though. If the scan is
doing something smart with the filters, it likely needs to adjust the
cost in a much more complex way. But the cloning does not allow that.

But more importantly, we can't copy CustomPath like this, because the
comment in pathnodes.h says:

* Core code must avoid assuming that the CustomPath is only as
* large as the structure declared here; providers are allowed to
* make it the first element in a larger structure. (Since the
* planner never copies Paths, ...

And we'd violate that. (Maybe this is a problem for the in-core scan
paths too?)

Some particular CustomPaths might be OK, but we can't know which ones
are. Maybe we could require scans wih CUSTOMPATH_SUPPORT_BLOOM_FILTERS
to allow such cloning, but it seems pretty fragile.

3) pushdown to multiple consumers (scans of partitioned tables)

Right now, we only allow pushdown to a single consumer. But with
partitioned tables that won't work - each partition scan is a consumer
of the same join (unless it's a partition-wise join, probably). We need
to make that work. It probably can't happen right now, because we don't
push through Append nodes (per find_bloom_filter_recipient), but that
seems like something we should allow.

Does not seem extremely particularly hard to address, though. The
find_bloom_filter_recipient simply needs to collect a list of recipient
nodes, not just the one.

4) parallel hash joins

There's some complexity due to having to put the filter into shmem, but
the v2 patch solved that I think, and we can mostly just copy that. I
don't think we need to solve that now.

I think (3) and (4) are somewhat mechanical. I know, it can easily be
much harder I envision. We won't know until we try.

For (1) and (2), I think we need to make this work more like how we
build paths for pathkeys. Do some sort of initial selection of
interesting filters, without trying to be very smart. So we'd maybe
check the filter discards > 30% of tuples, but without trying to keep
only the "top three" filters.

Then we'd generate the paths for all the possible combinations of
filters, not by cloning but by passing the expected filters to the
create_ function, and constructing a new path. If the scan can do
something smart - cool, it'd calculate the cost. In the worst case it'd
apply the filter at the end, with the generic cost / rows adjustment.

The paths with matching expected filters would still compete (just like
for matching pathkeys), of course.

There's still need to be some limits, though. If the table has 10 joins,
we can't generate paths for each possible combination of filters, that'd
be 2^10 combinations. But there are some inherent limits too, I think.
For example, once we discard ~90% tuples it probably does not make much
sense to keep applying additional filters. If each filter discards ~50%
tuples, that's 3 filters. So even with 10 joins we'd be limited to ~120
paths (which is still a lot, but way lower than 2^10).

We could still have some "heuristics" limits too, e.g. don't keep all
these paths, but prune them very aggressively and pick only the best
ones - just like we do now by only picking 3 most selective filters,
except that we'd do that based on proper costing (and not selectivity).

I plan to experiment with (1) and (2) soon, but it'd be great if you
could experiment with that too, and/or think about alternative ideas.

regards

--
Tomas Vondra

#19

Robert Haas

robertmhaas@gmail.com

18 days ago

In reply to: Oleg Bartunov (#10)

Re: hashjoins vs. Bloom filters (yet again)

On Wed, Jun 3, 2026 at 5:21 AM Oleg Bartunov <obartunov@postgrespro.ru> wrote:

For a plain heap scan this may mostly save hash probes. But with zone/chunk-oriented storage, where chunks have dictionaries, min/max metadata, Bloom summaries, or tenant ranges, the same runtime filter can skip whole chunks. That is the part I find most interesting: turning join-derived knowledge into scan-level pruning, against the normal direction of data flow.

Bloom is just one carrier for that knowledge. The real feature is a pluggable runtime-filter mechanism that heap, CustomScan, FDW, columnar/table AMs, partitioned storage, or chunk/cold storage can consume at the level they understand.

+1. I think it's fine if the optimizer and executor decide to do
things strictly with Bloom filters, if that turns out to be a good
technique. But if we're talking about pushing things down into table
AM we should try to be more general.

Or at least, that's my current thinking.

--
Robert Haas
EDB: http://www.enterprisedb.com

#20

Robert Haas

robertmhaas@gmail.com

18 days ago

In reply to: Tomas Vondra (#11)

Re: hashjoins vs. Bloom filters (yet again)

On Wed, Jun 3, 2026 at 6:07 AM Tomas Vondra <tomas@vondra.me> wrote

Right, there's a general concept of a "filter", and Bloom filters are
just one example of that. And maybe we could build other types of
filters more suitable for the scan. But I think it'll still be tied to a
hash join, because what other nodes / joins can build the filter?

Even if only hash joins build Bloom filters (or some other kind of
filters), those filters can be pushed through other joins. For
example, consider:

Hash Join
Nested Loop
-> Seq Scan on a
-> Index Scan on b
-> Hash
-> Seq Scan on c

If the hash join to table c builds a Bloom filter, it's perfectly
valid to test against that filter before doing the index probe into
table b.

The hard part is the planning. I feel like doing this at path-to-plan
time, after costing decisions have been made, is probably going to be
a tough sell. In a plan tree with many joins, pushing down the Bloom
filtering (or other filtering) decision to the lowest possible level
could drastically change the row count estimates, and thus the
costing, for a whole bunch of intermediate nodes, potentially meaning
that the best plan is something quite different from what we
estimated. On the other hand, if we try to do the "right" thing at
planning time -- meaning constructing separate paths for each possible
place where we could put the filter -- so that we can do costly
properly, that sounds super-expensive, which will be really bad for
queries that are supposed to have short execution times, and also not
great for cases where the runtime is longer but our estimates are
inaccurate, so that the extra planning effort is an expensive way of
deciding what to do at random.

i haven't hard time to read the paper to which you linked, but I do
sort of feel like knowing what the practical experience has been in
other systems might be important. Theoretically there are good
arguments for and against additional planning effort, but what happens
in practice is not clear to me.

--
Robert Haas
EDB: http://www.enterprisedb.com

#21

Tomas Vondra

tomas.vondra@2ndquadrant.com

18 days ago

In reply to: Robert Haas (#20)

#22

Robert Haas

robertmhaas@gmail.com

17 days ago

In reply to: Tomas Vondra (#21)

#23