A better way than tweaking NTUP_PER_BUCKET

Started by Simon Riggsalmost 13 years ago37 messageshackers

simon@2ndQuadrant.com

almost 13 years ago

Previous discussions of Hash Joins have noted that the performance
decreases when the average number of tuples per bucket increases.
O(N^2) effects are seen.

We've argued this about many different ways, yet all of those
discussions have centred around the constant NTUP_PER_BUCKET. I
believe that was a subtle mistake and I have a proposal.

The code in ExecChooseHashTableSize() line 460 says
/*
* Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
* memory is filled.
...

but the calculation then sets the number of buckets like this

dbuckets = ceil(ntuples / NTUP_PER_BUCKET);

**This is doesn't match the comment.** If we underestimate the number
of tuples and go on to fill the available memory, we then end up with
an average number of tuples per bucket higher than NTUP_PER_BUCKET. A
notational confusion that has been skewing the discussion.

The correct calculation that would match the objective set out in the
comment would be

dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;

Which leads us to a much more useful value of dbuckets in the case
where using ntuples occupies much less space than is available. This
value is always same or higher than previously because of the if-test
that surrounds it.

Given my experience that on larger tables we end up underestimating
ndistinct by 10-100-1000 times, I don't think this change is
unwarranted.

This solves the problem in earlier discussions since we get a lower
average number of tuples per bucket and yet we also get to keep the
current NTUP_PER_BUCKET value. Everybody wins.

Comments?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Simon Riggs (#1)

Re: A better way than tweaking NTUP_PER_BUCKET

Simon,

* Simon Riggs (simon@2ndQuadrant.com) wrote:

Previous discussions of Hash Joins have noted that the performance
decreases when the average number of tuples per bucket increases.

Having the hash table so small that we have hash bucket collisions with
different 32bit hash values is extremely painful, however..

The correct calculation that would match the objective set out in the
comment would be

dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;

This looks to be driving the size of the hash table size off of "how
many of this size tuple can I fit into memory?" and ignoring how many
actual rows we have to hash. Consider a work_mem of 1GB with a small
number of rows to actually hash- say 50. With a tupsize of 8 bytes,
we'd be creating a hash table sized for some 13M buckets.

This solves the problem in earlier discussions since we get a lower
average number of tuples per bucket and yet we also get to keep the
current NTUP_PER_BUCKET value. Everybody wins.

I agree that this should reduce the number of tuples per bucket, but I
don't think it's doing what you expect and based on what it's doing, it
seems wasteful.

Comments?

Based on your argument that we want to have a bucket load which is, on
average, the size of NTUP_PER_BUCKET, I have to '-1' on this. What we
want is to size a table large enough that we never have any true
collisions (cases where different 32-bit hash values end up in the same
bucket) *for all tuples being hashed*, that includes both the side
building the hash table and the side doing the scan. This should be
done when we can avoid batching- if we don't have enough to avoid
batching while having such a large table, we should consider adjusting
the hash table size down some because, in those cases, we're memory
constrained.

When we have enough memory to avoid batching, we never want to have to
check down through a bucket for a tuple which doesn't exist- we should
simply be able to look in the hash table itself, determine (with a
single fetch) that there are no entries for that hash value, and throw
the tuple away and move on to the next.

Thanks,

Stephen

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Stephen Frost (#2)

Re: A better way than tweaking NTUP_PER_BUCKET

On 22 June 2013 14:48, Stephen Frost <sfrost@snowman.net> wrote:

Based on your argument that we want to have a bucket load which is, on
average, the size of NTUP_PER_BUCKET, I have to '-1' on this.

That is not my argument. I am pointing out that the comments claim the
code does that, and yet the code does not.

So either we apply the patch to make the code fit the comment, or we
change the comment.

Tom's wish for an average tuples per bucket of 10 may be connected to
that error; I can't say. But we definitely don't end up with an
average of 10 in many cases, which is the basis for previous poor
performance reports.

What we
want is to size a table large enough that we never have any true
collisions (cases where different 32-bit hash values end up in the same
bucket) *for all tuples being hashed*, that includes both the side
building the hash table and the side doing the scan. This should be
done when we can avoid batching- if we don't have enough to avoid
batching while having such a large table, we should consider adjusting
the hash table size down some because, in those cases, we're memory
constrained.

When we have enough memory to avoid batching, we never want to have to
check down through a bucket for a tuple which doesn't exist- we should
simply be able to look in the hash table itself, determine (with a
single fetch) that there are no entries for that hash value, and throw
the tuple away and move on to the next.

The bottom line is that you're hoping for perfection. If we knew
exactly how many tuples were in the hash table then we could size it
precisely for the hash join we are about to perform. We don't know
that, so we can't size it precisely. Worse than that is that we know
we badly estimate the number of buckets on a regular basis.

That gives us three choices: if we have a hash table fixed without
prior knowledge then we either (0) we continue to underallocate them
in many cases or (1) potentially overallocate buckets. Or (2) we learn
to cope better with the variation in the number of entries. If we rule
out the first two we have only the last one remaining.

(1) will always be a heuristic, and as you point out could itself be
an extreme setting.

So I think that (2) is the best route: Given that we know with much
better certainty the number of rows in the scanned-relation, we should
be able to examine our hash table after it has been built and decide
whether it would be cheaper to rebuild the hash table with the right
number of buckets, or continue processing with what we have now. Which
is roughly what Heikki proposed already, in January.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Simon Riggs (#1)

Re: A better way than tweaking NTUP_PER_BUCKET

On Sat, Jun 22, 2013 at 9:15 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Previous discussions of Hash Joins have noted that the performance
decreases when the average number of tuples per bucket increases.
O(N^2) effects are seen.

We've argued this about many different ways, yet all of those
discussions have centred around the constant NTUP_PER_BUCKET. I
believe that was a subtle mistake and I have a proposal.

The code in ExecChooseHashTableSize() line 460 says
/*
* Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
* memory is filled.
...

but the calculation then sets the number of buckets like this

dbuckets = ceil(ntuples / NTUP_PER_BUCKET);

**This is doesn't match the comment.** If we underestimate the number
of tuples and go on to fill the available memory, we then end up with
an average number of tuples per bucket higher than NTUP_PER_BUCKET. A
notational confusion that has been skewing the discussion.

The correct calculation that would match the objective set out in the
comment would be

dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;

Which leads us to a much more useful value of dbuckets in the case
where using ntuples occupies much less space than is available. This
value is always same or higher than previously because of the if-test
that surrounds it.

+1. I think this is very clever.

The other thing that seems to be distorting things here is that this
function doesn't account for the memory consumed by the bucket array.
So as things stand today, changing NTUP_PER_BUCKET can't ever increase
the required number of batches. But that's really nonsensical.
Setting NTUP_PER_BUCKET to a smaller value like 1 or even, in effect,
a value less than 1 would probably improve performance for large hash
tables, but there's no way to decide what value is too expensive
because the memory cost of changing it isn't accounted for in the
first place.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Stephen Frost (#2)

Re: A better way than tweaking NTUP_PER_BUCKET

On Sat, Jun 22, 2013 at 9:48 AM, Stephen Frost <sfrost@snowman.net> wrote:

The correct calculation that would match the objective set out in the
comment would be

dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;

This looks to be driving the size of the hash table size off of "how
many of this size tuple can I fit into memory?" and ignoring how many
actual rows we have to hash. Consider a work_mem of 1GB with a small
number of rows to actually hash- say 50. With a tupsize of 8 bytes,
we'd be creating a hash table sized for some 13M buckets.

This is a fair point, but I still think Simon's got a good point, too.
Letting the number of buckets ramp up when there's ample memory seems
like a broadly sensible strategy. We might need to put a floor on the
effective load factor, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Simon Riggs (#3)

Re: A better way than tweaking NTUP_PER_BUCKET

On Saturday, June 22, 2013, Simon Riggs wrote:

On 22 June 2013 14:48, Stephen Frost <sfrost@snowman.net <javascript:;>>
wrote:

Based on your argument that we want to have a bucket load which is, on
average, the size of NTUP_PER_BUCKET, I have to '-1' on this.

That is not my argument. I am pointing out that the comments claim the
code does that, and yet the code does not.

To be honest, I don't read that comment quite the same way as you do.

Tom's wish for an average tuples per bucket of 10 may be connected to

that error; I can't say. But we definitely don't end up with an
average of 10 in many cases, which is the basis for previous poor
performance reports.

I don't believe Tom wishes for an average of 10.. That's just what
NTUP_PER_BUCKET has always been set to.

The bottom line is that you're hoping for perfection. If we knew

exactly how many tuples were in the hash table then we could size it
precisely for the hash join we are about to perform. We don't know
that, so we can't size it precisely. Worse than that is that we know
we badly estimate the number of buckets on a regular basis.

I'm actually not trying for perfection. What I think we should start with
is to at least not shoot ourselves in the foot by cutting the bucket size
to one-tenth of our distinct tuple estimate and virtually guaranteeing lots
of true collisions which are expensive.

That gives us three choices: if we have a hash table fixed without
prior knowledge then we either (0) we continue to underallocate them
in many cases or (1) potentially overallocate buckets. Or (2) we learn
to cope better with the variation in the number of entries. If we rule
out the first two we have only the last one remaining.

(1) will always be a heuristic, and as you point out could itself be
an extreme setting

A heuristic based on our estimates is a good choice, imv. I do think we
can do better than we are now though.

So I think that (2) is the best route: Given that we know with much

better certainty the number of rows in the scanned-relation, we should
be able to examine our hash table after it has been built and decide
whether it would be cheaper to rebuild the hash table with the right
number of buckets, or continue processing with what we have now. Which
is roughly what Heikki proposed already, in January.

I'm actually not a huge fan of this as it's certainly not cheap to do. If
it can be shown to be better than an improved heuristic then perhaps it
would work but I'm not convinced.

One other way to address having a smaller actual hash table is to come up
with another way to detect empty parts of the hash space, perhaps by using
a bitmap to represent the hash space (obviously the individual bits would
represent some chunk of the 32bit hash space, perhaps as much as 1/64'th,
allowing our bitmap to fit into a 64bit int; that may end up being too
small to actually help though). This would mean that when we did get to
scanning a bucket we would at least be much more likely of finding a match
instead of scanning it and finding nothing.

Thanks,

Stephen

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 13 years ago

In reply to: Simon Riggs (#3)

Re: A better way than tweaking NTUP_PER_BUCKET

On 22.06.2013 19:19, Simon Riggs wrote:

So I think that (2) is the best route: Given that we know with much
better certainty the number of rows in the scanned-relation, we should
be able to examine our hash table after it has been built and decide
whether it would be cheaper to rebuild the hash table with the right
number of buckets, or continue processing with what we have now. Which
is roughly what Heikki proposed already, in January.

Back in January, I wrote a quick patch to experiment with rehashing when
the hash table becomes too full. It was too late to make it into 9.3 so
I didn't pursue it further back then, but IIRC it worked. If we have the
capability to rehash, the accuracy of the initial guess becomes much
less important.

- Heikki

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Stephen Frost (#6)

Re: A better way than tweaking NTUP_PER_BUCKET

On 22 June 2013 21:40, Stephen Frost <sfrost@snowman.net> wrote:

I'm actually not a huge fan of this as it's certainly not cheap to do. If it
can be shown to be better than an improved heuristic then perhaps it would
work but I'm not convinced.

We need two heuristics, it would seem:

* an initial heuristic to overestimate the number of buckets when we
have sufficient memory to do so

* a heuristic to determine whether it is cheaper to rebuild a dense
hash table into a better one.

Although I like Heikki's rebuild approach we can't do this every x2
overstretch. Given large underestimates exist we'll end up rehashing
5-12 times, which seems bad. Better to let the hash table build and
then re-hash once, it we can see it will be useful.

OK?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Heikki Linnakangas (#7)

Re: A better way than tweaking NTUP_PER_BUCKET

On Saturday, June 22, 2013, Heikki Linnakangas wrote:

On 22.06.2013 19:19, Simon Riggs wrote:

So I think that (2) is the best route: Given that we know with much
better certainty the number of rows in the scanned-relation, we should
be able to examine our hash table after it has been built and decide
whether it would be cheaper to rebuild the hash table with the right
number of buckets, or continue processing with what we have now. Which
is roughly what Heikki proposed already, in January.

Back in January, I wrote a quick patch to experiment with rehashing when
the hash table becomes too full. It was too late to make it into 9.3 so I
didn't pursue it further back then, but IIRC it worked. If we have the
capability to rehash, the accuracy of the initial guess becomes much less
important.

What we're hashing isn't going to change mid-way through or be updated
after we've started doing lookups against it.

Why not simply scan and queue the data and then build the hash table right
the first time? Also, this patch doesn't appear to address dups and
therefore would rehash unnecessarily. There's no point rehashing into more
buckets if the buckets are only deep due to lots of dups. Figuring out how
many distinct values there are, in order to build the best hash table, is
actually pretty expensive compared to how quickly we can build the table
today. Lastly, this still encourages collisions due to too few buckets. If
we would simply start with more buckets outright we'd reduce the need to
rehash..

Thanks,

Stephen

#10

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Simon Riggs (#8)

Re: A better way than tweaking NTUP_PER_BUCKET

On Saturday, June 22, 2013, Simon Riggs wrote:

On 22 June 2013 21:40, Stephen Frost <sfrost@snowman.net <javascript:;>>
wrote:

I'm actually not a huge fan of this as it's certainly not cheap to do.

If it

can be shown to be better than an improved heuristic then perhaps it

would

work but I'm not convinced.

We need two heuristics, it would seem:

* an initial heuristic to overestimate the number of buckets when we
have sufficient memory to do so

* a heuristic to determine whether it is cheaper to rebuild a dense
hash table into a better one.

Although I like Heikki's rebuild approach we can't do this every x2
overstretch. Given large underestimates exist we'll end up rehashing
5-12 times, which seems bad. Better to let the hash table build and
then re-hash once, it we can see it will be useful.

OK?

I've been thinking a bit more on your notion of simply using as much memory
as we're permitted, but maybe adjust it down based on how big the input to
the hash table is (which I think we have better stats on, and even if we
don't, we could easily keep track of how many tuples we've seen and
consider rehashing as we go). Still doesn't really address the issue of
dups though. It may still be much larger than it should be if there's a lot
of duplicates in the input that hash into a much smaller set of buckets.

Will think on it more.

Thanks,

Stephen

#11

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Stephen Frost (#10)

Re: A better way than tweaking NTUP_PER_BUCKET

On 23 June 2013 03:16, Stephen Frost <sfrost@snowman.net> wrote:

Still doesn't really address the issue of dups though.

Checking for duplicates in all cases would be wasteful, since often we
are joining to the PK of a smaller table.

If duplicates are possible at all for a join, then it would make sense
to build the hash table more carefully to remove dupes. I think we
should treat that as a separate issue.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Stephen Frost (#10)

Re: A better way than tweaking NTUP_PER_BUCKET

On 23 June 2013 03:16, Stephen Frost <sfrost@snowman.net> wrote:

Will think on it more.

Some other thoughts related to this...

* Why are we building a special kind of hash table? Why don't we just
use the hash table code that we in every other place in the backend.
If that code is so bad why do we use it everywhere else? That is
extensible, so we could try just using that. (Has anyone actually
tried?)

* We're not thinking about cache locality and set correspondence
either. If the join is expected to hardly ever match, then we should
be using a bitmap as a bloom filter rather than assuming that a very
large hash table is easily accessible.

* The skew hash table will be hit frequently and would show good L2
cache usage. I think I'll try adding the skew table always to see if
that improves the speed of the hash join.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Simon Riggs (#11)

Re: A better way than tweaking NTUP_PER_BUCKET

On Sunday, June 23, 2013, Simon Riggs wrote:

On 23 June 2013 03:16, Stephen Frost <sfrost@snowman.net <javascript:;>>
wrote:

Still doesn't really address the issue of dups though.

Checking for duplicates in all cases would be wasteful, since often we
are joining to the PK of a smaller table.

Well, that's what ndistinct is there to help us figure out. If we don't
trust that though...

If duplicates are possible at all for a join, then it would make sense
to build the hash table more carefully to remove dupes. I think we
should treat that as a separate issue.

We can't simply remove the dups... We have to return all the matching dups
in the join. I did write a patch which created a 2-level list structure
where the first level was uniques and the 2nd was dups, but it was
extremely expensive to create the hash table then and scanning it wasn't
much cheaper.

Thanks,

Stephen

#14

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Simon Riggs (#12)

Re: A better way than tweaking NTUP_PER_BUCKET

On Sunday, June 23, 2013, Simon Riggs wrote:

On 23 June 2013 03:16, Stephen Frost <sfrost@snowman.net <javascript:;>>
wrote:

Will think on it more.

Some other thoughts related to this...

* Why are we building a special kind of hash table? Why don't we just
use the hash table code that we in every other place in the backend.
If that code is so bad why do we use it everywhere else? That is
extensible, so we could try just using that. (Has anyone actually
tried?)

I've not looked at the hash table in the rest of the backend.

* We're not thinking about cache locality and set correspondence
either. If the join is expected to hardly ever match, then we should
be using a bitmap as a bloom filter rather than assuming that a very
large hash table is easily accessible.

That's what I was suggesting earlier, though I don't think it's technically
a bloom filter- doesn't that require multiple hash functions?I don't think
we want to require every data type to provide multiple hash functions.

* The skew hash table will be hit frequently and would show good L2
cache usage. I think I'll try adding the skew table always to see if
that improves the speed of the hash join.

The skew tables is just for common values though... To be honest, I have
some doubts about that structure really being a terribly good approach for
anything which is completely in memory.

Thanks,

Stephen

#15

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 13 years ago

In reply to: Simon Riggs (#8)

Re: A better way than tweaking NTUP_PER_BUCKET

On 23.06.2013 01:48, Simon Riggs wrote:

On 22 June 2013 21:40, Stephen Frost<sfrost@snowman.net> wrote:

I'm actually not a huge fan of this as it's certainly not cheap to do. If it
can be shown to be better than an improved heuristic then perhaps it would
work but I'm not convinced.

We need two heuristics, it would seem:

* an initial heuristic to overestimate the number of buckets when we
have sufficient memory to do so

* a heuristic to determine whether it is cheaper to rebuild a dense
hash table into a better one.

Although I like Heikki's rebuild approach we can't do this every x2
overstretch. Given large underestimates exist we'll end up rehashing
5-12 times, which seems bad.

It's not very expensive. The hash values of all tuples have already been
calculated, so rebuilding just means moving the tuples to the right bins.

Better to let the hash table build and
then re-hash once, it we can see it will be useful.

That sounds even less expensive, though.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Atri Sharma

atri.jiit@gmail.com

almost 13 years ago

In reply to: Heikki Linnakangas (#15)

Re: A better way than tweaking NTUP_PER_BUCKET

On Sun, Jun 23, 2013 at 8:41 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 23.06.2013 01:48, Simon Riggs wrote:

On 22 June 2013 21:40, Stephen Frost<sfrost@snowman.net> wrote:

I'm actually not a huge fan of this as it's certainly not cheap to do. If
it
can be shown to be better than an improved heuristic then perhaps it
would
work but I'm not convinced.

We need two heuristics, it would seem:

* an initial heuristic to overestimate the number of buckets when we
have sufficient memory to do so

* a heuristic to determine whether it is cheaper to rebuild a dense
hash table into a better one.

Although I like Heikki's rebuild approach we can't do this every x2
overstretch. Given large underestimates exist we'll end up rehashing
5-12 times, which seems bad.

It's not very expensive. The hash values of all tuples have already been
calculated, so rebuilding just means moving the tuples to the right bins.

Better to let the hash table build and
then re-hash once, it we can see it will be useful.

That sounds even less expensive, though.

Hi all,

I just popped in here on Simon's advice to put an idea I had about
optimizing hash joins on this thread.

Essentially, I was thinking of using bloom filters in the part where
we match the actual hash values of the outer relation's tuple and the
hash table. We could do a bloom filter lookup first, and if it gives a
positive, then we can proceed like we do currently, since we could
have a false positive. However, if we have a negative from the bloom
filter, then, we can skip that tuple since bloom filters never give
false negatives.

Another thing we could potentially look at is that in the case when
the hash table doesnt fit into memory, and we have to do disk lookups,
then, we could do bloom filter lookups first in order to select the
tuples to be read from disk(only the positive ones).

Thoughts/Comments please?

Regards,

Atri

--
Regards,

Atri
l'apprenant

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Atri Sharma (#16)

Re: A better way than tweaking NTUP_PER_BUCKET

Atri,

* Atri Sharma (atri.jiit@gmail.com) wrote:

I just popped in here on Simon's advice to put an idea I had about
optimizing hash joins on this thread.

I'd encourage reading the thread a bit first, in the future.. :)

Essentially, I was thinking of using bloom filters in the part where
we match the actual hash values of the outer relation's tuple and the
hash table. We could do a bloom filter lookup first, and if it gives a
positive, then we can proceed like we do currently, since we could
have a false positive. However, if we have a negative from the bloom
filter, then, we can skip that tuple since bloom filters never give
false negatives.

I suggested this up-thread already, but it's not really a bloom filter
as there's only one hash function available- I can't see us requiring
every data type to provide multiple hash functions. Still, I do think
breaking the single 32-bit hash key space up into fixed-sized chunks and
then having a bitfield array which we test against (very similar to how
the visibility map works) to see if there's any chance that a given hash
key exists might be valuable. The problem is that, because we don't
have multiple hash functions, it's not clear how much "empty" space we'd
actually end up with.

This needs to be tested with different data sets and different sizes for
the bitfield (leading to correspondingly different sizes for the amount
of key space covered by a single bit), along with performance metrics
which determine how expensive it is to build and use the bitfield.

Another thing we could potentially look at is that in the case when
the hash table doesnt fit into memory, and we have to do disk lookups,
then, we could do bloom filter lookups first in order to select the
tuples to be read from disk(only the positive ones).

We could have a bitfield filter (as I described above) created for each
bucket and then test against that before considering if we actually have
to go look in that bucket, yes. I'm not sure if that's quite what you
were thinking, but I can see how a bitfield per bucket might work. If
you were suggesting something else, please clarify.

Thanks,

Stephen

#18

Atri Sharma

atri.jiit@gmail.com

almost 13 years ago

In reply to: Stephen Frost (#17)

Re: A better way than tweaking NTUP_PER_BUCKET

On Wed, Jun 26, 2013 at 7:12 PM, Stephen Frost <sfrost@snowman.net> wrote:

Atri,

* Atri Sharma (atri.jiit@gmail.com) wrote:

I just popped in here on Simon's advice to put an idea I had about
optimizing hash joins on this thread.

I'd encourage reading the thread a bit first, in the future.. :)

Yeah, I actually read a bit(admitted, not much) of the above thread. I
was following it a bit as well.

I suggested this up-thread already, but it's not really a bloom filter
as there's only one hash function available- I can't see us requiring
every data type to provide multiple hash functions. Still, I do think
breaking the single 32-bit hash key space up into fixed-sized chunks and
then having a bitfield array which we test against (very similar to how
the visibility map works) to see if there's any chance that a given hash
key exists might be valuable. The problem is that, because we don't
have multiple hash functions, it's not clear how much "empty" space we'd
actually end up with.

Agreed.

We could have a bitfield filter (as I described above) created for each
bucket and then test against that before considering if we actually have
to go look in that bucket, yes. I'm not sure if that's quite what you
were thinking, but I can see how a bitfield per bucket might work. If
you were suggesting something else, please clarify.

Yeah, this is what I wanted.

My point is that I would like to help in the implementation, if possible. :)

Regards,

Atri

--
Regards,

Atri
l'apprenant

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Atri Sharma (#18)

Re: A better way than tweaking NTUP_PER_BUCKET

* Atri Sharma (atri.jiit@gmail.com) wrote:

My point is that I would like to help in the implementation, if possible. :)

Feel free to go ahead and implement it.. I'm not sure when I'll have a
chance to (probably not in the next week or two anyway). Unfortunately,
the bigger issue here is really about testing the results and
determining if it's actually faster/better with various data sets
(including ones which have duplicates). I've got one test data set
which has some interesting characteristics (for one thing, hashing the
"large" side and then seq-scanning the "small" side is actually faster
than going the other way, which is quite 'odd' imv for a hashing
system): http://snowman.net/~sfrost/test_case2.sql

You might also look at the other emails that I sent regarding this
subject and NTUP_PER_BUCKET. Having someone confirm what I saw wrt
changing that parameter would be nice and it would be a good comparison
point against any kind of pre-filtering that we're doing.

One thing that re-reading the bloom filter description reminded me of is
that it's at least conceivable that we could take the existing hash
functions for each data type and do double-hashing or perhaps seed the
value to be hashed with additional data to produce an "independent" hash
result to use. Again, a lot of things that need to be tested and
measured to see if they improve overall performance.

Thanks,

Stephen

#20

Atri Sharma

atri.jiit@gmail.com

almost 13 years ago

In reply to: Stephen Frost (#19)

Re: A better way than tweaking NTUP_PER_BUCKET

On Wed, Jun 26, 2013 at 9:20 PM, Stephen Frost <sfrost@snowman.net> wrote:

* Atri Sharma (atri.jiit@gmail.com) wrote:

My point is that I would like to help in the implementation, if possible. :)

Feel free to go ahead and implement it.. I'm not sure when I'll have a
chance to (probably not in the next week or two anyway). Unfortunately,
the bigger issue here is really about testing the results and
determining if it's actually faster/better with various data sets
(including ones which have duplicates). I've got one test data set
which has some interesting characteristics (for one thing, hashing the
"large" side and then seq-scanning the "small" side is actually faster
than going the other way, which is quite 'odd' imv for a hashing
system): http://snowman.net/~sfrost/test_case2.sql

You might also look at the other emails that I sent regarding this
subject and NTUP_PER_BUCKET. Having someone confirm what I saw wrt
changing that parameter would be nice and it would be a good comparison
point against any kind of pre-filtering that we're doing.

One thing that re-reading the bloom filter description reminded me of is
that it's at least conceivable that we could take the existing hash
functions for each data type and do double-hashing or perhaps seed the
value to be hashed with additional data to produce an "independent" hash
result to use. Again, a lot of things that need to be tested and
measured to see if they improve overall performance.

Right, let me look.Although, I am pretty busy atm with ordered set
functions, so will get it done maybe last week of this month.

Another thing I believe in is that we should have multiple hashing
functions for bloom filters, which generate different probability
values so that the coverage is good.

Regards,

Atri

--
Regards,

Atri
l'apprenant

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Atri Sharma (#20)

#22

Atri Sharma

atri.jiit@gmail.com

almost 13 years ago

In reply to: Stephen Frost (#21)

#23