Memory-Bounded Hash Aggregation

tomas.vondra@2ndquadrant.com

almost 7 years ago

In reply to: Jeff Davis (#1)

Re: Memory-Bounded Hash Aggregation

Hi Jeff,

On Mon, Jul 01, 2019 at 12:13:53PM -0700, Jeff Davis wrote:

This is for design review. I have a patch (WIP) for Approach 1, and if
this discussion starts to converge on that approach I will polish and
post it.

Thanks for working on this.

Let's start at the beginning: why do we have two strategies -- hash
and sort -- for aggregating data? The two are more similar than they
first appear. A partitioned hash strategy writes randomly among the
partitions, and later reads the partitions sequentially; a sort will
write sorted runs sequentially, but then read the among the runs
randomly during the merge phase. A hash is a convenient small
representation of the data that is cheaper to operate on; sort uses
abbreviated keys for the same reason.

What does "partitioned hash strategy" do? It's probably explained in one
of the historical discussions, but I'm not sure which one. I assume it
simply hashes the group keys and uses that to partition the data, and then
passing it to hash aggregate.

Hash offers:

* Data is aggregated on-the-fly, effectively "compressing" the amount
of data that needs to go to disk. This is particularly important
when the data contains skewed groups (see below).

* Can output some groups after the first pass of the input data even
if other groups spilled.

* Some data types only support hashing; not sorting.

Sort+Group offers:

* Only one group is accumulating at once, so if the transition state
grows (like with ARRAY_AGG), it minimizes the memory needed.

* The input may already happen to be sorted.

* Some data types only support sorting; not hashing.

Currently, Hash Aggregation is only chosen if the optimizer believes
that all the groups (and their transition states) fit in
memory. Unfortunately, if the optimizer is wrong (often the case if the
input is not a base table), the hash table will
keep growing beyond work_mem, potentially bringing the entire system
to OOM. This patch fixes that problem by extending the Hash
Aggregation strategy to spill to disk when needed.

OK, makes sense.

Previous discussions:

/messages/by-id/1407706010.6623.16.camel@jeff-desktop

/messages/by-id/1419326161.24895.13.camel@jeff-desktop

/messages/by-id/87be3bd5-6b13-d76e-5618-6db0a4db584d@iki.fi

A lot was discussed, which I will try to summarize and address here.

Digression: Skewed Groups:

Imagine the input tuples have the following grouping keys:

0, 1, 0, 2, 0, 3, 0, 4, ..., 0, N-1, 0, N

Group 0 is a skew group because it consists of 50% of all tuples in
the table, whereas every other group has a single tuple. If the
algorithm is able to keep group 0 in memory the whole time until
finalized, that means that it doesn't have to spill any group-0
tuples. In this example, that would amount to a 50% savings, and is a
major advantage of Hash Aggregation versus Sort+Group.

Right. I agree efficiently handling skew is important and may be crucial
for achieving good performance.

High-level approaches:

1. When the in-memory hash table fills, keep existing entries in the
hash table, and spill the raw tuples for all new groups in a
partitioned fashion. When all input tuples are read, finalize groups
in memory and emit. Now that the in-memory hash table is cleared (and
memory context reset), process a spill file the same as the original
input, but this time with a fraction of the group cardinality.

2. When the in-memory hash table fills, partition the hash space, and
evict the groups from all partitions except one by writing out their
partial aggregate states to disk. Any input tuples belonging to an
evicted partition get spilled to disk. When the input is read
entirely, finalize the groups remaining in memory and emit. Now that
the in-memory hash table is cleared, process the next partition by
loading its partial states into the hash table, and then processing
its spilled tuples.

3. Use some kind of hybrid[1][2] of hashing and sorting.

Unfortunately the second link does not work :-(

Evaluation of approaches:

Approach 1 is a nice incremental improvement on today's code. The
final patch may be around 1KLOC. It's a single kind of on-disk data
(spilled tuples), and a single algorithm (hashing). It also handles
skewed groups well because the skewed groups are likely to be
encountered before the hash table fills up the first time, and
therefore will stay in memory.

I'm not going to block Approach 1, althought I'd really like to see
something that helps with array_agg.

Approach 2 is nice because it resembles the approach of Hash Join, and
it can determine whether a tuple should be spilled without a hash
lookup. Unfortunately, those upsides are fairly mild, and it has
significant downsides:

* It doesn't handle skew values well because it's likely to evict
them.

* If we leave part of the hash table in memory, it's difficult to
ensure that we will be able to actually use the space freed by
eviction, because the freed memory may be fragmented. That could
force us to evict the entire in-memory hash table as soon as we
partition, reducing a lot of the benefit of hashing.

Yeah, and it may not work well with the memory accounting if we only track
the size of allocated blocks, not chunks (because pfree likely won't free
the blocks).

* It requires eviction for the algorithm to work. That may be
necessary for handling cases like ARRAY_AGG (see below) anyway, but
this approach constrains the specifics of eviction.

Approach 3 is interesting because it unifies the two approaches and
can get some of the benfits of both. It's only a single path, so it
avoids planner mistakes. I really like this idea and it's possible we
will end up with approach 3. However:

* It requires that all data types support sorting, or that we punt
somehow.

* Right now we are in a weird state because hash aggregation cheats,
so it's difficult to evaluate whether Approach 3 is moving us in the
right direction because we have no other correct implementation to
compare against. Even if Approach 3 is where we end up, it seems
like we should fix hash aggregation as a stepping stone first.

Aren't all three approaches a way to "fix" hash aggregate? In any case,
it's certainly reasonable to make incremental changes. The question is
whether "approach 1" is sensible step towards some form of "approach 3"

* It means we have a hash table and sort running concurrently, each
using memory. Andres said this might not be a problem[3], but I'm
not convinced that the problem is zero. If you use small work_mem
for the write phase of sorting, you'll end up with a lot of runs to
merge later and that has some kind of cost.

Why would we need to do both concurrently? I thought we'd empty the hash
table before doing the sort, no?

* The simplicity might start to evaporate when we consider grouping
sets and eviction strategy.

Hmm, yeah :-/

Main topics to consider:

ARRAY_AGG:

Some aggregates, like ARRAY_AGG, have a transition state that grows
proportionally with the group size. In other words, it is not a
summary like COUNT or AVG, it contains all of the input data in a new
form.

Strictly speaking the state may grow even for count/avg aggregates, e.g.
for numeric types, but it's far less serious than array_agg etc.

These aggregates are not a good candidate for hash aggregation. Hash
aggregation is about keeping many transition states running in
parallel, which is just a bad fit for large transition states. Sorting
is better because it advances one transition state at a time. We could:

* Let ARRAY_AGG continue to exceed work_mem like today.

* Block or pessimize use of hash aggregation for such aggregates.

* Evict groups from the hash table when it becomes too large. This
requires the ability to serialize and deserialize transition states,
and some approaches here might also need combine_func
specified. These requirements seem reasonable, but we still need
some answer of what to do for aggregates that grow like ARRAY_AGG
but don't have the required serialfunc, deserialfunc, or
combine_func.

Do we actually need to handle that case? How many such aggregates are
there? I think it's OK to just ignore that case (and keep doing what we do
now), and require serial/deserial functions for anything better.

GROUPING SETS:

With grouping sets, there are multiple hash tables and each hash table
has it's own hash function, so that makes partitioning more
complex. In Approach 1, that means we need to either (a) not partition
the spilled tuples; or (b) have a different set of partitions for each
hash table and spill the same tuple multiple times. In Approach 2, we
would be required to partition each hash table separately and spill
tuples multiple times. In Approach 3 (depending on the exact approach
but taking a guess here) we would need to add a set of phases (one
extra phase for each hash table) for spilled tuples.

No thoughts about this yet.

MEMORY TRACKING:

I have a patch to track the total allocated memory by
incrementing/decrementing it when blocks are malloc'd/free'd. This
doesn't do bookkeeping for each chunk, only each block. Previously,
Robert Haas raised some concerns[4] about performance, which were
mitigated[5] but perhaps not entirely eliminated (but did become
elusive).

The only alternative is estimation, which is ugly and seems like a bad
idea. Memory usage isn't just driven by inputs, it's also driven by
patterns of use. Misestimates in the planner are fine (within reason)
because we don't have any other choice, and a small-factor misestimate
might not change the plan anyway. But in the executor, a small-factor
misestimate seems like it's just not doing the job. If a user found
that hash aggregation was using 3X work_mem, and my only explanation
is "well, it's just an estimate", I would be pretty embarrassed and
the user would likely lose confidence in the feature. I don't mean
that we must track memory perfectly everywhere, but using an estimate
seems like a mediocre improvement of the current state.

I agree estimates are not the right tool here.

We should proceed with memory context tracking and try to eliminate or
mitigate performance concerns. I would not like to make any hurculean
effort as a part of the hash aggregation work though; I think it's
basically just something a memory manager in a database system should
have supported all along. I think we will find other uses for it as
time goes on. We have more and more things happening in the executor
and having a cheap way to check "how much memory is this thing using?"
seems very likely to be useful.

IMO we should just use the cheapest memory accounting (tracking the amount
of memory allocated for blocks). I agree it's a feature we need, I don't
think we can devise anything cheaper than this.

Other points:

* Someone brought up the idea of using logtapes.c instead of writing
separate files for each partition. That seems reasonable, but it's
using logtapes.c slightly outside of its intended purpose. Also,
it's awkward to need to specify the number of tapes up-front. Worth
experimenting with to see if it's a win.

* Tomas did some experiments regarding the number of batches to choose
and how to choose them. It seems like there's room for improvement
over ths simple calculation I'm doing now.

Me? I don't recall such benchmarks, but maybe I did. But I think we'll
need to repeat those with the new patches etc. I think the question is
whether we see this as an emergency solution - in that case I wouldn't
obsess about getting the best possible parameters.

* A lot of discussion about a smart eviction strategy. I don't see
strong evidence that it's worth the complexity at this time. The
smarter we try to be, the more bookkeeping and memory fragmentation
problems we will have. If we evict something, we should probably
evict the whole hash table or some large part of it.

Maybe. For each "smart" eviction strategy there is a (trivial) example
of data on which it performs poorly.

I think it's the same thing as with the number of partitions - if we
consider this to be an emergency solution, it's OK if the performance is
not entirely perfect when it kicks in.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql@j-davis.com

almost 7 years ago

In reply to: Jeff Davis (#1)

Re: Memory-Bounded Hash Aggregation

On Mon, 2019-07-01 at 12:13 -0700, Jeff Davis wrote:

This is for design review. I have a patch (WIP) for Approach 1, and
if
this discussion starts to converge on that approach I will polish and
post it.

WIP patch attached (based on 9a81c9fa); targeting September CF.

Not intended for detailed review yet, but it seems to work in enough
cases (including grouping sets and JIT) to be a good proof-of-concept
for the algorithm and its complexity.

Initial performance numbers put it at 2X slower than sort for grouping
10M distinct integers. There are quite a few optimizations I haven't
tried yet and quite a few tunables I haven't tuned yet, so hopefully I
can close the gap a bit for the small-groups case.

I will offer more details soon when I have more confidence in the
numbers.

It does not attempt to spill ARRAY_AGG at all yet.

Regards,
Jeff Davis

/messages/by-id/CAGTBQpa__-NP7=kKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g@mail.gmail.com

pgsql@j-davis.com

almost 7 years ago

In reply to: Tomas Vondra (#2)

Re: Memory-Bounded Hash Aggregation

On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote:

What does "partitioned hash strategy" do? It's probably explained in
one
of the historical discussions, but I'm not sure which one. I assume
it
simply hashes the group keys and uses that to partition the data, and
then
passing it to hash aggregate.

Yes. When spilling, it is cheap to partition on the hash value at the
same time, which dramatically reduces the need to spill multiple times.
Previous discussions:

Unfortunately the second link does not work :-(

It's supposed to be:

I'm not going to block Approach 1, althought I'd really like to see
something that helps with array_agg.

I have a WIP patch that I just posted. It doesn't yet work with
ARRAY_AGG, but I think it can be made to work by evicting the entire
hash table, serializing the transition states, and then later combining
them.

Aren't all three approaches a way to "fix" hash aggregate? In any
case,
it's certainly reasonable to make incremental changes. The question
is
whether "approach 1" is sensible step towards some form of "approach
3"

Disk-based hashing certainly seems like a reasonable algorithm on paper
that has some potential advantages over sorting. It certainly seems
sensible to me that we explore the disk-based hashing strategy first,
and then we would at least know what we are missing (if anything) by
going with the hybrid approach later.

There's also a fair amount of design space to explore in the hybrid
strategy. That could take a while to converge, especially if we don't
have anything in place to compare against.

* It means we have a hash table and sort running concurrently, each
using memory. Andres said this might not be a problem[3], but I'm
not convinced that the problem is zero. If you use small work_mem
for the write phase of sorting, you'll end up with a lot of runs
to
merge later and that has some kind of cost.

Why would we need to do both concurrently? I thought we'd empty the
hash
table before doing the sort, no?

So you are saying we spill the tuples into a tuplestore, then feed the
tuplestore through a tuplesort? Seems inefficient, but I guess we can.

Do we actually need to handle that case? How many such aggregates are
there? I think it's OK to just ignore that case (and keep doing what
we do
now), and require serial/deserial functions for anything better.

Punting on a few cases is fine with me, if the user has a way to fix
it.

Regards,
Jeff Davis

tomas.vondra@2ndquadrant.com

almost 7 years ago

In reply to: Jeff Davis (#4)

Re: Memory-Bounded Hash Aggregation

On Wed, Jul 03, 2019 at 07:03:06PM -0700, Jeff Davis wrote:

On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote:

What does "partitioned hash strategy" do? It's probably explained in
one
of the historical discussions, but I'm not sure which one. I assume
it
simply hashes the group keys and uses that to partition the data, and
then
passing it to hash aggregate.

Yes. When spilling, it is cheap to partition on the hash value at the
same time, which dramatically reduces the need to spill multiple times.
Previous discussions:

Unfortunately the second link does not work :-(

It's supposed to be:

/messages/by-id/CAGTBQpa__-NP7=kKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g@mail.gmail.com

I'm not going to block Approach 1, althought I'd really like to see
something that helps with array_agg.

I have a WIP patch that I just posted. It doesn't yet work with
ARRAY_AGG, but I think it can be made to work by evicting the entire
hash table, serializing the transition states, and then later combining
them.

Aren't all three approaches a way to "fix" hash aggregate? In any
case,
it's certainly reasonable to make incremental changes. The question
is
whether "approach 1" is sensible step towards some form of "approach
3"

Disk-based hashing certainly seems like a reasonable algorithm on paper
that has some potential advantages over sorting. It certainly seems
sensible to me that we explore the disk-based hashing strategy first,
and then we would at least know what we are missing (if anything) by
going with the hybrid approach later.

There's also a fair amount of design space to explore in the hybrid
strategy. That could take a while to converge, especially if we don't
have anything in place to compare against.

Makes sense. I haven't thought about how the hybrid approach would be
implemented very much, so I can't quite judge how complicated would it be
to extend "approach 1" later. But if you think it's a sensible first step,
I trust you. And I certainly agree we need something to compare the other
approaches against.

* It means we have a hash table and sort running concurrently, each
using memory. Andres said this might not be a problem[3], but I'm
not convinced that the problem is zero. If you use small work_mem
for the write phase of sorting, you'll end up with a lot of runs
to
merge later and that has some kind of cost.

Why would we need to do both concurrently? I thought we'd empty the
hash
table before doing the sort, no?

So you are saying we spill the tuples into a tuplestore, then feed the
tuplestore through a tuplesort? Seems inefficient, but I guess we can.

I think the question is whether we see this as "emergency fix" (for cases
that are misestimated and could/would fail with OOM at runtime), or as
something that is meant to make "hash agg" more widely applicable.

I personally see it as an emergency fix, in which cases it's perfectly
fine if it's not 100% efficient, assuming it kicks in only rarely.
Effectively, we're betting on hash agg, and from time to time we lose.

But even if we see it as a general optimization technique it does not have
to be perfectly efficient, as long as it's properly costed (so the planner
only uses it when appropriate).

If we have a better solution (in terms of efficiency, code complexity,
etc.) then sure - let's use that. But considering we've started this
discussion in ~2015 and we still don't have anything, I wouldn't hold my
breath. Let's do something good enough, and maybe improve it later.

Do we actually need to handle that case? How many such aggregates are
there? I think it's OK to just ignore that case (and keep doing what
we do
now), and require serial/deserial functions for anything better.

Punting on a few cases is fine with me, if the user has a way to fix
it.

+1 to doing that

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

tomas.vondra@2ndquadrant.com

almost 7 years ago

In reply to: Jeff Davis (#4)

Re: Memory-Bounded Hash Aggregation

On Wed, Jul 03, 2019 at 07:03:06PM -0700, Jeff Davis wrote:

On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote:

What does "partitioned hash strategy" do? It's probably explained in
one
of the historical discussions, but I'm not sure which one. I assume
it
simply hashes the group keys and uses that to partition the data, and
then
passing it to hash aggregate.

Yes. When spilling, it is cheap to partition on the hash value at the
same time, which dramatically reduces the need to spill multiple times.
Previous discussions:

Unfortunately the second link does not work :-(

It's supposed to be:

/messages/by-id/CAGTBQpa__-NP7=kKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g@mail.gmail.com

I'm not going to block Approach 1, althought I'd really like to see
something that helps with array_agg.

I have a WIP patch that I just posted. It doesn't yet work with
ARRAY_AGG, but I think it can be made to work by evicting the entire
hash table, serializing the transition states, and then later combining
them.

Aren't all three approaches a way to "fix" hash aggregate? In any
case,
it's certainly reasonable to make incremental changes. The question
is
whether "approach 1" is sensible step towards some form of "approach
3"

Disk-based hashing certainly seems like a reasonable algorithm on paper
that has some potential advantages over sorting. It certainly seems
sensible to me that we explore the disk-based hashing strategy first,
and then we would at least know what we are missing (if anything) by
going with the hybrid approach later.

There's also a fair amount of design space to explore in the hybrid
strategy. That could take a while to converge, especially if we don't
have anything in place to compare against.

* It means we have a hash table and sort running concurrently, each
using memory. Andres said this might not be a problem[3], but I'm
not convinced that the problem is zero. If you use small work_mem
for the write phase of sorting, you'll end up with a lot of runs
to
merge later and that has some kind of cost.

Why would we need to do both concurrently? I thought we'd empty the
hash
table before doing the sort, no?

So you are saying we spill the tuples into a tuplestore, then feed the
tuplestore through a tuplesort? Seems inefficient, but I guess we can.

But even if we see it as a general optimization technique it does not have
to be perfectly efficient, as long as it's properly costed (so the planner
only uses it when appropriate).

Do we actually need to handle that case? How many such aggregates are
there? I think it's OK to just ignore that case (and keep doing what
we do
now), and require serial/deserial functions for anything better.

Punting on a few cases is fine with me, if the user has a way to fix
it.

+1 to doing that

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql@j-davis.com

almost 7 years ago

In reply to: Tomas Vondra (#6)

Re: Memory-Bounded Hash Aggregation

On Thu, 2019-07-11 at 17:55 +0200, Tomas Vondra wrote:

Makes sense. I haven't thought about how the hybrid approach would be
implemented very much, so I can't quite judge how complicated would
it be
to extend "approach 1" later. But if you think it's a sensible first
step,
I trust you. And I certainly agree we need something to compare the
other
approaches against.

Is this a duplicate of your previous email?

I'm slightly confused but I will use the opportunity to put out another
WIP patch. The patch could use a few rounds of cleanup and quality
work, but the funcionality is there and the performance seems
reasonable.

I rebased on master and fixed a few bugs, and most importantly, added
tests.

It seems to be working with grouping sets fine. It will take a little
longer to get good performance numbers, but even for group size of one,
I'm seeing HashAgg get close to Sort+Group in some cases.

You are right that the missed lookups appear to be costly, at least
when the data all fits in system memory. I think it's the cache misses,
because sometimes reducing work_mem improves performance. I'll try
tuning the number of buckets for the hash table and see if that helps.
If not, then the performance still seems pretty good to me.

Of course, HashAgg can beat sort for larger group sizes, but I'll try
to gather some more data on the cross-over point.

Regards,
Jeff Davis

tomas.vondra@2ndquadrant.com

almost 7 years ago

In reply to: Jeff Davis (#7)

Re: Memory-Bounded Hash Aggregation

On Thu, Jul 11, 2019 at 06:06:33PM -0700, Jeff Davis wrote:

On Thu, 2019-07-11 at 17:55 +0200, Tomas Vondra wrote:

Makes sense. I haven't thought about how the hybrid approach would be
implemented very much, so I can't quite judge how complicated would
it be
to extend "approach 1" later. But if you think it's a sensible first
step,
I trust you. And I certainly agree we need something to compare the
other
approaches against.

Is this a duplicate of your previous email?

Yes. I don't know how I managed to send it again. Sorry.

I'm slightly confused but I will use the opportunity to put out another
WIP patch. The patch could use a few rounds of cleanup and quality
work, but the funcionality is there and the performance seems
reasonable.

I rebased on master and fixed a few bugs, and most importantly, added
tests.

It seems to be working with grouping sets fine. It will take a little
longer to get good performance numbers, but even for group size of one,
I'm seeing HashAgg get close to Sort+Group in some cases.

Nice! That's a very nice progress!

You are right that the missed lookups appear to be costly, at least
when the data all fits in system memory. I think it's the cache misses,
because sometimes reducing work_mem improves performance. I'll try
tuning the number of buckets for the hash table and see if that helps.
If not, then the performance still seems pretty good to me.

Of course, HashAgg can beat sort for larger group sizes, but I'll try
to gather some more data on the cross-over point.

Yes, makes sense. I think it's acceptable as long as we consider this
during costing (when we know in advance we'll need this) or treat it to be
emergency measure.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

ali@pivotal.io

almost 7 years ago

In reply to: Jeff Davis (#7)

Re: Memory-Bounded Hash Aggregation

High-level approaches:

1. When the in-memory hash table fills, keep existing entries in the
hash table, and spill the raw tuples for all new groups in a
partitioned fashion. When all input tuples are read, finalize groups
in memory and emit. Now that the in-memory hash table is cleared (and
memory context reset), process a spill file the same as the original
input, but this time with a fraction of the group cardinality.

2. When the in-memory hash table fills, partition the hash space, and
evict the groups from all partitions except one by writing out their
partial aggregate states to disk. Any input tuples belonging to an
evicted partition get spilled to disk. When the input is read
entirely, finalize the groups remaining in memory and emit. Now that
the in-memory hash table is cleared, process the next partition by
loading its partial states into the hash table, and then processing
its spilled tuples.

I'm late to the party.

These two approaches both spill the input tuples, what if the skewed
groups are not encountered before the hash table fills up? The spill
files' size and disk I/O could be downsides.

Greenplum spills all the groups by writing the partial aggregate states,
reset the memory context, process incoming tuples and build in-memory
hash table, then reload and combine the spilled partial states at last,
how does this sound?

--
Adam Lee

#10

pgsql@j-davis.com

almost 7 years ago

In reply to: Adam Lee (#9)

Re: Memory-Bounded Hash Aggregation

On Fri, 2019-08-02 at 14:44 +0800, Adam Lee wrote:

I'm late to the party.

You are welcome to join any time!

These two approaches both spill the input tuples, what if the skewed
groups are not encountered before the hash table fills up? The spill
files' size and disk I/O could be downsides.

Let's say the worst case is that we encounter 10 million groups of size
one first; just enough to fill up memory. Then, we encounter a single
additional group of size 20 million, and need to write out all of those
20 million raw tuples. That's still not worse than Sort+GroupAgg which
would need to write out all 30 million raw tuples (in practice Sort is
pretty fast so may still win in some cases, but not by any huge
amount).

Greenplum spills all the groups by writing the partial aggregate
states,
reset the memory context, process incoming tuples and build in-memory
hash table, then reload and combine the spilled partial states at
last,
how does this sound?

That can be done as an add-on to approach #1 by evicting the entire
hash table (writing out the partial states), then resetting the memory
context.

It does add to the complexity though, and would only work for the
aggregates that support serializing and combining partial states. It
also might be a net loss to do the extra work of initializing and
evicting a partial state if we don't have large enough groups to
benefit.

Given that the worst case isn't worse than Sort+GroupAgg, I think it
should be left as a future optimization. That would give us time to
tune the process to work well in a variety of cases.

Regards,
Jeff Davis

#11

[1]: /messages/by-id/48abe675e1330f0c264ab2fe0d4ff23eb244f9ef.camel@j-davis.com
/messages/by-id/48abe675e1330f0c264ab2fe0d4ff23eb244f9ef.camel@j-davis.com

tomas.vondra@2ndquadrant.com

almost 7 years ago

In reply to: Jeff Davis (#10)

Re: Memory-Bounded Hash Aggregation

On Fri, Aug 02, 2019 at 08:11:19AM -0700, Jeff Davis wrote:

On Fri, 2019-08-02 at 14:44 +0800, Adam Lee wrote:

I'm late to the party.

You are welcome to join any time!

These two approaches both spill the input tuples, what if the skewed
groups are not encountered before the hash table fills up? The spill
files' size and disk I/O could be downsides.

Let's say the worst case is that we encounter 10 million groups of size
one first; just enough to fill up memory. Then, we encounter a single
additional group of size 20 million, and need to write out all of those
20 million raw tuples. That's still not worse than Sort+GroupAgg which
would need to write out all 30 million raw tuples (in practice Sort is
pretty fast so may still win in some cases, but not by any huge
amount).

Greenplum spills all the groups by writing the partial aggregate
states,
reset the memory context, process incoming tuples and build in-memory
hash table, then reload and combine the spilled partial states at
last,
how does this sound?

That can be done as an add-on to approach #1 by evicting the entire
hash table (writing out the partial states), then resetting the memory
context.

It does add to the complexity though, and would only work for the
aggregates that support serializing and combining partial states. It
also might be a net loss to do the extra work of initializing and
evicting a partial state if we don't have large enough groups to
benefit.

Given that the worst case isn't worse than Sort+GroupAgg, I think it
should be left as a future optimization. That would give us time to
tune the process to work well in a variety of cases.

+1 to leaving that as a future optimization

I think it's clear there's no perfect eviction strategy - for every
algorithm we came up with we can construct a data set on which it
performs terribly (I'm sure we could do that for the approach used by
Greenplum, for example).

So I think it makes sense to do what Jeff proposed, and then maybe try
improving that in the future with a switch to different eviction
strategy based on some heuristics.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12

Taylor Vesely

tvesely@pivotal.io

almost 7 years ago

In reply to: Tomas Vondra (#11)

Re: Memory-Bounded Hash Aggregation

I started to review this patch yesterday with Melanie Plageman, so we
rebased this patch over the current master. The main conflicts were
due to a simplehash patch that has been committed separately[1]/messages/by-id/48abe675e1330f0c264ab2fe0d4ff23eb244f9ef.camel@j-davis.com. I've
attached the rebased patch.

I was playing with the code, and if one of the table's most common
values isn't placed into the initial hash table it spills a whole lot
of tuples to disk that might have been avoided if we had some way to
'seed' the hash table with MCVs from the statistics. Seems to me that
you would need some way of dealing with values that are in the MCV
list, but ultimately don't show up in the scan. I imagine that this
kind of optimization would most useful for aggregates on a full table
scan.

Some questions:

Right now the patch always initializes 32 spill partitions. Have you given
any thought into how to intelligently pick an optimal number of
partitions yet?

That can be done as an add-on to approach #1 by evicting the entire
Hash table (writing out the partial states), then resetting the memory
Context.

By add-on approach, do you mean to say that you have something in mind
to combine the two strategies? Or do you mean that it could be implemented
as a separate strategy?

I think it's clear there's no perfect eviction strategy - for every
algorithm we came up with we can construct a data set on which it
performs terribly (I'm sure we could do that for the approach used by
Greenplum, for example).

So I think it makes sense to do what Jeff proposed, and then maybe try
improving that in the future with a switch to different eviction
strategy based on some heuristics.

I agree. It definitely feels like both spilling strategies have their
own use case.

That said, I think it's worth mentioning that with parallel aggregates
it might actually be more useful to spill the trans values instead,
and have them combined in a Gather or Finalize stage.

#13

pgsql@j-davis.com

over 6 years ago

In reply to: Taylor Vesely (#12)

Re: Memory-Bounded Hash Aggregation

On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote:

I started to review this patch yesterday with Melanie Plageman, so we
rebased this patch over the current master. The main conflicts were
due to a simplehash patch that has been committed separately[1]. I've
attached the rebased patch.

Great, thanks!

I was playing with the code, and if one of the table's most common
values isn't placed into the initial hash table it spills a whole lot
of tuples to disk that might have been avoided if we had some way to
'seed' the hash table with MCVs from the statistics. Seems to me that
you would need some way of dealing with values that are in the MCV
list, but ultimately don't show up in the scan. I imagine that this
kind of optimization would most useful for aggregates on a full table
scan.

Interesting idea, I didn't think of that.

Some questions:

Right now the patch always initializes 32 spill partitions. Have you
given
any thought into how to intelligently pick an optimal number of
partitions yet?

Yes. The idea is to guess how many groups are remaining, then guess how
much space they will need in memory, then divide by work_mem. I just
didn't get around to it yet. (Same with the costing work.)

By add-on approach, do you mean to say that you have something in
mind
to combine the two strategies? Or do you mean that it could be
implemented
as a separate strategy?

It would be an extension of the existing patch, but would add a fair
amount of complexity (dealing with partial states, etc.) and the
benefit would be fairly modest. We can do it later if justified.

That said, I think it's worth mentioning that with parallel
aggregates
it might actually be more useful to spill the trans values instead,
and have them combined in a Gather or Finalize stage.

That's a good point.

Regards,
Jeff Davis

#14

pgsql@j-davis.com

over 6 years ago

In reply to: Taylor Vesely (#12)

Re: Memory-Bounded Hash Aggregation

On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote:

Right now the patch always initializes 32 spill partitions. Have you
given
any thought into how to intelligently pick an optimal number of
partitions yet?

Attached a new patch that addresses this.

1. Divide hash table memory used by the number of groups in the hash
table to get the average memory used per group.
2. Multiply by the number of groups spilled -- which I pessimistically
estimate as the number of tuples spilled -- to get the total amount of
memory that we'd like to have to process all spilled tuples at once.
3. Divide the desired amount of memory by work_mem to get the number of
partitions we'd like to have such that each partition can be processed
in work_mem without spilling.
4. Apply a few sanity checks, fudge factors, and limits.

Using this runtime information should be substantially better than
using estimates and projections.

Additionally, I removed some branches from the common path. I think I
still have more work to do there.

I also rebased of course, and fixed a few other things.

Regards,
Jeff Davis

#15

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Jeff Davis (#14)

Re: Memory-Bounded Hash Aggregation

On Wed, Nov 27, 2019 at 02:58:04PM -0800, Jeff Davis wrote:

On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote:

Right now the patch always initializes 32 spill partitions. Have you
given
any thought into how to intelligently pick an optimal number of
partitions yet?

Attached a new patch that addresses this.

1. Divide hash table memory used by the number of groups in the hash
table to get the average memory used per group.
2. Multiply by the number of groups spilled -- which I pessimistically
estimate as the number of tuples spilled -- to get the total amount of
memory that we'd like to have to process all spilled tuples at once.

Isn't the "number of tuples = number of groups" estimate likely to be
way too pessimistic? IIUC the consequence is that it pushes us to pick
more partitions than necessary, correct?

Could we instead track how many tuples we actually consumed for the the
in-memory groups, and then use this information to improve the estimate
of number of groups? I mean, if we know we've consumed 1000 tuples which
created 100 groups, then we know there's ~1:10 ratio.

3. Divide the desired amount of memory by work_mem to get the number of
partitions we'd like to have such that each partition can be processed
in work_mem without spilling.
4. Apply a few sanity checks, fudge factors, and limits.

Using this runtime information should be substantially better than
using estimates and projections.

Additionally, I removed some branches from the common path. I think I
still have more work to do there.

I also rebased of course, and fixed a few other things.

A couple of comments based on eye-balling the patch:

1) Shouldn't the hashagg_mem_overflow use the other GUC naming, i.e.
maybe it should be enable_hashagg_mem_overflow or something similar?

2) I'm a bit puzzled by this code in ExecInterpExpr (there are multiple
such blocks, this is just an example)

aggstate = op->d.agg_init_trans.aggstate;
pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];

/* If transValue has not yet been initialized, do so now. */
if (pergroup_allaggs != NULL && pergroup->noTransValue)
{ ... }

How could the (pergroup_allaggs != NULL) protect against anything? Let's
assume the pointer really is NULL. Surely we'll get a segfault on the
preceding line which does dereference it

pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];

Or am I missing anything?

3) execGrouping.c

A couple of functions would deserve a comment, explaining what it does.

- LookupTupleHashEntryHash
- prepare_hash_slot
- calculate_hash

And it's not clear to me why we should remove part of the comment before
TupleHashTableHash.

4) I'm not sure I agree with this reasoning that HASH_PARTITION_FACTOR
making the hash tables smaller is desirable - it may be, but if that was
generally the case we'd just use small hash tables all the time. It's a
bit annoying to give user the capability to set work_mem and then kinda
override that.

* ... Another benefit of having more, smaller partitions is that small
* hash tables may perform better than large ones due to memory caching
* effects.

5) Not sure what "directly" means in this context?

* partitions at the time we need to spill, and because this algorithm
* shouldn't depend too directly on the internal memory needs of a
* BufFile.

#define HASH_PARTITION_MEM (HASH_MIN_PARTITIONS * BLCKSZ)

Does that mean we don't want to link to PGAlignedBlock, or what?

6) I think we should have some protection against underflows in this
piece of code:

- this would probably deserve some protection against underflow if HASH_PARTITION_MEM gets too big

if (hashagg_mem_overflow)
aggstate->hash_mem_limit = SIZE_MAX;
else
aggstate->hash_mem_limit = (work_mem * 1024L) - HASH_PARTITION_MEM;

At the moment it's safe because work_mem is 64kB at least, and
HASH_PARTITION_MEM is 32kB (4 partitions, 8kB each). But if we happen to
bump HASH_MIN_PARTITIONS up, this can underflow.

7) Shouldn't lookup_hash_entry briefly explain why/how it handles the
memory limit?

8) The comment before lookup_hash_entries says:

...
* Return false if hash table has exceeded its memory limit.
..

But that's clearly bogus, because that's a void function.

9) Shouldn't the hash_finish_initial_spills calls in agg_retrieve_direct
have a comment, similar to the surrounding code? Might be an overkill,
not sure.

10) The comment for agg_refill_hash_table says

* Should only be called after all in memory hash table entries have been
* consumed.

Can we enforce that with an assert, somehow?

11) The hash_spill_npartitions naming seems a bit confusing, because it
seems to imply it's about the "spill" while in practice it just choses
number of spill partitions. Maybe hash_choose_num_spill_partitions would
be better?

12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the
reasoning behind the current value (256)? Not wanting to pick too many
partitions? Comment?

if (npartitions > HASH_MAX_PARTITIONS)
npartitions = HASH_MAX_PARTITIONS;

13) As for this:

/* make sure that we don't exhaust the hash bits */
if (partition_bits + input_bits >= 32)
partition_bits = 32 - input_bits;

We already ran into this issue (exhausting bits in a hash value) in
hashjoin batching, we should be careful to use the same approach in both
places (not the same code, just general approach).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#15)

Re: Memory-Bounded Hash Aggregation

On Thu, Nov 28, 2019 at 9:47 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Wed, Nov 27, 2019 at 02:58:04PM -0800, Jeff Davis wrote:

On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote:

Right now the patch always initializes 32 spill partitions. Have you
given
any thought into how to intelligently pick an optimal number of
partitions yet?

Attached a new patch that addresses this.

1. Divide hash table memory used by the number of groups in the hash
table to get the average memory used per group.
2. Multiply by the number of groups spilled -- which I pessimistically
estimate as the number of tuples spilled -- to get the total amount of
memory that we'd like to have to process all spilled tuples at once.

Isn't the "number of tuples = number of groups" estimate likely to be
way too pessimistic? IIUC the consequence is that it pushes us to pick
more partitions than necessary, correct?

Could we instead track how many tuples we actually consumed for the the
in-memory groups, and then use this information to improve the estimate
of number of groups? I mean, if we know we've consumed 1000 tuples which
created 100 groups, then we know there's ~1:10 ratio.

What would the cost be of having many small partitions? Some of the
spill files created may not be used if the estimate was pessimistic,
but that seems better than the alternative of re-spilling, since every
spill writes every tuple again.

Also, number of groups = number of tuples is only for re-spilling.
This is a little bit unclear from the variable naming.

It looks like the parameter input_tuples passed to hash_spill_init()
in lookup_hash_entries() is the number of groups estimated by planner.
However, when reloading a spill file, if we run out of memory and
re-spill, hash_spill_init() is passed batch->input_groups (which is
actually set from input_ngroups which is the number of tuples in the
spill file). So, input_tuples is groups and input_groups is
input_tuples. It may be helpful to rename this.

4) I'm not sure I agree with this reasoning that HASH_PARTITION_FACTOR
making the hash tables smaller is desirable - it may be, but if that was
generally the case we'd just use small hash tables all the time. It's a
bit annoying to give user the capability to set work_mem and then kinda
override that.

* ... Another benefit of having more, smaller partitions is that small
* hash tables may perform better than large ones due to memory caching
* effects.

So, it looks like the HASH_PARTITION_FACTOR is only used when
re-spilling. The initial hashtable will use work_mem.
It seems like the reason for using it when re-spilling is to be very
conservative to avoid more than one re-spill and make sure each spill
file fits in a hashtable in memory.
The comment does seem to point to some other reason, though...

11) The hash_spill_npartitions naming seems a bit confusing, because it
seems to imply it's about the "spill" while in practice it just choses
number of spill partitions. Maybe hash_choose_num_spill_partitions would
be better?

Agreed that a name with "choose" or "calculate" as the verb would be
more clear.

12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the
reasoning behind the current value (256)? Not wanting to pick too many
partitions? Comment?

if (npartitions > HASH_MAX_PARTITIONS)
npartitions = HASH_MAX_PARTITIONS;

256 actually seems very large. hash_spill_npartitions() will be called
for every respill, so, HASH_MAX_PARTITIONS it not the total number of
spill files permitted, but, actually, it is the number of respill
files in a given spill (a spill set). So if you made X partitions
initially and every partition re-spills, now you would have (at most)
X * 256 partitions.
If HASH_MAX_PARTITIONS is 256, wouldn't the metadata from the spill
files take up a lot of memory at that point?

Melanie & Adam Lee

#17

pgsql@j-davis.com

over 6 years ago

In reply to: Tomas Vondra (#15)

Re: Memory-Bounded Hash Aggregation

Thanks very much for a great review! I've attached a new patch.

There are some significant changes in the new version also:

In the non-spilling path, removed the extra nullcheck branch in the
compiled evaltrans expression. When the first tuple is spilled, I the
branch becomes necessary, so I recompile the expression using a new
opcode that includes that branch.

I also changed the read-from-spill path to use a slot with
TTSOpsMinimalTuple (avoiding the need to make it into a virtual slot
right away), which means I need to recompile the evaltrans expression
for that case, as well.

I also improved the way we initialize the hash tables to use a better
estimate for the number of groups. And I made it only initialize one
hash table in the read-from-spill path.

With all of the changes I made (thanks to some suggestions from Andres)
the performance is looking pretty good. It's pretty easy to beat
Sort+Group when the group size is 10+. Even for average group size of
~1, HashAgg is getting really close to Sort in some cases.

There are still a few things to do, most notably costing. I also need
to project before spilling to avoid wasting disk. And I'm sure my
changes have created some more problems, so I have some significant
work to do on quality.

My answers to your questions inline:

On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote:

Could we instead track how many tuples we actually consumed for the
the
in-memory groups, and then use this information to improve the
estimate
of number of groups? I mean, if we know we've consumed 1000 tuples
which
created 100 groups, then we know there's ~1:10 ratio.

That would be a good estimate for an even distribution, but not
necessarily for a skewed distribution. I'm not opposed to it, but it's
generally my philosophy to overpartition as it seems there's not a big
downside.

A couple of comments based on eye-balling the patch:

1) Shouldn't the hashagg_mem_overflow use the other GUC naming, i.e.
maybe it should be enable_hashagg_mem_overflow or something similar?

The enable_* naming is for planner GUCs. hashagg_mem_overflow is an
execution-time GUC that disables spilling and overflows work_mem (that
is, it reverts to the old behavior).

assume the pointer really is NULL. Surely we'll get a segfault on the
preceding line which does dereference it

pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];

Or am I missing anything?

That's not actually dereferencing anything, it's just doing a pointer
calculation. You are probably right that it's not a good thing to rely
on, or at least not quite as readable, so I changed the order to put
the NULL check first.

3) execGrouping.c

A couple of functions would deserve a comment, explaining what it
does.

- LookupTupleHashEntryHash
- prepare_hash_slot
- calculate_hash

Done, thank you.

And it's not clear to me why we should remove part of the comment
before
TupleHashTableHash.

Trying to remember back to when I first did that, but IIRC the comment
was not updated from a previous change, and I was cleaning it up. I
will check over that again to be sure it's an improvement.

4) I'm not sure I agree with this reasoning that
HASH_PARTITION_FACTOR
making the hash tables smaller is desirable - it may be, but if that
was
generally the case we'd just use small hash tables all the time. It's
a
bit annoying to give user the capability to set work_mem and then
kinda
override that.

I think adding some kind of headroom is reasonable to avoid recursively
spilling, but perhaps it's not critical. I see this as a tuning
question more than anything else. I don't see it as "overriding"
work_mem, but I can see where you're coming from.

5) Not sure what "directly" means in this context?

* partitions at the time we need to spill, and because this
algorithm
* shouldn't depend too directly on the internal memory needs of a
* BufFile.

#define HASH_PARTITION_MEM (HASH_MIN_PARTITIONS * BLCKSZ)

Does that mean we don't want to link to PGAlignedBlock, or what?

That's what I meant, yes, but I reworded the comment to not say that.

6) I think we should have some protection against underflows in this
piece of code:

- this would probably deserve some protection against underflow if
HASH_PARTITION_MEM gets too big

if (hashagg_mem_overflow)
aggstate->hash_mem_limit = SIZE_MAX;
else
aggstate->hash_mem_limit = (work_mem * 1024L) -
HASH_PARTITION_MEM;

At the moment it's safe because work_mem is 64kB at least, and
HASH_PARTITION_MEM is 32kB (4 partitions, 8kB each). But if we happen
to
bump HASH_MIN_PARTITIONS up, this can underflow.

Thank you, done.

7) Shouldn't lookup_hash_entry briefly explain why/how it handles the
memory limit?

Improved.

8) The comment before lookup_hash_entries says:

...
* Return false if hash table has exceeded its memory limit.
..

But that's clearly bogus, because that's a void function.

Thank you, improved comment.

9) Shouldn't the hash_finish_initial_spills calls in
agg_retrieve_direct
have a comment, similar to the surrounding code? Might be an
overkill,
not sure.

Sure, done.

10) The comment for agg_refill_hash_table says

* Should only be called after all in memory hash table entries have
been
* consumed.

Can we enforce that with an assert, somehow?

It's a bit awkward. Simplehash doesn't expose the number of groups, and
we would also have to check each hash table. Not a bad idea to add an
interface to simplehash to make that work, though.

11) The hash_spill_npartitions naming seems a bit confusing, because
it
seems to imply it's about the "spill" while in practice it just
choses
number of spill partitions. Maybe hash_choose_num_spill_partitions
would
be better?

Done.

12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the
reasoning behind the current value (256)? Not wanting to pick too
many
partitions? Comment?

if (npartitions > HASH_MAX_PARTITIONS)
npartitions = HASH_MAX_PARTITIONS;

Added a comment. There's no deep reasoning there -- I just don't want
it to choose to create 5000 files and surprise a user.

13) As for this:

/* make sure that we don't exhaust the hash bits */
if (partition_bits + input_bits >= 32)
partition_bits = 32 - input_bits;

We already ran into this issue (exhausting bits in a hash value) in
hashjoin batching, we should be careful to use the same approach in
both
places (not the same code, just general approach).

Didn't investigate this yet, but will do.

Regards,
Jeff Davis

#18

ali@pivotal.io

over 6 years ago

In reply to: Jeff Davis (#17)

Re: Memory-Bounded Hash Aggregation

On Wed, Dec 04, 2019 at 06:55:43PM -0800, Jeff Davis wrote:

Thanks very much for a great review! I've attached a new patch.

Hi,

About the `TODO: project needed attributes only` in your patch, when
would the input tuple contain columns not needed? It seems like anything
you can project has to be in the group or aggregates.

--
Melanie Plageman & Adam

#19

pgsql@j-davis.com

over 6 years ago

In reply to: Adam Lee (#18)

Re: Memory-Bounded Hash Aggregation

On Wed, 2019-12-04 at 19:50 -0800, Adam Lee wrote:

On Wed, Dec 04, 2019 at 06:55:43PM -0800, Jeff Davis wrote:

Thanks very much for a great review! I've attached a new patch.

Hi,

About the `TODO: project needed attributes only` in your patch, when
would the input tuple contain columns not needed? It seems like
anything
you can project has to be in the group or aggregates.

If you have a table like:

CREATE TABLE foo(i int, j int, x int, y int, z int);

And do:

SELECT i, SUM(j) FROM foo GROUP BY i;

At least from a logical standpoint, you might expect that we project
only the attributes we need from foo before feeding them into the
HashAgg. But that's not quite how postgres works. Instead, it leaves
the tuples intact (which, in this case, means they have 5 attributes)
until after aggregation and lazily fetches whatever attributes are
referenced. Tuples are spilled from the input, at which time they still
have 5 attributes; so naively copying them is wasteful.

I'm not sure how often this laziness is really a win in practice,
especially after the expression evaluation has changed so much in
recent releases. So it might be better to just project all the
attributes eagerly, and then none of this would be a problem. If we
still wanted to be lazy about attribute fetching, that should still be
possible even if we did a kind of "logical" projection of the tuple so
that the useless attributes would not be relevant. Regardless, that's
outside the scope of the patch I'm currently working on.

What I'd like to do is copy just the attributes needed into a new
virtual slot, leave the unneeded ones NULL, and then write it out to
the tuplestore as a MinimalTuple. I just need to be sure to get the
right attributes.

Regards,
Jeff Davis

#20

pgsql@j-davis.com

over 6 years ago

In reply to: Melanie Plageman (#16)

Re: Memory-Bounded Hash Aggregation

On Wed, 2019-12-04 at 17:24 -0800, Melanie Plageman wrote:

It looks like the parameter input_tuples passed to hash_spill_init()
in lookup_hash_entries() is the number of groups estimated by
planner.
However, when reloading a spill file, if we run out of memory and
re-spill, hash_spill_init() is passed batch->input_groups (which is
actually set from input_ngroups which is the number of tuples in the
spill file). So, input_tuples is groups and input_groups is
input_tuples. It may be helpful to rename this.

You're right; this is confusing. I will clarify this in the next patch.

So, it looks like the HASH_PARTITION_FACTOR is only used when
re-spilling. The initial hashtable will use work_mem.
It seems like the reason for using it when re-spilling is to be very
conservative to avoid more than one re-spill and make sure each spill
file fits in a hashtable in memory.

It's used any time a spill happens, even the first spill. I'm flexible
on the use of HASH_PARTITION_FACTOR though... it seems not everyone
thinks it's a good idea. To me it's just a knob to tune and I tend to
think over-partitioning is the safer bet most of the time.

The comment does seem to point to some other reason, though...

I have observed some anomalies where smaller work_mem values (for
already-low values of work_mem) result faster runtime. The only
explanation I have is caching effects.

256 actually seems very large. hash_spill_npartitions() will be
called
for every respill, so, HASH_MAX_PARTITIONS it not the total number of
spill files permitted, but, actually, it is the number of respill
files in a given spill (a spill set). So if you made X partitions
initially and every partition re-spills, now you would have (at most)
X * 256 partitions.

Right. Though I'm not sure there's any theoretical max... given enough
input tuples and it will just keep getting deeper. If this is a serious
concern maybe I should make it depth-first recursion by prepending new
work items rather than appending. That would still not bound the
theoretical max, but it would slow the growth.

If HASH_MAX_PARTITIONS is 256, wouldn't the metadata from the spill
files take up a lot of memory at that point?

Yes. Each file keeps a BLCKSZ buffer, plus some other metadata. And it
does create a file, so it's offloading some work to the OS to manage
that new file.

It's annoying to properly account for these costs because the memory
needs to be reserved at the time we are building the hash table, but we
don't know how many partitions we want until it comes time to spill.
And for that matter, we don't even know whether we will need to spill
or not.

There are two alternative approaches which sidestep this problem:

1. Reserve a fixed fraction of work_mem, say, 1/8 to make space for
however many partitions that memory can handle. We would still have a
min and max, but the logic for reserving the space would be easy and so
would choosing the number of partitions to create.
* Pro: simple
* Con: lose the ability to choose the numer of partitions

2. Use logtape.c instead (suggestion from Heikki). Supporting more
logical tapes doesn't impose costs on the OS, and we can potentially
use a lot of logical tapes.
* Pro: can use lots of partitions without making lots of files
* Con: buffering still needs to happen somewhere, so we still need
memory for each logical tape. Also, we risk losing locality of read
access when reading the tapes, or perhaps confusing readahead.
Fundamentally, logtapes.c was designed for sequential write, random
read; but we are going to do random write and sequential read.

Regards,
Jeff Davis

#21

pgsql@j-davis.com

over 6 years ago

In reply to: Tomas Vondra (#15)

#22

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Jeff Davis (#21)

#23

andres@anarazel.de

over 6 years ago

In reply to: Jeff Davis (#21)

#24

ali@pivotal.io

over 6 years ago

In reply to: Jeff Davis (#19)

#25

pgsql@j-davis.com

over 6 years ago

In reply to: Tomas Vondra (#15)

#26

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Jeff Davis (#25)

#27

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Jeff Davis (#14)

#28

ali@pivotal.io

over 6 years ago

In reply to: Tomas Vondra (#27)

#29

pgsql@j-davis.com

over 6 years ago

In reply to: Adam Lee (#24)

#30

pgsql@j-davis.com

over 6 years ago

In reply to: Tomas Vondra (#27)

#31

pgsql@j-davis.com

over 6 years ago

In reply to: Tomas Vondra (#27)

#32

ali@pivotal.io

over 6 years ago

In reply to: Jeff Davis (#29)

#33

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 6 years ago

In reply to: Jeff Davis (#31)

#34

pgsql@j-davis.com

over 6 years ago

In reply to: Heikki Linnakangas (#33)

#35

pg@bowt.ie

over 6 years ago

In reply to: Jeff Davis (#34)

#36

pgsql@j-davis.com

over 6 years ago

In reply to: Peter Geoghegan (#35)

#37

pgsql@j-davis.com

over 6 years ago

In reply to: Jeff Davis (#36)

#38

pgsql@j-davis.com

over 6 years ago

In reply to: Jeff Davis (#37)

#39

ali@pivotal.io

over 6 years ago

In reply to: Jeff Davis (#38)

#40

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 6 years ago

In reply to: Jeff Davis (#37)

#41

pg@bowt.ie

over 6 years ago

In reply to: Jeff Davis (#38)

#42

Thomas Munro

thomas.munro@gmail.com

over 6 years ago

In reply to: Peter Geoghegan (#41)

#43

pgsql@j-davis.com

over 6 years ago

In reply to: Adam Lee (#39)

#44

pgsql@j-davis.com

over 6 years ago

In reply to: Peter Geoghegan (#41)

#45

pg@bowt.ie

over 6 years ago

In reply to: Jeff Davis (#44)

#46

pgsql@j-davis.com

over 6 years ago

In reply to: Heikki Linnakangas (#40)

#47

pgsql@j-davis.com

over 6 years ago

In reply to: Jeff Davis (#46)

#48

pgsql@j-davis.com

over 6 years ago

In reply to: Jeff Davis (#34)

#49

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 6 years ago

In reply to: Jeff Davis (#46)

#50

pg@bowt.ie

over 6 years ago

In reply to: Heikki Linnakangas (#49)

#51

pgsql@j-davis.com

over 6 years ago

In reply to: Jeff Davis (#38)

#52

pgsql@j-davis.com

over 6 years ago

In reply to: Jeff Davis (#51)

#53

pgsql@j-davis.com

over 6 years ago

In reply to: Jeff Davis (#52)

#54

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Heikki Linnakangas (#33)

#55

pgsql@j-davis.com

over 6 years ago

In reply to: Jeff Davis (#52)

#56

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Melanie Plageman (#54)

#57

pgsql@j-davis.com

over 6 years ago

In reply to: Tomas Vondra (#56)

#58

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#56)

#59

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Jeff Davis (#57)

#60

ali@pivotal.io

over 6 years ago

In reply to: Tomas Vondra (#59)

#61

ali@pivotal.io

over 6 years ago

In reply to: Tomas Vondra (#59)

#62

pgsql@j-davis.com

over 6 years ago

In reply to: Tomas Vondra (#59)

#63

andres@anarazel.de

over 6 years ago

In reply to: Tomas Vondra (#59)

#64

pgsql@j-davis.com

over 6 years ago

In reply to: Andres Freund (#63)

#65

andres@anarazel.de

over 6 years ago

In reply to: Jeff Davis (#64)

#66

pgsql@j-davis.com

over 6 years ago

In reply to: Andres Freund (#65)

#67

pgsql@j-davis.com

over 6 years ago

In reply to: Tomas Vondra (#59)

#68

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Jeff Davis (#62)

#69

andres@anarazel.de

over 6 years ago

In reply to: Jeff Davis (#66)

#70

pgsql@j-davis.com

over 6 years ago

In reply to: Andres Freund (#69)

#71

pgsql@j-davis.com

about 6 years ago

In reply to: Jeff Davis (#70)

#72

Justin Pryzby

pryzby@telsasoft.com

about 6 years ago

In reply to: Jeff Davis (#71)

#73

pgsql@j-davis.com

about 6 years ago

In reply to: Justin Pryzby (#72)

#74

pgsql@j-davis.com

about 6 years ago

In reply to: Jeff Davis (#73)

#75

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Jeff Davis (#74)

#76

Justin Pryzby

pryzby@telsasoft.com

about 6 years ago

In reply to: Jeff Davis (#73)

#77

Pengzhou Tang

ptang@pivotal.io

about 6 years ago

In reply to: Jeff Davis (#74)

#78

Pengzhou Tang

ptang@pivotal.io

about 6 years ago

In reply to: Pengzhou Tang (#77)

#79

Richard Guo

guofenglinux@gmail.com

about 6 years ago

In reply to: Jeff Davis (#74)

#80

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Richard Guo (#79)

#81