Memory-Bounded Hash Aggregation
This is for design review. I have a patch (WIP) for Approach 1, and if
this discussion starts to converge on that approach I will polish and
post it.
Let's start at the beginning: why do we have two strategies -- hash
and sort -- for aggregating data? The two are more similar than they
first appear. A partitioned hash strategy writes randomly among the
partitions, and later reads the partitions sequentially; a sort will
write sorted runs sequentially, but then read the among the runs
randomly during the merge phase. A hash is a convenient small
representation of the data that is cheaper to operate on; sort uses
abbreviated keys for the same reason.
Hash offers:
* Data is aggregated on-the-fly, effectively "compressing" the amount
of data that needs to go to disk. This is particularly important
when the data contains skewed groups (see below).
* Can output some groups after the first pass of the input data even
if other groups spilled.
* Some data types only support hashing; not sorting.
Sort+Group offers:
* Only one group is accumulating at once, so if the transition state
grows (like with ARRAY_AGG), it minimizes the memory needed.
* The input may already happen to be sorted.
* Some data types only support sorting; not hashing.
Currently, Hash Aggregation is only chosen if the optimizer believes
that all the groups (and their transition states) fit in
memory. Unfortunately, if the optimizer is wrong (often the case if the
input is not a base table), the hash table will
keep growing beyond work_mem, potentially bringing the entire system
to OOM. This patch fixes that problem by extending the Hash
Aggregation strategy to spill to disk when needed.
Previous discussions:
/messages/by-id/1407706010.6623.16.camel@jeff-desktop
/messages/by-id/1419326161.24895.13.camel@jeff-desktop
/messages/by-id/87be3bd5-6b13-d76e-5618-6db0a4db584d@iki.fi
A lot was discussed, which I will try to summarize and address here.
Digression: Skewed Groups:
Imagine the input tuples have the following grouping keys:
0, 1, 0, 2, 0, 3, 0, 4, ..., 0, N-1, 0, N
Group 0 is a skew group because it consists of 50% of all tuples in
the table, whereas every other group has a single tuple. If the
algorithm is able to keep group 0 in memory the whole time until
finalized, that means that it doesn't have to spill any group-0
tuples. In this example, that would amount to a 50% savings, and is a
major advantage of Hash Aggregation versus Sort+Group.
High-level approaches:
1. When the in-memory hash table fills, keep existing entries in the
hash table, and spill the raw tuples for all new groups in a
partitioned fashion. When all input tuples are read, finalize groups
in memory and emit. Now that the in-memory hash table is cleared (and
memory context reset), process a spill file the same as the original
input, but this time with a fraction of the group cardinality.
2. When the in-memory hash table fills, partition the hash space, and
evict the groups from all partitions except one by writing out their
partial aggregate states to disk. Any input tuples belonging to an
evicted partition get spilled to disk. When the input is read
entirely, finalize the groups remaining in memory and emit. Now that
the in-memory hash table is cleared, process the next partition by
loading its partial states into the hash table, and then processing
its spilled tuples.
3. Use some kind of hybrid[1]/messages/by-id/20180604185205.epue25jzpavokupf@alap3.anarazel.de[2]/messages/by-id/message-id/CAGTBQpa__-NP7=kKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g@mail.gmail.com of hashing and sorting.
Evaluation of approaches:
Approach 1 is a nice incremental improvement on today's code. The
final patch may be around 1KLOC. It's a single kind of on-disk data
(spilled tuples), and a single algorithm (hashing). It also handles
skewed groups well because the skewed groups are likely to be
encountered before the hash table fills up the first time, and
therefore will stay in memory.
Approach 2 is nice because it resembles the approach of Hash Join, and
it can determine whether a tuple should be spilled without a hash
lookup. Unfortunately, those upsides are fairly mild, and it has
significant downsides:
* It doesn't handle skew values well because it's likely to evict
them.
* If we leave part of the hash table in memory, it's difficult to
ensure that we will be able to actually use the space freed by
eviction, because the freed memory may be fragmented. That could
force us to evict the entire in-memory hash table as soon as we
partition, reducing a lot of the benefit of hashing.
* It requires eviction for the algorithm to work. That may be
necessary for handling cases like ARRAY_AGG (see below) anyway, but
this approach constrains the specifics of eviction.
Approach 3 is interesting because it unifies the two approaches and
can get some of the benfits of both. It's only a single path, so it
avoids planner mistakes. I really like this idea and it's possible we
will end up with approach 3. However:
* It requires that all data types support sorting, or that we punt
somehow.
* Right now we are in a weird state because hash aggregation cheats,
so it's difficult to evaluate whether Approach 3 is moving us in the
right direction because we have no other correct implementation to
compare against. Even if Approach 3 is where we end up, it seems
like we should fix hash aggregation as a stepping stone first.
* It means we have a hash table and sort running concurrently, each
using memory. Andres said this might not be a problem[3]/messages/by-id/20180605175209.vavuqe4idovcpeie@alap3.anarazel.de, but I'm
not convinced that the problem is zero. If you use small work_mem
for the write phase of sorting, you'll end up with a lot of runs to
merge later and that has some kind of cost.
* The simplicity might start to evaporate when we consider grouping
sets and eviction strategy.
Main topics to consider:
ARRAY_AGG:
Some aggregates, like ARRAY_AGG, have a transition state that grows
proportionally with the group size. In other words, it is not a
summary like COUNT or AVG, it contains all of the input data in a new
form.
These aggregates are not a good candidate for hash aggregation. Hash
aggregation is about keeping many transition states running in
parallel, which is just a bad fit for large transition states. Sorting
is better because it advances one transition state at a time. We could:
* Let ARRAY_AGG continue to exceed work_mem like today.
* Block or pessimize use of hash aggregation for such aggregates.
* Evict groups from the hash table when it becomes too large. This
requires the ability to serialize and deserialize transition states,
and some approaches here might also need combine_func
specified. These requirements seem reasonable, but we still need
some answer of what to do for aggregates that grow like ARRAY_AGG
but don't have the required serialfunc, deserialfunc, or
combine_func.
GROUPING SETS:
With grouping sets, there are multiple hash tables and each hash table
has it's own hash function, so that makes partitioning more
complex. In Approach 1, that means we need to either (a) not partition
the spilled tuples; or (b) have a different set of partitions for each
hash table and spill the same tuple multiple times. In Approach 2, we
would be required to partition each hash table separately and spill
tuples multiple times. In Approach 3 (depending on the exact approach
but taking a guess here) we would need to add a set of phases (one
extra phase for each hash table) for spilled tuples.
MEMORY TRACKING:
I have a patch to track the total allocated memory by
incrementing/decrementing it when blocks are malloc'd/free'd. This
doesn't do bookkeeping for each chunk, only each block. Previously,
Robert Haas raised some concerns[4]/messages/by-id/CA+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com about performance, which were
mitigated[5]/messages/by-id/1413422787.18615.18.camel@jeff-desktop but perhaps not entirely eliminated (but did become
elusive).
The only alternative is estimation, which is ugly and seems like a bad
idea. Memory usage isn't just driven by inputs, it's also driven by
patterns of use. Misestimates in the planner are fine (within reason)
because we don't have any other choice, and a small-factor misestimate
might not change the plan anyway. But in the executor, a small-factor
misestimate seems like it's just not doing the job. If a user found
that hash aggregation was using 3X work_mem, and my only explanation
is "well, it's just an estimate", I would be pretty embarrassed and
the user would likely lose confidence in the feature. I don't mean
that we must track memory perfectly everywhere, but using an estimate
seems like a mediocre improvement of the current state.
We should proceed with memory context tracking and try to eliminate or
mitigate performance concerns. I would not like to make any hurculean
effort as a part of the hash aggregation work though; I think it's
basically just something a memory manager in a database system should
have supported all along. I think we will find other uses for it as
time goes on. We have more and more things happening in the executor
and having a cheap way to check "how much memory is this thing using?"
seems very likely to be useful.
Other points:
* Someone brought up the idea of using logtapes.c instead of writing
separate files for each partition. That seems reasonable, but it's
using logtapes.c slightly outside of its intended purpose. Also,
it's awkward to need to specify the number of tapes up-front. Worth
experimenting with to see if it's a win.
* Tomas did some experiments regarding the number of batches to choose
and how to choose them. It seems like there's room for improvement
over ths simple calculation I'm doing now.
* A lot of discussion about a smart eviction strategy. I don't see
strong evidence that it's worth the complexity at this time. The
smarter we try to be, the more bookkeeping and memory fragmentation
problems we will have. If we evict something, we should probably
evict the whole hash table or some large part of it.
Regards,
Jeff Davis
[1]: /messages/by-id/20180604185205.epue25jzpavokupf@alap3.anarazel.de
/messages/by-id/20180604185205.epue25jzpavokupf@alap3.anarazel.de
[2]: /messages/by-id/message-id/CAGTBQpa__-NP7=kKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g@mail.gmail.com
/messages/by-id/message-id/CAGTBQpa__-NP7=kKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g@mail.gmail.com
[3]: /messages/by-id/20180605175209.vavuqe4idovcpeie@alap3.anarazel.de
/messages/by-id/20180605175209.vavuqe4idovcpeie@alap3.anarazel.de
[4]: /messages/by-id/CA+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com
/messages/by-id/CA+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com
[5]: /messages/by-id/1413422787.18615.18.camel@jeff-desktop
/messages/by-id/1413422787.18615.18.camel@jeff-desktop
Hi Jeff,
On Mon, Jul 01, 2019 at 12:13:53PM -0700, Jeff Davis wrote:
This is for design review. I have a patch (WIP) for Approach 1, and if
this discussion starts to converge on that approach I will polish and
post it.
Thanks for working on this.
Let's start at the beginning: why do we have two strategies -- hash
and sort -- for aggregating data? The two are more similar than they
first appear. A partitioned hash strategy writes randomly among the
partitions, and later reads the partitions sequentially; a sort will
write sorted runs sequentially, but then read the among the runs
randomly during the merge phase. A hash is a convenient small
representation of the data that is cheaper to operate on; sort uses
abbreviated keys for the same reason.
What does "partitioned hash strategy" do? It's probably explained in one
of the historical discussions, but I'm not sure which one. I assume it
simply hashes the group keys and uses that to partition the data, and then
passing it to hash aggregate.
Hash offers:
* Data is aggregated on-the-fly, effectively "compressing" the amount
of data that needs to go to disk. This is particularly important
when the data contains skewed groups (see below).* Can output some groups after the first pass of the input data even
if other groups spilled.* Some data types only support hashing; not sorting.
Sort+Group offers:
* Only one group is accumulating at once, so if the transition state
grows (like with ARRAY_AGG), it minimizes the memory needed.* The input may already happen to be sorted.
* Some data types only support sorting; not hashing.
Currently, Hash Aggregation is only chosen if the optimizer believes
that all the groups (and their transition states) fit in
memory. Unfortunately, if the optimizer is wrong (often the case if the
input is not a base table), the hash table will
keep growing beyond work_mem, potentially bringing the entire system
to OOM. This patch fixes that problem by extending the Hash
Aggregation strategy to spill to disk when needed.
OK, makes sense.
Previous discussions:
/messages/by-id/1407706010.6623.16.camel@jeff-desktop
/messages/by-id/1419326161.24895.13.camel@jeff-desktop
/messages/by-id/87be3bd5-6b13-d76e-5618-6db0a4db584d@iki.fi
A lot was discussed, which I will try to summarize and address here.
Digression: Skewed Groups:
Imagine the input tuples have the following grouping keys:
0, 1, 0, 2, 0, 3, 0, 4, ..., 0, N-1, 0, N
Group 0 is a skew group because it consists of 50% of all tuples in
the table, whereas every other group has a single tuple. If the
algorithm is able to keep group 0 in memory the whole time until
finalized, that means that it doesn't have to spill any group-0
tuples. In this example, that would amount to a 50% savings, and is a
major advantage of Hash Aggregation versus Sort+Group.
Right. I agree efficiently handling skew is important and may be crucial
for achieving good performance.
High-level approaches:
1. When the in-memory hash table fills, keep existing entries in the
hash table, and spill the raw tuples for all new groups in a
partitioned fashion. When all input tuples are read, finalize groups
in memory and emit. Now that the in-memory hash table is cleared (and
memory context reset), process a spill file the same as the original
input, but this time with a fraction of the group cardinality.2. When the in-memory hash table fills, partition the hash space, and
evict the groups from all partitions except one by writing out their
partial aggregate states to disk. Any input tuples belonging to an
evicted partition get spilled to disk. When the input is read
entirely, finalize the groups remaining in memory and emit. Now that
the in-memory hash table is cleared, process the next partition by
loading its partial states into the hash table, and then processing
its spilled tuples.3. Use some kind of hybrid[1][2] of hashing and sorting.
Unfortunately the second link does not work :-(
Evaluation of approaches:
Approach 1 is a nice incremental improvement on today's code. The
final patch may be around 1KLOC. It's a single kind of on-disk data
(spilled tuples), and a single algorithm (hashing). It also handles
skewed groups well because the skewed groups are likely to be
encountered before the hash table fills up the first time, and
therefore will stay in memory.
I'm not going to block Approach 1, althought I'd really like to see
something that helps with array_agg.
Approach 2 is nice because it resembles the approach of Hash Join, and
it can determine whether a tuple should be spilled without a hash
lookup. Unfortunately, those upsides are fairly mild, and it has
significant downsides:* It doesn't handle skew values well because it's likely to evict
them.* If we leave part of the hash table in memory, it's difficult to
ensure that we will be able to actually use the space freed by
eviction, because the freed memory may be fragmented. That could
force us to evict the entire in-memory hash table as soon as we
partition, reducing a lot of the benefit of hashing.
Yeah, and it may not work well with the memory accounting if we only track
the size of allocated blocks, not chunks (because pfree likely won't free
the blocks).
* It requires eviction for the algorithm to work. That may be
necessary for handling cases like ARRAY_AGG (see below) anyway, but
this approach constrains the specifics of eviction.Approach 3 is interesting because it unifies the two approaches and
can get some of the benfits of both. It's only a single path, so it
avoids planner mistakes. I really like this idea and it's possible we
will end up with approach 3. However:* It requires that all data types support sorting, or that we punt
somehow.* Right now we are in a weird state because hash aggregation cheats,
so it's difficult to evaluate whether Approach 3 is moving us in the
right direction because we have no other correct implementation to
compare against. Even if Approach 3 is where we end up, it seems
like we should fix hash aggregation as a stepping stone first.
Aren't all three approaches a way to "fix" hash aggregate? In any case,
it's certainly reasonable to make incremental changes. The question is
whether "approach 1" is sensible step towards some form of "approach 3"
* It means we have a hash table and sort running concurrently, each
using memory. Andres said this might not be a problem[3], but I'm
not convinced that the problem is zero. If you use small work_mem
for the write phase of sorting, you'll end up with a lot of runs to
merge later and that has some kind of cost.
Why would we need to do both concurrently? I thought we'd empty the hash
table before doing the sort, no?
* The simplicity might start to evaporate when we consider grouping
sets and eviction strategy.
Hmm, yeah :-/
Main topics to consider:
ARRAY_AGG:
Some aggregates, like ARRAY_AGG, have a transition state that grows
proportionally with the group size. In other words, it is not a
summary like COUNT or AVG, it contains all of the input data in a new
form.
Strictly speaking the state may grow even for count/avg aggregates, e.g.
for numeric types, but it's far less serious than array_agg etc.
These aggregates are not a good candidate for hash aggregation. Hash
aggregation is about keeping many transition states running in
parallel, which is just a bad fit for large transition states. Sorting
is better because it advances one transition state at a time. We could:* Let ARRAY_AGG continue to exceed work_mem like today.
* Block or pessimize use of hash aggregation for such aggregates.
* Evict groups from the hash table when it becomes too large. This
requires the ability to serialize and deserialize transition states,
and some approaches here might also need combine_func
specified. These requirements seem reasonable, but we still need
some answer of what to do for aggregates that grow like ARRAY_AGG
but don't have the required serialfunc, deserialfunc, or
combine_func.
Do we actually need to handle that case? How many such aggregates are
there? I think it's OK to just ignore that case (and keep doing what we do
now), and require serial/deserial functions for anything better.
GROUPING SETS:
With grouping sets, there are multiple hash tables and each hash table
has it's own hash function, so that makes partitioning more
complex. In Approach 1, that means we need to either (a) not partition
the spilled tuples; or (b) have a different set of partitions for each
hash table and spill the same tuple multiple times. In Approach 2, we
would be required to partition each hash table separately and spill
tuples multiple times. In Approach 3 (depending on the exact approach
but taking a guess here) we would need to add a set of phases (one
extra phase for each hash table) for spilled tuples.
No thoughts about this yet.
MEMORY TRACKING:
I have a patch to track the total allocated memory by
incrementing/decrementing it when blocks are malloc'd/free'd. This
doesn't do bookkeeping for each chunk, only each block. Previously,
Robert Haas raised some concerns[4] about performance, which were
mitigated[5] but perhaps not entirely eliminated (but did become
elusive).The only alternative is estimation, which is ugly and seems like a bad
idea. Memory usage isn't just driven by inputs, it's also driven by
patterns of use. Misestimates in the planner are fine (within reason)
because we don't have any other choice, and a small-factor misestimate
might not change the plan anyway. But in the executor, a small-factor
misestimate seems like it's just not doing the job. If a user found
that hash aggregation was using 3X work_mem, and my only explanation
is "well, it's just an estimate", I would be pretty embarrassed and
the user would likely lose confidence in the feature. I don't mean
that we must track memory perfectly everywhere, but using an estimate
seems like a mediocre improvement of the current state.
I agree estimates are not the right tool here.
We should proceed with memory context tracking and try to eliminate or
mitigate performance concerns. I would not like to make any hurculean
effort as a part of the hash aggregation work though; I think it's
basically just something a memory manager in a database system should
have supported all along. I think we will find other uses for it as
time goes on. We have more and more things happening in the executor
and having a cheap way to check "how much memory is this thing using?"
seems very likely to be useful.
IMO we should just use the cheapest memory accounting (tracking the amount
of memory allocated for blocks). I agree it's a feature we need, I don't
think we can devise anything cheaper than this.
Other points:
* Someone brought up the idea of using logtapes.c instead of writing
separate files for each partition. That seems reasonable, but it's
using logtapes.c slightly outside of its intended purpose. Also,
it's awkward to need to specify the number of tapes up-front. Worth
experimenting with to see if it's a win.* Tomas did some experiments regarding the number of batches to choose
and how to choose them. It seems like there's room for improvement
over ths simple calculation I'm doing now.
Me? I don't recall such benchmarks, but maybe I did. But I think we'll
need to repeat those with the new patches etc. I think the question is
whether we see this as an emergency solution - in that case I wouldn't
obsess about getting the best possible parameters.
* A lot of discussion about a smart eviction strategy. I don't see
strong evidence that it's worth the complexity at this time. The
smarter we try to be, the more bookkeeping and memory fragmentation
problems we will have. If we evict something, we should probably
evict the whole hash table or some large part of it.
Maybe. For each "smart" eviction strategy there is a (trivial) example
of data on which it performs poorly.
I think it's the same thing as with the number of partitions - if we
consider this to be an emergency solution, it's OK if the performance is
not entirely perfect when it kicks in.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, 2019-07-01 at 12:13 -0700, Jeff Davis wrote:
This is for design review. I have a patch (WIP) for Approach 1, and
if
this discussion starts to converge on that approach I will polish and
post it.
WIP patch attached (based on 9a81c9fa); targeting September CF.
Not intended for detailed review yet, but it seems to work in enough
cases (including grouping sets and JIT) to be a good proof-of-concept
for the algorithm and its complexity.
Initial performance numbers put it at 2X slower than sort for grouping
10M distinct integers. There are quite a few optimizations I haven't
tried yet and quite a few tunables I haven't tuned yet, so hopefully I
can close the gap a bit for the small-groups case.
I will offer more details soon when I have more confidence in the
numbers.
It does not attempt to spill ARRAY_AGG at all yet.
Regards,
Jeff Davis
Attachments:
hashagg-20190703.patchtext/x-patch; charset=UTF-8; name=hashagg-20190703.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..9f978e5a90 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1702,6 +1702,23 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-hashagg-mem-overflow" xreflabel="hashagg_mem_overflow">
+ <term><varname>hashagg_mem_overflow</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>hashagg_mem_overflow</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ If hash aggregation exceeds <varname>work_mem</varname> at query
+ execution time, and <varname>hashagg_mem_overflow</varname> is set
+ to <literal>on</literal>, continue consuming more memory rather than
+ performing disk-based hash aggregation. The default
+ is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
@@ -4354,6 +4371,24 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-hashagg-spill" xreflabel="enable_hashagg_spill">
+ <term><varname>enable_hashagg_spill</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_hashagg_spill</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the memory usage is expected to
+ exceed <varname>work_mem</varname>. This only affects the planner
+ choice; actual behavior at execution time is dictated by
+ <xref linkend="guc-hashagg-mem-overflow"/>. The default
+ is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
<term><varname>enable_hashjoin</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 92969636b7..6d6481a75f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -102,6 +102,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1826,6 +1827,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2715,6 +2717,55 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * Show information on hash aggregate buckets and batches
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+ long diskKb = (aggstate->hash_disk_used + 1023) / 1024;
+
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(
+ es->str,
+ "Memory Usage: %ldkB",
+ memPeakKb);
+
+ if (aggstate->hash_batches_used > 0)
+ {
+ appendStringInfo(
+ es->str,
+ " Batches: %d Disk Usage:%ldkB",
+ aggstate->hash_batches_used, diskKb);
+ }
+
+ appendStringInfo(
+ es->str,
+ "\n");
+ }
+ else
+ {
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ ExplainPropertyInteger("Disk Usage", "kB", diskKb, es);
+ }
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 66a67c72b2..19e1127627 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -1570,7 +1570,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
[op->d.agg_init_trans.transno];
/* If transValue has not yet been initialized, do so now. */
- if (pergroup->noTransValue)
+ if (pergroup != NULL && pergroup->noTransValue)
{
AggStatePerTrans pertrans = op->d.agg_init_trans.pertrans;
@@ -1597,7 +1597,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
[op->d.agg_strict_trans_check.setoff]
[op->d.agg_strict_trans_check.transno];
- if (unlikely(pergroup->transValueIsNull))
+ if (pergroup == NULL ||
+ unlikely(pergroup->transValueIsNull))
EEO_JUMP(op->d.agg_strict_trans_check.jumpnull);
EEO_NEXT();
@@ -1624,6 +1625,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
[op->d.agg_trans.setoff]
[op->d.agg_trans.transno];
+ if (pergroup == NULL)
+ EEO_NEXT();
+
Assert(pertrans->transtypeByVal);
fcinfo = pertrans->transfn_fcinfo;
@@ -1675,6 +1679,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
[op->d.agg_trans.setoff]
[op->d.agg_trans.transno];
+ if (pergroup == NULL)
+ EEO_NEXT();
+
Assert(!pertrans->transtypeByVal);
fcinfo = pertrans->transfn_fcinfo;
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 14ee8db3f9..91714664d6 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -25,7 +25,6 @@
#include "utils/hashutils.h"
#include "utils/memutils.h"
-static uint32 TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple);
static int TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
/*
@@ -371,17 +370,12 @@ FindTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
/*
* Compute the hash value for a tuple
*
- * The passed-in key is a pointer to TupleHashEntryData. In an actual hash
- * table entry, the firstTuple field points to a tuple (in MinimalTuple
- * format). LookupTupleHashEntry sets up a dummy TupleHashEntryData with a
- * NULL firstTuple field --- that cues us to look at the inputslot instead.
- * This convention avoids the need to materialize virtual input tuples unless
- * they actually need to get copied into the table.
+ * If tuple is NULL, use the input slot instead.
*
* Also, the caller must select an appropriate memory context for running
* the hash functions. (dynahash.c doesn't change CurrentMemoryContext.)
*/
-static uint32
+uint32
TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
{
TupleHashTable hashtable = (TupleHashTable) tb->private_data;
@@ -402,9 +396,6 @@ TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
{
/*
* Process a tuple already stored in the table.
- *
- * (this case never actually occurs due to the way simplehash.h is
- * used, as the hash-value is stored in the entries)
*/
slot = hashtable->tableslot;
ExecStoreMinimalTuple(tuple, slot, false);
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 6b8ef40599..1548de220e 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -229,14 +229,40 @@
#include "optimizer/optimizer.h"
#include "parser/parse_agg.h"
#include "parser/parse_coerce.h"
+#include "storage/buffile.h"
#include "utils/acl.h"
#include "utils/builtins.h"
+#include "utils/dynahash.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
#include "utils/datum.h"
+/*
+ * Represents partitioned spill data for a single hashtable.
+ */
+typedef struct HashAggSpill
+{
+ int n_partitions; /* number of output partitions */
+ int partition_bits; /* number of bits for partition mask
+ log2(n_partitions) parent partition bits */
+ BufFile **partitions; /* output partition files */
+ int64 *ntuples; /* number of tuples in each partition */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation. Initially,
+ * only the input fields are set. If spilled to disk, also set the spill data.
+ */
+typedef struct HashAggBatch
+{
+ BufFile *input_file; /* input partition */
+ int input_bits; /* number of bits for input partition mask */
+ int64 input_groups; /* estimated number of input groups */
+ int setno; /* grouping set */
+ HashAggSpill spill; /* spill output */
+} HashAggBatch;
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
@@ -272,11 +298,24 @@ static TupleTableSlot *project_aggregates(AggState *aggstate);
static Bitmapset *find_unaggregated_cols(AggState *aggstate);
static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
static void build_hash_table(AggState *aggstate);
-static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
+static AggStatePerGroup lookup_hash_entry(AggState *aggstate);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+static Size hash_spill_tuple(TupleHashTable hashtable, HashAggSpill *spill,
+ int input_bits, TupleTableSlot *slot);
+static void hash_spill_tuples(AggState *aggstate, TupleTableSlot *slot);
+static MinimalTuple hash_read_spilled(BufFile *file);
+static HashAggBatch *hash_batch_new(BufFile *input_file, int setno,
+ int64 input_groups, int input_bits);
+static void hash_finish_initial_spills(AggState *aggstate);
+static void hash_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno, int input_bits);
+static void hash_reset_spill(HashAggSpill *spill);
+static void hash_reset_spills(AggState *aggstate);
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1269,6 +1308,10 @@ build_hash_table(AggState *aggstate)
Assert(aggstate->aggstrategy == AGG_HASHED || aggstate->aggstrategy == AGG_MIXED);
+ /* TODO: work harder to find a good nGroups for each hash table. We don't
+ * want the hash table itself to fill up work_mem with no room for
+ * out-of-line transition values. Also, we need to consider that there are
+ * multiple hash tables for grouping sets. */
additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
for (i = 0; i < aggstate->num_hashes; ++i)
@@ -1294,6 +1337,15 @@ build_hash_table(AggState *aggstate)
tmpmem,
DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
}
+
+ /*
+ * Set initial size to be that of an empty hash table. This ensures that
+ * at least one entry can be added before it exceeds work_mem; otherwise
+ * the algorithm might not make progress.
+ */
+ aggstate->hash_mem_init = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_mem_current = aggstate->hash_mem_init;
}
/*
@@ -1462,14 +1514,14 @@ hash_agg_entry_size(int numAggs)
*
* When called, CurrentMemoryContext should be the per-query context.
*/
-static TupleHashEntryData *
+static AggStatePerGroup
lookup_hash_entry(AggState *aggstate)
{
TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
TupleTableSlot *hashslot = perhash->hashslot;
TupleHashEntryData *entry;
- bool isnew;
+ bool isnew = false;
int i;
/* transfer just the needed columns into hashslot */
@@ -1486,12 +1538,26 @@ lookup_hash_entry(AggState *aggstate)
ExecStoreVirtualTuple(hashslot);
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntry(perhash->hashtable, hashslot, &isnew);
+ if (aggstate->hash_can_spill &&
+ aggstate->hash_mem_current > work_mem * 1024L &&
+ aggstate->hash_mem_current > aggstate->hash_mem_init)
+ entry = LookupTupleHashEntry(perhash->hashtable, hashslot, NULL);
+ else
+ entry = LookupTupleHashEntry(perhash->hashtable, hashslot, &isnew);
+
+ if (entry == NULL)
+ return NULL;
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_mem_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ if (aggstate->hash_mem_current > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = aggstate->hash_mem_current;
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1511,7 +1577,7 @@ lookup_hash_entry(AggState *aggstate)
}
}
- return entry;
+ return entry->additional;
}
/*
@@ -1519,6 +1585,8 @@ lookup_hash_entry(AggState *aggstate)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Return false if hash table has exceeded its memory limit.
*/
static void
lookup_hash_entries(AggState *aggstate)
@@ -1530,7 +1598,7 @@ lookup_hash_entries(AggState *aggstate)
for (setno = 0; setno < numHashes; setno++)
{
select_current_set(aggstate, setno, true);
- pergroup[setno] = lookup_hash_entry(aggstate)->additional;
+ pergroup[setno] = lookup_hash_entry(aggstate);
}
}
@@ -1841,6 +1909,8 @@ agg_retrieve_direct(AggState *aggstate)
aggstate->current_phase == 1)
{
lookup_hash_entries(aggstate);
+ hash_spill_tuples(
+ aggstate, aggstate->tmpcontext->ecxt_outertuple);
}
/* Advance the aggregates (or combine functions) */
@@ -1852,6 +1922,10 @@ agg_retrieve_direct(AggState *aggstate)
outerslot = fetch_input_tuple(aggstate);
if (TupIsNull(outerslot))
{
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hash_finish_initial_spills(aggstate);
+
/* no more outer-plan tuples available */
if (hasGroupingSets)
{
@@ -1944,6 +2018,7 @@ agg_fill_hash_table(AggState *aggstate)
/* Find or build hashtable entries */
lookup_hash_entries(aggstate);
+ hash_spill_tuples(aggstate, outerslot);
/* Advance the aggregates (or combine functions) */
advance_aggregates(aggstate);
@@ -1955,6 +2030,8 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ hash_finish_initial_spills(aggstate);
+
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1962,11 +2039,125 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * consumed.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ /* free memory */
+ ReScanExprContext(aggstate->hashcontext);
+ /* Rebuild an empty hash table */
+ build_hash_table(aggstate);
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ Assert(aggstate->current_phase == 0);
+
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hash_read_spilled(batch->input_file);
+ if (tuple == NULL)
+ break;
+
+ /*
+ * TODO: Should we re-compile the expressions to use a minimal tuple
+ * slot so that we don't have to create the virtual tuple here? If we
+ * project the tuple before writing, then perhaps this is not
+ * important.
+ */
+ ExecForceStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ /* Find or build hashtable entries */
+ memset(aggstate->hash_pergroup, 0,
+ sizeof(AggStatePerGroup) * aggstate->num_hashes);
+ select_current_set(aggstate, batch->setno, true);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate);
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ aggstate->hash_disk_used += hash_spill_tuple(
+ aggstate->perhash[batch->setno].hashtable,
+ &batch->spill, batch->input_bits, slot);
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ BufFileClose(batch->input_file);
+
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ hash_spill_finish(aggstate, &batch->spill, batch->setno,
+ batch->input_bits);
+
+ pfree(batch);
+
+ /* Initialize to walk the first hash table */
+ select_current_set(aggstate, 0, true);
+ ResetTupleHashIterator(aggstate->perhash[0].hashtable,
+ &aggstate->perhash[0].hashiter);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -1995,7 +2186,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2026,8 +2217,6 @@ agg_retrieve_hash_table(AggState *aggstate)
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2084,6 +2273,296 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * hash_spill_tuple
+ *
+ * Not enough memory to add tuple as new entry in hash table. Save for later
+ * in the appropriate partition.
+ */
+static Size
+hash_spill_tuple(TupleHashTable hashtable, HashAggSpill *spill,
+ int input_bits, TupleTableSlot *slot)
+{
+ int partition;
+ MinimalTuple tuple;
+ BufFile *file;
+ int written;
+ uint32 hashvalue;
+ bool shouldFree;
+
+ /* initialize output partitions */
+ if (spill->partitions == NULL)
+ {
+ int npartitions;
+ int partition_bits;
+
+ /*TODO: be smarter */
+ npartitions = 32;
+
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits */
+ if (partition_bits + input_bits >= 32)
+ partition_bits = 32 - input_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ spill->partition_bits = partition_bits;
+ spill->n_partitions = npartitions;
+ spill->partitions = palloc0(sizeof(BufFile *) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+ }
+
+ /*
+ * TODO: should we project only needed attributes from the tuple before
+ * writing it?
+ */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ /*
+ * TODO: should we store the hash along with the tuple to avoid
+ * calculating the hash value multiple times?
+ */
+ hashvalue = TupleHashTableHash(hashtable->hashtab, tuple);
+
+ if (spill->partition_bits == 0)
+ partition = 0;
+ else
+ partition = (hashvalue << input_bits) >>
+ (32 - spill->partition_bits);
+
+ spill->ntuples[partition]++;
+
+ /*
+ * TODO: use logtape.c instead?
+ */
+ if (spill->partitions[partition] == NULL)
+ spill->partitions[partition] = BufFileCreateTemp(false);
+ file = spill->partitions[partition];
+
+ written = BufFileWrite(file, (void *) tuple, tuple->t_len);
+ if (written != tuple->t_len)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to HashAgg temporary file: %m")));
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return written;
+}
+
+/*
+ * hash_spill_tuple
+ *
+ * Not enough memory to add tuple as new entry in hash table. Save for later
+ * in the appropriate partition.
+ */
+static void
+hash_spill_tuples(AggState *aggstate, TupleTableSlot *slot)
+{
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ {
+ AggStatePerGroup pergroup = aggstate->hash_pergroup[setno];
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ HashAggSpill *spill;
+
+ if (pergroup == NULL)
+ {
+ if (aggstate->hash_spills == NULL)
+ aggstate->hash_spills = palloc0(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+ aggstate->hash_spilled = true;
+
+ spill = &aggstate->hash_spills[setno];
+
+ aggstate->hash_disk_used += hash_spill_tuple(
+ perhash->hashtable, spill, 0, slot);
+ }
+ }
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hash_read_spilled(BufFile *file)
+{
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+
+ nread = BufFileRead(file, &t_len, sizeof(t_len));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = BufFileRead(file, (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+
+ return tuple;
+}
+
+/*
+ * new_hashagg_batch
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done. Should be called in the aggregate's memory context.
+ */
+static HashAggBatch *
+hash_batch_new(BufFile *input_file, int setno, int64 input_groups,
+ int input_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->input_file = input_file;
+ batch->input_bits = input_bits;
+ batch->input_groups = input_groups;
+ batch->setno = setno;
+
+ /* batch->spill will be set only after spilling this batch */
+
+ return batch;
+}
+
+/*
+ * hash_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hash_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+
+ if (aggstate->hash_spills == NULL)
+ return;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hash_spill_finish(aggstate, &aggstate->hash_spills[setno], setno, 0);
+
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+}
+
+/*
+ * hash_spill_finish
+ *
+ *
+ */
+static void
+hash_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno, int input_bits)
+{
+ int i;
+
+ if (spill->n_partitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->n_partitions; i++)
+ {
+ BufFile *file = spill->partitions[i];
+ MemoryContext oldContext;
+ HashAggBatch *new_batch;
+ int64 input_ngroups;
+
+ /* partition is empty */
+ if (file == NULL)
+ continue;
+
+ /* rewind file for reading */
+ if (BufFileSeek(file, 0, 0L, SEEK_SET))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rewind HashAgg temporary file: %m")));
+
+ /*
+ * Estimate the number of input groups for this new work item as the
+ * total number of tuples in its input file. Although that's a worst
+ * case, it's not bad here for two reasons: (1) overestimating is
+ * better than underestimating; and (2) we've already scanned the
+ * relation once, so it's likely that we've already finalized many of
+ * the common values.
+ */
+ input_ngroups = spill->ntuples[i];
+
+ oldContext = MemoryContextSwitchTo(aggstate->ss.ps.state->es_query_cxt);
+ new_batch = hash_batch_new(file, setno, input_ngroups,
+ spill->partition_bits + input_bits);
+ aggstate->hash_batches = lappend(aggstate->hash_batches, new_batch);
+ aggstate->hash_batches_used++;
+ MemoryContextSwitchTo(oldContext);
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Clear a HashAggSpill, free its memory, and close its files.
+ */
+static void
+hash_reset_spill(HashAggSpill *spill)
+{
+ int i;
+ for (i = 0; i < spill->n_partitions; i++)
+ {
+ BufFile *file = spill->partitions[i];
+
+ if (file != NULL)
+ BufFileClose(file);
+ }
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+}
+
+/*
+ * Find and reset all active HashAggSpills.
+ */
+static void
+hash_reset_spills(AggState *aggstate)
+{
+ ListCell *lc;
+
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hash_reset_spill(&aggstate->hash_spills[setno]);
+
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ hash_reset_spill(&batch->spill);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2213,6 +2692,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
{
ExecAssignExprContext(estate, &aggstate->ss.ps);
aggstate->hashcontext = aggstate->ss.ps.ps_ExprContext;
+
+ /* will set to false if there are aggs with transtype == INTERNALOID */
+ if (!hashagg_mem_overflow)
+ aggstate->hash_can_spill = true;
}
ExecAssignExprContext(estate, &aggstate->ss.ps);
@@ -2238,6 +2721,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->sort_slot = ExecInitExtraTupleSlot(estate, scanDesc,
&TTSOpsMinimalTuple);
+ if (use_hashing)
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsVirtual);
+
/*
* Initialize result type, slot and projection.
*/
@@ -2661,6 +3148,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
elog(ERROR, "deserialfunc not provided for deserialization aggregation");
deserialfn_oid = aggform->aggdeserialfn;
}
+
+ aggstate->hash_can_spill = false;
}
/* Check that aggregate owner has permission to call component fns */
@@ -3368,6 +3857,8 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hash_reset_spills(node);
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3423,12 +3914,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3485,6 +3977,16 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ hash_reset_spills(node);
+
+ /* reset stats */
+ node->hash_spilled = false;
+ node->hash_mem_init = 0;
+ node->hash_mem_peak = 0;
+ node->hash_mem_current = 0;
+ node->hash_disk_used = 0;
+ node->hash_batches_used = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
build_hash_table(node);
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 30133634c7..6dc175eabf 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2100,6 +2100,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_notransvalue;
+ LLVMBasicBlockRef b_check_notransvalue;
LLVMBasicBlockRef b_init;
aggstate = op->d.agg_init_trans.aggstate;
@@ -2126,6 +2127,19 @@ llvm_compile_expr(ExprState *state)
l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
&v_transno, 1, "");
+ b_check_notransvalue = l_bb_before_v(
+ opblocks[i + 1], "op.%d.check_notransvalue", i);
+
+ LLVMBuildCondBr(b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroupp,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_check_notransvalue);
+
+ LLVMPositionBuilderAtEnd(b, b_check_notransvalue);
+
v_notransvalue =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_NOTRANSVALUE,
@@ -2193,6 +2207,8 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_transnull;
LLVMValueRef v_pergroupp;
+ LLVMBasicBlockRef b_check_transnull;
+
int jumpnull = op->d.agg_strict_trans_check.jumpnull;
aggstate = op->d.agg_strict_trans_check.aggstate;
@@ -2216,6 +2232,19 @@ llvm_compile_expr(ExprState *state)
l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
&v_transno, 1, "");
+ b_check_transnull = l_bb_before_v(opblocks[i + 1],
+ "op.%d.check_transnull", i);
+
+ LLVMBuildCondBr(b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroupp,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[jumpnull],
+ b_check_transnull);
+
+ LLVMPositionBuilderAtEnd(b, b_check_transnull);
+
v_transnull =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_TRANSVALUEISNULL,
@@ -2263,6 +2292,8 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_tmpcontext;
LLVMValueRef v_oldcontext;
+ LLVMBasicBlockRef b_advance_transval;
+
aggstate = op->d.agg_trans.aggstate;
pertrans = op->d.agg_trans.pertrans;
@@ -2289,6 +2320,19 @@ llvm_compile_expr(ExprState *state)
l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
&v_transno, 1, "");
+ b_advance_transval = l_bb_before_v(opblocks[i + 1],
+ "op.%d.advance_transval", i);
+
+ LLVMBuildCondBr(b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroupp,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_advance_transval);
+
+ LLVMPositionBuilderAtEnd(b, b_advance_transval);
+
v_fcinfo = l_ptr_const(fcinfo,
l_ptr(StructFunctionCallInfoData));
v_aggcontext = l_ptr_const(op->d.agg_trans.aggcontext,
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a9b1f7be..3cfc299947 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
bool enable_hashagg = true;
+bool enable_hashagg_spill = true;
bool enable_nestloop = true;
bool enable_material = true;
bool enable_mergejoin = true;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index cb897cc7f4..7dc0855461 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4250,8 +4250,8 @@ consider_groupingsets_paths(PlannerInfo *root,
* with. Override work_mem in that case; otherwise, we'll rely on the
* sorted-input case to generate usable mixed paths.
*/
- if (hashsize > work_mem * 1024L && gd->rollups)
- return; /* nope, won't fit */
+ if (!enable_hashagg_spill && hashsize > work_mem * 1024L && gd->rollups)
+ return; /* nope, won't fit */
/*
* We need to burst the existing rollups list into individual grouping
@@ -6522,7 +6522,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
* were unable to sort above, then we'd better generate a Path, so
* that we at least have one.
*/
- if (hashaggtablesize < work_mem * 1024L ||
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L ||
grouped_rel->pathlist == NIL)
{
/*
@@ -6555,7 +6556,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
agg_final_costs,
dNumGroups);
- if (hashaggtablesize < work_mem * 1024L)
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L)
add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
@@ -6824,7 +6826,7 @@ create_partial_grouping_paths(PlannerInfo *root,
* Tentatively produce a partial HashAgg Path, depending on if it
* looks as if the hash table will fit in work_mem.
*/
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
@@ -6851,7 +6853,7 @@ create_partial_grouping_paths(PlannerInfo *root,
dNumPartialPartialGroups);
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 3bf96de256..b0cb1d7e6b 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -120,6 +120,7 @@ bool enableFsync = true;
bool allowSystemTableMods = false;
int work_mem = 1024;
int maintenance_work_mem = 16384;
+bool hashagg_mem_overflow = false;
int max_parallel_maintenance_workers = 2;
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1208eb9a68..90883c7efd 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -952,6 +952,26 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_hashagg_spill", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans that are expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_hashagg_spill,
+ true,
+ NULL, NULL, NULL
+ },
+ {
+ {"hashagg_mem_overflow", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables hashed aggregation to overflow work_mem at execution time."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &hashagg_mem_overflow,
+ false,
+ NULL, NULL, NULL
+ },
{
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of materialization."),
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 6e4a343439..46e12c359e 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -458,6 +458,8 @@ AllocSetContextCreateInternal(MemoryContext parent,
parent,
name);
+ ((MemoryContext) set)->mem_allocated = set->keeper->endptr - ((char *)set->keeper) + MAXALIGN(sizeof(AllocSetContext));
+
return (MemoryContext) set;
}
}
@@ -546,6 +548,8 @@ AllocSetContextCreateInternal(MemoryContext parent,
parent,
name);
+ ((MemoryContext) set)->mem_allocated = set->keeper->endptr - ((char *)set->keeper) + MAXALIGN(sizeof(AllocSetContext));
+
return (MemoryContext) set;
}
@@ -604,6 +608,8 @@ AllocSetReset(MemoryContext context)
else
{
/* Normal case, release the block */
+ context->mem_allocated -= block->endptr - ((char*) block);
+
#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->freeptr - ((char *) block));
#endif
@@ -688,11 +694,16 @@ AllocSetDelete(MemoryContext context)
#endif
if (block != set->keeper)
+ {
+ context->mem_allocated -= block->endptr - ((char *) block);
free(block);
+ }
block = next;
}
+ Assert(context->mem_allocated == 0);
+
/* Finally, free the context header, including the keeper block */
free(set);
}
@@ -733,6 +744,9 @@ AllocSetAlloc(MemoryContext context, Size size)
block = (AllocBlock) malloc(blksize);
if (block == NULL)
return NULL;
+
+ context->mem_allocated += blksize;
+
block->aset = set;
block->freeptr = block->endptr = ((char *) block) + blksize;
@@ -928,6 +942,8 @@ AllocSetAlloc(MemoryContext context, Size size)
if (block == NULL)
return NULL;
+ context->mem_allocated += blksize;
+
block->aset = set;
block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
block->endptr = ((char *) block) + blksize;
@@ -1028,6 +1044,9 @@ AllocSetFree(MemoryContext context, void *pointer)
set->blocks = block->next;
if (block->next)
block->next->prev = block->prev;
+
+ context->mem_allocated -= block->endptr - ((char*) block);
+
#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->freeptr - ((char *) block));
#endif
@@ -1144,6 +1163,7 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
AllocBlock block = (AllocBlock) (((char *) chunk) - ALLOC_BLOCKHDRSZ);
Size chksize;
Size blksize;
+ Size oldblksize;
/*
* Try to verify that we have a sane block pointer: it should
@@ -1159,6 +1179,8 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
/* Do the realloc */
chksize = MAXALIGN(size);
blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+ oldblksize = block->endptr - ((char *)block);
+
block = (AllocBlock) realloc(block, blksize);
if (block == NULL)
{
@@ -1166,6 +1188,9 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN);
return NULL;
}
+
+ context->mem_allocated += blksize - oldblksize;
+
block->freeptr = block->endptr = ((char *) block) + blksize;
/* Update pointers since block has likely been moved */
@@ -1383,6 +1408,7 @@ AllocSetCheck(MemoryContext context)
const char *name = set->header.name;
AllocBlock prevblock;
AllocBlock block;
+ int64 total_allocated = 0;
for (prevblock = NULL, block = set->blocks;
block != NULL;
@@ -1393,6 +1419,10 @@ AllocSetCheck(MemoryContext context)
long blk_data = 0;
long nchunks = 0;
+ total_allocated += block->endptr - ((char *)block);
+ if (set->keeper == block)
+ total_allocated += MAXALIGN(sizeof(AllocSetContext));
+
/*
* Empty block - empty can be keeper-block only
*/
@@ -1479,6 +1509,8 @@ AllocSetCheck(MemoryContext context)
elog(WARNING, "problem in alloc set %s: found inconsistent memory block %p",
name, block);
}
+
+ Assert(total_allocated == context->mem_allocated);
}
#endif /* MEMORY_CONTEXT_CHECKING */
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index b07be12236..27417af548 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -462,6 +462,29 @@ MemoryContextIsEmpty(MemoryContext context)
return context->methods->is_empty(context);
}
+/*
+ * Find the memory allocated to blocks for this memory context. If recurse is
+ * true, also include children.
+ */
+int64
+MemoryContextMemAllocated(MemoryContext context, bool recurse)
+{
+ int64 total = context->mem_allocated;
+
+ AssertArg(MemoryContextIsValid(context));
+
+ if (recurse)
+ {
+ MemoryContext child = context->firstchild;
+ for (child = context->firstchild;
+ child != NULL;
+ child = child->nextchild)
+ total += MemoryContextMemAllocated(child, true);
+ }
+
+ return total;
+}
+
/*
* MemoryContextStats
* Print statistics about the named context and all its descendants.
@@ -736,6 +759,7 @@ MemoryContextCreate(MemoryContext node,
node->methods = methods;
node->parent = parent;
node->firstchild = NULL;
+ node->mem_allocated = 0;
node->prevchild = NULL;
node->name = name;
node->ident = NULL;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index d056fd6151..265e5ffe62 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -143,6 +143,8 @@ extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
ExprState *eqcomp,
FmgrInfo *hashfunctions);
+extern uint32 TupleHashTableHash(struct tuplehash_hash *tb,
+ const MinimalTuple tuple);
extern void ResetTupleHashTable(TupleHashTable hashtable);
/*
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 61a24c2e3c..be9ae1028d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -244,6 +244,7 @@ extern bool enableFsync;
extern PGDLLIMPORT bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT bool hashagg_mem_overflow;
extern PGDLLIMPORT int max_parallel_maintenance_workers;
extern int VacuumCostPageHit;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 99b9fa414f..dd7378d4ca 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2022,13 +2022,25 @@ typedef struct AggState
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
- int num_hashes;
+ int num_hashes; /* number of hash tables active at once */
+ bool hash_can_spill; /* nothing disqualifies the hash from spilling? */
+ bool hash_spilled; /* any hash table ever spilled? */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each hash table,
+ exists only during first pass if spilled */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ uint64 hash_mem_init; /* initial hash table memory usage */
+ uint64 hash_mem_peak; /* peak hash table memory usage */
+ uint64 hash_mem_current; /* current hash table memory usage */
+ uint64 hash_disk_used; /* bytes of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+ List *hash_batches; /* hash batches remaining to be processed */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 44
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index dbae98d3d9..df0ae3625c 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -79,6 +79,7 @@ typedef struct MemoryContextData
/* these two fields are placed here to minimize alignment wastage: */
bool isReset; /* T = no space alloced since last reset */
bool allowInCritSection; /* allow palloc in critical section */
+ int64 mem_allocated; /* track memory allocated for this context */
const MemoryContextMethods *methods; /* virtual function table */
MemoryContext parent; /* NULL if no parent (toplevel context) */
MemoryContext firstchild; /* head of linked list of children */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9b6bdbc518..78e24be7b6 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_hashagg_spill;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index ffe6de536e..6a837bc990 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -82,6 +82,7 @@ extern void MemoryContextSetParent(MemoryContext context,
extern Size GetMemoryChunkSpace(void *pointer);
extern MemoryContext MemoryContextGetParent(MemoryContext context);
extern bool MemoryContextIsEmpty(MemoryContext context);
+extern int64 MemoryContextMemAllocated(MemoryContext context, bool recurse);
extern void MemoryContextStats(MemoryContext context);
extern void MemoryContextStatsDetail(MemoryContext context, int max_children);
extern void MemoryContextAllowInCriticalSection(MemoryContext context,
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..c40bf6c16e 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -75,6 +75,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
+ enable_hashagg_spill | on
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote:
What does "partitioned hash strategy" do? It's probably explained in
one
of the historical discussions, but I'm not sure which one. I assume
it
simply hashes the group keys and uses that to partition the data, and
then
passing it to hash aggregate.
Yes. When spilling, it is cheap to partition on the hash value at the
same time, which dramatically reduces the need to spill multiple times.
Previous discussions:
Unfortunately the second link does not work :-(
It's supposed to be:
/messages/by-id/CAGTBQpa__-NP7=kKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g@mail.gmail.com
I'm not going to block Approach 1, althought I'd really like to see
something that helps with array_agg.
I have a WIP patch that I just posted. It doesn't yet work with
ARRAY_AGG, but I think it can be made to work by evicting the entire
hash table, serializing the transition states, and then later combining
them.
Aren't all three approaches a way to "fix" hash aggregate? In any
case,
it's certainly reasonable to make incremental changes. The question
is
whether "approach 1" is sensible step towards some form of "approach
3"
Disk-based hashing certainly seems like a reasonable algorithm on paper
that has some potential advantages over sorting. It certainly seems
sensible to me that we explore the disk-based hashing strategy first,
and then we would at least know what we are missing (if anything) by
going with the hybrid approach later.
There's also a fair amount of design space to explore in the hybrid
strategy. That could take a while to converge, especially if we don't
have anything in place to compare against.
* It means we have a hash table and sort running concurrently, each
using memory. Andres said this might not be a problem[3], but I'm
not convinced that the problem is zero. If you use small work_mem
for the write phase of sorting, you'll end up with a lot of runs
to
merge later and that has some kind of cost.Why would we need to do both concurrently? I thought we'd empty the
hash
table before doing the sort, no?
So you are saying we spill the tuples into a tuplestore, then feed the
tuplestore through a tuplesort? Seems inefficient, but I guess we can.
Do we actually need to handle that case? How many such aggregates are
there? I think it's OK to just ignore that case (and keep doing what
we do
now), and require serial/deserial functions for anything better.
Punting on a few cases is fine with me, if the user has a way to fix
it.
Regards,
Jeff Davis
On Wed, Jul 03, 2019 at 07:03:06PM -0700, Jeff Davis wrote:
On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote:
What does "partitioned hash strategy" do? It's probably explained in
one
of the historical discussions, but I'm not sure which one. I assume
it
simply hashes the group keys and uses that to partition the data, and
then
passing it to hash aggregate.Yes. When spilling, it is cheap to partition on the hash value at the
same time, which dramatically reduces the need to spill multiple times.
Previous discussions:Unfortunately the second link does not work :-(
It's supposed to be:
/messages/by-id/CAGTBQpa__-NP7=kKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g@mail.gmail.com
I'm not going to block Approach 1, althought I'd really like to see
something that helps with array_agg.I have a WIP patch that I just posted. It doesn't yet work with
ARRAY_AGG, but I think it can be made to work by evicting the entire
hash table, serializing the transition states, and then later combining
them.Aren't all three approaches a way to "fix" hash aggregate? In any
case,
it's certainly reasonable to make incremental changes. The question
is
whether "approach 1" is sensible step towards some form of "approach
3"Disk-based hashing certainly seems like a reasonable algorithm on paper
that has some potential advantages over sorting. It certainly seems
sensible to me that we explore the disk-based hashing strategy first,
and then we would at least know what we are missing (if anything) by
going with the hybrid approach later.There's also a fair amount of design space to explore in the hybrid
strategy. That could take a while to converge, especially if we don't
have anything in place to compare against.
Makes sense. I haven't thought about how the hybrid approach would be
implemented very much, so I can't quite judge how complicated would it be
to extend "approach 1" later. But if you think it's a sensible first step,
I trust you. And I certainly agree we need something to compare the other
approaches against.
* It means we have a hash table and sort running concurrently, each
using memory. Andres said this might not be a problem[3], but I'm
not convinced that the problem is zero. If you use small work_mem
for the write phase of sorting, you'll end up with a lot of runs
to
merge later and that has some kind of cost.Why would we need to do both concurrently? I thought we'd empty the
hash
table before doing the sort, no?So you are saying we spill the tuples into a tuplestore, then feed the
tuplestore through a tuplesort? Seems inefficient, but I guess we can.
I think the question is whether we see this as "emergency fix" (for cases
that are misestimated and could/would fail with OOM at runtime), or as
something that is meant to make "hash agg" more widely applicable.
I personally see it as an emergency fix, in which cases it's perfectly
fine if it's not 100% efficient, assuming it kicks in only rarely.
Effectively, we're betting on hash agg, and from time to time we lose.
But even if we see it as a general optimization technique it does not have
to be perfectly efficient, as long as it's properly costed (so the planner
only uses it when appropriate).
If we have a better solution (in terms of efficiency, code complexity,
etc.) then sure - let's use that. But considering we've started this
discussion in ~2015 and we still don't have anything, I wouldn't hold my
breath. Let's do something good enough, and maybe improve it later.
Do we actually need to handle that case? How many such aggregates are
there? I think it's OK to just ignore that case (and keep doing what
we do
now), and require serial/deserial functions for anything better.Punting on a few cases is fine with me, if the user has a way to fix
it.
+1 to doing that
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 03, 2019 at 07:03:06PM -0700, Jeff Davis wrote:
On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote:
What does "partitioned hash strategy" do? It's probably explained in
one
of the historical discussions, but I'm not sure which one. I assume
it
simply hashes the group keys and uses that to partition the data, and
then
passing it to hash aggregate.Yes. When spilling, it is cheap to partition on the hash value at the
same time, which dramatically reduces the need to spill multiple times.
Previous discussions:Unfortunately the second link does not work :-(
It's supposed to be:
/messages/by-id/CAGTBQpa__-NP7=kKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g@mail.gmail.com
I'm not going to block Approach 1, althought I'd really like to see
something that helps with array_agg.I have a WIP patch that I just posted. It doesn't yet work with
ARRAY_AGG, but I think it can be made to work by evicting the entire
hash table, serializing the transition states, and then later combining
them.Aren't all three approaches a way to "fix" hash aggregate? In any
case,
it's certainly reasonable to make incremental changes. The question
is
whether "approach 1" is sensible step towards some form of "approach
3"Disk-based hashing certainly seems like a reasonable algorithm on paper
that has some potential advantages over sorting. It certainly seems
sensible to me that we explore the disk-based hashing strategy first,
and then we would at least know what we are missing (if anything) by
going with the hybrid approach later.There's also a fair amount of design space to explore in the hybrid
strategy. That could take a while to converge, especially if we don't
have anything in place to compare against.
Makes sense. I haven't thought about how the hybrid approach would be
implemented very much, so I can't quite judge how complicated would it be
to extend "approach 1" later. But if you think it's a sensible first step,
I trust you. And I certainly agree we need something to compare the other
approaches against.
* It means we have a hash table and sort running concurrently, each
using memory. Andres said this might not be a problem[3], but I'm
not convinced that the problem is zero. If you use small work_mem
for the write phase of sorting, you'll end up with a lot of runs
to
merge later and that has some kind of cost.Why would we need to do both concurrently? I thought we'd empty the
hash
table before doing the sort, no?So you are saying we spill the tuples into a tuplestore, then feed the
tuplestore through a tuplesort? Seems inefficient, but I guess we can.
I think the question is whether we see this as "emergency fix" (for cases
that are misestimated and could/would fail with OOM at runtime), or as
something that is meant to make "hash agg" more widely applicable.
I personally see it as an emergency fix, in which cases it's perfectly
fine if it's not 100% efficient, assuming it kicks in only rarely.
Effectively, we're betting on hash agg, and from time to time we lose.
But even if we see it as a general optimization technique it does not have
to be perfectly efficient, as long as it's properly costed (so the planner
only uses it when appropriate).
If we have a better solution (in terms of efficiency, code complexity,
etc.) then sure - let's use that. But considering we've started this
discussion in ~2015 and we still don't have anything, I wouldn't hold my
breath. Let's do something good enough, and maybe improve it later.
Do we actually need to handle that case? How many such aggregates are
there? I think it's OK to just ignore that case (and keep doing what
we do
now), and require serial/deserial functions for anything better.Punting on a few cases is fine with me, if the user has a way to fix
it.
+1 to doing that
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 2019-07-11 at 17:55 +0200, Tomas Vondra wrote:
Makes sense. I haven't thought about how the hybrid approach would be
implemented very much, so I can't quite judge how complicated would
it be
to extend "approach 1" later. But if you think it's a sensible first
step,
I trust you. And I certainly agree we need something to compare the
other
approaches against.
Is this a duplicate of your previous email?
I'm slightly confused but I will use the opportunity to put out another
WIP patch. The patch could use a few rounds of cleanup and quality
work, but the funcionality is there and the performance seems
reasonable.
I rebased on master and fixed a few bugs, and most importantly, added
tests.
It seems to be working with grouping sets fine. It will take a little
longer to get good performance numbers, but even for group size of one,
I'm seeing HashAgg get close to Sort+Group in some cases.
You are right that the missed lookups appear to be costly, at least
when the data all fits in system memory. I think it's the cache misses,
because sometimes reducing work_mem improves performance. I'll try
tuning the number of buckets for the hash table and see if that helps.
If not, then the performance still seems pretty good to me.
Of course, HashAgg can beat sort for larger group sizes, but I'll try
to gather some more data on the cross-over point.
Regards,
Jeff Davis
Attachments:
hashagg-20190711.patchtext/x-patch; charset=UTF-8; name=hashagg-20190711.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c91e3e1550..d2f97d5fce 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1702,6 +1702,23 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-hashagg-mem-overflow" xreflabel="hashagg_mem_overflow">
+ <term><varname>hashagg_mem_overflow</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>hashagg_mem_overflow</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ If hash aggregation exceeds <varname>work_mem</varname> at query
+ execution time, and <varname>hashagg_mem_overflow</varname> is set
+ to <literal>on</literal>, continue consuming more memory rather than
+ performing disk-based hash aggregation. The default
+ is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
@@ -4354,6 +4371,24 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-hashagg-spill" xreflabel="enable_hashagg_spill">
+ <term><varname>enable_hashagg_spill</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_hashagg_spill</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the memory usage is expected to
+ exceed <varname>work_mem</varname>. This only affects the planner
+ choice; actual behavior at execution time is dictated by
+ <xref linkend="guc-hashagg-mem-overflow"/>. The default
+ is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
<term><varname>enable_hashjoin</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index dff2ed3f97..a5b7b73b13 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -102,6 +102,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1826,6 +1827,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ if (es->analyze)
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2715,6 +2718,56 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * If EXPLAIN ANALYZE, show information on hash aggregate memory usage and
+ * batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+ long diskKb = (aggstate->hash_disk_used + 1023) / 1024;
+
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(
+ es->str,
+ "Memory Usage: %ldkB",
+ memPeakKb);
+
+ if (aggstate->hash_batches_used > 0)
+ {
+ appendStringInfo(
+ es->str,
+ " Batches: %d Disk Usage:%ldkB",
+ aggstate->hash_batches_used, diskKb);
+ }
+
+ appendStringInfo(
+ es->str,
+ "\n");
+ }
+ else
+ {
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ ExplainPropertyInteger("Disk Usage", "kB", diskKb, es);
+ }
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 66a67c72b2..62014e4ffb 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -1563,14 +1563,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
{
AggState *aggstate;
AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
aggstate = op->d.agg_init_trans.aggstate;
- pergroup = &aggstate->all_pergroups
- [op->d.agg_init_trans.setoff]
- [op->d.agg_init_trans.transno];
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
/* If transValue has not yet been initialized, do so now. */
- if (pergroup->noTransValue)
+ if (pergroup_allaggs != NULL && pergroup->noTransValue)
{
AggStatePerTrans pertrans = op->d.agg_init_trans.pertrans;
@@ -1591,13 +1591,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
{
AggState *aggstate;
AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
aggstate = op->d.agg_strict_trans_check.aggstate;
- pergroup = &aggstate->all_pergroups
- [op->d.agg_strict_trans_check.setoff]
- [op->d.agg_strict_trans_check.transno];
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_strict_trans_check.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_strict_trans_check.transno];
- if (unlikely(pergroup->transValueIsNull))
+ if (pergroup_allaggs == NULL ||
+ unlikely(pergroup->transValueIsNull))
EEO_JUMP(op->d.agg_strict_trans_check.jumpnull);
EEO_NEXT();
@@ -1613,6 +1614,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
AggState *aggstate;
AggStatePerTrans pertrans;
AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
FunctionCallInfo fcinfo;
MemoryContext oldContext;
Datum newVal;
@@ -1620,9 +1622,11 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
aggstate = op->d.agg_trans.aggstate;
pertrans = op->d.agg_trans.pertrans;
- pergroup = &aggstate->all_pergroups
- [op->d.agg_trans.setoff]
- [op->d.agg_trans.transno];
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
Assert(pertrans->transtypeByVal);
@@ -1664,6 +1668,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
AggState *aggstate;
AggStatePerTrans pertrans;
AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
FunctionCallInfo fcinfo;
MemoryContext oldContext;
Datum newVal;
@@ -1671,9 +1676,11 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
aggstate = op->d.agg_trans.aggstate;
pertrans = op->d.agg_trans.pertrans;
- pergroup = &aggstate->all_pergroups
- [op->d.agg_trans.setoff]
- [op->d.agg_trans.transno];
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
Assert(!pertrans->transtypeByVal);
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 14ee8db3f9..8f5404b3d6 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -25,7 +25,6 @@
#include "utils/hashutils.h"
#include "utils/memutils.h"
-static uint32 TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple);
static int TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
/*
@@ -288,6 +287,28 @@ ResetTupleHashTable(TupleHashTable hashtable)
TupleHashEntry
LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
bool *isnew)
+{
+ MemoryContext oldContext;
+ uint32 hash;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = slot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+
+ MemoryContextSwitchTo(oldContext);
+
+ return LookupTupleHashEntryHash(hashtable, slot, isnew, hash);
+}
+
+TupleHashEntry
+LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
+ bool *isnew, uint32 hash)
{
TupleHashEntryData *entry;
MemoryContext oldContext;
@@ -306,7 +327,7 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
if (isnew)
{
- entry = tuplehash_insert(hashtable->hashtab, key, &found);
+ entry = tuplehash_insert_hash(hashtable->hashtab, key, hash, &found);
if (found)
{
@@ -326,7 +347,7 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
}
else
{
- entry = tuplehash_lookup(hashtable->hashtab, key);
+ entry = tuplehash_lookup_hash(hashtable->hashtab, key, hash);
}
MemoryContextSwitchTo(oldContext);
@@ -371,17 +392,12 @@ FindTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
/*
* Compute the hash value for a tuple
*
- * The passed-in key is a pointer to TupleHashEntryData. In an actual hash
- * table entry, the firstTuple field points to a tuple (in MinimalTuple
- * format). LookupTupleHashEntry sets up a dummy TupleHashEntryData with a
- * NULL firstTuple field --- that cues us to look at the inputslot instead.
- * This convention avoids the need to materialize virtual input tuples unless
- * they actually need to get copied into the table.
+ * If tuple is NULL, use the input slot instead.
*
* Also, the caller must select an appropriate memory context for running
* the hash functions. (dynahash.c doesn't change CurrentMemoryContext.)
*/
-static uint32
+uint32
TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
{
TupleHashTable hashtable = (TupleHashTable) tb->private_data;
@@ -402,9 +418,6 @@ TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
{
/*
* Process a tuple already stored in the table.
- *
- * (this case never actually occurs due to the way simplehash.h is
- * used, as the hash-value is stored in the entries)
*/
slot = hashtable->tableslot;
ExecStoreMinimalTuple(tuple, slot, false);
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 6b8ef40599..6cb3e32767 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -229,14 +229,40 @@
#include "optimizer/optimizer.h"
#include "parser/parse_agg.h"
#include "parser/parse_coerce.h"
+#include "storage/buffile.h"
#include "utils/acl.h"
#include "utils/builtins.h"
+#include "utils/dynahash.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
#include "utils/datum.h"
+/*
+ * Represents partitioned spill data for a single hashtable.
+ */
+typedef struct HashAggSpill
+{
+ int n_partitions; /* number of output partitions */
+ int partition_bits; /* number of bits for partition mask
+ log2(n_partitions) parent partition bits */
+ BufFile **partitions; /* output partition files */
+ int64 *ntuples; /* number of tuples in each partition */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation. Initially,
+ * only the input fields are set. If spilled to disk, also set the spill data.
+ */
+typedef struct HashAggBatch
+{
+ BufFile *input_file; /* input partition */
+ int input_bits; /* number of bits for input partition mask */
+ int64 input_groups; /* estimated number of input groups */
+ int setno; /* grouping set */
+ HashAggSpill spill; /* spill output */
+} HashAggBatch;
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
@@ -272,11 +298,25 @@ static TupleTableSlot *project_aggregates(AggState *aggstate);
static Bitmapset *find_unaggregated_cols(AggState *aggstate);
static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
static void build_hash_table(AggState *aggstate);
-static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
+static void prepare_hash_slot(AggState *aggstate);
+static uint32 calculate_hash(AggState *aggstate);
+static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+static Size hash_spill_tuple(HashAggSpill *spill, int input_bits,
+ TupleTableSlot *slot, uint32 hash);
+static MinimalTuple hash_read_spilled(BufFile *file, uint32 *hashp);
+static HashAggBatch *hash_batch_new(BufFile *input_file, int setno,
+ int64 input_groups, int input_bits);
+static void hash_finish_initial_spills(AggState *aggstate);
+static void hash_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno, int input_bits);
+static void hash_reset_spill(HashAggSpill *spill);
+static void hash_reset_spills(AggState *aggstate);
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1269,6 +1309,10 @@ build_hash_table(AggState *aggstate)
Assert(aggstate->aggstrategy == AGG_HASHED || aggstate->aggstrategy == AGG_MIXED);
+ /* TODO: work harder to find a good nGroups for each hash table. We don't
+ * want the hash table itself to fill up work_mem with no room for
+ * out-of-line transition values. Also, we need to consider that there are
+ * multiple hash tables for grouping sets. */
additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
for (i = 0; i < aggstate->num_hashes; ++i)
@@ -1294,6 +1338,15 @@ build_hash_table(AggState *aggstate)
tmpmem,
DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
}
+
+ /*
+ * Set initial size to be that of an empty hash table. This ensures that
+ * at least one entry can be added before it exceeds work_mem; otherwise
+ * the algorithm might not make progress.
+ */
+ aggstate->hash_mem_init = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_mem_current = aggstate->hash_mem_init;
}
/*
@@ -1454,23 +1507,13 @@ hash_agg_entry_size(int numAggs)
return entrysize;
}
-/*
- * Find or create a hashtable entry for the tuple group containing the current
- * tuple (already set in tmpcontext's outertuple slot), in the current grouping
- * set (which the caller must have selected - note that initialize_aggregate
- * depends on this).
- *
- * When called, CurrentMemoryContext should be the per-query context.
- */
-static TupleHashEntryData *
-lookup_hash_entry(AggState *aggstate)
+void
+prepare_hash_slot(AggState *aggstate)
{
- TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
- AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
- TupleTableSlot *hashslot = perhash->hashslot;
- TupleHashEntryData *entry;
- bool isnew;
- int i;
+ TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ int i;
/* transfer just the needed columns into hashslot */
slot_getsomeattrs(inputslot, perhash->largestGrpColIdx);
@@ -1484,14 +1527,70 @@ lookup_hash_entry(AggState *aggstate)
hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
}
ExecStoreVirtualTuple(hashslot);
+}
+
+uint32
+calculate_hash(AggState *aggstate)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleHashTable hashtable = perhash->hashtable;
+ MemoryContext oldContext;
+ uint32 hash;
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = perhash->hashslot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+
+ MemoryContextSwitchTo(oldContext);
+
+ return hash;
+}
+
+/*
+ * Find or create a hashtable entry for the tuple group containing the current
+ * tuple (already set in tmpcontext's outertuple slot), in the current grouping
+ * set (which the caller must have selected - note that initialize_aggregate
+ * depends on this).
+ *
+ * When called, CurrentMemoryContext should be the per-query context.
+ */
+static AggStatePerGroup
+lookup_hash_entry(AggState *aggstate, uint32 hash)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ TupleHashEntryData *entry;
+ bool isnew = false;
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntry(perhash->hashtable, hashslot, &isnew);
+ if (!hashagg_mem_overflow &&
+ aggstate->hash_mem_current > work_mem * 1024L &&
+ aggstate->hash_mem_current > aggstate->hash_mem_init)
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot,
+ NULL, hash);
+ else
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot,
+ &isnew, hash);
+
+ if (entry == NULL)
+ return NULL;
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_mem_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ if (aggstate->hash_mem_current > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = aggstate->hash_mem_current;
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1511,7 +1610,7 @@ lookup_hash_entry(AggState *aggstate)
}
}
- return entry;
+ return entry->additional;
}
/*
@@ -1519,18 +1618,38 @@ lookup_hash_entry(AggState *aggstate)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Return false if hash table has exceeded its memory limit.
*/
static void
lookup_hash_entries(AggState *aggstate)
{
- int numHashes = aggstate->num_hashes;
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
int setno;
- for (setno = 0; setno < numHashes; setno++)
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
{
+ uint32 hash;
+
select_current_set(aggstate, setno, true);
- pergroup[setno] = lookup_hash_entry(aggstate)->additional;
+ prepare_hash_slot(aggstate);
+ hash = calculate_hash(aggstate);
+ pergroup[setno] = lookup_hash_entry(aggstate, hash);
+
+ if (pergroup[setno] == NULL)
+ {
+ HashAggSpill *spill;
+ TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
+
+ if (aggstate->hash_spills == NULL)
+ aggstate->hash_spills = palloc0(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+ aggstate->hash_spilled = true;
+
+ spill = &aggstate->hash_spills[setno];
+
+ aggstate->hash_disk_used += hash_spill_tuple(spill, 0, slot, hash);
+ }
}
}
@@ -1852,6 +1971,10 @@ agg_retrieve_direct(AggState *aggstate)
outerslot = fetch_input_tuple(aggstate);
if (TupIsNull(outerslot))
{
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hash_finish_initial_spills(aggstate);
+
/* no more outer-plan tuples available */
if (hasGroupingSets)
{
@@ -1955,6 +2078,8 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ hash_finish_initial_spills(aggstate);
+
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1962,11 +2087,136 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * consumed.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+ AggStatePerGroup *pergroup;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ pergroup = aggstate->all_pergroups;
+ while(pergroup != aggstate->hash_pergroup) {
+ *pergroup = NULL;
+ pergroup++;
+ }
+
+ /* free memory */
+ ReScanExprContext(aggstate->hashcontext);
+ /* Rebuild an empty hash table */
+ build_hash_table(aggstate);
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ Assert(aggstate->current_phase == 0);
+
+ /*
+ * TODO: what should be done here to set up for advance_aggregates?
+ */
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+ uint32 hash;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hash_read_spilled(batch->input_file, &hash);
+ if (tuple == NULL)
+ break;
+
+ /*
+ * TODO: Should we re-compile the expressions to use a minimal tuple
+ * slot so that we don't have to create the virtual tuple here? If we
+ * project the tuple before writing, then perhaps this is not
+ * important.
+ */
+ ExecForceStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ /* Find or build hashtable entries */
+ memset(aggstate->hash_pergroup, 0,
+ sizeof(AggStatePerGroup) * aggstate->num_hashes);
+ select_current_set(aggstate, batch->setno, true);
+ prepare_hash_slot(aggstate);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ aggstate->hash_disk_used += hash_spill_tuple(
+ &batch->spill, batch->input_bits, slot, hash);
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ BufFileClose(batch->input_file);
+
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ hash_spill_finish(aggstate, &batch->spill, batch->setno,
+ batch->input_bits);
+
+ pfree(batch);
+
+ /* Initialize to walk the first hash table */
+ select_current_set(aggstate, 0, true);
+ ResetTupleHashIterator(aggstate->perhash[0].hashtable,
+ &aggstate->perhash[0].hashiter);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -1995,7 +2245,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2026,8 +2276,6 @@ agg_retrieve_hash_table(AggState *aggstate)
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2084,6 +2332,276 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * hash_spill_tuple
+ *
+ * Not enough memory to add tuple as new entry in hash table. Save for later
+ * in the appropriate partition.
+ */
+static Size
+hash_spill_tuple(HashAggSpill *spill, int input_bits, TupleTableSlot *slot,
+ uint32 hash)
+{
+ int partition;
+ MinimalTuple tuple;
+ BufFile *file;
+ int written;
+ int total_written = 0;
+ bool shouldFree;
+
+ /* initialize output partitions */
+ if (spill->partitions == NULL)
+ {
+ int npartitions;
+ int partition_bits;
+
+ /*TODO: be smarter */
+ npartitions = 32;
+
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits */
+ if (partition_bits + input_bits >= 32)
+ partition_bits = 32 - input_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ spill->partition_bits = partition_bits;
+ spill->n_partitions = npartitions;
+ spill->partitions = palloc0(sizeof(BufFile *) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+ }
+
+ /*
+ * TODO: should we project only needed attributes from the tuple before
+ * writing it?
+ */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ if (spill->partition_bits == 0)
+ partition = 0;
+ else
+ partition = (hash << input_bits) >>
+ (32 - spill->partition_bits);
+
+ spill->ntuples[partition]++;
+
+ /*
+ * TODO: use logtape.c instead?
+ */
+ if (spill->partitions[partition] == NULL)
+ spill->partitions[partition] = BufFileCreateTemp(false);
+ file = spill->partitions[partition];
+
+
+ written = BufFileWrite(file, (void *) &hash, sizeof(uint32));
+ if (written != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to HashAgg temporary file: %m")));
+ total_written += written;
+
+ written = BufFileWrite(file, (void *) tuple, tuple->t_len);
+ if (written != tuple->t_len)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to HashAgg temporary file: %m")));
+ total_written += written;
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return total_written;
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hash_read_spilled(BufFile *file, uint32 *hashp)
+{
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+ uint32 hash;
+
+ nread = BufFileRead(file, &hash, sizeof(uint32));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+ if (hashp != NULL)
+ *hashp = hash;
+
+ nread = BufFileRead(file, &t_len, sizeof(t_len));
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = BufFileRead(file, (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+
+ return tuple;
+}
+
+/*
+ * new_hashagg_batch
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done. Should be called in the aggregate's memory context.
+ */
+static HashAggBatch *
+hash_batch_new(BufFile *input_file, int setno, int64 input_groups,
+ int input_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->input_file = input_file;
+ batch->input_bits = input_bits;
+ batch->input_groups = input_groups;
+ batch->setno = setno;
+
+ /* batch->spill will be set only after spilling this batch */
+
+ return batch;
+}
+
+/*
+ * hash_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hash_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+
+ if (aggstate->hash_spills == NULL)
+ return;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hash_spill_finish(aggstate, &aggstate->hash_spills[setno], setno, 0);
+
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+}
+
+/*
+ * hash_spill_finish
+ *
+ *
+ */
+static void
+hash_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno, int input_bits)
+{
+ int i;
+
+ if (spill->n_partitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->n_partitions; i++)
+ {
+ BufFile *file = spill->partitions[i];
+ MemoryContext oldContext;
+ HashAggBatch *new_batch;
+ int64 input_ngroups;
+
+ /* partition is empty */
+ if (file == NULL)
+ continue;
+
+ /* rewind file for reading */
+ if (BufFileSeek(file, 0, 0L, SEEK_SET))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rewind HashAgg temporary file: %m")));
+
+ /*
+ * Estimate the number of input groups for this new work item as the
+ * total number of tuples in its input file. Although that's a worst
+ * case, it's not bad here for two reasons: (1) overestimating is
+ * better than underestimating; and (2) we've already scanned the
+ * relation once, so it's likely that we've already finalized many of
+ * the common values.
+ */
+ input_ngroups = spill->ntuples[i];
+
+ oldContext = MemoryContextSwitchTo(aggstate->ss.ps.state->es_query_cxt);
+ new_batch = hash_batch_new(file, setno, input_ngroups,
+ spill->partition_bits + input_bits);
+ aggstate->hash_batches = lappend(aggstate->hash_batches, new_batch);
+ aggstate->hash_batches_used++;
+ MemoryContextSwitchTo(oldContext);
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Clear a HashAggSpill, free its memory, and close its files.
+ */
+static void
+hash_reset_spill(HashAggSpill *spill)
+{
+ int i;
+ for (i = 0; i < spill->n_partitions; i++)
+ {
+ BufFile *file = spill->partitions[i];
+
+ if (file != NULL)
+ BufFileClose(file);
+ }
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+}
+
+/*
+ * Find and reset all active HashAggSpills.
+ */
+static void
+hash_reset_spills(AggState *aggstate)
+{
+ ListCell *lc;
+
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hash_reset_spill(&aggstate->hash_spills[setno]);
+
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ hash_reset_spill(&batch->spill);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2238,6 +2756,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->sort_slot = ExecInitExtraTupleSlot(estate, scanDesc,
&TTSOpsMinimalTuple);
+ if (use_hashing)
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsVirtual);
+
/*
* Initialize result type, slot and projection.
*/
@@ -3368,6 +3890,8 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hash_reset_spills(node);
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3423,12 +3947,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3485,6 +4010,16 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ hash_reset_spills(node);
+
+ /* reset stats */
+ node->hash_spilled = false;
+ node->hash_mem_init = 0;
+ node->hash_mem_peak = 0;
+ node->hash_mem_current = 0;
+ node->hash_disk_used = 0;
+ node->hash_batches_used = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
build_hash_table(node);
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 30133634c7..14764e9c1d 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2094,12 +2094,14 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_allpergroupsp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_setoff,
v_transno;
LLVMValueRef v_notransvalue;
+ LLVMBasicBlockRef b_check_notransvalue;
LLVMBasicBlockRef b_init;
aggstate = op->d.agg_init_trans.aggstate;
@@ -2121,11 +2123,22 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_init_trans.setoff);
v_transno = l_int32_const(op->d.agg_init_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+ b_check_notransvalue = l_bb_before_v(
+ opblocks[i + 1], "op.%d.check_notransvalue", i);
+
+ LLVMBuildCondBr(b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_check_notransvalue);
+
+ LLVMPositionBuilderAtEnd(b, b_check_notransvalue);
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_notransvalue =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_NOTRANSVALUE,
@@ -2192,6 +2205,9 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_transnull;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
+
+ LLVMBasicBlockRef b_check_transnull;
int jumpnull = op->d.agg_strict_trans_check.jumpnull;
@@ -2211,11 +2227,22 @@ llvm_compile_expr(ExprState *state)
l_int32_const(op->d.agg_strict_trans_check.setoff);
v_transno =
l_int32_const(op->d.agg_strict_trans_check.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ b_check_transnull = l_bb_before_v(opblocks[i + 1],
+ "op.%d.check_transnull", i);
+ LLVMBuildCondBr(b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[jumpnull],
+ b_check_transnull);
+
+ LLVMPositionBuilderAtEnd(b, b_check_transnull);
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_transnull =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_TRANSVALUEISNULL,
@@ -2257,12 +2284,15 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_pertransp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_retval;
LLVMValueRef v_tmpcontext;
LLVMValueRef v_oldcontext;
+ LLVMBasicBlockRef b_advance_transval;
+
aggstate = op->d.agg_trans.aggstate;
pertrans = op->d.agg_trans.pertrans;
@@ -2284,10 +2314,22 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_trans.setoff);
v_transno = l_int32_const(op->d.agg_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ b_advance_transval = l_bb_before_v(opblocks[i + 1],
+ "op.%d.advance_transval", i);
+
+ LLVMBuildCondBr(b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_advance_transval);
+
+ LLVMPositionBuilderAtEnd(b, b_advance_transval);
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_fcinfo = l_ptr_const(fcinfo,
l_ptr(StructFunctionCallInfoData));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a9b1f7be..3cfc299947 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
bool enable_hashagg = true;
+bool enable_hashagg_spill = true;
bool enable_nestloop = true;
bool enable_material = true;
bool enable_mergejoin = true;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 401299e542..b3c1043c78 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4255,8 +4255,8 @@ consider_groupingsets_paths(PlannerInfo *root,
* with. Override work_mem in that case; otherwise, we'll rely on the
* sorted-input case to generate usable mixed paths.
*/
- if (hashsize > work_mem * 1024L && gd->rollups)
- return; /* nope, won't fit */
+ if (!enable_hashagg_spill && hashsize > work_mem * 1024L && gd->rollups)
+ return; /* nope, won't fit */
/*
* We need to burst the existing rollups list into individual grouping
@@ -6527,7 +6527,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
* were unable to sort above, then we'd better generate a Path, so
* that we at least have one.
*/
- if (hashaggtablesize < work_mem * 1024L ||
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L ||
grouped_rel->pathlist == NIL)
{
/*
@@ -6560,7 +6561,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
agg_final_costs,
dNumGroups);
- if (hashaggtablesize < work_mem * 1024L)
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L)
add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
@@ -6829,7 +6831,7 @@ create_partial_grouping_paths(PlannerInfo *root,
* Tentatively produce a partial HashAgg Path, depending on if it
* looks as if the hash table will fit in work_mem.
*/
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
@@ -6856,7 +6858,7 @@ create_partial_grouping_paths(PlannerInfo *root,
dNumPartialPartialGroups);
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 3bf96de256..b0cb1d7e6b 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -120,6 +120,7 @@ bool enableFsync = true;
bool allowSystemTableMods = false;
int work_mem = 1024;
int maintenance_work_mem = 16384;
+bool hashagg_mem_overflow = false;
int max_parallel_maintenance_workers = 2;
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fc463601ff..c8b44569df 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -951,6 +951,26 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_hashagg_spill", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans that are expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_hashagg_spill,
+ true,
+ NULL, NULL, NULL
+ },
+ {
+ {"hashagg_mem_overflow", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables hashed aggregation to overflow work_mem at execution time."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &hashagg_mem_overflow,
+ false,
+ NULL, NULL, NULL
+ },
{
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of materialization."),
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 6e4a343439..46e12c359e 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -458,6 +458,8 @@ AllocSetContextCreateInternal(MemoryContext parent,
parent,
name);
+ ((MemoryContext) set)->mem_allocated = set->keeper->endptr - ((char *)set->keeper) + MAXALIGN(sizeof(AllocSetContext));
+
return (MemoryContext) set;
}
}
@@ -546,6 +548,8 @@ AllocSetContextCreateInternal(MemoryContext parent,
parent,
name);
+ ((MemoryContext) set)->mem_allocated = set->keeper->endptr - ((char *)set->keeper) + MAXALIGN(sizeof(AllocSetContext));
+
return (MemoryContext) set;
}
@@ -604,6 +608,8 @@ AllocSetReset(MemoryContext context)
else
{
/* Normal case, release the block */
+ context->mem_allocated -= block->endptr - ((char*) block);
+
#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->freeptr - ((char *) block));
#endif
@@ -688,11 +694,16 @@ AllocSetDelete(MemoryContext context)
#endif
if (block != set->keeper)
+ {
+ context->mem_allocated -= block->endptr - ((char *) block);
free(block);
+ }
block = next;
}
+ Assert(context->mem_allocated == 0);
+
/* Finally, free the context header, including the keeper block */
free(set);
}
@@ -733,6 +744,9 @@ AllocSetAlloc(MemoryContext context, Size size)
block = (AllocBlock) malloc(blksize);
if (block == NULL)
return NULL;
+
+ context->mem_allocated += blksize;
+
block->aset = set;
block->freeptr = block->endptr = ((char *) block) + blksize;
@@ -928,6 +942,8 @@ AllocSetAlloc(MemoryContext context, Size size)
if (block == NULL)
return NULL;
+ context->mem_allocated += blksize;
+
block->aset = set;
block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
block->endptr = ((char *) block) + blksize;
@@ -1028,6 +1044,9 @@ AllocSetFree(MemoryContext context, void *pointer)
set->blocks = block->next;
if (block->next)
block->next->prev = block->prev;
+
+ context->mem_allocated -= block->endptr - ((char*) block);
+
#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->freeptr - ((char *) block));
#endif
@@ -1144,6 +1163,7 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
AllocBlock block = (AllocBlock) (((char *) chunk) - ALLOC_BLOCKHDRSZ);
Size chksize;
Size blksize;
+ Size oldblksize;
/*
* Try to verify that we have a sane block pointer: it should
@@ -1159,6 +1179,8 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
/* Do the realloc */
chksize = MAXALIGN(size);
blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+ oldblksize = block->endptr - ((char *)block);
+
block = (AllocBlock) realloc(block, blksize);
if (block == NULL)
{
@@ -1166,6 +1188,9 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN);
return NULL;
}
+
+ context->mem_allocated += blksize - oldblksize;
+
block->freeptr = block->endptr = ((char *) block) + blksize;
/* Update pointers since block has likely been moved */
@@ -1383,6 +1408,7 @@ AllocSetCheck(MemoryContext context)
const char *name = set->header.name;
AllocBlock prevblock;
AllocBlock block;
+ int64 total_allocated = 0;
for (prevblock = NULL, block = set->blocks;
block != NULL;
@@ -1393,6 +1419,10 @@ AllocSetCheck(MemoryContext context)
long blk_data = 0;
long nchunks = 0;
+ total_allocated += block->endptr - ((char *)block);
+ if (set->keeper == block)
+ total_allocated += MAXALIGN(sizeof(AllocSetContext));
+
/*
* Empty block - empty can be keeper-block only
*/
@@ -1479,6 +1509,8 @@ AllocSetCheck(MemoryContext context)
elog(WARNING, "problem in alloc set %s: found inconsistent memory block %p",
name, block);
}
+
+ Assert(total_allocated == context->mem_allocated);
}
#endif /* MEMORY_CONTEXT_CHECKING */
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index b07be12236..27417af548 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -462,6 +462,29 @@ MemoryContextIsEmpty(MemoryContext context)
return context->methods->is_empty(context);
}
+/*
+ * Find the memory allocated to blocks for this memory context. If recurse is
+ * true, also include children.
+ */
+int64
+MemoryContextMemAllocated(MemoryContext context, bool recurse)
+{
+ int64 total = context->mem_allocated;
+
+ AssertArg(MemoryContextIsValid(context));
+
+ if (recurse)
+ {
+ MemoryContext child = context->firstchild;
+ for (child = context->firstchild;
+ child != NULL;
+ child = child->nextchild)
+ total += MemoryContextMemAllocated(child, true);
+ }
+
+ return total;
+}
+
/*
* MemoryContextStats
* Print statistics about the named context and all its descendants.
@@ -736,6 +759,7 @@ MemoryContextCreate(MemoryContext node,
node->methods = methods;
node->parent = parent;
node->firstchild = NULL;
+ node->mem_allocated = 0;
node->prevchild = NULL;
node->name = name;
node->ident = NULL;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 1fb28b4596..6f1c2f9c73 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -139,10 +139,15 @@ extern TupleHashTable BuildTupleHashTableExt(PlanState *parent,
extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
bool *isnew);
+extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
+ TupleTableSlot *slot,
+ bool *isnew, uint32 hash);
extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
ExprState *eqcomp,
FmgrInfo *hashfunctions);
+extern uint32 TupleHashTableHash(struct tuplehash_hash *tb,
+ const MinimalTuple tuple);
extern void ResetTupleHashTable(TupleHashTable hashtable);
/*
diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index 5c6bd93bc7..d51c1ea022 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -74,8 +74,10 @@
#define SH_DESTROY SH_MAKE_NAME(destroy)
#define SH_RESET SH_MAKE_NAME(reset)
#define SH_INSERT SH_MAKE_NAME(insert)
+#define SH_INSERT_HASH SH_MAKE_NAME(insert_hash)
#define SH_DELETE SH_MAKE_NAME(delete)
#define SH_LOOKUP SH_MAKE_NAME(lookup)
+#define SH_LOOKUP_HASH SH_MAKE_NAME(lookup_hash)
#define SH_GROW SH_MAKE_NAME(grow)
#define SH_START_ITERATE SH_MAKE_NAME(start_iterate)
#define SH_START_ITERATE_AT SH_MAKE_NAME(start_iterate_at)
@@ -144,7 +146,11 @@ SH_SCOPE void SH_DESTROY(SH_TYPE * tb);
SH_SCOPE void SH_RESET(SH_TYPE * tb);
SH_SCOPE void SH_GROW(SH_TYPE * tb, uint32 newsize);
SH_SCOPE SH_ELEMENT_TYPE *SH_INSERT(SH_TYPE * tb, SH_KEY_TYPE key, bool *found);
+SH_SCOPE SH_ELEMENT_TYPE *SH_INSERT_HASH(SH_TYPE * tb, SH_KEY_TYPE key,
+ uint32 hash, bool *found);
SH_SCOPE SH_ELEMENT_TYPE *SH_LOOKUP(SH_TYPE * tb, SH_KEY_TYPE key);
+SH_SCOPE SH_ELEMENT_TYPE *SH_LOOKUP_HASH(SH_TYPE * tb, SH_KEY_TYPE key,
+ uint32 hash);
SH_SCOPE bool SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key);
SH_SCOPE void SH_START_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter);
SH_SCOPE void SH_START_ITERATE_AT(SH_TYPE * tb, SH_ITERATOR * iter, uint32 at);
@@ -499,7 +505,14 @@ SH_GROW(SH_TYPE * tb, uint32 newsize)
SH_SCOPE SH_ELEMENT_TYPE *
SH_INSERT(SH_TYPE * tb, SH_KEY_TYPE key, bool *found)
{
- uint32 hash = SH_HASH_KEY(tb, key);
+ uint32 hash = SH_HASH_KEY(tb, key);
+
+ return SH_INSERT_HASH(tb, key, hash, found);
+}
+
+SH_SCOPE SH_ELEMENT_TYPE *
+SH_INSERT_HASH(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash, bool *found)
+{
uint32 startelem;
uint32 curelem;
SH_ELEMENT_TYPE *data;
@@ -669,7 +682,14 @@ restart:
SH_SCOPE SH_ELEMENT_TYPE *
SH_LOOKUP(SH_TYPE * tb, SH_KEY_TYPE key)
{
- uint32 hash = SH_HASH_KEY(tb, key);
+ uint32 hash = SH_HASH_KEY(tb, key);
+
+ return SH_LOOKUP_HASH(tb, key, hash);
+}
+
+SH_SCOPE SH_ELEMENT_TYPE *
+SH_LOOKUP_HASH(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash)
+{
const uint32 startelem = SH_INITIAL_BUCKET(tb, hash);
uint32 curelem = startelem;
@@ -971,8 +991,10 @@ SH_STAT(SH_TYPE * tb)
#undef SH_DESTROY
#undef SH_RESET
#undef SH_INSERT
+#undef SH_INSERT_HASH
#undef SH_DELETE
#undef SH_LOOKUP
+#undef SH_LOOKUP_HASH
#undef SH_GROW
#undef SH_START_ITERATE
#undef SH_START_ITERATE_AT
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 61a24c2e3c..be9ae1028d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -244,6 +244,7 @@ extern bool enableFsync;
extern PGDLLIMPORT bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT bool hashagg_mem_overflow;
extern PGDLLIMPORT int max_parallel_maintenance_workers;
extern int VacuumCostPageHit;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 98bdcbcef5..419de41170 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2022,13 +2022,24 @@ typedef struct AggState
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
- int num_hashes;
+ int num_hashes; /* number of hash tables active at once */
+ bool hash_spilled; /* any hash table ever spilled? */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each hash table,
+ exists only during first pass if spilled */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ uint64 hash_mem_init; /* initial hash table memory usage */
+ uint64 hash_mem_peak; /* peak hash table memory usage */
+ uint64 hash_mem_current; /* current hash table memory usage */
+ uint64 hash_disk_used; /* bytes of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+ List *hash_batches; /* hash batches remaining to be processed */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 43
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
@@ -2200,7 +2211,7 @@ typedef struct HashInstrumentation
int nbuckets_original; /* planned number of buckets */
int nbatch; /* number of batches at end of execution */
int nbatch_original; /* planned number of batches */
- size_t space_peak; /* speak memory usage in bytes */
+ size_t space_peak; /* peak memory usage in bytes */
} HashInstrumentation;
/* ----------------
diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index dbae98d3d9..df0ae3625c 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -79,6 +79,7 @@ typedef struct MemoryContextData
/* these two fields are placed here to minimize alignment wastage: */
bool isReset; /* T = no space alloced since last reset */
bool allowInCritSection; /* allow palloc in critical section */
+ int64 mem_allocated; /* track memory allocated for this context */
const MemoryContextMethods *methods; /* virtual function table */
MemoryContext parent; /* NULL if no parent (toplevel context) */
MemoryContext firstchild; /* head of linked list of children */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fb..b72e2d0829 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_hashagg_spill;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index ffe6de536e..6a837bc990 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -82,6 +82,7 @@ extern void MemoryContextSetParent(MemoryContext context,
extern Size GetMemoryChunkSpace(void *pointer);
extern MemoryContext MemoryContextGetParent(MemoryContext context);
extern bool MemoryContextIsEmpty(MemoryContext context);
+extern int64 MemoryContextMemAllocated(MemoryContext context, bool recurse);
extern void MemoryContextStats(MemoryContext context);
extern void MemoryContextStatsDetail(MemoryContext context, int max_children);
extern void MemoryContextAllowInCriticalSection(MemoryContext context,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index ef8eec3fbf..8fa4c7466b 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2331,3 +2331,95 @@ explain (costs off)
-> Seq Scan on onek
(8 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+set work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------------
+ GroupAggregate
+ Group Key: ((g % 100000))
+ -> Sort
+ Sort Key: ((g % 100000))
+ -> Function Scan on generate_series g
+(5 rows)
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+set jit_above_cost to default;
+create table agg_group_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_group_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+-- Produce results with hash aggregation
+set enable_hashagg = true;
+set enable_sort = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 100000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+set jit_above_cost to default;
+create table agg_hash_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_hash_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+set enable_sort = true;
+set work_mem to default;
+-- Compare group aggregation results to hash aggregation results
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index 5d92b08d20..7d7fc929c7 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1494,22 +1494,18 @@ explain (costs off)
count(hundred), count(thousand), count(twothousand),
count(*)
from tenk1 group by grouping sets (unique1,twothousand,thousand,hundred,ten,four,two);
- QUERY PLAN
--------------------------------
- MixedAggregate
- Hash Key: two
- Hash Key: four
- Hash Key: ten
+ QUERY PLAN
+-------------------------
+ HashAggregate
+ Hash Key: unique1
+ Hash Key: twothousand
+ Hash Key: thousand
Hash Key: hundred
- Group Key: unique1
- Sort Key: twothousand
- Group Key: twothousand
- Sort Key: thousand
- Group Key: thousand
- -> Sort
- Sort Key: unique1
- -> Seq Scan on tenk1
-(13 rows)
+ Hash Key: ten
+ Hash Key: four
+ Hash Key: two
+ -> Seq Scan on tenk1
+(9 rows)
explain (costs off)
select unique1,
@@ -1517,18 +1513,16 @@ explain (costs off)
count(hundred), count(thousand), count(twothousand),
count(*)
from tenk1 group by grouping sets (unique1,hundred,ten,four,two);
- QUERY PLAN
--------------------------------
- MixedAggregate
- Hash Key: two
- Hash Key: four
- Hash Key: ten
+ QUERY PLAN
+-------------------------
+ HashAggregate
+ Hash Key: unique1
Hash Key: hundred
- Group Key: unique1
- -> Sort
- Sort Key: unique1
- -> Seq Scan on tenk1
-(9 rows)
+ Hash Key: ten
+ Hash Key: four
+ Hash Key: two
+ -> Seq Scan on tenk1
+(7 rows)
set work_mem = '384kB';
explain (costs off)
@@ -1537,21 +1531,18 @@ explain (costs off)
count(hundred), count(thousand), count(twothousand),
count(*)
from tenk1 group by grouping sets (unique1,twothousand,thousand,hundred,ten,four,two);
- QUERY PLAN
--------------------------------
- MixedAggregate
- Hash Key: two
- Hash Key: four
- Hash Key: ten
- Hash Key: hundred
+ QUERY PLAN
+-------------------------
+ HashAggregate
+ Hash Key: unique1
+ Hash Key: twothousand
Hash Key: thousand
- Group Key: unique1
- Sort Key: twothousand
- Group Key: twothousand
- -> Sort
- Sort Key: unique1
- -> Seq Scan on tenk1
-(12 rows)
+ Hash Key: hundred
+ Hash Key: ten
+ Hash Key: four
+ Hash Key: two
+ -> Seq Scan on tenk1
+(9 rows)
-- check collation-sensitive matching between grouping expressions
-- (similar to a check for aggregates, but there are additional code
@@ -1578,4 +1569,123 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
| 1 | 2
(4 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+-- Produce results with hash aggregation.
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------
+ MixedAggregate
+ Hash Key: (g.g % 1000), (g.g % 100), (g.g % 10)
+ Hash Key: (g.g % 1000), (g.g % 100)
+ Hash Key: (g.g % 1000)
+ Hash Key: (g.g % 100), (g.g % 10)
+ Hash Key: (g.g % 100)
+ Hash Key: (g.g % 10), (g.g % 1000)
+ Hash Key: (g.g % 10)
+ Group Key: ()
+ -> Function Scan on generate_series g
+(10 rows)
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+set enable_sort = true;
+set work_mem to default;
+-- Compare results
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+ g100 | g10 | unnest | c | m
+------+-----+--------+---+---
+(0 rows)
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
-- end
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..11c6f50fbf 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -148,6 +148,68 @@ SELECT count(*) FROM
4
(1 row)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+SET enable_hashagg=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------------
+ Unique
+ -> Sort
+ Sort Key: ((g % 1000))
+ -> Function Scan on generate_series g
+(4 rows)
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_hashagg=TRUE;
+-- Produce results with hash aggregation.
+SET enable_sort=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 1000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_sort=TRUE;
+SET work_mem TO DEFAULT;
+-- Compare results
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..c40bf6c16e 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -75,6 +75,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
+ enable_hashagg_spill | on
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 17fb256aec..bcd336c581 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1017,3 +1017,91 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
explain (costs off)
select 1 from tenk1
where (hundred, thousand) in (select twothousand, twothousand from onek);
+
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+set work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+set jit_above_cost to default;
+
+create table agg_group_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_group_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+-- Produce results with hash aggregation
+
+set enable_hashagg = true;
+set enable_sort = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+set jit_above_cost to default;
+
+create table agg_hash_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_hash_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare group aggregation results to hash aggregation results
+
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index d8f78fcc00..264c3ab5c2 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -429,4 +429,103 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
from unnest(array[1,1], array['a','b']) u(i,v)
group by rollup(i, v||'a') order by 1,3;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+-- Produce results with hash aggregation.
+
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare results
+
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+
-- end
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..33102744eb 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -45,6 +45,68 @@ SELECT count(*) FROM
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+SET enable_hashagg=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_hashagg=TRUE;
+
+-- Produce results with hash aggregation.
+
+SET enable_sort=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_sort=TRUE;
+
+SET work_mem TO DEFAULT;
+
+-- Compare results
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
+
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
On Thu, Jul 11, 2019 at 06:06:33PM -0700, Jeff Davis wrote:
On Thu, 2019-07-11 at 17:55 +0200, Tomas Vondra wrote:
Makes sense. I haven't thought about how the hybrid approach would be
implemented very much, so I can't quite judge how complicated would
it be
to extend "approach 1" later. But if you think it's a sensible first
step,
I trust you. And I certainly agree we need something to compare the
other
approaches against.Is this a duplicate of your previous email?
Yes. I don't know how I managed to send it again. Sorry.
I'm slightly confused but I will use the opportunity to put out another
WIP patch. The patch could use a few rounds of cleanup and quality
work, but the funcionality is there and the performance seems
reasonable.I rebased on master and fixed a few bugs, and most importantly, added
tests.It seems to be working with grouping sets fine. It will take a little
longer to get good performance numbers, but even for group size of one,
I'm seeing HashAgg get close to Sort+Group in some cases.
Nice! That's a very nice progress!
You are right that the missed lookups appear to be costly, at least
when the data all fits in system memory. I think it's the cache misses,
because sometimes reducing work_mem improves performance. I'll try
tuning the number of buckets for the hash table and see if that helps.
If not, then the performance still seems pretty good to me.Of course, HashAgg can beat sort for larger group sizes, but I'll try
to gather some more data on the cross-over point.
Yes, makes sense. I think it's acceptable as long as we consider this
during costing (when we know in advance we'll need this) or treat it to be
emergency measure.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
High-level approaches:
1. When the in-memory hash table fills, keep existing entries in the
hash table, and spill the raw tuples for all new groups in a
partitioned fashion. When all input tuples are read, finalize groups
in memory and emit. Now that the in-memory hash table is cleared (and
memory context reset), process a spill file the same as the original
input, but this time with a fraction of the group cardinality.2. When the in-memory hash table fills, partition the hash space, and
evict the groups from all partitions except one by writing out their
partial aggregate states to disk. Any input tuples belonging to an
evicted partition get spilled to disk. When the input is read
entirely, finalize the groups remaining in memory and emit. Now that
the in-memory hash table is cleared, process the next partition by
loading its partial states into the hash table, and then processing
its spilled tuples.
I'm late to the party.
These two approaches both spill the input tuples, what if the skewed
groups are not encountered before the hash table fills up? The spill
files' size and disk I/O could be downsides.
Greenplum spills all the groups by writing the partial aggregate states,
reset the memory context, process incoming tuples and build in-memory
hash table, then reload and combine the spilled partial states at last,
how does this sound?
--
Adam Lee
On Fri, 2019-08-02 at 14:44 +0800, Adam Lee wrote:
I'm late to the party.
You are welcome to join any time!
These two approaches both spill the input tuples, what if the skewed
groups are not encountered before the hash table fills up? The spill
files' size and disk I/O could be downsides.
Let's say the worst case is that we encounter 10 million groups of size
one first; just enough to fill up memory. Then, we encounter a single
additional group of size 20 million, and need to write out all of those
20 million raw tuples. That's still not worse than Sort+GroupAgg which
would need to write out all 30 million raw tuples (in practice Sort is
pretty fast so may still win in some cases, but not by any huge
amount).
Greenplum spills all the groups by writing the partial aggregate
states,
reset the memory context, process incoming tuples and build in-memory
hash table, then reload and combine the spilled partial states at
last,
how does this sound?
That can be done as an add-on to approach #1 by evicting the entire
hash table (writing out the partial states), then resetting the memory
context.
It does add to the complexity though, and would only work for the
aggregates that support serializing and combining partial states. It
also might be a net loss to do the extra work of initializing and
evicting a partial state if we don't have large enough groups to
benefit.
Given that the worst case isn't worse than Sort+GroupAgg, I think it
should be left as a future optimization. That would give us time to
tune the process to work well in a variety of cases.
Regards,
Jeff Davis
On Fri, Aug 02, 2019 at 08:11:19AM -0700, Jeff Davis wrote:
On Fri, 2019-08-02 at 14:44 +0800, Adam Lee wrote:
I'm late to the party.
You are welcome to join any time!
These two approaches both spill the input tuples, what if the skewed
groups are not encountered before the hash table fills up? The spill
files' size and disk I/O could be downsides.Let's say the worst case is that we encounter 10 million groups of size
one first; just enough to fill up memory. Then, we encounter a single
additional group of size 20 million, and need to write out all of those
20 million raw tuples. That's still not worse than Sort+GroupAgg which
would need to write out all 30 million raw tuples (in practice Sort is
pretty fast so may still win in some cases, but not by any huge
amount).Greenplum spills all the groups by writing the partial aggregate
states,
reset the memory context, process incoming tuples and build in-memory
hash table, then reload and combine the spilled partial states at
last,
how does this sound?That can be done as an add-on to approach #1 by evicting the entire
hash table (writing out the partial states), then resetting the memory
context.It does add to the complexity though, and would only work for the
aggregates that support serializing and combining partial states. It
also might be a net loss to do the extra work of initializing and
evicting a partial state if we don't have large enough groups to
benefit.Given that the worst case isn't worse than Sort+GroupAgg, I think it
should be left as a future optimization. That would give us time to
tune the process to work well in a variety of cases.
+1 to leaving that as a future optimization
I think it's clear there's no perfect eviction strategy - for every
algorithm we came up with we can construct a data set on which it
performs terribly (I'm sure we could do that for the approach used by
Greenplum, for example).
So I think it makes sense to do what Jeff proposed, and then maybe try
improving that in the future with a switch to different eviction
strategy based on some heuristics.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I started to review this patch yesterday with Melanie Plageman, so we
rebased this patch over the current master. The main conflicts were
due to a simplehash patch that has been committed separately[1]/messages/by-id/48abe675e1330f0c264ab2fe0d4ff23eb244f9ef.camel@j-davis.com. I've
attached the rebased patch.
I was playing with the code, and if one of the table's most common
values isn't placed into the initial hash table it spills a whole lot
of tuples to disk that might have been avoided if we had some way to
'seed' the hash table with MCVs from the statistics. Seems to me that
you would need some way of dealing with values that are in the MCV
list, but ultimately don't show up in the scan. I imagine that this
kind of optimization would most useful for aggregates on a full table
scan.
Some questions:
Right now the patch always initializes 32 spill partitions. Have you given
any thought into how to intelligently pick an optimal number of
partitions yet?
That can be done as an add-on to approach #1 by evicting the entire
Hash table (writing out the partial states), then resetting the memory
Context.
By add-on approach, do you mean to say that you have something in mind
to combine the two strategies? Or do you mean that it could be implemented
as a separate strategy?
I think it's clear there's no perfect eviction strategy - for every
algorithm we came up with we can construct a data set on which it
performs terribly (I'm sure we could do that for the approach used by
Greenplum, for example).So I think it makes sense to do what Jeff proposed, and then maybe try
improving that in the future with a switch to different eviction
strategy based on some heuristics.
I agree. It definitely feels like both spilling strategies have their
own use case.
That said, I think it's worth mentioning that with parallel aggregates
it might actually be more useful to spill the trans values instead,
and have them combined in a Gather or Finalize stage.
[1]: /messages/by-id/48abe675e1330f0c264ab2fe0d4ff23eb244f9ef.camel@j-davis.com
/messages/by-id/48abe675e1330f0c264ab2fe0d4ff23eb244f9ef.camel@j-davis.com
Attachments:
v1-0001-Rebased-memory-bounded-hash-aggregation.patchapplication/octet-stream; name=v1-0001-Rebased-memory-bounded-hash-aggregation.patchDownload
From 62c9c7a6213310504edaae4e47c4391f40034e2a Mon Sep 17 00:00:00 2001
From: Jeff Davis <jdavis@postgresql.org>
Date: Tue, 27 Aug 2019 17:21:38 +0000
Subject: [PATCH v1] Rebased memory bounded hash aggregation
---
doc/src/sgml/config.sgml | 35 +
src/backend/commands/explain.c | 53 ++
src/backend/executor/execExprInterp.c | 35 +-
src/backend/executor/execGrouping.c | 39 +-
src/backend/executor/nodeAgg.c | 599 +++++++++++++++++-
src/backend/jit/llvm/llvmjit_expr.c | 66 +-
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/planner.c | 14 +-
src/backend/utils/init/globals.c | 1 +
src/backend/utils/misc/guc.c | 20 +
src/backend/utils/mmgr/aset.c | 32 +
src/backend/utils/mmgr/mcxt.c | 24 +
src/include/executor/executor.h | 5 +
src/include/miscadmin.h | 1 +
src/include/nodes/execnodes.h | 17 +-
src/include/nodes/memnodes.h | 1 +
src/include/optimizer/cost.h | 1 +
src/include/utils/memutils.h | 1 +
src/test/regress/expected/aggregates.out | 92 +++
src/test/regress/expected/groupingsets.out | 190 ++++--
src/test/regress/expected/select_distinct.out | 62 ++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/aggregates.sql | 88 +++
src/test/regress/sql/groupingsets.sql | 99 +++
src/test/regress/sql/select_distinct.sql | 62 ++
25 files changed, 1420 insertions(+), 121 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 89284dc5c0..f525e64b68 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1715,6 +1715,23 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-hashagg-mem-overflow" xreflabel="hashagg_mem_overflow">
+ <term><varname>hashagg_mem_overflow</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>hashagg_mem_overflow</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ If hash aggregation exceeds <varname>work_mem</varname> at query
+ execution time, and <varname>hashagg_mem_overflow</varname> is set
+ to <literal>on</literal>, continue consuming more memory rather than
+ performing disk-based hash aggregation. The default
+ is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
@@ -4367,6 +4384,24 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-hashagg-spill" xreflabel="enable_hashagg_spill">
+ <term><varname>enable_hashagg_spill</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_hashagg_spill</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the memory usage is expected to
+ exceed <varname>work_mem</varname>. This only affects the planner
+ choice; actual behavior at execution time is dictated by
+ <xref linkend="guc-hashagg-mem-overflow"/>. The default
+ is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
<term><varname>enable_hashjoin</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb3434a3..092a79ea14 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -102,6 +102,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1826,6 +1827,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ if (es->analyze)
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2715,6 +2718,56 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * If EXPLAIN ANALYZE, show information on hash aggregate memory usage and
+ * batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+ long diskKb = (aggstate->hash_disk_used + 1023) / 1024;
+
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(
+ es->str,
+ "Memory Usage: %ldkB",
+ memPeakKb);
+
+ if (aggstate->hash_batches_used > 0)
+ {
+ appendStringInfo(
+ es->str,
+ " Batches: %d Disk Usage:%ldkB",
+ aggstate->hash_batches_used, diskKb);
+ }
+
+ appendStringInfo(
+ es->str,
+ "\n");
+ }
+ else
+ {
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ ExplainPropertyInteger("Disk Usage", "kB", diskKb, es);
+ }
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index d61f75bc3b..a477eb05e1 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -1564,14 +1564,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
{
AggState *aggstate;
AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
aggstate = op->d.agg_init_trans.aggstate;
- pergroup = &aggstate->all_pergroups
- [op->d.agg_init_trans.setoff]
- [op->d.agg_init_trans.transno];
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
/* If transValue has not yet been initialized, do so now. */
- if (pergroup->noTransValue)
+ if (pergroup_allaggs != NULL && pergroup->noTransValue)
{
AggStatePerTrans pertrans = op->d.agg_init_trans.pertrans;
@@ -1592,13 +1592,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
{
AggState *aggstate;
AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
aggstate = op->d.agg_strict_trans_check.aggstate;
- pergroup = &aggstate->all_pergroups
- [op->d.agg_strict_trans_check.setoff]
- [op->d.agg_strict_trans_check.transno];
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_strict_trans_check.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_strict_trans_check.transno];
- if (unlikely(pergroup->transValueIsNull))
+ if (pergroup_allaggs == NULL ||
+ unlikely(pergroup->transValueIsNull))
EEO_JUMP(op->d.agg_strict_trans_check.jumpnull);
EEO_NEXT();
@@ -1614,6 +1615,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
AggState *aggstate;
AggStatePerTrans pertrans;
AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
FunctionCallInfo fcinfo;
MemoryContext oldContext;
Datum newVal;
@@ -1621,9 +1623,11 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
aggstate = op->d.agg_trans.aggstate;
pertrans = op->d.agg_trans.pertrans;
- pergroup = &aggstate->all_pergroups
- [op->d.agg_trans.setoff]
- [op->d.agg_trans.transno];
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
Assert(pertrans->transtypeByVal);
@@ -1665,6 +1669,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
AggState *aggstate;
AggStatePerTrans pertrans;
AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
FunctionCallInfo fcinfo;
MemoryContext oldContext;
Datum newVal;
@@ -1672,9 +1677,11 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
aggstate = op->d.agg_trans.aggstate;
pertrans = op->d.agg_trans.pertrans;
- pergroup = &aggstate->all_pergroups
- [op->d.agg_trans.setoff]
- [op->d.agg_trans.transno];
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
Assert(!pertrans->transtypeByVal);
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 14ee8db3f9..8f5404b3d6 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -25,7 +25,6 @@
#include "utils/hashutils.h"
#include "utils/memutils.h"
-static uint32 TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple);
static int TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
/*
@@ -288,6 +287,28 @@ ResetTupleHashTable(TupleHashTable hashtable)
TupleHashEntry
LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
bool *isnew)
+{
+ MemoryContext oldContext;
+ uint32 hash;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = slot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+
+ MemoryContextSwitchTo(oldContext);
+
+ return LookupTupleHashEntryHash(hashtable, slot, isnew, hash);
+}
+
+TupleHashEntry
+LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
+ bool *isnew, uint32 hash)
{
TupleHashEntryData *entry;
MemoryContext oldContext;
@@ -306,7 +327,7 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
if (isnew)
{
- entry = tuplehash_insert(hashtable->hashtab, key, &found);
+ entry = tuplehash_insert_hash(hashtable->hashtab, key, hash, &found);
if (found)
{
@@ -326,7 +347,7 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
}
else
{
- entry = tuplehash_lookup(hashtable->hashtab, key);
+ entry = tuplehash_lookup_hash(hashtable->hashtab, key, hash);
}
MemoryContextSwitchTo(oldContext);
@@ -371,17 +392,12 @@ FindTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
/*
* Compute the hash value for a tuple
*
- * The passed-in key is a pointer to TupleHashEntryData. In an actual hash
- * table entry, the firstTuple field points to a tuple (in MinimalTuple
- * format). LookupTupleHashEntry sets up a dummy TupleHashEntryData with a
- * NULL firstTuple field --- that cues us to look at the inputslot instead.
- * This convention avoids the need to materialize virtual input tuples unless
- * they actually need to get copied into the table.
+ * If tuple is NULL, use the input slot instead.
*
* Also, the caller must select an appropriate memory context for running
* the hash functions. (dynahash.c doesn't change CurrentMemoryContext.)
*/
-static uint32
+uint32
TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
{
TupleHashTable hashtable = (TupleHashTable) tb->private_data;
@@ -402,9 +418,6 @@ TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
{
/*
* Process a tuple already stored in the table.
- *
- * (this case never actually occurs due to the way simplehash.h is
- * used, as the hash-value is stored in the entries)
*/
slot = hashtable->tableslot;
ExecStoreMinimalTuple(tuple, slot, false);
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 58c376aeb7..d60ac3d47c 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -229,8 +229,10 @@
#include "optimizer/optimizer.h"
#include "parser/parse_agg.h"
#include "parser/parse_coerce.h"
+#include "storage/buffile.h"
#include "utils/acl.h"
#include "utils/builtins.h"
+#include "utils/dynahash.h"
#include "utils/expandeddatum.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -238,6 +240,30 @@
#include "utils/tuplesort.h"
#include "utils/datum.h"
+/*
+ * Represents partitioned spill data for a single hashtable.
+ */
+typedef struct HashAggSpill
+{
+ int n_partitions; /* number of output partitions */
+ int partition_bits; /* number of bits for partition mask
+ log2(n_partitions) parent partition bits */
+ BufFile **partitions; /* output partition files */
+ int64 *ntuples; /* number of tuples in each partition */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation. Initially,
+ * only the input fields are set. If spilled to disk, also set the spill data.
+ */
+typedef struct HashAggBatch
+{
+ BufFile *input_file; /* input partition */
+ int input_bits; /* number of bits for input partition mask */
+ int64 input_groups; /* estimated number of input groups */
+ int setno; /* grouping set */
+ HashAggSpill spill; /* spill output */
+} HashAggBatch;
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
@@ -273,11 +299,25 @@ static TupleTableSlot *project_aggregates(AggState *aggstate);
static Bitmapset *find_unaggregated_cols(AggState *aggstate);
static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
static void build_hash_table(AggState *aggstate);
-static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
+static void prepare_hash_slot(AggState *aggstate);
+static uint32 calculate_hash(AggState *aggstate);
+static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+static Size hash_spill_tuple(HashAggSpill *spill, int input_bits,
+ TupleTableSlot *slot, uint32 hash);
+static MinimalTuple hash_read_spilled(BufFile *file, uint32 *hashp);
+static HashAggBatch *hash_batch_new(BufFile *input_file, int setno,
+ int64 input_groups, int input_bits);
+static void hash_finish_initial_spills(AggState *aggstate);
+static void hash_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno, int input_bits);
+static void hash_reset_spill(HashAggSpill *spill);
+static void hash_reset_spills(AggState *aggstate);
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1270,6 +1310,10 @@ build_hash_table(AggState *aggstate)
Assert(aggstate->aggstrategy == AGG_HASHED || aggstate->aggstrategy == AGG_MIXED);
+ /* TODO: work harder to find a good nGroups for each hash table. We don't
+ * want the hash table itself to fill up work_mem with no room for
+ * out-of-line transition values. Also, we need to consider that there are
+ * multiple hash tables for grouping sets. */
additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
for (i = 0; i < aggstate->num_hashes; ++i)
@@ -1295,6 +1339,15 @@ build_hash_table(AggState *aggstate)
tmpmem,
DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
}
+
+ /*
+ * Set initial size to be that of an empty hash table. This ensures that
+ * at least one entry can be added before it exceeds work_mem; otherwise
+ * the algorithm might not make progress.
+ */
+ aggstate->hash_mem_init = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_mem_current = aggstate->hash_mem_init;
}
/*
@@ -1455,23 +1508,13 @@ hash_agg_entry_size(int numAggs)
return entrysize;
}
-/*
- * Find or create a hashtable entry for the tuple group containing the current
- * tuple (already set in tmpcontext's outertuple slot), in the current grouping
- * set (which the caller must have selected - note that initialize_aggregate
- * depends on this).
- *
- * When called, CurrentMemoryContext should be the per-query context.
- */
-static TupleHashEntryData *
-lookup_hash_entry(AggState *aggstate)
+void
+prepare_hash_slot(AggState *aggstate)
{
- TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
- AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
- TupleTableSlot *hashslot = perhash->hashslot;
- TupleHashEntryData *entry;
- bool isnew;
- int i;
+ TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ int i;
/* transfer just the needed columns into hashslot */
slot_getsomeattrs(inputslot, perhash->largestGrpColIdx);
@@ -1485,14 +1528,70 @@ lookup_hash_entry(AggState *aggstate)
hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
}
ExecStoreVirtualTuple(hashslot);
+}
+
+uint32
+calculate_hash(AggState *aggstate)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleHashTable hashtable = perhash->hashtable;
+ MemoryContext oldContext;
+ uint32 hash;
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = perhash->hashslot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+
+ MemoryContextSwitchTo(oldContext);
+
+ return hash;
+}
+
+/*
+ * Find or create a hashtable entry for the tuple group containing the current
+ * tuple (already set in tmpcontext's outertuple slot), in the current grouping
+ * set (which the caller must have selected - note that initialize_aggregate
+ * depends on this).
+ *
+ * When called, CurrentMemoryContext should be the per-query context.
+ */
+static AggStatePerGroup
+lookup_hash_entry(AggState *aggstate, uint32 hash)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ TupleHashEntryData *entry;
+ bool isnew = false;
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntry(perhash->hashtable, hashslot, &isnew);
+ if (!hashagg_mem_overflow &&
+ aggstate->hash_mem_current > work_mem * 1024L &&
+ aggstate->hash_mem_current > aggstate->hash_mem_init)
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot,
+ NULL, hash);
+ else
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot,
+ &isnew, hash);
+
+ if (entry == NULL)
+ return NULL;
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_mem_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ if (aggstate->hash_mem_current > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = aggstate->hash_mem_current;
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1512,7 +1611,7 @@ lookup_hash_entry(AggState *aggstate)
}
}
- return entry;
+ return entry->additional;
}
/*
@@ -1520,18 +1619,38 @@ lookup_hash_entry(AggState *aggstate)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Return false if hash table has exceeded its memory limit.
*/
static void
lookup_hash_entries(AggState *aggstate)
{
- int numHashes = aggstate->num_hashes;
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
int setno;
- for (setno = 0; setno < numHashes; setno++)
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
{
+ uint32 hash;
+
select_current_set(aggstate, setno, true);
- pergroup[setno] = lookup_hash_entry(aggstate)->additional;
+ prepare_hash_slot(aggstate);
+ hash = calculate_hash(aggstate);
+ pergroup[setno] = lookup_hash_entry(aggstate, hash);
+
+ if (pergroup[setno] == NULL)
+ {
+ HashAggSpill *spill;
+ TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
+
+ if (aggstate->hash_spills == NULL)
+ aggstate->hash_spills = palloc0(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+ aggstate->hash_spilled = true;
+
+ spill = &aggstate->hash_spills[setno];
+
+ aggstate->hash_disk_used += hash_spill_tuple(spill, 0, slot, hash);
+ }
}
}
@@ -1853,6 +1972,10 @@ agg_retrieve_direct(AggState *aggstate)
outerslot = fetch_input_tuple(aggstate);
if (TupIsNull(outerslot))
{
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hash_finish_initial_spills(aggstate);
+
/* no more outer-plan tuples available */
if (hasGroupingSets)
{
@@ -1956,6 +2079,8 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ hash_finish_initial_spills(aggstate);
+
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1963,11 +2088,136 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * consumed.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+ AggStatePerGroup *pergroup;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ pergroup = aggstate->all_pergroups;
+ while(pergroup != aggstate->hash_pergroup) {
+ *pergroup = NULL;
+ pergroup++;
+ }
+
+ /* free memory */
+ ReScanExprContext(aggstate->hashcontext);
+ /* Rebuild an empty hash table */
+ build_hash_table(aggstate);
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ Assert(aggstate->current_phase == 0);
+
+ /*
+ * TODO: what should be done here to set up for advance_aggregates?
+ */
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+ uint32 hash;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hash_read_spilled(batch->input_file, &hash);
+ if (tuple == NULL)
+ break;
+
+ /*
+ * TODO: Should we re-compile the expressions to use a minimal tuple
+ * slot so that we don't have to create the virtual tuple here? If we
+ * project the tuple before writing, then perhaps this is not
+ * important.
+ */
+ ExecForceStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ /* Find or build hashtable entries */
+ memset(aggstate->hash_pergroup, 0,
+ sizeof(AggStatePerGroup) * aggstate->num_hashes);
+ select_current_set(aggstate, batch->setno, true);
+ prepare_hash_slot(aggstate);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ aggstate->hash_disk_used += hash_spill_tuple(
+ &batch->spill, batch->input_bits, slot, hash);
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ BufFileClose(batch->input_file);
+
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ hash_spill_finish(aggstate, &batch->spill, batch->setno,
+ batch->input_bits);
+
+ pfree(batch);
+
+ /* Initialize to walk the first hash table */
+ select_current_set(aggstate, 0, true);
+ ResetTupleHashIterator(aggstate->perhash[0].hashtable,
+ &aggstate->perhash[0].hashiter);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -1996,7 +2246,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2027,8 +2277,6 @@ agg_retrieve_hash_table(AggState *aggstate)
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2085,6 +2333,276 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * hash_spill_tuple
+ *
+ * Not enough memory to add tuple as new entry in hash table. Save for later
+ * in the appropriate partition.
+ */
+static Size
+hash_spill_tuple(HashAggSpill *spill, int input_bits, TupleTableSlot *slot,
+ uint32 hash)
+{
+ int partition;
+ MinimalTuple tuple;
+ BufFile *file;
+ int written;
+ int total_written = 0;
+ bool shouldFree;
+
+ /* initialize output partitions */
+ if (spill->partitions == NULL)
+ {
+ int npartitions;
+ int partition_bits;
+
+ /*TODO: be smarter */
+ npartitions = 32;
+
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits */
+ if (partition_bits + input_bits >= 32)
+ partition_bits = 32 - input_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ spill->partition_bits = partition_bits;
+ spill->n_partitions = npartitions;
+ spill->partitions = palloc0(sizeof(BufFile *) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+ }
+
+ /*
+ * TODO: should we project only needed attributes from the tuple before
+ * writing it?
+ */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ if (spill->partition_bits == 0)
+ partition = 0;
+ else
+ partition = (hash << input_bits) >>
+ (32 - spill->partition_bits);
+
+ spill->ntuples[partition]++;
+
+ /*
+ * TODO: use logtape.c instead?
+ */
+ if (spill->partitions[partition] == NULL)
+ spill->partitions[partition] = BufFileCreateTemp(false);
+ file = spill->partitions[partition];
+
+
+ written = BufFileWrite(file, (void *) &hash, sizeof(uint32));
+ if (written != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to HashAgg temporary file: %m")));
+ total_written += written;
+
+ written = BufFileWrite(file, (void *) tuple, tuple->t_len);
+ if (written != tuple->t_len)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to HashAgg temporary file: %m")));
+ total_written += written;
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return total_written;
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hash_read_spilled(BufFile *file, uint32 *hashp)
+{
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+ uint32 hash;
+
+ nread = BufFileRead(file, &hash, sizeof(uint32));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+ if (hashp != NULL)
+ *hashp = hash;
+
+ nread = BufFileRead(file, &t_len, sizeof(t_len));
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = BufFileRead(file, (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+
+ return tuple;
+}
+
+/*
+ * new_hashagg_batch
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done. Should be called in the aggregate's memory context.
+ */
+static HashAggBatch *
+hash_batch_new(BufFile *input_file, int setno, int64 input_groups,
+ int input_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->input_file = input_file;
+ batch->input_bits = input_bits;
+ batch->input_groups = input_groups;
+ batch->setno = setno;
+
+ /* batch->spill will be set only after spilling this batch */
+
+ return batch;
+}
+
+/*
+ * hash_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hash_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+
+ if (aggstate->hash_spills == NULL)
+ return;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hash_spill_finish(aggstate, &aggstate->hash_spills[setno], setno, 0);
+
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+}
+
+/*
+ * hash_spill_finish
+ *
+ *
+ */
+static void
+hash_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno, int input_bits)
+{
+ int i;
+
+ if (spill->n_partitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->n_partitions; i++)
+ {
+ BufFile *file = spill->partitions[i];
+ MemoryContext oldContext;
+ HashAggBatch *new_batch;
+ int64 input_ngroups;
+
+ /* partition is empty */
+ if (file == NULL)
+ continue;
+
+ /* rewind file for reading */
+ if (BufFileSeek(file, 0, 0L, SEEK_SET))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rewind HashAgg temporary file: %m")));
+
+ /*
+ * Estimate the number of input groups for this new work item as the
+ * total number of tuples in its input file. Although that's a worst
+ * case, it's not bad here for two reasons: (1) overestimating is
+ * better than underestimating; and (2) we've already scanned the
+ * relation once, so it's likely that we've already finalized many of
+ * the common values.
+ */
+ input_ngroups = spill->ntuples[i];
+
+ oldContext = MemoryContextSwitchTo(aggstate->ss.ps.state->es_query_cxt);
+ new_batch = hash_batch_new(file, setno, input_ngroups,
+ spill->partition_bits + input_bits);
+ aggstate->hash_batches = lappend(aggstate->hash_batches, new_batch);
+ aggstate->hash_batches_used++;
+ MemoryContextSwitchTo(oldContext);
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Clear a HashAggSpill, free its memory, and close its files.
+ */
+static void
+hash_reset_spill(HashAggSpill *spill)
+{
+ int i;
+ for (i = 0; i < spill->n_partitions; i++)
+ {
+ BufFile *file = spill->partitions[i];
+
+ if (file != NULL)
+ BufFileClose(file);
+ }
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+}
+
+/*
+ * Find and reset all active HashAggSpills.
+ */
+static void
+hash_reset_spills(AggState *aggstate)
+{
+ ListCell *lc;
+
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hash_reset_spill(&aggstate->hash_spills[setno]);
+
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ hash_reset_spill(&batch->spill);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2269,6 +2787,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->ss.ps.outeropsfixed = false;
}
+ if (use_hashing)
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsVirtual);
+
/*
* Initialize result type, slot and projection.
*/
@@ -3399,6 +3921,8 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hash_reset_spills(node);
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3454,12 +3978,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3516,6 +4041,16 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ hash_reset_spills(node);
+
+ /* reset stats */
+ node->hash_spilled = false;
+ node->hash_mem_init = 0;
+ node->hash_mem_peak = 0;
+ node->hash_mem_current = 0;
+ node->hash_disk_used = 0;
+ node->hash_batches_used = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
build_hash_table(node);
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 30133634c7..14764e9c1d 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2094,12 +2094,14 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_allpergroupsp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_setoff,
v_transno;
LLVMValueRef v_notransvalue;
+ LLVMBasicBlockRef b_check_notransvalue;
LLVMBasicBlockRef b_init;
aggstate = op->d.agg_init_trans.aggstate;
@@ -2121,11 +2123,22 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_init_trans.setoff);
v_transno = l_int32_const(op->d.agg_init_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+ b_check_notransvalue = l_bb_before_v(
+ opblocks[i + 1], "op.%d.check_notransvalue", i);
+
+ LLVMBuildCondBr(b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_check_notransvalue);
+
+ LLVMPositionBuilderAtEnd(b, b_check_notransvalue);
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_notransvalue =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_NOTRANSVALUE,
@@ -2192,6 +2205,9 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_transnull;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
+
+ LLVMBasicBlockRef b_check_transnull;
int jumpnull = op->d.agg_strict_trans_check.jumpnull;
@@ -2211,11 +2227,22 @@ llvm_compile_expr(ExprState *state)
l_int32_const(op->d.agg_strict_trans_check.setoff);
v_transno =
l_int32_const(op->d.agg_strict_trans_check.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ b_check_transnull = l_bb_before_v(opblocks[i + 1],
+ "op.%d.check_transnull", i);
+ LLVMBuildCondBr(b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[jumpnull],
+ b_check_transnull);
+
+ LLVMPositionBuilderAtEnd(b, b_check_transnull);
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_transnull =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_TRANSVALUEISNULL,
@@ -2257,12 +2284,15 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_pertransp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_retval;
LLVMValueRef v_tmpcontext;
LLVMValueRef v_oldcontext;
+ LLVMBasicBlockRef b_advance_transval;
+
aggstate = op->d.agg_trans.aggstate;
pertrans = op->d.agg_trans.pertrans;
@@ -2284,10 +2314,22 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_trans.setoff);
v_transno = l_int32_const(op->d.agg_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ b_advance_transval = l_bb_before_v(opblocks[i + 1],
+ "op.%d.advance_transval", i);
+
+ LLVMBuildCondBr(b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_advance_transval);
+
+ LLVMPositionBuilderAtEnd(b, b_advance_transval);
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_fcinfo = l_ptr_const(fcinfo,
l_ptr(StructFunctionCallInfoData));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f6593485..3f0d289963 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
bool enable_hashagg = true;
+bool enable_hashagg_spill = true;
bool enable_nestloop = true;
bool enable_material = true;
bool enable_mergejoin = true;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 17c5f086fb..93b4fa1c5b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4257,8 +4257,8 @@ consider_groupingsets_paths(PlannerInfo *root,
* with. Override work_mem in that case; otherwise, we'll rely on the
* sorted-input case to generate usable mixed paths.
*/
- if (hashsize > work_mem * 1024L && gd->rollups)
- return; /* nope, won't fit */
+ if (!enable_hashagg_spill && hashsize > work_mem * 1024L && gd->rollups)
+ return; /* nope, won't fit */
/*
* We need to burst the existing rollups list into individual grouping
@@ -6528,7 +6528,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
* were unable to sort above, then we'd better generate a Path, so
* that we at least have one.
*/
- if (hashaggtablesize < work_mem * 1024L ||
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L ||
grouped_rel->pathlist == NIL)
{
/*
@@ -6561,7 +6562,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
agg_final_costs,
dNumGroups);
- if (hashaggtablesize < work_mem * 1024L)
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L)
add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
@@ -6830,7 +6832,7 @@ create_partial_grouping_paths(PlannerInfo *root,
* Tentatively produce a partial HashAgg Path, depending on if it
* looks as if the hash table will fit in work_mem.
*/
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
@@ -6857,7 +6859,7 @@ create_partial_grouping_paths(PlannerInfo *root,
dNumPartialPartialGroups);
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 3bf96de256..b0cb1d7e6b 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -120,6 +120,7 @@ bool enableFsync = true;
bool allowSystemTableMods = false;
int work_mem = 1024;
int maintenance_work_mem = 16384;
+bool hashagg_mem_overflow = false;
int max_parallel_maintenance_workers = 2;
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 90ffd89339..a4b8efb848 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -950,6 +950,26 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_hashagg_spill", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans that are expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_hashagg_spill,
+ true,
+ NULL, NULL, NULL
+ },
+ {
+ {"hashagg_mem_overflow", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables hashed aggregation to overflow work_mem at execution time."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &hashagg_mem_overflow,
+ false,
+ NULL, NULL, NULL
+ },
{
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of materialization."),
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 6b63d6f85d..d49e8d40bd 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -458,6 +458,8 @@ AllocSetContextCreateInternal(MemoryContext parent,
parent,
name);
+ ((MemoryContext) set)->mem_allocated = set->keeper->endptr - ((char *)set->keeper) + MAXALIGN(sizeof(AllocSetContext));
+
return (MemoryContext) set;
}
}
@@ -546,6 +548,8 @@ AllocSetContextCreateInternal(MemoryContext parent,
parent,
name);
+ ((MemoryContext) set)->mem_allocated = set->keeper->endptr - ((char *)set->keeper) + MAXALIGN(sizeof(AllocSetContext));
+
return (MemoryContext) set;
}
@@ -604,6 +608,8 @@ AllocSetReset(MemoryContext context)
else
{
/* Normal case, release the block */
+ context->mem_allocated -= block->endptr - ((char*) block);
+
#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->freeptr - ((char *) block));
#endif
@@ -688,11 +694,16 @@ AllocSetDelete(MemoryContext context)
#endif
if (block != set->keeper)
+ {
+ context->mem_allocated -= block->endptr - ((char *) block);
free(block);
+ }
block = next;
}
+ Assert(context->mem_allocated == 0);
+
/* Finally, free the context header, including the keeper block */
free(set);
}
@@ -733,6 +744,9 @@ AllocSetAlloc(MemoryContext context, Size size)
block = (AllocBlock) malloc(blksize);
if (block == NULL)
return NULL;
+
+ context->mem_allocated += blksize;
+
block->aset = set;
block->freeptr = block->endptr = ((char *) block) + blksize;
@@ -928,6 +942,8 @@ AllocSetAlloc(MemoryContext context, Size size)
if (block == NULL)
return NULL;
+ context->mem_allocated += blksize;
+
block->aset = set;
block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
block->endptr = ((char *) block) + blksize;
@@ -1028,6 +1044,9 @@ AllocSetFree(MemoryContext context, void *pointer)
set->blocks = block->next;
if (block->next)
block->next->prev = block->prev;
+
+ context->mem_allocated -= block->endptr - ((char*) block);
+
#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->freeptr - ((char *) block));
#endif
@@ -1144,6 +1163,7 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
AllocBlock block = (AllocBlock) (((char *) chunk) - ALLOC_BLOCKHDRSZ);
Size chksize;
Size blksize;
+ Size oldblksize;
/*
* Try to verify that we have a sane block pointer: it should
@@ -1159,6 +1179,8 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
/* Do the realloc */
chksize = MAXALIGN(size);
blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+ oldblksize = block->endptr - ((char *)block);
+
block = (AllocBlock) realloc(block, blksize);
if (block == NULL)
{
@@ -1166,6 +1188,9 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN);
return NULL;
}
+
+ context->mem_allocated += blksize - oldblksize;
+
block->freeptr = block->endptr = ((char *) block) + blksize;
/* Update pointers since block has likely been moved */
@@ -1383,6 +1408,7 @@ AllocSetCheck(MemoryContext context)
const char *name = set->header.name;
AllocBlock prevblock;
AllocBlock block;
+ int64 total_allocated = 0;
for (prevblock = NULL, block = set->blocks;
block != NULL;
@@ -1393,6 +1419,10 @@ AllocSetCheck(MemoryContext context)
long blk_data = 0;
long nchunks = 0;
+ total_allocated += block->endptr - ((char *)block);
+ if (set->keeper == block)
+ total_allocated += MAXALIGN(sizeof(AllocSetContext));
+
/*
* Empty block - empty can be keeper-block only
*/
@@ -1479,6 +1509,8 @@ AllocSetCheck(MemoryContext context)
elog(WARNING, "problem in alloc set %s: found inconsistent memory block %p",
name, block);
}
+
+ Assert(total_allocated == context->mem_allocated);
}
#endif /* MEMORY_CONTEXT_CHECKING */
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index b07be12236..27417af548 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -462,6 +462,29 @@ MemoryContextIsEmpty(MemoryContext context)
return context->methods->is_empty(context);
}
+/*
+ * Find the memory allocated to blocks for this memory context. If recurse is
+ * true, also include children.
+ */
+int64
+MemoryContextMemAllocated(MemoryContext context, bool recurse)
+{
+ int64 total = context->mem_allocated;
+
+ AssertArg(MemoryContextIsValid(context));
+
+ if (recurse)
+ {
+ MemoryContext child = context->firstchild;
+ for (child = context->firstchild;
+ child != NULL;
+ child = child->nextchild)
+ total += MemoryContextMemAllocated(child, true);
+ }
+
+ return total;
+}
+
/*
* MemoryContextStats
* Print statistics about the named context and all its descendants.
@@ -736,6 +759,7 @@ MemoryContextCreate(MemoryContext node,
node->methods = methods;
node->parent = parent;
node->firstchild = NULL;
+ node->mem_allocated = 0;
node->prevchild = NULL;
node->name = name;
node->ident = NULL;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index affe6ad698..9ea4d0558d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -140,10 +140,15 @@ extern TupleHashTable BuildTupleHashTableExt(PlanState *parent,
extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
bool *isnew);
+extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
+ TupleTableSlot *slot,
+ bool *isnew, uint32 hash);
extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
ExprState *eqcomp,
FmgrInfo *hashfunctions);
+extern uint32 TupleHashTableHash(struct tuplehash_hash *tb,
+ const MinimalTuple tuple);
extern void ResetTupleHashTable(TupleHashTable hashtable);
/*
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bc6e03fbc7..321759ead5 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -244,6 +244,7 @@ extern bool enableFsync;
extern PGDLLIMPORT bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT bool hashagg_mem_overflow;
extern PGDLLIMPORT int max_parallel_maintenance_workers;
extern int VacuumCostPageHit;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f42189d2bf..246b64dedd 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2020,13 +2020,24 @@ typedef struct AggState
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
- int num_hashes;
+ int num_hashes; /* number of hash tables active at once */
+ bool hash_spilled; /* any hash table ever spilled? */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each hash table,
+ exists only during first pass if spilled */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ uint64 hash_mem_init; /* initial hash table memory usage */
+ uint64 hash_mem_peak; /* peak hash table memory usage */
+ uint64 hash_mem_current; /* current hash table memory usage */
+ uint64 hash_disk_used; /* bytes of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+ List *hash_batches; /* hash batches remaining to be processed */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 43
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
@@ -2198,7 +2209,7 @@ typedef struct HashInstrumentation
int nbuckets_original; /* planned number of buckets */
int nbatch; /* number of batches at end of execution */
int nbatch_original; /* planned number of batches */
- size_t space_peak; /* speak memory usage in bytes */
+ size_t space_peak; /* peak memory usage in bytes */
} HashInstrumentation;
/* ----------------
diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index dbae98d3d9..df0ae3625c 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -79,6 +79,7 @@ typedef struct MemoryContextData
/* these two fields are placed here to minimize alignment wastage: */
bool isReset; /* T = no space alloced since last reset */
bool allowInCritSection; /* allow palloc in critical section */
+ int64 mem_allocated; /* track memory allocated for this context */
const MemoryContextMethods *methods; /* virtual function table */
MemoryContext parent; /* NULL if no parent (toplevel context) */
MemoryContext firstchild; /* head of linked list of children */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fb..b72e2d0829 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_hashagg_spill;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index ffe6de536e..6a837bc990 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -82,6 +82,7 @@ extern void MemoryContextSetParent(MemoryContext context,
extern Size GetMemoryChunkSpace(void *pointer);
extern MemoryContext MemoryContextGetParent(MemoryContext context);
extern bool MemoryContextIsEmpty(MemoryContext context);
+extern int64 MemoryContextMemAllocated(MemoryContext context, bool recurse);
extern void MemoryContextStats(MemoryContext context);
extern void MemoryContextStatsDetail(MemoryContext context, int max_children);
extern void MemoryContextAllowInCriticalSection(MemoryContext context,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index be4ddf86a4..8b64d15368 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2331,3 +2331,95 @@ explain (costs off)
-> Seq Scan on onek
(8 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+set work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------------
+ GroupAggregate
+ Group Key: ((g % 100000))
+ -> Sort
+ Sort Key: ((g % 100000))
+ -> Function Scan on generate_series g
+(5 rows)
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+set jit_above_cost to default;
+create table agg_group_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_group_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+-- Produce results with hash aggregation
+set enable_hashagg = true;
+set enable_sort = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 100000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+set jit_above_cost to default;
+create table agg_hash_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_hash_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+set enable_sort = true;
+set work_mem to default;
+-- Compare group aggregation results to hash aggregation results
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a..71e6e2407a 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1549,22 +1549,18 @@ explain (costs off)
count(hundred), count(thousand), count(twothousand),
count(*)
from tenk1 group by grouping sets (unique1,twothousand,thousand,hundred,ten,four,two);
- QUERY PLAN
--------------------------------
- MixedAggregate
- Hash Key: two
- Hash Key: four
- Hash Key: ten
+ QUERY PLAN
+-------------------------
+ HashAggregate
+ Hash Key: unique1
+ Hash Key: twothousand
+ Hash Key: thousand
Hash Key: hundred
- Group Key: unique1
- Sort Key: twothousand
- Group Key: twothousand
- Sort Key: thousand
- Group Key: thousand
- -> Sort
- Sort Key: unique1
- -> Seq Scan on tenk1
-(13 rows)
+ Hash Key: ten
+ Hash Key: four
+ Hash Key: two
+ -> Seq Scan on tenk1
+(9 rows)
explain (costs off)
select unique1,
@@ -1572,18 +1568,16 @@ explain (costs off)
count(hundred), count(thousand), count(twothousand),
count(*)
from tenk1 group by grouping sets (unique1,hundred,ten,four,two);
- QUERY PLAN
--------------------------------
- MixedAggregate
- Hash Key: two
- Hash Key: four
- Hash Key: ten
+ QUERY PLAN
+-------------------------
+ HashAggregate
+ Hash Key: unique1
Hash Key: hundred
- Group Key: unique1
- -> Sort
- Sort Key: unique1
- -> Seq Scan on tenk1
-(9 rows)
+ Hash Key: ten
+ Hash Key: four
+ Hash Key: two
+ -> Seq Scan on tenk1
+(7 rows)
set work_mem = '384kB';
explain (costs off)
@@ -1592,21 +1586,18 @@ explain (costs off)
count(hundred), count(thousand), count(twothousand),
count(*)
from tenk1 group by grouping sets (unique1,twothousand,thousand,hundred,ten,four,two);
- QUERY PLAN
--------------------------------
- MixedAggregate
- Hash Key: two
- Hash Key: four
- Hash Key: ten
- Hash Key: hundred
+ QUERY PLAN
+-------------------------
+ HashAggregate
+ Hash Key: unique1
+ Hash Key: twothousand
Hash Key: thousand
- Group Key: unique1
- Sort Key: twothousand
- Group Key: twothousand
- -> Sort
- Sort Key: unique1
- -> Seq Scan on tenk1
-(12 rows)
+ Hash Key: hundred
+ Hash Key: ten
+ Hash Key: four
+ Hash Key: two
+ -> Seq Scan on tenk1
+(9 rows)
-- check collation-sensitive matching between grouping expressions
-- (similar to a check for aggregates, but there are additional code
@@ -1633,4 +1624,123 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
| 1 | 2
(4 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+-- Produce results with hash aggregation.
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------
+ MixedAggregate
+ Hash Key: (g.g % 1000), (g.g % 100), (g.g % 10)
+ Hash Key: (g.g % 1000), (g.g % 100)
+ Hash Key: (g.g % 1000)
+ Hash Key: (g.g % 100), (g.g % 10)
+ Hash Key: (g.g % 100)
+ Hash Key: (g.g % 10), (g.g % 1000)
+ Hash Key: (g.g % 10)
+ Group Key: ()
+ -> Function Scan on generate_series g
+(10 rows)
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+set enable_sort = true;
+set work_mem to default;
+-- Compare results
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+ g100 | g10 | unnest | c | m
+------+-----+--------+---+---
+(0 rows)
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
-- end
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..11c6f50fbf 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -148,6 +148,68 @@ SELECT count(*) FROM
4
(1 row)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+SET enable_hashagg=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------------
+ Unique
+ -> Sort
+ Sort Key: ((g % 1000))
+ -> Function Scan on generate_series g
+(4 rows)
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_hashagg=TRUE;
+-- Produce results with hash aggregation.
+SET enable_sort=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 1000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_sort=TRUE;
+SET work_mem TO DEFAULT;
+-- Compare results
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..c40bf6c16e 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -75,6 +75,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
+ enable_hashagg_spill | on
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 17fb256aec..bcd336c581 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1017,3 +1017,91 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
explain (costs off)
select 1 from tenk1
where (hundred, thousand) in (select twothousand, twothousand from onek);
+
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+set work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+set jit_above_cost to default;
+
+create table agg_group_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_group_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+-- Produce results with hash aggregation
+
+set enable_hashagg = true;
+set enable_sort = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+set jit_above_cost to default;
+
+create table agg_hash_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_hash_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare group aggregation results to hash aggregation results
+
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 95ac3fb52f..bf8bce6ed3 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -441,4 +441,103 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
from unnest(array[1,1], array['a','b']) u(i,v)
group by rollup(i, v||'a') order by 1,3;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+-- Produce results with hash aggregation.
+
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare results
+
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+
-- end
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..33102744eb 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -45,6 +45,68 @@ SELECT count(*) FROM
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+SET enable_hashagg=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_hashagg=TRUE;
+
+-- Produce results with hash aggregation.
+
+SET enable_sort=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_sort=TRUE;
+
+SET work_mem TO DEFAULT;
+
+-- Compare results
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
+
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
--
2.17.1
On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote:
I started to review this patch yesterday with Melanie Plageman, so we
rebased this patch over the current master. The main conflicts were
due to a simplehash patch that has been committed separately[1]. I've
attached the rebased patch.
Great, thanks!
I was playing with the code, and if one of the table's most common
values isn't placed into the initial hash table it spills a whole lot
of tuples to disk that might have been avoided if we had some way to
'seed' the hash table with MCVs from the statistics. Seems to me that
you would need some way of dealing with values that are in the MCV
list, but ultimately don't show up in the scan. I imagine that this
kind of optimization would most useful for aggregates on a full table
scan.
Interesting idea, I didn't think of that.
Some questions:
Right now the patch always initializes 32 spill partitions. Have you
given
any thought into how to intelligently pick an optimal number of
partitions yet?
Yes. The idea is to guess how many groups are remaining, then guess how
much space they will need in memory, then divide by work_mem. I just
didn't get around to it yet. (Same with the costing work.)
By add-on approach, do you mean to say that you have something in
mind
to combine the two strategies? Or do you mean that it could be
implemented
as a separate strategy?
It would be an extension of the existing patch, but would add a fair
amount of complexity (dealing with partial states, etc.) and the
benefit would be fairly modest. We can do it later if justified.
That said, I think it's worth mentioning that with parallel
aggregates
it might actually be more useful to spill the trans values instead,
and have them combined in a Gather or Finalize stage.
That's a good point.
Regards,
Jeff Davis
On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote:
Right now the patch always initializes 32 spill partitions. Have you
given
any thought into how to intelligently pick an optimal number of
partitions yet?
Attached a new patch that addresses this.
1. Divide hash table memory used by the number of groups in the hash
table to get the average memory used per group.
2. Multiply by the number of groups spilled -- which I pessimistically
estimate as the number of tuples spilled -- to get the total amount of
memory that we'd like to have to process all spilled tuples at once.
3. Divide the desired amount of memory by work_mem to get the number of
partitions we'd like to have such that each partition can be processed
in work_mem without spilling.
4. Apply a few sanity checks, fudge factors, and limits.
Using this runtime information should be substantially better than
using estimates and projections.
Additionally, I removed some branches from the common path. I think I
still have more work to do there.
I also rebased of course, and fixed a few other things.
Regards,
Jeff Davis
Attachments:
hashagg-20191127.difftext/x-patch; charset=UTF-8; name=hashagg-20191127.diffDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d4d1fe45cc1..6ddbadb2abd 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1753,6 +1753,23 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-hashagg-mem-overflow" xreflabel="hashagg_mem_overflow">
+ <term><varname>hashagg_mem_overflow</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>hashagg_mem_overflow</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ If hash aggregation exceeds <varname>work_mem</varname> at query
+ execution time, and <varname>hashagg_mem_overflow</varname> is set
+ to <literal>on</literal>, continue consuming more memory rather than
+ performing disk-based hash aggregation. The default
+ is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
@@ -4453,6 +4470,24 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-hashagg-spill" xreflabel="enable_hashagg_spill">
+ <term><varname>enable_hashagg_spill</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_hashagg_spill</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the memory usage is expected to
+ exceed <varname>work_mem</varname>. This only affects the planner
+ choice; actual behavior at execution time is dictated by
+ <xref linkend="guc-hashagg-mem-overflow"/>. The default
+ is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
<term><varname>enable_hashjoin</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb3434a32..092a79ea14f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -102,6 +102,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1826,6 +1827,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ if (es->analyze)
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2715,6 +2718,56 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * If EXPLAIN ANALYZE, show information on hash aggregate memory usage and
+ * batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+ long diskKb = (aggstate->hash_disk_used + 1023) / 1024;
+
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(
+ es->str,
+ "Memory Usage: %ldkB",
+ memPeakKb);
+
+ if (aggstate->hash_batches_used > 0)
+ {
+ appendStringInfo(
+ es->str,
+ " Batches: %d Disk Usage:%ldkB",
+ aggstate->hash_batches_used, diskKb);
+ }
+
+ appendStringInfo(
+ es->str,
+ "\n");
+ }
+ else
+ {
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ ExplainPropertyInteger("Disk Usage", "kB", diskKb, es);
+ }
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index dbed5978162..07ac8e96fdf 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -1603,14 +1603,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
{
AggState *aggstate;
AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
aggstate = op->d.agg_init_trans.aggstate;
- pergroup = &aggstate->all_pergroups
- [op->d.agg_init_trans.setoff]
- [op->d.agg_init_trans.transno];
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
/* If transValue has not yet been initialized, do so now. */
- if (pergroup->noTransValue)
+ if (pergroup_allaggs != NULL && pergroup->noTransValue)
{
AggStatePerTrans pertrans = op->d.agg_init_trans.pertrans;
@@ -1631,13 +1631,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
{
AggState *aggstate;
AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
aggstate = op->d.agg_strict_trans_check.aggstate;
- pergroup = &aggstate->all_pergroups
- [op->d.agg_strict_trans_check.setoff]
- [op->d.agg_strict_trans_check.transno];
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_strict_trans_check.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_strict_trans_check.transno];
- if (unlikely(pergroup->transValueIsNull))
+ if (pergroup_allaggs == NULL ||
+ unlikely(pergroup->transValueIsNull))
EEO_JUMP(op->d.agg_strict_trans_check.jumpnull);
EEO_NEXT();
@@ -1653,6 +1654,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
AggState *aggstate;
AggStatePerTrans pertrans;
AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
FunctionCallInfo fcinfo;
MemoryContext oldContext;
Datum newVal;
@@ -1660,9 +1662,11 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
aggstate = op->d.agg_trans.aggstate;
pertrans = op->d.agg_trans.pertrans;
- pergroup = &aggstate->all_pergroups
- [op->d.agg_trans.setoff]
- [op->d.agg_trans.transno];
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
Assert(pertrans->transtypeByVal);
@@ -1704,6 +1708,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
AggState *aggstate;
AggStatePerTrans pertrans;
AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
FunctionCallInfo fcinfo;
MemoryContext oldContext;
Datum newVal;
@@ -1711,9 +1716,11 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
aggstate = op->d.agg_trans.aggstate;
pertrans = op->d.agg_trans.pertrans;
- pergroup = &aggstate->all_pergroups
- [op->d.agg_trans.setoff]
- [op->d.agg_trans.transno];
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
Assert(!pertrans->transtypeByVal);
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 7bc5e405bcc..7c831831b5d 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -25,7 +25,6 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
-static uint32 TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple);
static int TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
/*
@@ -299,6 +298,28 @@ ResetTupleHashTable(TupleHashTable hashtable)
TupleHashEntry
LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
bool *isnew)
+{
+ MemoryContext oldContext;
+ uint32 hash;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = slot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+
+ MemoryContextSwitchTo(oldContext);
+
+ return LookupTupleHashEntryHash(hashtable, slot, isnew, hash);
+}
+
+TupleHashEntry
+LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
+ bool *isnew, uint32 hash)
{
TupleHashEntryData *entry;
MemoryContext oldContext;
@@ -317,7 +338,7 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
if (isnew)
{
- entry = tuplehash_insert(hashtable->hashtab, key, &found);
+ entry = tuplehash_insert_hash(hashtable->hashtab, key, hash, &found);
if (found)
{
@@ -337,7 +358,7 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
}
else
{
- entry = tuplehash_lookup(hashtable->hashtab, key);
+ entry = tuplehash_lookup_hash(hashtable->hashtab, key, hash);
}
MemoryContextSwitchTo(oldContext);
@@ -382,17 +403,12 @@ FindTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
/*
* Compute the hash value for a tuple
*
- * The passed-in key is a pointer to TupleHashEntryData. In an actual hash
- * table entry, the firstTuple field points to a tuple (in MinimalTuple
- * format). LookupTupleHashEntry sets up a dummy TupleHashEntryData with a
- * NULL firstTuple field --- that cues us to look at the inputslot instead.
- * This convention avoids the need to materialize virtual input tuples unless
- * they actually need to get copied into the table.
+ * If tuple is NULL, use the input slot instead.
*
* Also, the caller must select an appropriate memory context for running
* the hash functions. (dynahash.c doesn't change CurrentMemoryContext.)
*/
-static uint32
+uint32
TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
{
TupleHashTable hashtable = (TupleHashTable) tb->private_data;
@@ -413,9 +429,6 @@ TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
{
/*
* Process a tuple already stored in the table.
- *
- * (this case never actually occurs due to the way simplehash.h is
- * used, as the hash-value is stored in the entries)
*/
slot = hashtable->tableslot;
ExecStoreMinimalTuple(tuple, slot, false);
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 6ee24eab3d2..a70151cf7da 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -194,6 +194,18 @@
* transition values. hashcontext is the single context created to support
* all hash tables.
*
+ * When the hash table memory exceeds work_mem, we advance the transition
+ * states only for groups already in the hash table. For tuples that would
+ * need to create a new hash table entries (and initialize new transition
+ * states), we spill them to disk to be processed later. The tuples are
+ * spilled in a partitioned manner, so that subsequent batches are smaller
+ * and less likely to exceed work_mem (if a batch does exceed work_mem, it
+ * must be spilled recursively).
+ *
+ * Note that it's possible for transition states to start small but then
+ * grow very large; for instance in the case of ARRAY_AGG. In such cases,
+ * it's still possible to significantly exceed work_mem.
+ *
* Transition / Combine function invocation:
*
* For performance reasons transition functions, including combine
@@ -229,15 +241,65 @@
#include "optimizer/optimizer.h"
#include "parser/parse_agg.h"
#include "parser/parse_coerce.h"
+#include "storage/buffile.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/datum.h"
+#include "utils/dynahash.h"
#include "utils/expandeddatum.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+/*
+ * Control how many partitions are created when spilling HashAgg to
+ * disk.
+ *
+ * HASH_PARTITION_FACTOR is multiplied by the estimated number of partitions
+ * needed such that each partition will fit in memory. The factor is set
+ * higher than one because there's not a high cost to having a few too many
+ * partitions, and it makes it less likely that a partition will need to be
+ * spilled recursively. Another benefit of having more, smaller partitions is
+ * that small hash tables may perform better than large ones due to memory
+ * caching effects.
+ *
+ * HASH_PARTITION_MEM is the approximate amount of work_mem we should reserve
+ * for the partitions themselves (i.e. buffering of the files backing the
+ * partitions). This is an estimate, because we choose the number of
+ * partitions at the time we need to spill, and because this algorithm
+ * shouldn't depend too directly on the internal memory needs of a BufFile.
+ */
+#define HASH_PARTITION_FACTOR 1.50
+#define HASH_MIN_PARTITIONS 4
+#define HASH_MAX_PARTITIONS 256
+#define HASH_PARTITION_MEM (HASH_MIN_PARTITIONS * BLCKSZ)
+
+/*
+ * Represents partitioned spill data for a single hashtable.
+ */
+typedef struct HashAggSpill
+{
+ int n_partitions; /* number of output partitions */
+ int partition_bits; /* number of bits for partition mask
+ log2(n_partitions) parent partition bits */
+ BufFile **partitions; /* output partition files */
+ int64 *ntuples; /* number of tuples in each partition */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation. Initially,
+ * only the input fields are set. If spilled to disk, also set the spill data.
+ */
+typedef struct HashAggBatch
+{
+ BufFile *input_file; /* input partition */
+ int input_bits; /* number of bits for input partition mask */
+ int64 input_groups; /* estimated number of input groups */
+ int setno; /* grouping set */
+ HashAggSpill spill; /* spill output */
+} HashAggBatch;
+
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -272,11 +334,27 @@ static TupleTableSlot *project_aggregates(AggState *aggstate);
static Bitmapset *find_unaggregated_cols(AggState *aggstate);
static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
static void build_hash_table(AggState *aggstate);
-static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
+static void prepare_hash_slot(AggState *aggstate);
+static uint32 calculate_hash(AggState *aggstate);
+static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+static void hash_spill_init(HashAggSpill *spill, int input_bits,
+ uint64 input_tuples, double hashentrysize);
+static Size hash_spill_tuple(HashAggSpill *spill, int input_bits,
+ TupleTableSlot *slot, uint32 hash);
+static MinimalTuple hash_read_spilled(BufFile *file, uint32 *hashp);
+static HashAggBatch *hash_batch_new(BufFile *input_file, int setno,
+ int64 input_groups, int input_bits);
+static void hash_finish_initial_spills(AggState *aggstate);
+static void hash_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno, int input_bits);
+static void hash_reset_spill(HashAggSpill *spill);
+static void hash_reset_spills(AggState *aggstate);
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1269,6 +1347,10 @@ build_hash_table(AggState *aggstate)
Assert(aggstate->aggstrategy == AGG_HASHED || aggstate->aggstrategy == AGG_MIXED);
+ /* TODO: work harder to find a good nGroups for each hash table. We don't
+ * want the hash table itself to fill up work_mem with no room for
+ * out-of-line transition values. Also, we need to consider that there are
+ * multiple hash tables for grouping sets. */
additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
for (i = 0; i < aggstate->num_hashes; ++i)
@@ -1294,6 +1376,24 @@ build_hash_table(AggState *aggstate)
tmpmem,
DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
}
+
+ aggstate->hash_mem_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_ngroups_current = 0;
+
+ /*
+ * Initialize the threshold at which we stop creating new hash entries and
+ * start spilling. If an empty hash table exceeds the limit, increase the
+ * limit to be the size of the empty hash table. This ensures that at
+ * least one entry can be added so that the algorithm can make progress.
+ */
+ if (hashagg_mem_overflow)
+ aggstate->hash_mem_limit = SIZE_MAX;
+ else
+ aggstate->hash_mem_limit = (work_mem * 1024L) - HASH_PARTITION_MEM;
+
+ if (aggstate->hash_mem_current > aggstate->hash_mem_limit)
+ aggstate->hash_mem_limit = aggstate->hash_mem_current;
}
/*
@@ -1454,23 +1554,13 @@ hash_agg_entry_size(int numAggs)
return entrysize;
}
-/*
- * Find or create a hashtable entry for the tuple group containing the current
- * tuple (already set in tmpcontext's outertuple slot), in the current grouping
- * set (which the caller must have selected - note that initialize_aggregate
- * depends on this).
- *
- * When called, CurrentMemoryContext should be the per-query context.
- */
-static TupleHashEntryData *
-lookup_hash_entry(AggState *aggstate)
+static void
+prepare_hash_slot(AggState *aggstate)
{
- TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
- AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
- TupleTableSlot *hashslot = perhash->hashslot;
- TupleHashEntryData *entry;
- bool isnew;
- int i;
+ TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ int i;
/* transfer just the needed columns into hashslot */
slot_getsomeattrs(inputslot, perhash->largestGrpColIdx);
@@ -1484,14 +1574,71 @@ lookup_hash_entry(AggState *aggstate)
hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
}
ExecStoreVirtualTuple(hashslot);
+}
+
+static uint32
+calculate_hash(AggState *aggstate)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleHashTable hashtable = perhash->hashtable;
+ MemoryContext oldContext;
+ uint32 hash;
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = perhash->hashslot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+
+ MemoryContextSwitchTo(oldContext);
+
+ return hash;
+}
+
+/*
+ * Find or create a hashtable entry for the tuple group containing the current
+ * tuple (already set in tmpcontext's outertuple slot), in the current grouping
+ * set (which the caller must have selected - note that initialize_aggregate
+ * depends on this).
+ *
+ * When called, CurrentMemoryContext should be the per-query context.
+ */
+static AggStatePerGroup
+lookup_hash_entry(AggState *aggstate, uint32 hash)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ TupleHashEntryData *entry;
+ bool isnew = false;
+ bool *p_isnew;
+
+ /* if hash table memory limit is exceeded, don't create new entries */
+ p_isnew = (aggstate->hash_mem_current > aggstate->hash_mem_limit) ?
+ NULL : &isnew;
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntry(perhash->hashtable, hashslot, &isnew);
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
+ hash);
+
+ if (entry == NULL)
+ return NULL;
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_ngroups_current++;
+
+ aggstate->hash_mem_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ if (aggstate->hash_mem_current > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = aggstate->hash_mem_current;
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1511,7 +1658,7 @@ lookup_hash_entry(AggState *aggstate)
}
}
- return entry;
+ return entry->additional;
}
/*
@@ -1519,18 +1666,49 @@ lookup_hash_entry(AggState *aggstate)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Return false if hash table has exceeded its memory limit.
*/
static void
lookup_hash_entries(AggState *aggstate)
{
- int numHashes = aggstate->num_hashes;
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
int setno;
- for (setno = 0; setno < numHashes; setno++)
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
{
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ uint32 hash;
+
select_current_set(aggstate, setno, true);
- pergroup[setno] = lookup_hash_entry(aggstate)->additional;
+ prepare_hash_slot(aggstate);
+ hash = calculate_hash(aggstate);
+ pergroup[setno] = lookup_hash_entry(aggstate, hash);
+
+ if (pergroup[setno] == NULL)
+ {
+ HashAggSpill *spill;
+ TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
+ double hashentrysize = 0;
+
+ /* average memory cost per entry */
+ if (aggstate->hash_ngroups_current > 0)
+ hashentrysize = (double)aggstate->hash_mem_current /
+ (double)aggstate->hash_ngroups_current;
+
+ if (aggstate->hash_spills == NULL)
+ aggstate->hash_spills = palloc0(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+ aggstate->hash_spilled = true;
+
+ spill = &aggstate->hash_spills[setno];
+
+ if (spill->partitions == NULL)
+ hash_spill_init(spill, 0, perhash->aggnode->numGroups,
+ hashentrysize);
+
+ aggstate->hash_disk_used += hash_spill_tuple(spill, 0, slot, hash);
+ }
}
}
@@ -1852,6 +2030,10 @@ agg_retrieve_direct(AggState *aggstate)
outerslot = fetch_input_tuple(aggstate);
if (TupIsNull(outerslot))
{
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hash_finish_initial_spills(aggstate);
+
/* no more outer-plan tuples available */
if (hasGroupingSets)
{
@@ -1955,6 +2137,8 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ hash_finish_initial_spills(aggstate);
+
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1962,11 +2146,149 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * consumed.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+ AggStatePerGroup *pergroup;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ pergroup = aggstate->all_pergroups;
+ while(pergroup != aggstate->hash_pergroup) {
+ *pergroup = NULL;
+ pergroup++;
+ }
+
+ /* free memory */
+ ReScanExprContext(aggstate->hashcontext);
+ /* Rebuild an empty hash table */
+ build_hash_table(aggstate);
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ Assert(aggstate->current_phase == 0);
+
+ /*
+ * TODO: what should be done here to set up for advance_aggregates?
+ */
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+ uint32 hash;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hash_read_spilled(batch->input_file, &hash);
+ if (tuple == NULL)
+ break;
+
+ /*
+ * TODO: Should we re-compile the expressions to use a minimal tuple
+ * slot so that we don't have to create the virtual tuple here? If we
+ * project the tuple before writing, then perhaps this is not
+ * important.
+ */
+ ExecForceStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ /* Find or build hashtable entries */
+ memset(aggstate->hash_pergroup, 0,
+ sizeof(AggStatePerGroup) * aggstate->num_hashes);
+ select_current_set(aggstate, batch->setno, true);
+ prepare_hash_slot(aggstate);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ {
+ double hashentrysize = 0;
+
+ /* average memory cost per entry */
+ if (aggstate->hash_ngroups_current > 0)
+ hashentrysize = (double)aggstate->hash_mem_current /
+ (double)aggstate->hash_ngroups_current;
+
+ if (batch->spill.partitions == NULL)
+ hash_spill_init(&batch->spill, batch->input_bits,
+ batch->input_groups, hashentrysize);
+
+ aggstate->hash_disk_used += hash_spill_tuple(
+ &batch->spill, batch->input_bits, slot, hash);
+ }
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ BufFileClose(batch->input_file);
+
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ hash_spill_finish(aggstate, &batch->spill, batch->setno,
+ batch->input_bits);
+
+ pfree(batch);
+
+ /* Initialize to walk the first hash table */
+ select_current_set(aggstate, 0, true);
+ ResetTupleHashIterator(aggstate->perhash[0].hashtable,
+ &aggstate->perhash[0].hashiter);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -1995,7 +2317,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2026,8 +2348,6 @@ agg_retrieve_hash_table(AggState *aggstate)
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2084,6 +2404,322 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * Determine the number of partitions to create when spilling.
+ */
+static int
+hash_spill_npartitions(uint64 input_tuples, double hashentrysize)
+{
+ Size mem_needed;
+ int partition_limit;
+ int npartitions;
+
+ /*
+ * Avoid creating so many partitions that the memory requirements of the
+ * open partition files (estimated at BLCKSZ for buffering) are greater
+ * than 1/4 of work_mem.
+ */
+ partition_limit = (work_mem * 1024L * 0.25) / BLCKSZ;
+
+ /* pessimistically estimate that each input tuple creates a new group */
+ mem_needed = HASH_PARTITION_FACTOR * input_tuples * hashentrysize;
+
+ /* make enough partitions so that each one is likely to fit in memory */
+ npartitions = 1 + (mem_needed / (work_mem * 1024L));
+
+ if (npartitions > partition_limit)
+ npartitions = partition_limit;
+
+ if (npartitions < HASH_MIN_PARTITIONS)
+ npartitions = HASH_MIN_PARTITIONS;
+ if (npartitions > HASH_MAX_PARTITIONS)
+ npartitions = HASH_MAX_PARTITIONS;
+
+ return npartitions;
+}
+
+/*
+ * hash_spill_init
+ *
+ * Called after we determined that spilling is necessary. Chooses the number
+ * of partitions to create, and initializes them.
+ */
+static void
+hash_spill_init(HashAggSpill *spill, int input_bits, uint64 input_tuples,
+ double hashentrysize)
+{
+ int npartitions;
+ int partition_bits;
+
+ npartitions = hash_spill_npartitions(input_tuples, hashentrysize);
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits */
+ if (partition_bits + input_bits >= 32)
+ partition_bits = 32 - input_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ spill->partition_bits = partition_bits;
+ spill->n_partitions = npartitions;
+ spill->partitions = palloc0(sizeof(BufFile *) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+}
+
+/*
+ * hash_spill_tuple
+ *
+ * Not enough memory to add tuple as new entry in hash table. Save for later
+ * in the appropriate partition.
+ */
+static Size
+hash_spill_tuple(HashAggSpill *spill, int input_bits, TupleTableSlot *slot,
+ uint32 hash)
+{
+ int partition;
+ MinimalTuple tuple;
+ BufFile *file;
+ int written;
+ int total_written = 0;
+ bool shouldFree;
+
+ Assert(spill->partitions != NULL);
+
+ /*
+ * TODO: should we project only needed attributes from the tuple before
+ * writing it?
+ */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ if (spill->partition_bits == 0)
+ partition = 0;
+ else
+ partition = (hash << input_bits) >>
+ (32 - spill->partition_bits);
+
+ spill->ntuples[partition]++;
+
+ /*
+ * TODO: use logtape.c instead?
+ */
+ if (spill->partitions[partition] == NULL)
+ spill->partitions[partition] = BufFileCreateTemp(false);
+ file = spill->partitions[partition];
+
+
+ written = BufFileWrite(file, (void *) &hash, sizeof(uint32));
+ if (written != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to HashAgg temporary file: %m")));
+ total_written += written;
+
+ written = BufFileWrite(file, (void *) tuple, tuple->t_len);
+ if (written != tuple->t_len)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to HashAgg temporary file: %m")));
+ total_written += written;
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return total_written;
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hash_read_spilled(BufFile *file, uint32 *hashp)
+{
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+ uint32 hash;
+
+ nread = BufFileRead(file, &hash, sizeof(uint32));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+ if (hashp != NULL)
+ *hashp = hash;
+
+ nread = BufFileRead(file, &t_len, sizeof(t_len));
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = BufFileRead(file, (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+
+ return tuple;
+}
+
+/*
+ * new_hashagg_batch
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done. Should be called in the aggregate's memory context.
+ */
+static HashAggBatch *
+hash_batch_new(BufFile *input_file, int setno, int64 input_groups,
+ int input_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->input_file = input_file;
+ batch->input_bits = input_bits;
+ batch->input_groups = input_groups;
+ batch->setno = setno;
+
+ /* batch->spill will be set only after spilling this batch */
+
+ return batch;
+}
+
+/*
+ * hash_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hash_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+
+ if (aggstate->hash_spills == NULL)
+ return;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hash_spill_finish(aggstate, &aggstate->hash_spills[setno], setno, 0);
+
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+}
+
+/*
+ * hash_spill_finish
+ *
+ *
+ */
+static void
+hash_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno, int input_bits)
+{
+ int i;
+
+ if (spill->n_partitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->n_partitions; i++)
+ {
+ BufFile *file = spill->partitions[i];
+ MemoryContext oldContext;
+ HashAggBatch *new_batch;
+ int64 input_ngroups;
+
+ /* partition is empty */
+ if (file == NULL)
+ continue;
+
+ /* rewind file for reading */
+ if (BufFileSeek(file, 0, 0L, SEEK_SET))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rewind HashAgg temporary file: %m")));
+
+ /*
+ * Estimate the number of input groups for this new work item as the
+ * total number of tuples in its input file. Although that's a worst
+ * case, it's not bad here for two reasons: (1) overestimating is
+ * better than underestimating; and (2) we've already scanned the
+ * relation once, so it's likely that we've already finalized many of
+ * the common values.
+ */
+ input_ngroups = spill->ntuples[i];
+
+ oldContext = MemoryContextSwitchTo(aggstate->ss.ps.state->es_query_cxt);
+ new_batch = hash_batch_new(file, setno, input_ngroups,
+ spill->partition_bits + input_bits);
+ aggstate->hash_batches = lappend(aggstate->hash_batches, new_batch);
+ aggstate->hash_batches_used++;
+ MemoryContextSwitchTo(oldContext);
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Clear a HashAggSpill, free its memory, and close its files.
+ */
+static void
+hash_reset_spill(HashAggSpill *spill)
+{
+ int i;
+ for (i = 0; i < spill->n_partitions; i++)
+ {
+ BufFile *file = spill->partitions[i];
+
+ if (file != NULL)
+ BufFileClose(file);
+ }
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+}
+
+/*
+ * Find and reset all active HashAggSpills.
+ */
+static void
+hash_reset_spills(AggState *aggstate)
+{
+ ListCell *lc;
+
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hash_reset_spill(&aggstate->hash_spills[setno]);
+
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ if (batch->input_file != NULL)
+ {
+ BufFileClose(batch->input_file);
+ batch->input_file = NULL;
+ }
+ hash_reset_spill(&batch->spill);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2268,6 +2904,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->ss.ps.outeropsfixed = false;
}
+ if (use_hashing)
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsVirtual);
+
/*
* Initialize result type, slot and projection.
*/
@@ -3398,6 +4038,8 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hash_reset_spills(node);
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3453,12 +4095,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3515,6 +4158,17 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ hash_reset_spills(node);
+
+ node->hash_spilled = false;
+ node->hash_mem_current = 0;
+ node->hash_ngroups_current = 0;
+
+ /* reset stats */
+ node->hash_mem_peak = 0;
+ node->hash_disk_used = 0;
+ node->hash_batches_used = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
build_hash_table(node);
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index a9d362100a8..f0f742eebf5 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2093,12 +2093,14 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_allpergroupsp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_setoff,
v_transno;
LLVMValueRef v_notransvalue;
+ LLVMBasicBlockRef b_check_notransvalue;
LLVMBasicBlockRef b_init;
aggstate = op->d.agg_init_trans.aggstate;
@@ -2120,11 +2122,22 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_init_trans.setoff);
v_transno = l_int32_const(op->d.agg_init_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+ b_check_notransvalue = l_bb_before_v(
+ opblocks[i + 1], "op.%d.check_notransvalue", i);
+
+ LLVMBuildCondBr(b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_check_notransvalue);
+
+ LLVMPositionBuilderAtEnd(b, b_check_notransvalue);
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_notransvalue =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_NOTRANSVALUE,
@@ -2191,6 +2204,9 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_transnull;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
+
+ LLVMBasicBlockRef b_check_transnull;
int jumpnull = op->d.agg_strict_trans_check.jumpnull;
@@ -2210,11 +2226,22 @@ llvm_compile_expr(ExprState *state)
l_int32_const(op->d.agg_strict_trans_check.setoff);
v_transno =
l_int32_const(op->d.agg_strict_trans_check.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ b_check_transnull = l_bb_before_v(opblocks[i + 1],
+ "op.%d.check_transnull", i);
+ LLVMBuildCondBr(b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[jumpnull],
+ b_check_transnull);
+
+ LLVMPositionBuilderAtEnd(b, b_check_transnull);
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_transnull =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_TRANSVALUEISNULL,
@@ -2256,12 +2283,15 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_pertransp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_retval;
LLVMValueRef v_tmpcontext;
LLVMValueRef v_oldcontext;
+ LLVMBasicBlockRef b_advance_transval;
+
aggstate = op->d.agg_trans.aggstate;
pertrans = op->d.agg_trans.pertrans;
@@ -2283,10 +2313,22 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_trans.setoff);
v_transno = l_int32_const(op->d.agg_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ b_advance_transval = l_bb_before_v(opblocks[i + 1],
+ "op.%d.advance_transval", i);
+
+ LLVMBuildCondBr(b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_advance_transval);
+
+ LLVMPositionBuilderAtEnd(b, b_advance_transval);
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_fcinfo = l_ptr_const(fcinfo,
l_ptr(StructFunctionCallInfoData));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f65934859..3f0d2899635 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
bool enable_hashagg = true;
+bool enable_hashagg_spill = true;
bool enable_nestloop = true;
bool enable_material = true;
bool enable_mergejoin = true;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7fe11b59a02..511f8861a8f 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4255,6 +4255,9 @@ consider_groupingsets_paths(PlannerInfo *root,
* gd->rollups is empty if we have only unsortable columns to work
* with. Override work_mem in that case; otherwise, we'll rely on the
* sorted-input case to generate usable mixed paths.
+ *
+ * TODO: think more about how to plan grouping sets when spilling hash
+ * tables is an option
*/
if (hashsize > work_mem * 1024L && gd->rollups)
return; /* nope, won't fit */
@@ -6527,7 +6530,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
* were unable to sort above, then we'd better generate a Path, so
* that we at least have one.
*/
- if (hashaggtablesize < work_mem * 1024L ||
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L ||
grouped_rel->pathlist == NIL)
{
/*
@@ -6560,7 +6564,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
agg_final_costs,
dNumGroups);
- if (hashaggtablesize < work_mem * 1024L)
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L)
add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
@@ -6829,7 +6834,7 @@ create_partial_grouping_paths(PlannerInfo *root,
* Tentatively produce a partial HashAgg Path, depending on if it
* looks as if the hash table will fit in work_mem.
*/
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
@@ -6856,7 +6861,7 @@ create_partial_grouping_paths(PlannerInfo *root,
dNumPartialPartialGroups);
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 3bf96de256d..b0cb1d7e6b2 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -120,6 +120,7 @@ bool enableFsync = true;
bool allowSystemTableMods = false;
int work_mem = 1024;
int maintenance_work_mem = 16384;
+bool hashagg_mem_overflow = false;
int max_parallel_maintenance_workers = 2;
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ba4edde71a3..d588198df55 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -957,6 +957,26 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_hashagg_spill", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans that are expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_hashagg_spill,
+ true,
+ NULL, NULL, NULL
+ },
+ {
+ {"hashagg_mem_overflow", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables hashed aggregation to overflow work_mem at execution time."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &hashagg_mem_overflow,
+ false,
+ NULL, NULL, NULL
+ },
{
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of materialization."),
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 6298c7c8cad..84a71444264 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -140,10 +140,15 @@ extern TupleHashTable BuildTupleHashTableExt(PlanState *parent,
extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
bool *isnew);
+extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
+ TupleTableSlot *slot,
+ bool *isnew, uint32 hash);
extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
ExprState *eqcomp,
FmgrInfo *hashfunctions);
+extern uint32 TupleHashTableHash(struct tuplehash_hash *tb,
+ const MinimalTuple tuple);
extern void ResetTupleHashTable(TupleHashTable hashtable);
/*
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bc6e03fbc7e..321759ead51 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -244,6 +244,7 @@ extern bool enableFsync;
extern PGDLLIMPORT bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT bool hashagg_mem_overflow;
extern PGDLLIMPORT int max_parallel_maintenance_workers;
extern int VacuumCostPageHit;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6eb647290be..e7b12ed39b8 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2070,13 +2070,26 @@ typedef struct AggState
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
- int num_hashes;
+ int num_hashes; /* number of hash tables active at once */
+ bool hash_spilled; /* any hash table ever spilled? */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each hash table,
+ exists only during first pass if spilled */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ Size hash_mem_limit; /* limit before spilling hash table */
+ Size hash_mem_peak; /* peak hash table memory usage */
+ uint64 hash_ngroups_current; /* number of tuples currently in
+ memory in all hash tables */
+ Size hash_mem_current; /* current hash table memory usage */
+ uint64 hash_disk_used; /* bytes of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+ List *hash_batches; /* hash batches remaining to be processed */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 44
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
@@ -2248,7 +2261,7 @@ typedef struct HashInstrumentation
int nbuckets_original; /* planned number of buckets */
int nbatch; /* number of batches at end of execution */
int nbatch_original; /* planned number of batches */
- size_t space_peak; /* speak memory usage in bytes */
+ size_t space_peak; /* peak memory usage in bytes */
} HashInstrumentation;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fbc..b72e2d08290 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_hashagg_spill;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index be4ddf86a43..8b64d15368e 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2331,3 +2331,95 @@ explain (costs off)
-> Seq Scan on onek
(8 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+set work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------------
+ GroupAggregate
+ Group Key: ((g % 100000))
+ -> Sort
+ Sort Key: ((g % 100000))
+ -> Function Scan on generate_series g
+(5 rows)
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+set jit_above_cost to default;
+create table agg_group_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_group_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+-- Produce results with hash aggregation
+set enable_hashagg = true;
+set enable_sort = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 100000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+set jit_above_cost to default;
+create table agg_hash_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_hash_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+set enable_sort = true;
+set work_mem to default;
+-- Compare group aggregation results to hash aggregation results
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a7..767f60a96c7 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1633,4 +1633,127 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
| 1 | 2
(4 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+-- Produce results with hash aggregation.
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+set enable_sort = true;
+set work_mem to default;
+-- Compare results
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+ g100 | g10 | unnest | c | m
+------+-----+--------+---+---
+(0 rows)
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
-- end
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1de..11c6f50fbfa 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -148,6 +148,68 @@ SELECT count(*) FROM
4
(1 row)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+SET enable_hashagg=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------------
+ Unique
+ -> Sort
+ Sort Key: ((g % 1000))
+ -> Function Scan on generate_series g
+(4 rows)
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_hashagg=TRUE;
+-- Produce results with hash aggregation.
+SET enable_sort=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 1000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_sort=TRUE;
+SET work_mem TO DEFAULT;
+-- Compare results
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb9057..c40bf6c16eb 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -75,6 +75,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
+ enable_hashagg_spill | on
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 17fb256aec5..bcd336c5812 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1017,3 +1017,91 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
explain (costs off)
select 1 from tenk1
where (hundred, thousand) in (select twothousand, twothousand from onek);
+
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+set work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+set jit_above_cost to default;
+
+create table agg_group_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_group_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+-- Produce results with hash aggregation
+
+set enable_hashagg = true;
+set enable_sort = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+set jit_above_cost to default;
+
+create table agg_hash_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_hash_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare group aggregation results to hash aggregation results
+
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 95ac3fb52f6..bf8bce6ed31 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -441,4 +441,103 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
from unnest(array[1,1], array['a','b']) u(i,v)
group by rollup(i, v||'a') order by 1,3;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+-- Produce results with hash aggregation.
+
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare results
+
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+
-- end
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449e..33102744ebf 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -45,6 +45,68 @@ SELECT count(*) FROM
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+SET enable_hashagg=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_hashagg=TRUE;
+
+-- Produce results with hash aggregation.
+
+SET enable_sort=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_sort=TRUE;
+
+SET work_mem TO DEFAULT;
+
+-- Compare results
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
+
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
On Wed, Nov 27, 2019 at 02:58:04PM -0800, Jeff Davis wrote:
On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote:
Right now the patch always initializes 32 spill partitions. Have you
given
any thought into how to intelligently pick an optimal number of
partitions yet?Attached a new patch that addresses this.
1. Divide hash table memory used by the number of groups in the hash
table to get the average memory used per group.
2. Multiply by the number of groups spilled -- which I pessimistically
estimate as the number of tuples spilled -- to get the total amount of
memory that we'd like to have to process all spilled tuples at once.
Isn't the "number of tuples = number of groups" estimate likely to be
way too pessimistic? IIUC the consequence is that it pushes us to pick
more partitions than necessary, correct?
Could we instead track how many tuples we actually consumed for the the
in-memory groups, and then use this information to improve the estimate
of number of groups? I mean, if we know we've consumed 1000 tuples which
created 100 groups, then we know there's ~1:10 ratio.
3. Divide the desired amount of memory by work_mem to get the number of
partitions we'd like to have such that each partition can be processed
in work_mem without spilling.
4. Apply a few sanity checks, fudge factors, and limits.Using this runtime information should be substantially better than
using estimates and projections.Additionally, I removed some branches from the common path. I think I
still have more work to do there.I also rebased of course, and fixed a few other things.
A couple of comments based on eye-balling the patch:
1) Shouldn't the hashagg_mem_overflow use the other GUC naming, i.e.
maybe it should be enable_hashagg_mem_overflow or something similar?
2) I'm a bit puzzled by this code in ExecInterpExpr (there are multiple
such blocks, this is just an example)
aggstate = op->d.agg_init_trans.aggstate;
pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
/* If transValue has not yet been initialized, do so now. */
if (pergroup_allaggs != NULL && pergroup->noTransValue)
{ ... }
How could the (pergroup_allaggs != NULL) protect against anything? Let's
assume the pointer really is NULL. Surely we'll get a segfault on the
preceding line which does dereference it
pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
Or am I missing anything?
3) execGrouping.c
A couple of functions would deserve a comment, explaining what it does.
- LookupTupleHashEntryHash
- prepare_hash_slot
- calculate_hash
And it's not clear to me why we should remove part of the comment before
TupleHashTableHash.
4) I'm not sure I agree with this reasoning that HASH_PARTITION_FACTOR
making the hash tables smaller is desirable - it may be, but if that was
generally the case we'd just use small hash tables all the time. It's a
bit annoying to give user the capability to set work_mem and then kinda
override that.
* ... Another benefit of having more, smaller partitions is that small
* hash tables may perform better than large ones due to memory caching
* effects.
5) Not sure what "directly" means in this context?
* partitions at the time we need to spill, and because this algorithm
* shouldn't depend too directly on the internal memory needs of a
* BufFile.
#define HASH_PARTITION_MEM (HASH_MIN_PARTITIONS * BLCKSZ)
Does that mean we don't want to link to PGAlignedBlock, or what?
6) I think we should have some protection against underflows in this
piece of code:
- this would probably deserve some protection against underflow if HASH_PARTITION_MEM gets too big
if (hashagg_mem_overflow)
aggstate->hash_mem_limit = SIZE_MAX;
else
aggstate->hash_mem_limit = (work_mem * 1024L) - HASH_PARTITION_MEM;
At the moment it's safe because work_mem is 64kB at least, and
HASH_PARTITION_MEM is 32kB (4 partitions, 8kB each). But if we happen to
bump HASH_MIN_PARTITIONS up, this can underflow.
7) Shouldn't lookup_hash_entry briefly explain why/how it handles the
memory limit?
8) The comment before lookup_hash_entries says:
...
* Return false if hash table has exceeded its memory limit.
..
But that's clearly bogus, because that's a void function.
9) Shouldn't the hash_finish_initial_spills calls in agg_retrieve_direct
have a comment, similar to the surrounding code? Might be an overkill,
not sure.
10) The comment for agg_refill_hash_table says
* Should only be called after all in memory hash table entries have been
* consumed.
Can we enforce that with an assert, somehow?
11) The hash_spill_npartitions naming seems a bit confusing, because it
seems to imply it's about the "spill" while in practice it just choses
number of spill partitions. Maybe hash_choose_num_spill_partitions would
be better?
12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the
reasoning behind the current value (256)? Not wanting to pick too many
partitions? Comment?
if (npartitions > HASH_MAX_PARTITIONS)
npartitions = HASH_MAX_PARTITIONS;
13) As for this:
/* make sure that we don't exhaust the hash bits */
if (partition_bits + input_bits >= 32)
partition_bits = 32 - input_bits;
We already ran into this issue (exhausting bits in a hash value) in
hashjoin batching, we should be careful to use the same approach in both
places (not the same code, just general approach).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Nov 28, 2019 at 9:47 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
On Wed, Nov 27, 2019 at 02:58:04PM -0800, Jeff Davis wrote:
On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote:
Right now the patch always initializes 32 spill partitions. Have you
given
any thought into how to intelligently pick an optimal number of
partitions yet?Attached a new patch that addresses this.
1. Divide hash table memory used by the number of groups in the hash
table to get the average memory used per group.
2. Multiply by the number of groups spilled -- which I pessimistically
estimate as the number of tuples spilled -- to get the total amount of
memory that we'd like to have to process all spilled tuples at once.Isn't the "number of tuples = number of groups" estimate likely to be
way too pessimistic? IIUC the consequence is that it pushes us to pick
more partitions than necessary, correct?
Could we instead track how many tuples we actually consumed for the the
in-memory groups, and then use this information to improve the estimate
of number of groups? I mean, if we know we've consumed 1000 tuples which
created 100 groups, then we know there's ~1:10 ratio.
What would the cost be of having many small partitions? Some of the
spill files created may not be used if the estimate was pessimistic,
but that seems better than the alternative of re-spilling, since every
spill writes every tuple again.
Also, number of groups = number of tuples is only for re-spilling.
This is a little bit unclear from the variable naming.
It looks like the parameter input_tuples passed to hash_spill_init()
in lookup_hash_entries() is the number of groups estimated by planner.
However, when reloading a spill file, if we run out of memory and
re-spill, hash_spill_init() is passed batch->input_groups (which is
actually set from input_ngroups which is the number of tuples in the
spill file). So, input_tuples is groups and input_groups is
input_tuples. It may be helpful to rename this.
4) I'm not sure I agree with this reasoning that HASH_PARTITION_FACTOR
making the hash tables smaller is desirable - it may be, but if that was
generally the case we'd just use small hash tables all the time. It's a
bit annoying to give user the capability to set work_mem and then kinda
override that.* ... Another benefit of having more, smaller partitions is that small
* hash tables may perform better than large ones due to memory caching
* effects.
So, it looks like the HASH_PARTITION_FACTOR is only used when
re-spilling. The initial hashtable will use work_mem.
It seems like the reason for using it when re-spilling is to be very
conservative to avoid more than one re-spill and make sure each spill
file fits in a hashtable in memory.
The comment does seem to point to some other reason, though...
11) The hash_spill_npartitions naming seems a bit confusing, because it
seems to imply it's about the "spill" while in practice it just choses
number of spill partitions. Maybe hash_choose_num_spill_partitions would
be better?
Agreed that a name with "choose" or "calculate" as the verb would be
more clear.
12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the
reasoning behind the current value (256)? Not wanting to pick too many
partitions? Comment?if (npartitions > HASH_MAX_PARTITIONS)
npartitions = HASH_MAX_PARTITIONS;
256 actually seems very large. hash_spill_npartitions() will be called
for every respill, so, HASH_MAX_PARTITIONS it not the total number of
spill files permitted, but, actually, it is the number of respill
files in a given spill (a spill set). So if you made X partitions
initially and every partition re-spills, now you would have (at most)
X * 256 partitions.
If HASH_MAX_PARTITIONS is 256, wouldn't the metadata from the spill
files take up a lot of memory at that point?
Melanie & Adam Lee
Thanks very much for a great review! I've attached a new patch.
There are some significant changes in the new version also:
In the non-spilling path, removed the extra nullcheck branch in the
compiled evaltrans expression. When the first tuple is spilled, I the
branch becomes necessary, so I recompile the expression using a new
opcode that includes that branch.
I also changed the read-from-spill path to use a slot with
TTSOpsMinimalTuple (avoiding the need to make it into a virtual slot
right away), which means I need to recompile the evaltrans expression
for that case, as well.
I also improved the way we initialize the hash tables to use a better
estimate for the number of groups. And I made it only initialize one
hash table in the read-from-spill path.
With all of the changes I made (thanks to some suggestions from Andres)
the performance is looking pretty good. It's pretty easy to beat
Sort+Group when the group size is 10+. Even for average group size of
~1, HashAgg is getting really close to Sort in some cases.
There are still a few things to do, most notably costing. I also need
to project before spilling to avoid wasting disk. And I'm sure my
changes have created some more problems, so I have some significant
work to do on quality.
My answers to your questions inline:
On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote:
Could we instead track how many tuples we actually consumed for the
the
in-memory groups, and then use this information to improve the
estimate
of number of groups? I mean, if we know we've consumed 1000 tuples
which
created 100 groups, then we know there's ~1:10 ratio.
That would be a good estimate for an even distribution, but not
necessarily for a skewed distribution. I'm not opposed to it, but it's
generally my philosophy to overpartition as it seems there's not a big
downside.
A couple of comments based on eye-balling the patch:
1) Shouldn't the hashagg_mem_overflow use the other GUC naming, i.e.
maybe it should be enable_hashagg_mem_overflow or something similar?
The enable_* naming is for planner GUCs. hashagg_mem_overflow is an
execution-time GUC that disables spilling and overflows work_mem (that
is, it reverts to the old behavior).
assume the pointer really is NULL. Surely we'll get a segfault on the
preceding line which does dereference itpergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
Or am I missing anything?
That's not actually dereferencing anything, it's just doing a pointer
calculation. You are probably right that it's not a good thing to rely
on, or at least not quite as readable, so I changed the order to put
the NULL check first.
3) execGrouping.c
A couple of functions would deserve a comment, explaining what it
does.- LookupTupleHashEntryHash
- prepare_hash_slot
- calculate_hash
Done, thank you.
And it's not clear to me why we should remove part of the comment
before
TupleHashTableHash.
Trying to remember back to when I first did that, but IIRC the comment
was not updated from a previous change, and I was cleaning it up. I
will check over that again to be sure it's an improvement.
4) I'm not sure I agree with this reasoning that
HASH_PARTITION_FACTOR
making the hash tables smaller is desirable - it may be, but if that
was
generally the case we'd just use small hash tables all the time. It's
a
bit annoying to give user the capability to set work_mem and then
kinda
override that.
I think adding some kind of headroom is reasonable to avoid recursively
spilling, but perhaps it's not critical. I see this as a tuning
question more than anything else. I don't see it as "overriding"
work_mem, but I can see where you're coming from.
5) Not sure what "directly" means in this context?
* partitions at the time we need to spill, and because this
algorithm
* shouldn't depend too directly on the internal memory needs of a
* BufFile.#define HASH_PARTITION_MEM (HASH_MIN_PARTITIONS * BLCKSZ)
Does that mean we don't want to link to PGAlignedBlock, or what?
That's what I meant, yes, but I reworded the comment to not say that.
6) I think we should have some protection against underflows in this
piece of code:- this would probably deserve some protection against underflow if
HASH_PARTITION_MEM gets too bigif (hashagg_mem_overflow)
aggstate->hash_mem_limit = SIZE_MAX;
else
aggstate->hash_mem_limit = (work_mem * 1024L) -
HASH_PARTITION_MEM;At the moment it's safe because work_mem is 64kB at least, and
HASH_PARTITION_MEM is 32kB (4 partitions, 8kB each). But if we happen
to
bump HASH_MIN_PARTITIONS up, this can underflow.
Thank you, done.
7) Shouldn't lookup_hash_entry briefly explain why/how it handles the
memory limit?
Improved.
8) The comment before lookup_hash_entries says:
...
* Return false if hash table has exceeded its memory limit.
..But that's clearly bogus, because that's a void function.
Thank you, improved comment.
9) Shouldn't the hash_finish_initial_spills calls in
agg_retrieve_direct
have a comment, similar to the surrounding code? Might be an
overkill,
not sure.
Sure, done.
10) The comment for agg_refill_hash_table says
* Should only be called after all in memory hash table entries have
been
* consumed.Can we enforce that with an assert, somehow?
It's a bit awkward. Simplehash doesn't expose the number of groups, and
we would also have to check each hash table. Not a bad idea to add an
interface to simplehash to make that work, though.
11) The hash_spill_npartitions naming seems a bit confusing, because
it
seems to imply it's about the "spill" while in practice it just
choses
number of spill partitions. Maybe hash_choose_num_spill_partitions
would
be better?
Done.
12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the
reasoning behind the current value (256)? Not wanting to pick too
many
partitions? Comment?if (npartitions > HASH_MAX_PARTITIONS)
npartitions = HASH_MAX_PARTITIONS;
Added a comment. There's no deep reasoning there -- I just don't want
it to choose to create 5000 files and surprise a user.
13) As for this:
/* make sure that we don't exhaust the hash bits */
if (partition_bits + input_bits >= 32)
partition_bits = 32 - input_bits;We already ran into this issue (exhausting bits in a hash value) in
hashjoin batching, we should be careful to use the same approach in
both
places (not the same code, just general approach).
Didn't investigate this yet, but will do.
Regards,
Jeff Davis
Attachments:
hashagg-20191204.difftext/x-patch; charset=UTF-8; name=hashagg-20191204.diffDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 53ac14490a1..10bfd7e1c3c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,23 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-hashagg-mem-overflow" xreflabel="hashagg_mem_overflow">
+ <term><varname>hashagg_mem_overflow</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>hashagg_mem_overflow</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ If hash aggregation exceeds <varname>work_mem</varname> at query
+ execution time, and <varname>hashagg_mem_overflow</varname> is set
+ to <literal>on</literal>, continue consuming more memory rather than
+ performing disk-based hash aggregation. The default
+ is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
@@ -4451,6 +4468,24 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-hashagg-spill" xreflabel="enable_hashagg_spill">
+ <term><varname>enable_hashagg_spill</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_hashagg_spill</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the memory usage is expected to
+ exceed <varname>work_mem</varname>. This only affects the planner
+ choice; actual behavior at execution time is dictated by
+ <xref linkend="guc-hashagg-mem-overflow"/>. The default
+ is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
<term><varname>enable_hashjoin</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb3434a32..092a79ea14f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -102,6 +102,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1826,6 +1827,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ if (es->analyze)
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2715,6 +2718,56 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * If EXPLAIN ANALYZE, show information on hash aggregate memory usage and
+ * batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+ long diskKb = (aggstate->hash_disk_used + 1023) / 1024;
+
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(
+ es->str,
+ "Memory Usage: %ldkB",
+ memPeakKb);
+
+ if (aggstate->hash_batches_used > 0)
+ {
+ appendStringInfo(
+ es->str,
+ " Batches: %d Disk Usage:%ldkB",
+ aggstate->hash_batches_used, diskKb);
+ }
+
+ appendStringInfo(
+ es->str,
+ "\n");
+ }
+ else
+ {
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ ExplainPropertyInteger("Disk Usage", "kB", diskKb, es);
+ }
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 7e486449eca..b6d80ebe14c 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -79,7 +79,8 @@ static void ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash);
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled);
/*
@@ -2927,7 +2928,7 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
*/
ExprState *
ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
- bool doSort, bool doHash)
+ bool doSort, bool doHash, bool spilled)
{
ExprState *state = makeNode(ExprState);
PlanState *parent = &aggstate->ss.ps;
@@ -3160,7 +3161,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (setno = 0; setno < processGroupingSets; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, false);
+ pertrans, transno, setno, setoff, false,
+ spilled);
setoff++;
}
}
@@ -3178,7 +3180,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (setno = 0; setno < numHashes; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, true);
+ pertrans, transno, setno, setoff, true,
+ spilled);
setoff++;
}
}
@@ -3226,7 +3229,8 @@ static void
ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash)
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled)
{
int adjust_init_jumpnull = -1;
int adjust_strict_jumpnull = -1;
@@ -3248,7 +3252,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
fcinfo->flinfo->fn_strict &&
pertrans->initValueIsNull)
{
- scratch->opcode = EEOP_AGG_INIT_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_INIT_TRANS_SPILLED : EEOP_AGG_INIT_TRANS;
scratch->d.agg_init_trans.aggstate = aggstate;
scratch->d.agg_init_trans.pertrans = pertrans;
scratch->d.agg_init_trans.setno = setno;
@@ -3265,7 +3270,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
if (pertrans->numSortCols == 0 &&
fcinfo->flinfo->fn_strict)
{
- scratch->opcode = EEOP_AGG_STRICT_TRANS_CHECK;
+ scratch->opcode = spilled ?
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED : EEOP_AGG_STRICT_TRANS_CHECK;
scratch->d.agg_strict_trans_check.aggstate = aggstate;
scratch->d.agg_strict_trans_check.setno = setno;
scratch->d.agg_strict_trans_check.setoff = setoff;
@@ -3283,9 +3289,11 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
/* invoke appropriate transition implementation */
if (pertrans->numSortCols == 0 && pertrans->transtypeByVal)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS_BYVAL;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED : EEOP_AGG_PLAIN_TRANS_BYVAL;
else if (pertrans->numSortCols == 0)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_SPILLED : EEOP_AGG_PLAIN_TRANS;
else if (pertrans->numInputs == 1)
scratch->opcode = EEOP_AGG_ORDERED_TRANS_DATUM;
else
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index dbed5978162..49fbf8e4a42 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -430,9 +430,13 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
&&CASE_EEOP_AGG_INIT_TRANS,
+ &&CASE_EEOP_AGG_INIT_TRANS_SPILLED,
&&CASE_EEOP_AGG_STRICT_TRANS_CHECK,
+ &&CASE_EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_SPILLED,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
&&CASE_EEOP_LAST
@@ -1625,6 +1629,36 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ aggstate = op->d.agg_init_trans.aggstate;
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
+
+ /* If transValue has not yet been initialized, do so now. */
+ if (pergroup->noTransValue)
+ {
+ AggStatePerTrans pertrans = op->d.agg_init_trans.pertrans;
+
+ aggstate->curaggcontext = op->d.agg_init_trans.aggcontext;
+ aggstate->current_set = op->d.agg_init_trans.setno;
+
+ ExecAggInitGroup(aggstate, pertrans, pergroup);
+
+ /* copied trans value from input, done this round */
+ EEO_JUMP(op->d.agg_init_trans.jumpnull);
+ }
+
+ EEO_NEXT();
+ }
/* check that a strict aggregate's input isn't NULL */
EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK)
@@ -1642,6 +1676,25 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ aggstate = op->d.agg_strict_trans_check.aggstate;
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_strict_trans_check.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_strict_trans_check.transno];
+
+ if (unlikely(pergroup->transValueIsNull))
+ EEO_JUMP(op->d.agg_strict_trans_check.jumpnull);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1691,6 +1744,52 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ aggstate = op->d.agg_trans.aggstate;
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ Assert(pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1756,6 +1855,67 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ aggstate = op->d.agg_trans.aggstate;
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ Assert(!pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
+ /*
+ * For pass-by-ref datatype, must copy the new value into
+ * aggcontext and free the prior transValue. But if transfn
+ * returned a pointer to its first input, we don't need to do
+ * anything. Also, if transfn returned a pointer to a R/W
+ * expanded object that is already a child of the aggcontext,
+ * assume we can adopt that value without copying it.
+ */
+ if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggTransReparent(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
/* process single-column ordered aggregate datum */
EEO_CASE(EEOP_AGG_ORDERED_TRANS_DATUM)
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 7bc5e405bcc..d92037e8b4e 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -25,8 +25,9 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
-static uint32 TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple);
static int TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
+static TupleHashEntry LookupTupleHashEntry_internal(
+ TupleHashTable hashtable, TupleTableSlot *slot, bool *isnew, uint32 hash);
/*
* Define parameters for tuple hash table code generation. The interface is
@@ -284,6 +285,17 @@ ResetTupleHashTable(TupleHashTable hashtable)
tuplehash_reset(hashtable->hashtab);
}
+/*
+ * Destroy the hash table. Note that the tablecxt passed to
+ * BuildTupleHashTableExt() should also be reset, otherwise there will be
+ * leaks.
+ */
+void
+DestroyTupleHashTable(TupleHashTable hashtable)
+{
+ tuplehash_destroy(hashtable->hashtab);
+}
+
/*
* Find or create a hashtable entry for the tuple group containing the
* given tuple. The tuple must be the same type as the hashtable entries.
@@ -300,10 +312,9 @@ TupleHashEntry
LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
bool *isnew)
{
- TupleHashEntryData *entry;
- MemoryContext oldContext;
- bool found;
- MinimalTuple key;
+ TupleHashEntry entry;
+ MemoryContext oldContext;
+ uint32 hash;
/* Need to run the hash functions in short-lived context */
oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
@@ -313,32 +324,29 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
hashtable->cur_eq_func = hashtable->tab_eq_func;
- key = NULL; /* flag to reference inputslot */
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+ entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
- if (isnew)
- {
- entry = tuplehash_insert(hashtable->hashtab, key, &found);
+ MemoryContextSwitchTo(oldContext);
- if (found)
- {
- /* found pre-existing entry */
- *isnew = false;
- }
- else
- {
- /* created new entry */
- *isnew = true;
- /* zero caller data */
- entry->additional = NULL;
- MemoryContextSwitchTo(hashtable->tablecxt);
- /* Copy the first tuple into the table context */
- entry->firstTuple = ExecCopySlotMinimalTuple(slot);
- }
- }
- else
- {
- entry = tuplehash_lookup(hashtable->hashtab, key);
- }
+ return entry;
+}
+
+/*
+ * A variant of LookupTupleHashEntry for callers that have already computed
+ * the hash value.
+ */
+TupleHashEntry
+LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
+ bool *isnew, uint32 hash)
+{
+ TupleHashEntry entry;
+ MemoryContext oldContext;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
MemoryContextSwitchTo(oldContext);
@@ -382,17 +390,12 @@ FindTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
/*
* Compute the hash value for a tuple
*
- * The passed-in key is a pointer to TupleHashEntryData. In an actual hash
- * table entry, the firstTuple field points to a tuple (in MinimalTuple
- * format). LookupTupleHashEntry sets up a dummy TupleHashEntryData with a
- * NULL firstTuple field --- that cues us to look at the inputslot instead.
- * This convention avoids the need to materialize virtual input tuples unless
- * they actually need to get copied into the table.
+ * If tuple is NULL, use the input slot instead.
*
* Also, the caller must select an appropriate memory context for running
* the hash functions. (dynahash.c doesn't change CurrentMemoryContext.)
*/
-static uint32
+uint32
TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
{
TupleHashTable hashtable = (TupleHashTable) tb->private_data;
@@ -413,9 +416,6 @@ TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
{
/*
* Process a tuple already stored in the table.
- *
- * (this case never actually occurs due to the way simplehash.h is
- * used, as the hash-value is stored in the entries)
*/
slot = hashtable->tableslot;
ExecStoreMinimalTuple(tuple, slot, false);
@@ -453,6 +453,54 @@ TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
return murmurhash32(hashkey);
}
+/*
+ * Does the work of LookupTupleHashEntry and LookupTupleHashEntryHash. Useful
+ * so that we can avoid switching the memory context multiple times for
+ * LookupTupleHashEntry.
+ */
+static TupleHashEntry
+LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
+ bool *isnew, uint32 hash)
+{
+ TupleHashEntryData *entry;
+ bool found;
+ MinimalTuple key;
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = slot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ key = NULL; /* flag to reference inputslot */
+
+ if (isnew)
+ {
+ entry = tuplehash_insert_hash(hashtable->hashtab, key, hash, &found);
+
+ if (found)
+ {
+ /* found pre-existing entry */
+ *isnew = false;
+ }
+ else
+ {
+ /* created new entry */
+ *isnew = true;
+ /* zero caller data */
+ entry->additional = NULL;
+ MemoryContextSwitchTo(hashtable->tablecxt);
+ /* Copy the first tuple into the table context */
+ entry->firstTuple = ExecCopySlotMinimalTuple(slot);
+ }
+ }
+ else
+ {
+ entry = tuplehash_lookup_hash(hashtable->hashtab, key, hash);
+ }
+
+ return entry;
+}
+
/*
* See whether two tuples (presumably of the same hash value) match
*
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 6ee24eab3d2..f509c8e8f55 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -194,6 +194,18 @@
* transition values. hashcontext is the single context created to support
* all hash tables.
*
+ * When the hash table memory exceeds work_mem, we advance the transition
+ * states only for groups already in the hash table. For tuples that would
+ * need to create a new hash table entries (and initialize new transition
+ * states), we spill them to disk to be processed later. The tuples are
+ * spilled in a partitioned manner, so that subsequent batches are smaller
+ * and less likely to exceed work_mem (if a batch does exceed work_mem, it
+ * must be spilled recursively).
+ *
+ * Note that it's possible for transition states to start small but then
+ * grow very large; for instance in the case of ARRAY_AGG. In such cases,
+ * it's still possible to significantly exceed work_mem.
+ *
* Transition / Combine function invocation:
*
* For performance reasons transition functions, including combine
@@ -229,15 +241,70 @@
#include "optimizer/optimizer.h"
#include "parser/parse_agg.h"
#include "parser/parse_coerce.h"
+#include "storage/buffile.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/datum.h"
+#include "utils/dynahash.h"
#include "utils/expandeddatum.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+/*
+ * Control how many partitions are created when spilling HashAgg to
+ * disk.
+ *
+ * HASH_PARTITION_FACTOR is multiplied by the estimated number of partitions
+ * needed such that each partition will fit in memory. The factor is set
+ * higher than one because there's not a high cost to having a few too many
+ * partitions, and it makes it less likely that a partition will need to be
+ * spilled recursively. Another benefit of having more, smaller partitions is
+ * that small hash tables may perform better than large ones due to memory
+ * caching effects.
+ *
+ * HASH_PARTITION_MEM is the approximate amount of work_mem we should reserve
+ * for the partitions themselves (i.e. buffering of the files backing the
+ * partitions). This is sloppy, because we must reserve the memory before
+ * filling the hash table; but we choose the number of partitions at the time
+ * we need to spill.
+ *
+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (and
+ * possibly pushing hidden costs to the OS for managing more files).
+ */
+#define HASH_PARTITION_FACTOR 1.50
+#define HASH_MIN_PARTITIONS 4
+#define HASH_MAX_PARTITIONS 256
+#define HASH_PARTITION_MEM (HASH_MIN_PARTITIONS * BLCKSZ)
+
+/*
+ * Represents partitioned spill data for a single hashtable.
+ */
+typedef struct HashAggSpill
+{
+ int n_partitions; /* number of output partitions */
+ int partition_bits; /* number of bits for partition mask
+ log2(n_partitions) parent partition bits */
+ BufFile **partitions; /* output partition files */
+ int64 *ntuples; /* number of tuples in each partition */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation. Initially,
+ * only the input fields are set. If spilled to disk, also set the spill data.
+ */
+typedef struct HashAggBatch
+{
+ BufFile *input_file; /* input partition */
+ int input_bits; /* number of bits for input partition mask */
+ int64 input_groups; /* estimated number of input groups */
+ int setno; /* grouping set */
+ HashAggSpill spill; /* spill output */
+} HashAggBatch;
+
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -271,12 +338,35 @@ static void finalize_aggregates(AggState *aggstate,
static TupleTableSlot *project_aggregates(AggState *aggstate);
static Bitmapset *find_unaggregated_cols(AggState *aggstate);
static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
-static void build_hash_table(AggState *aggstate);
-static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
+static void build_hash_table(AggState *aggstate, int setno,
+ int64 ngroups_estimate);
+static void prepare_hash_slot(AggState *aggstate);
+static void hash_recompile_expressions(AggState *aggstate);
+static uint32 calculate_hash(AggState *aggstate);
+static long hash_choose_num_buckets(AggState *aggstate,
+ long estimated_nbuckets,
+ Size memory);
+static int hash_choose_num_spill_partitions(uint64 input_tuples,
+ double hashentrysize);
+static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+static void hash_spill_init(HashAggSpill *spill, int input_bits,
+ uint64 input_tuples, double hashentrysize);
+static Size hash_spill_tuple(HashAggSpill *spill, int input_bits,
+ TupleTableSlot *slot, uint32 hash);
+static MinimalTuple hash_read_spilled(BufFile *file, uint32 *hashp);
+static HashAggBatch *hash_batch_new(BufFile *input_file, int setno,
+ int64 input_groups, int input_bits);
+static void hash_finish_initial_spills(AggState *aggstate);
+static void hash_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno, int input_bits);
+static void hash_reset_spill(HashAggSpill *spill);
+static void hash_reset_spills(AggState *aggstate);
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1254,18 +1344,20 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
* for each entry.
*
* We have a separate hashtable and associated perhash data structure for each
- * grouping set for which we're doing hashing.
+ * grouping set for which we're doing hashing. If setno is -1, build hash
+ * tables for all grouping sets. Otherwise, build only for the specified
+ * grouping set.
*
* The contents of the hash tables always live in the hashcontext's per-tuple
* memory context (there is only one of these for all tables together, since
* they are all reset at the same time).
*/
static void
-build_hash_table(AggState *aggstate)
+build_hash_table(AggState *aggstate, int setno, long ngroups_estimate)
{
- MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
- Size additionalsize;
- int i;
+ MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
+ Size additionalsize;
+ int i;
Assert(aggstate->aggstrategy == AGG_HASHED || aggstate->aggstrategy == AGG_MIXED);
@@ -1274,26 +1366,71 @@ build_hash_table(AggState *aggstate)
for (i = 0; i < aggstate->num_hashes; ++i)
{
AggStatePerHash perhash = &aggstate->perhash[i];
+ int64 ngroups;
+ long nbuckets;
+ Size memory;
Assert(perhash->aggnode->numGroups > 0);
if (perhash->hashtable)
- ResetTupleHashTable(perhash->hashtable);
- else
- perhash->hashtable = BuildTupleHashTableExt(&aggstate->ss.ps,
- perhash->hashslot->tts_tupleDescriptor,
- perhash->numCols,
- perhash->hashGrpColIdxHash,
- perhash->eqfuncoids,
- perhash->hashfunctions,
- perhash->aggnode->grpCollations,
- perhash->aggnode->numGroups,
- additionalsize,
- aggstate->ss.ps.state->es_query_cxt,
- aggstate->hashcontext->ecxt_per_tuple_memory,
- tmpmem,
- DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+ DestroyTupleHashTable(perhash->hashtable);
+ perhash->hashtable = NULL;
+
+ /*
+ * If we are building a hash table for only a single grouping set,
+ * skip the others.
+ */
+ if (setno >= 0 && setno != i)
+ continue;
+
+ /*
+ * Use an estimate from execution time if we have it; otherwise fall
+ * back to the planner estimate.
+ */
+ ngroups = ngroups_estimate > 0 ?
+ ngroups_estimate : perhash->aggnode->numGroups;
+
+ /* divide memory by the number of hash tables we are initializing */
+ memory = (long)work_mem * 1024L /
+ (setno >= 0 ? 1 : aggstate->num_hashes);
+
+ /* choose reasonable number of buckets per hashtable */
+ nbuckets = hash_choose_num_buckets(aggstate, ngroups, memory);
+
+ perhash->hashtable = BuildTupleHashTableExt(&aggstate->ss.ps,
+ perhash->hashslot->tts_tupleDescriptor,
+ perhash->numCols,
+ perhash->hashGrpColIdxHash,
+ perhash->eqfuncoids,
+ perhash->hashfunctions,
+ perhash->aggnode->grpCollations,
+ nbuckets,
+ additionalsize,
+ aggstate->ss.ps.state->es_query_cxt,
+ aggstate->hashcontext->ecxt_per_tuple_memory,
+ tmpmem,
+ DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
}
+
+ aggstate->hash_mem_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_ngroups_current = 0;
+
+ /*
+ * Initialize the threshold at which we stop creating new hash entries and
+ * start spilling. If an empty hash table exceeds the limit, increase the
+ * limit to be the size of the empty hash table. This ensures that at
+ * least one entry can be added so that the algorithm can make progress.
+ */
+ if (hashagg_mem_overflow)
+ aggstate->hash_mem_limit = SIZE_MAX;
+ else if (work_mem * 1024L > HASH_PARTITION_MEM * 2)
+ aggstate->hash_mem_limit = (work_mem * 1024L) - HASH_PARTITION_MEM;
+ else
+ aggstate->hash_mem_limit = (work_mem * 1024L);
+
+ if (aggstate->hash_mem_current > aggstate->hash_mem_limit)
+ aggstate->hash_mem_limit = aggstate->hash_mem_current;
}
/*
@@ -1455,22 +1592,16 @@ hash_agg_entry_size(int numAggs)
}
/*
- * Find or create a hashtable entry for the tuple group containing the current
- * tuple (already set in tmpcontext's outertuple slot), in the current grouping
- * set (which the caller must have selected - note that initialize_aggregate
- * depends on this).
- *
- * When called, CurrentMemoryContext should be the per-query context.
+ * Extract the attributes that make up the grouping key into the
+ * hashslot. This is necessary to compute the hash of the grouping key.
*/
-static TupleHashEntryData *
-lookup_hash_entry(AggState *aggstate)
+static void
+prepare_hash_slot(AggState *aggstate)
{
- TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
- AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
- TupleTableSlot *hashslot = perhash->hashslot;
- TupleHashEntryData *entry;
- bool isnew;
- int i;
+ TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ int i;
/* transfer just the needed columns into hashslot */
slot_getsomeattrs(inputslot, perhash->largestGrpColIdx);
@@ -1484,14 +1615,169 @@ lookup_hash_entry(AggState *aggstate)
hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
}
ExecStoreVirtualTuple(hashslot);
+}
+
+/*
+ * Recompile the expressions for advancing aggregates while hashing. This is
+ * necessary for certain kinds of state changes that affect the resulting
+ * expression. For instance, changing aggstate->hash_spilled or
+ * aggstate->ss.ps.outerops require recompilation.
+ */
+static void
+hash_recompile_expressions(AggState *aggstate)
+{
+ AggStatePerPhase phase;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ if (aggstate->aggstrategy == AGG_HASHED)
+ phase = &aggstate->phases[0];
+ else /* AGG_MIXED */
+ phase = &aggstate->phases[1];
+
+ phase->evaltrans = ExecBuildAggTrans(
+ aggstate, phase,
+ aggstate->aggstrategy == AGG_MIXED ? true : false, /* dosort */
+ true, /* dohash */
+ aggstate->hash_spilled /* spilled */);
+}
+
+/*
+ * Calculate the hash value for a tuple. It's useful to do this outside of the
+ * hash table so that we can reuse saved hash values rather than recomputing.
+ */
+static uint32
+calculate_hash(AggState *aggstate)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleHashTable hashtable = perhash->hashtable;
+ MemoryContext oldContext;
+ uint32 hash;
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = perhash->hashslot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+
+ MemoryContextSwitchTo(oldContext);
+
+ return hash;
+}
+
+/*
+ * Choose a reasonable number of buckets for the initial hash table size.
+ */
+static long
+hash_choose_num_buckets(AggState *aggstate, long ngroups, Size memory)
+{
+ long max_nbuckets;
+ int log2_ngroups;
+ long nbuckets;
+
+ if (aggstate->hashentrysize == 0.0)
+ aggstate->hashentrysize = hash_agg_entry_size(aggstate->numtrans);
+
+ max_nbuckets = memory / aggstate->hashentrysize;
+
+ /*
+ * Lowest power of two greater than ngroups, without exceeding
+ * max_nbuckets.
+ */
+ for (log2_ngroups = 1, nbuckets = 2;
+ nbuckets < ngroups && nbuckets < max_nbuckets;
+ log2_ngroups++, nbuckets <<= 1);
+
+ if (nbuckets > max_nbuckets && nbuckets > 2)
+ nbuckets >>= 1;
+
+ return nbuckets;
+}
+
+/*
+ * Determine the number of partitions to create when spilling.
+ */
+static int
+hash_choose_num_spill_partitions(uint64 input_tuples, double hashentrysize)
+{
+ Size mem_needed;
+ int partition_limit;
+ int npartitions;
+
+ /*
+ * Avoid creating so many partitions that the memory requirements of the
+ * open partition files (estimated at BLCKSZ for buffering) are greater
+ * than 1/4 of work_mem.
+ */
+ partition_limit = (work_mem * 1024L * 0.25) / BLCKSZ;
+
+ /* pessimistically estimate that each input tuple creates a new group */
+ mem_needed = HASH_PARTITION_FACTOR * input_tuples * hashentrysize;
+
+ /* make enough partitions so that each one is likely to fit in memory */
+ npartitions = 1 + (mem_needed / (work_mem * 1024L));
+
+ if (npartitions > partition_limit)
+ npartitions = partition_limit;
+
+ if (npartitions < HASH_MIN_PARTITIONS)
+ npartitions = HASH_MIN_PARTITIONS;
+ if (npartitions > HASH_MAX_PARTITIONS)
+ npartitions = HASH_MAX_PARTITIONS;
+
+ return npartitions;
+}
+
+/*
+ * Find or create a hashtable entry for the tuple group containing the current
+ * tuple (already set in tmpcontext's outertuple slot), in the current grouping
+ * set (which the caller must have selected - note that initialize_aggregate
+ * depends on this).
+ *
+ * When called, CurrentMemoryContext should be the per-query context.
+ *
+ * If the hash table is at the memory limit, then only find existing hashtable
+ * entries; don't create new ones. If a tuple's group is not already present
+ * in the hash table for the current grouping set, return NULL and the caller
+ * will spill it to disk.
+ */
+static AggStatePerGroup
+lookup_hash_entry(AggState *aggstate, uint32 hash)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ TupleHashEntryData *entry;
+ bool isnew = false;
+ bool *p_isnew;
+
+ /* if hash table memory limit is exceeded, don't create new entries */
+ p_isnew = (aggstate->hash_mem_current > aggstate->hash_mem_limit) ?
+ NULL : &isnew;
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntry(perhash->hashtable, hashslot, &isnew);
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
+ hash);
+
+ if (entry == NULL)
+ return NULL;
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_ngroups_current++;
+
+ aggstate->hash_mem_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ if (aggstate->hash_mem_current > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = aggstate->hash_mem_current;
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1511,7 +1797,7 @@ lookup_hash_entry(AggState *aggstate)
}
}
- return entry;
+ return entry->additional;
}
/*
@@ -1519,18 +1805,64 @@ lookup_hash_entry(AggState *aggstate)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Some entries may be left NULL if we are at the memory limit. The same tuple
+ * will belong to different groups for each set, so may match a group already
+ * in memory for one set and match a group not in memory for another set. If
+ * at the memory limit and a tuple doesn't match a group in memory for a
+ * particular set, it will be spilled.
+ *
+ * NB: It's possible to spill the same tuple for several different grouping
+ * sets. This may seem wasteful, but it's actually a trade-off: if we spill
+ * the tuple multiple times for multiple grouping sets, it can be partitioned
+ * for each grouping set, making the refilling of the hash table very
+ * efficient.
*/
static void
lookup_hash_entries(AggState *aggstate)
{
- int numHashes = aggstate->num_hashes;
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
int setno;
- for (setno = 0; setno < numHashes; setno++)
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
{
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ uint32 hash;
+
select_current_set(aggstate, setno, true);
- pergroup[setno] = lookup_hash_entry(aggstate)->additional;
+ prepare_hash_slot(aggstate);
+ hash = calculate_hash(aggstate);
+ pergroup[setno] = lookup_hash_entry(aggstate, hash);
+
+ /* check to see if we need to spill the tuple for this grouping set */
+ if (pergroup[setno] == NULL)
+ {
+ HashAggSpill *spill;
+ TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
+
+ /* update hashentrysize estimate based on contents */
+ Assert(aggstate->hash_ngroups_current > 0);
+ aggstate->hashentrysize = (double)aggstate->hash_mem_current /
+ (double)aggstate->hash_ngroups_current;
+
+ if (aggstate->hash_spills == NULL)
+ aggstate->hash_spills = palloc0(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+
+ if (!aggstate->hash_spilled)
+ {
+ aggstate->hash_spilled = true;
+ hash_recompile_expressions(aggstate);
+ }
+
+ spill = &aggstate->hash_spills[setno];
+
+ if (spill->partitions == NULL)
+ hash_spill_init(spill, 0, perhash->aggnode->numGroups,
+ aggstate->hashentrysize);
+
+ aggstate->hash_disk_used += hash_spill_tuple(spill, 0, slot, hash);
+ }
}
}
@@ -1853,6 +2185,12 @@ agg_retrieve_direct(AggState *aggstate)
if (TupIsNull(outerslot))
{
/* no more outer-plan tuples available */
+
+ /* if we built hash tables, finalize any spills */
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hash_finish_initial_spills(aggstate);
+
if (hasGroupingSets)
{
aggstate->input_done = true;
@@ -1955,6 +2293,9 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ /* finalize spills, if any */
+ hash_finish_initial_spills(aggstate);
+
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1962,11 +2303,161 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * consumed.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ /*
+ * Each spill file contains spilled data for only a single grouping
+ * set. We want to ignore all others, which is done by setting the other
+ * pergroups to NULL.
+ */
+ memset(aggstate->all_pergroups, 0,
+ sizeof(AggStatePerGroup) *
+ (aggstate->maxsets + aggstate->num_hashes));
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ /*
+ * Free memory and rebuild a single hash table for this batch's grouping
+ * set.
+ */
+ ReScanExprContext(aggstate->hashcontext);
+ build_hash_table(aggstate, batch->setno, batch->input_groups);
+
+ Assert(aggstate->current_phase == 0);
+
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ /*
+ * The first pass (agg_fill_hash_table) reads whatever kind of slot comes
+ * from the outer plan, and considers the slot fixed. But spilled tuples
+ * are always MinimalTuples, so if that's different from the outer plan we
+ * need to change it and recompile the aggregate expressions.
+ */
+ if (aggstate->ss.ps.outerops != &TTSOpsMinimalTuple)
+ {
+ aggstate->ss.ps.outerops = &TTSOpsMinimalTuple;
+ hash_recompile_expressions(aggstate);
+ }
+
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+ uint32 hash;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hash_read_spilled(batch->input_file, &hash);
+ if (tuple == NULL)
+ break;
+
+ ExecStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ select_current_set(aggstate, batch->setno, true);
+ prepare_hash_slot(aggstate);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+
+ /* if there's no memory for a new group, spill */
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ {
+ /* update hashentrysize estimate based on contents */
+ Assert(aggstate->hash_ngroups_current > 0);
+ aggstate->hashentrysize = (double)aggstate->hash_mem_current /
+ (double)aggstate->hash_ngroups_current;
+
+ if (batch->spill.partitions == NULL)
+ hash_spill_init(&batch->spill, batch->input_bits,
+ batch->input_groups, aggstate->hashentrysize);
+
+ aggstate->hash_disk_used += hash_spill_tuple(
+ &batch->spill, batch->input_bits, slot, hash);
+ }
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ BufFileClose(batch->input_file);
+
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ hash_spill_finish(aggstate, &batch->spill, batch->setno,
+ batch->input_bits);
+
+ pfree(batch);
+
+ /* Initialize to walk the first hash table */
+ select_current_set(aggstate, 0, true);
+ ResetTupleHashIterator(aggstate->perhash[0].hashtable,
+ &aggstate->perhash[0].hashiter);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
+ *
+ * After exhausting in-memory tuples, also try refilling the hash table using
+ * previously-spilled tuples. Only returns NULL after all in-memory and
+ * spilled tuples are exhausted.
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+/*
+ * Retrieve the groups from the in-memory hash tables without considering any
+ * spilled tuples.
+ */
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -1995,7 +2486,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2026,8 +2517,6 @@ agg_retrieve_hash_table(AggState *aggstate)
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2084,6 +2573,283 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * hash_spill_init
+ *
+ * Called after we determined that spilling is necessary. Chooses the number
+ * of partitions to create, and initializes them.
+ */
+static void
+hash_spill_init(HashAggSpill *spill, int input_bits, uint64 input_tuples,
+ double hashentrysize)
+{
+ int npartitions;
+ int partition_bits;
+
+ npartitions = hash_choose_num_spill_partitions(input_tuples,
+ hashentrysize);
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits
+ TODO: be consistent with hashjoin batching */
+ if (partition_bits + input_bits >= 32)
+ partition_bits = 32 - input_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ spill->partition_bits = partition_bits;
+ spill->n_partitions = npartitions;
+ spill->partitions = palloc0(sizeof(BufFile *) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+}
+
+/*
+ * hash_spill_tuple
+ *
+ * Not enough memory to add tuple as new entry in hash table. Save for later
+ * in the appropriate partition.
+ */
+static Size
+hash_spill_tuple(HashAggSpill *spill, int input_bits, TupleTableSlot *slot,
+ uint32 hash)
+{
+ int partition;
+ MinimalTuple tuple;
+ BufFile *file;
+ int written;
+ int total_written = 0;
+ bool shouldFree;
+
+ Assert(spill->partitions != NULL);
+
+ /*TODO: project needed attributes only */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ if (spill->partition_bits == 0)
+ partition = 0;
+ else
+ partition = (hash << input_bits) >>
+ (32 - spill->partition_bits);
+
+ spill->ntuples[partition]++;
+
+ if (spill->partitions[partition] == NULL)
+ spill->partitions[partition] = BufFileCreateTemp(false);
+ file = spill->partitions[partition];
+
+ written = BufFileWrite(file, (void *) &hash, sizeof(uint32));
+ if (written != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to HashAgg temporary file: %m")));
+ total_written += written;
+
+ written = BufFileWrite(file, (void *) tuple, tuple->t_len);
+ if (written != tuple->t_len)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to HashAgg temporary file: %m")));
+ total_written += written;
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return total_written;
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hash_read_spilled(BufFile *file, uint32 *hashp)
+{
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+ uint32 hash;
+
+ nread = BufFileRead(file, &hash, sizeof(uint32));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+ if (hashp != NULL)
+ *hashp = hash;
+
+ nread = BufFileRead(file, &t_len, sizeof(t_len));
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = BufFileRead(file, (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+
+ return tuple;
+}
+
+/*
+ * new_hashagg_batch
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done. Should be called in the aggregate's memory context.
+ */
+static HashAggBatch *
+hash_batch_new(BufFile *input_file, int setno, int64 input_groups,
+ int input_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->input_file = input_file;
+ batch->input_bits = input_bits;
+ batch->input_groups = input_groups;
+ batch->setno = setno;
+
+ /* batch->spill will be set only after spilling this batch */
+
+ return batch;
+}
+
+/*
+ * hash_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hash_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+
+ if (aggstate->hash_spills == NULL)
+ return;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hash_spill_finish(aggstate, &aggstate->hash_spills[setno], setno, 0);
+
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+}
+
+/*
+ * hash_spill_finish
+ *
+ *
+ */
+static void
+hash_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno, int input_bits)
+{
+ int i;
+
+ if (spill->n_partitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->n_partitions; i++)
+ {
+ BufFile *file = spill->partitions[i];
+ MemoryContext oldContext;
+ HashAggBatch *new_batch;
+ int64 input_ngroups;
+
+ /* partition is empty */
+ if (file == NULL)
+ continue;
+
+ /* rewind file for reading */
+ if (BufFileSeek(file, 0, 0L, SEEK_SET))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rewind HashAgg temporary file: %m")));
+
+ /*
+ * Estimate the number of input groups for this new work item as the
+ * total number of tuples in its input file. Although that's a worst
+ * case, it's not bad here for two reasons: (1) overestimating is
+ * better than underestimating; and (2) we've already scanned the
+ * relation once, so it's likely that we've already finalized many of
+ * the common values.
+ */
+ input_ngroups = spill->ntuples[i];
+
+ oldContext = MemoryContextSwitchTo(aggstate->ss.ps.state->es_query_cxt);
+ new_batch = hash_batch_new(file, setno, input_ngroups,
+ spill->partition_bits + input_bits);
+ aggstate->hash_batches = lappend(aggstate->hash_batches, new_batch);
+ aggstate->hash_batches_used++;
+ MemoryContextSwitchTo(oldContext);
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Clear a HashAggSpill, free its memory, and close its files.
+ */
+static void
+hash_reset_spill(HashAggSpill *spill)
+{
+ int i;
+ for (i = 0; i < spill->n_partitions; i++)
+ {
+ BufFile *file = spill->partitions[i];
+
+ if (file != NULL)
+ BufFileClose(file);
+ }
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+}
+
+/*
+ * Find and reset all active HashAggSpills.
+ */
+static void
+hash_reset_spills(AggState *aggstate)
+{
+ ListCell *lc;
+
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hash_reset_spill(&aggstate->hash_spills[setno]);
+
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ if (batch->input_file != NULL)
+ {
+ BufFileClose(batch->input_file);
+ batch->input_file = NULL;
+ }
+ hash_reset_spill(&batch->spill);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2268,6 +3034,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->ss.ps.outeropsfixed = false;
}
+ if (use_hashing)
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsMinimalTuple);
+
/*
* Initialize result type, slot and projection.
*/
@@ -2497,7 +3267,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->hash_pergroup = pergroups;
find_hash_columns(aggstate);
- build_hash_table(aggstate);
+ build_hash_table(aggstate, -1, 0);
aggstate->table_filled = false;
}
@@ -2903,7 +3673,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
else
Assert(false);
- phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash);
+ phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash, false);
}
@@ -3398,6 +4168,8 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hash_reset_spills(node);
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3453,12 +4225,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3515,9 +4288,20 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ hash_reset_spills(node);
+
+ node->hash_spilled = false;
+ node->hash_mem_current = 0;
+ node->hash_ngroups_current = 0;
+
+ /* reset stats */
+ node->hash_mem_peak = 0;
+ node->hash_disk_used = 0;
+ node->hash_batches_used = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
- build_hash_table(node);
+ build_hash_table(node, -1, 0);
node->table_filled = false;
/* iterator will be reset when the table is filled */
}
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index a9d362100a8..fd29ce5d12c 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2083,6 +2083,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_INIT_TRANS:
+ case EEOP_AGG_INIT_TRANS_SPILLED:
{
AggState *aggstate;
AggStatePerTrans pertrans;
@@ -2093,6 +2094,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_allpergroupsp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_setoff,
v_transno;
@@ -2120,11 +2122,32 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_init_trans.setoff);
v_transno = l_int32_const(op->d.agg_init_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_notransvalue = l_bb_before_v(
+ opblocks[i + 1], "op.%d.check_notransvalue", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(
+ b, v_pergroup_allaggs, TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_check_notransvalue);
+
+ LLVMPositionBuilderAtEnd(b, b_check_notransvalue);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_notransvalue =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_NOTRANSVALUE,
@@ -2181,6 +2204,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_STRICT_TRANS_CHECK:
+ case EEOP_AGG_STRICT_TRANS_CHECK_SPILLED:
{
AggState *aggstate;
LLVMValueRef v_setoff,
@@ -2191,6 +2215,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_transnull;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
int jumpnull = op->d.agg_strict_trans_check.jumpnull;
@@ -2210,11 +2235,32 @@ llvm_compile_expr(ExprState *state)
l_int32_const(op->d.agg_strict_trans_check.setoff);
v_transno =
l_int32_const(op->d.agg_strict_trans_check.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_transnull = l_bb_before_v(
+ opblocks[i + 1], "op.%d.check_transnull", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[jumpnull],
+ b_check_transnull);
+
+ LLVMPositionBuilderAtEnd(b, b_check_transnull);
+ }
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_transnull =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_TRANSVALUEISNULL,
@@ -2230,7 +2276,9 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_PLAIN_TRANS_BYVAL:
+ case EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED:
case EEOP_AGG_PLAIN_TRANS:
+ case EEOP_AGG_PLAIN_TRANS_SPILLED:
{
AggState *aggstate;
AggStatePerTrans pertrans;
@@ -2256,6 +2304,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_pertransp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_retval;
@@ -2283,10 +2332,33 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_trans.setoff);
v_transno = l_int32_const(op->d.agg_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED ||
+ opcode == EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_advance_transval = l_bb_before_v(
+ opblocks[i + 1], "op.%d.advance_transval", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_advance_transval);
+
+ LLVMPositionBuilderAtEnd(b, b_advance_transval);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_fcinfo = l_ptr_const(fcinfo,
l_ptr(StructFunctionCallInfoData));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f65934859..3f0d2899635 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
bool enable_hashagg = true;
+bool enable_hashagg_spill = true;
bool enable_nestloop = true;
bool enable_material = true;
bool enable_mergejoin = true;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7fe11b59a02..511f8861a8f 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4255,6 +4255,9 @@ consider_groupingsets_paths(PlannerInfo *root,
* gd->rollups is empty if we have only unsortable columns to work
* with. Override work_mem in that case; otherwise, we'll rely on the
* sorted-input case to generate usable mixed paths.
+ *
+ * TODO: think more about how to plan grouping sets when spilling hash
+ * tables is an option
*/
if (hashsize > work_mem * 1024L && gd->rollups)
return; /* nope, won't fit */
@@ -6527,7 +6530,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
* were unable to sort above, then we'd better generate a Path, so
* that we at least have one.
*/
- if (hashaggtablesize < work_mem * 1024L ||
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L ||
grouped_rel->pathlist == NIL)
{
/*
@@ -6560,7 +6564,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
agg_final_costs,
dNumGroups);
- if (hashaggtablesize < work_mem * 1024L)
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L)
add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
@@ -6829,7 +6834,7 @@ create_partial_grouping_paths(PlannerInfo *root,
* Tentatively produce a partial HashAgg Path, depending on if it
* looks as if the hash table will fit in work_mem.
*/
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
@@ -6856,7 +6861,7 @@ create_partial_grouping_paths(PlannerInfo *root,
dNumPartialPartialGroups);
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 3bf96de256d..b0cb1d7e6b2 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -120,6 +120,7 @@ bool enableFsync = true;
bool allowSystemTableMods = false;
int work_mem = 1024;
int maintenance_work_mem = 16384;
+bool hashagg_mem_overflow = false;
int max_parallel_maintenance_workers = 2;
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ba74bf9f7dc..d2b66a7f46a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -957,6 +957,26 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_hashagg_spill", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans that are expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_hashagg_spill,
+ true,
+ NULL, NULL, NULL
+ },
+ {
+ {"hashagg_mem_overflow", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables hashed aggregation to overflow work_mem at execution time."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &hashagg_mem_overflow,
+ false,
+ NULL, NULL, NULL
+ },
{
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of materialization."),
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index d21dbead0a2..e50a7ad6712 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -226,9 +226,13 @@ typedef enum ExprEvalOp
EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
EEOP_AGG_INIT_TRANS,
+ EEOP_AGG_INIT_TRANS_SPILLED,
EEOP_AGG_STRICT_TRANS_CHECK,
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
EEOP_AGG_PLAIN_TRANS_BYVAL,
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
EEOP_AGG_PLAIN_TRANS,
+ EEOP_AGG_PLAIN_TRANS_SPILLED,
EEOP_AGG_ORDERED_TRANS_DATUM,
EEOP_AGG_ORDERED_TRANS_TUPLE,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 6298c7c8cad..e8d88f2ce26 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -140,11 +140,17 @@ extern TupleHashTable BuildTupleHashTableExt(PlanState *parent,
extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
bool *isnew);
+extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
+ TupleTableSlot *slot,
+ bool *isnew, uint32 hash);
extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
ExprState *eqcomp,
FmgrInfo *hashfunctions);
+extern uint32 TupleHashTableHash(struct tuplehash_hash *tb,
+ const MinimalTuple tuple);
extern void ResetTupleHashTable(TupleHashTable hashtable);
+extern void DestroyTupleHashTable(TupleHashTable hashtable);
/*
* prototypes from functions in execJunk.c
@@ -250,7 +256,7 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
- bool doSort, bool doHash);
+ bool doSort, bool doHash, bool spilled);
extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
int numCols,
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bc6e03fbc7e..321759ead51 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -244,6 +244,7 @@ extern bool enableFsync;
extern PGDLLIMPORT bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT bool hashagg_mem_overflow;
extern PGDLLIMPORT int max_parallel_maintenance_workers;
extern int VacuumCostPageHit;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6eb647290be..b9803a28bd2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2070,13 +2070,27 @@ typedef struct AggState
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
- int num_hashes;
+ int num_hashes; /* number of hash tables active at once */
+ bool hash_spilled; /* any hash table ever spilled? */
+ double hashentrysize; /* estimate revised during execution */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each hash table,
+ exists only during first pass if spilled */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ Size hash_mem_limit; /* limit before spilling hash table */
+ Size hash_mem_peak; /* peak hash table memory usage */
+ uint64 hash_ngroups_current; /* number of tuples currently in
+ memory in all hash tables */
+ Size hash_mem_current; /* current hash table memory usage */
+ uint64 hash_disk_used; /* bytes of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+ List *hash_batches; /* hash batches remaining to be processed */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 45
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
@@ -2248,7 +2262,7 @@ typedef struct HashInstrumentation
int nbuckets_original; /* planned number of buckets */
int nbatch; /* number of batches at end of execution */
int nbatch_original; /* planned number of batches */
- size_t space_peak; /* speak memory usage in bytes */
+ size_t space_peak; /* peak memory usage in bytes */
} HashInstrumentation;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fbc..b72e2d08290 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_hashagg_spill;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 0b097f96520..a9ddcce3d3a 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2331,3 +2331,95 @@ explain (costs off)
-> Seq Scan on onek
(8 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+set work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------------
+ GroupAggregate
+ Group Key: ((g % 100000))
+ -> Sort
+ Sort Key: ((g % 100000))
+ -> Function Scan on generate_series g
+(5 rows)
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+set jit_above_cost to default;
+create table agg_group_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_group_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+-- Produce results with hash aggregation
+set enable_hashagg = true;
+set enable_sort = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 100000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+set jit_above_cost to default;
+create table agg_hash_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_hash_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+set enable_sort = true;
+set work_mem to default;
+-- Compare group aggregation results to hash aggregation results
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a7..767f60a96c7 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1633,4 +1633,127 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
| 1 | 2
(4 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+-- Produce results with hash aggregation.
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+set enable_sort = true;
+set work_mem to default;
+-- Compare results
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+ g100 | g10 | unnest | c | m
+------+-----+--------+---+---
+(0 rows)
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
-- end
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1de..11c6f50fbfa 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -148,6 +148,68 @@ SELECT count(*) FROM
4
(1 row)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+SET enable_hashagg=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------------
+ Unique
+ -> Sort
+ Sort Key: ((g % 1000))
+ -> Function Scan on generate_series g
+(4 rows)
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_hashagg=TRUE;
+-- Produce results with hash aggregation.
+SET enable_sort=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 1000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_sort=TRUE;
+SET work_mem TO DEFAULT;
+-- Compare results
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb9057..c40bf6c16eb 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -75,6 +75,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
+ enable_hashagg_spill | on
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 17fb256aec5..bcd336c5812 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1017,3 +1017,91 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
explain (costs off)
select 1 from tenk1
where (hundred, thousand) in (select twothousand, twothousand from onek);
+
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+set work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+set jit_above_cost to default;
+
+create table agg_group_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_group_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+-- Produce results with hash aggregation
+
+set enable_hashagg = true;
+set enable_sort = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+set jit_above_cost to default;
+
+create table agg_hash_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_hash_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare group aggregation results to hash aggregation results
+
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 95ac3fb52f6..bf8bce6ed31 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -441,4 +441,103 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
from unnest(array[1,1], array['a','b']) u(i,v)
group by rollup(i, v||'a') order by 1,3;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+-- Produce results with hash aggregation.
+
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare results
+
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+
-- end
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449e..33102744ebf 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -45,6 +45,68 @@ SELECT count(*) FROM
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+SET enable_hashagg=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_hashagg=TRUE;
+
+-- Produce results with hash aggregation.
+
+SET enable_sort=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_sort=TRUE;
+
+SET work_mem TO DEFAULT;
+
+-- Compare results
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
+
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
On Wed, Dec 04, 2019 at 06:55:43PM -0800, Jeff Davis wrote:
Thanks very much for a great review! I've attached a new patch.
Hi,
About the `TODO: project needed attributes only` in your patch, when
would the input tuple contain columns not needed? It seems like anything
you can project has to be in the group or aggregates.
--
Melanie Plageman & Adam
On Wed, 2019-12-04 at 19:50 -0800, Adam Lee wrote:
On Wed, Dec 04, 2019 at 06:55:43PM -0800, Jeff Davis wrote:
Thanks very much for a great review! I've attached a new patch.
Hi,
About the `TODO: project needed attributes only` in your patch, when
would the input tuple contain columns not needed? It seems like
anything
you can project has to be in the group or aggregates.
If you have a table like:
CREATE TABLE foo(i int, j int, x int, y int, z int);
And do:
SELECT i, SUM(j) FROM foo GROUP BY i;
At least from a logical standpoint, you might expect that we project
only the attributes we need from foo before feeding them into the
HashAgg. But that's not quite how postgres works. Instead, it leaves
the tuples intact (which, in this case, means they have 5 attributes)
until after aggregation and lazily fetches whatever attributes are
referenced. Tuples are spilled from the input, at which time they still
have 5 attributes; so naively copying them is wasteful.
I'm not sure how often this laziness is really a win in practice,
especially after the expression evaluation has changed so much in
recent releases. So it might be better to just project all the
attributes eagerly, and then none of this would be a problem. If we
still wanted to be lazy about attribute fetching, that should still be
possible even if we did a kind of "logical" projection of the tuple so
that the useless attributes would not be relevant. Regardless, that's
outside the scope of the patch I'm currently working on.
What I'd like to do is copy just the attributes needed into a new
virtual slot, leave the unneeded ones NULL, and then write it out to
the tuplestore as a MinimalTuple. I just need to be sure to get the
right attributes.
Regards,
Jeff Davis
On Wed, 2019-12-04 at 17:24 -0800, Melanie Plageman wrote:
It looks like the parameter input_tuples passed to hash_spill_init()
in lookup_hash_entries() is the number of groups estimated by
planner.
However, when reloading a spill file, if we run out of memory and
re-spill, hash_spill_init() is passed batch->input_groups (which is
actually set from input_ngroups which is the number of tuples in the
spill file). So, input_tuples is groups and input_groups is
input_tuples. It may be helpful to rename this.
You're right; this is confusing. I will clarify this in the next patch.
So, it looks like the HASH_PARTITION_FACTOR is only used when
re-spilling. The initial hashtable will use work_mem.
It seems like the reason for using it when re-spilling is to be very
conservative to avoid more than one re-spill and make sure each spill
file fits in a hashtable in memory.
It's used any time a spill happens, even the first spill. I'm flexible
on the use of HASH_PARTITION_FACTOR though... it seems not everyone
thinks it's a good idea. To me it's just a knob to tune and I tend to
think over-partitioning is the safer bet most of the time.
The comment does seem to point to some other reason, though...
I have observed some anomalies where smaller work_mem values (for
already-low values of work_mem) result faster runtime. The only
explanation I have is caching effects.
256 actually seems very large. hash_spill_npartitions() will be
called
for every respill, so, HASH_MAX_PARTITIONS it not the total number of
spill files permitted, but, actually, it is the number of respill
files in a given spill (a spill set). So if you made X partitions
initially and every partition re-spills, now you would have (at most)
X * 256 partitions.
Right. Though I'm not sure there's any theoretical max... given enough
input tuples and it will just keep getting deeper. If this is a serious
concern maybe I should make it depth-first recursion by prepending new
work items rather than appending. That would still not bound the
theoretical max, but it would slow the growth.
If HASH_MAX_PARTITIONS is 256, wouldn't the metadata from the spill
files take up a lot of memory at that point?
Yes. Each file keeps a BLCKSZ buffer, plus some other metadata. And it
does create a file, so it's offloading some work to the OS to manage
that new file.
It's annoying to properly account for these costs because the memory
needs to be reserved at the time we are building the hash table, but we
don't know how many partitions we want until it comes time to spill.
And for that matter, we don't even know whether we will need to spill
or not.
There are two alternative approaches which sidestep this problem:
1. Reserve a fixed fraction of work_mem, say, 1/8 to make space for
however many partitions that memory can handle. We would still have a
min and max, but the logic for reserving the space would be easy and so
would choosing the number of partitions to create.
* Pro: simple
* Con: lose the ability to choose the numer of partitions
2. Use logtape.c instead (suggestion from Heikki). Supporting more
logical tapes doesn't impose costs on the OS, and we can potentially
use a lot of logical tapes.
* Pro: can use lots of partitions without making lots of files
* Con: buffering still needs to happen somewhere, so we still need
memory for each logical tape. Also, we risk losing locality of read
access when reading the tapes, or perhaps confusing readahead.
Fundamentally, logtapes.c was designed for sequential write, random
read; but we are going to do random write and sequential read.
Regards,
Jeff Davis
On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote:
And it's not clear to me why we should remove part of the comment
before
TupleHashTableHash.
It looks like 5dfc1981 changed the signature of TupleHashTableHash
without updating the comment, so it doesn't really make sense any more.
I just updated the comment as a part of my patch, but it's not related.
Andres, comments? Maybe we can just commit a fix for that comment and
take it out of my patch.
Regards,
Jeff Davis
On Thu, Dec 05, 2019 at 12:55:51PM -0800, Jeff Davis wrote:
On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote:
And it's not clear to me why we should remove part of the comment
before
TupleHashTableHash.It looks like 5dfc1981 changed the signature of TupleHashTableHash
without updating the comment, so it doesn't really make sense any more.
I just updated the comment as a part of my patch, but it's not related.Andres, comments? Maybe we can just commit a fix for that comment and
take it out of my patch.
+1 to push that as an independent fix
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-12-05 12:55:51 -0800, Jeff Davis wrote:
On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote:
And it's not clear to me why we should remove part of the comment
before
TupleHashTableHash.It looks like 5dfc1981 changed the signature of TupleHashTableHash
without updating the comment, so it doesn't really make sense any more.
I just updated the comment as a part of my patch, but it's not related.Andres, comments? Maybe we can just commit a fix for that comment and
take it out of my patch.
Fine with me!
- Andres
On Wed, Dec 04, 2019 at 10:57:51PM -0800, Jeff Davis wrote:
About the `TODO: project needed attributes only` in your patch, when
would the input tuple contain columns not needed? It seems like
anything
you can project has to be in the group or aggregates.If you have a table like:
CREATE TABLE foo(i int, j int, x int, y int, z int);
And do:
SELECT i, SUM(j) FROM foo GROUP BY i;
At least from a logical standpoint, you might expect that we project
only the attributes we need from foo before feeding them into the
HashAgg. But that's not quite how postgres works. Instead, it leaves
the tuples intact (which, in this case, means they have 5 attributes)
until after aggregation and lazily fetches whatever attributes are
referenced. Tuples are spilled from the input, at which time they still
have 5 attributes; so naively copying them is wasteful.I'm not sure how often this laziness is really a win in practice,
especially after the expression evaluation has changed so much in
recent releases. So it might be better to just project all the
attributes eagerly, and then none of this would be a problem. If we
still wanted to be lazy about attribute fetching, that should still be
possible even if we did a kind of "logical" projection of the tuple so
that the useless attributes would not be relevant. Regardless, that's
outside the scope of the patch I'm currently working on.What I'd like to do is copy just the attributes needed into a new
virtual slot, leave the unneeded ones NULL, and then write it out to
the tuplestore as a MinimalTuple. I just need to be sure to get the
right attributes.Regards,
Jeff Davis
Melanie and I tried this, had a installcheck passed patch. The way how
we verify it is composing a wide table with long unnecessary text
columns, then check the size it writes on every iteration.
Please check out the attachment, it's based on your 1204 version.
--
Adam Lee
Attachments:
spill_fewer_cols.patchtext/plain; charset=us-asciiDownload
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index f509c8e8f5..fe4a520305 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1291,6 +1291,68 @@ project_aggregates(AggState *aggstate)
return NULL;
}
+static bool
+find_aggregated_cols_walker(Node *node, Bitmapset **colnos)
+{
+ if (node == NULL)
+ return false;
+
+ if (IsA(node, Var))
+ {
+ Var *var = (Var *) node;
+
+ *colnos = bms_add_member(*colnos, var->varattno);
+
+ return false;
+ }
+ return expression_tree_walker(node, find_aggregated_cols_walker,
+ (void *) colnos);
+}
+
+/*
+ * find_aggregated_cols
+ * Construct a bitmapset of the column numbers of aggregated Vars
+ * appearing in our targetlist and qual (HAVING clause)
+ */
+static Bitmapset *
+find_aggregated_cols(AggState *aggstate)
+{
+ Agg *node = (Agg *) aggstate->ss.ps.plan;
+ Bitmapset *colnos = NULL;
+ ListCell *temp;
+
+ /*
+ * We only want the columns used by aggregations in the targetlist or qual
+ */
+ if (node->plan.targetlist != NULL)
+ {
+ foreach(temp, (List *) node->plan.targetlist)
+ {
+ if (IsA(lfirst(temp), TargetEntry))
+ {
+ Node *node = (Node *)((TargetEntry *)lfirst(temp))->expr;
+ if (IsA(node, Aggref) || IsA(node, GroupingFunc))
+ find_aggregated_cols_walker(node, &colnos);
+ }
+ }
+ }
+
+ if (node->plan.qual != NULL)
+ {
+ foreach(temp, (List *) node->plan.qual)
+ {
+ if (IsA(lfirst(temp), TargetEntry))
+ {
+ Node *node = (Node *)((TargetEntry *)lfirst(temp))->expr;
+ if (IsA(node, Aggref) || IsA(node, GroupingFunc))
+ find_aggregated_cols_walker(node, &colnos);
+ }
+ }
+ }
+
+ return colnos;
+}
+
/*
* find_unaggregated_cols
* Construct a bitmapset of the column numbers of un-aggregated Vars
@@ -1520,6 +1582,23 @@ find_hash_columns(AggState *aggstate)
for (i = 0; i < perhash->numCols; i++)
colnos = bms_add_member(colnos, grpColIdx[i]);
+ /*
+ * Find the columns used by aggregations
+ *
+ * This is shared by the entire aggregation.
+ */
+ if (aggstate->aggregated_columns == NULL)
+ aggstate->aggregated_columns = find_aggregated_cols(aggstate);
+
+ /*
+ * The necessary columns to spill are either group keys or used by
+ * aggregations
+ *
+ * This is the convenient place to calculate the necessary columns to
+ * spill, because the group keys are different per hash.
+ */
+ perhash->necessarySpillCols = bms_union(colnos, aggstate->aggregated_columns);
+
/*
* First build mapping for columns directly hashed. These are the
* first, because they'll be accessed when computing hash values and
@@ -1861,6 +1940,23 @@ lookup_hash_entries(AggState *aggstate)
hash_spill_init(spill, 0, perhash->aggnode->numGroups,
aggstate->hashentrysize);
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ for (int ttsno = 0; ttsno < slot->tts_nvalid; ttsno++)
+ {
+ /*
+ * null the column out if it's unnecessary, the following
+ * forming functions will shrink it.
+ *
+ * it must be a virtual tuple here, this function is only used
+ * by the first round, tuples are from other node but not the
+ * spilled files.
+ *
+ * note: ttsno is zero indexed, cols are one indexed.
+ */
+ if (!bms_is_member(ttsno+1, perhash->necessarySpillCols))
+ slot->tts_isnull[ttsno] = true;
+ }
+
aggstate->hash_disk_used += hash_spill_tuple(spill, 0, slot, hash);
}
}
@@ -2623,7 +2719,15 @@ hash_spill_tuple(HashAggSpill *spill, int input_bits, TupleTableSlot *slot,
Assert(spill->partitions != NULL);
- /*TODO: project needed attributes only */
+ /*
+ * heap_form_minimal_tuple() if it's a virtual tuple,
+ * tts_minimal_get_minimal_tuple() if it's a minimal tuple, which is
+ * exactly what we want.
+ *
+ * when we spill the tuples from input, they are virtual tuples with some
+ * columns nulled out, when we re-spill the tuples from spilling files,
+ * they are minimal tuples which was already nulled out before.
+ */
tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
if (spill->partition_bits == 0)
@@ -3072,6 +3176,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
*/
aggstate->phases = palloc0(numPhases * sizeof(AggStatePerPhaseData));
+ aggstate->aggregated_columns = NULL;
aggstate->num_hashes = numHashes;
if (numHashes)
{
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 68c9e5f540..3b61109b52 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -302,6 +302,7 @@ typedef struct AggStatePerHashData
AttrNumber *hashGrpColIdxInput; /* hash col indices in input slot */
AttrNumber *hashGrpColIdxHash; /* indices in hash table tuples */
Agg *aggnode; /* original Agg node, for numGroups etc. */
+ Bitmapset *necessarySpillCols; /* the necessary columns if spills */
} AggStatePerHashData;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b9803a28bd..0c034b5f67 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2084,6 +2084,7 @@ typedef struct AggState
uint64 hash_disk_used; /* bytes of disk space used */
int hash_batches_used; /* batches used during entire execution */
List *hash_batches; /* hash batches remaining to be processed */
+ Bitmapset *aggregated_columns; /* the columns used by aggregations */
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote:
13) As for this:
/* make sure that we don't exhaust the hash bits */
if (partition_bits + input_bits >= 32)
partition_bits = 32 - input_bits;We already ran into this issue (exhausting bits in a hash value) in
hashjoin batching, we should be careful to use the same approach in
both
places (not the same code, just general approach).
I assume you're talking about ExecHashIncreaseNumBatches(), and in
particular, commit 8442317b. But that's a 10-year-old commit, so
perhaps you're talking about something else?
It looks like that code in HJ is protecting against having a very large
number of batches, such that we can't allocate an array of pointers for
each batch. And it seems like the concern is more related to a planner
error causing such a large nbatch.
I don't quite see the analogous case in HashAgg. npartitions is already
constrained to a maximum of 256. And the batches are individually
allocated, held in a list, not an array.
It could perhaps use some defensive programming to make sure that we
don't run into problems if the max is set very high.
Can you clarify what you're looking for here?
Perhaps I can also add a comment saying that we can have less than
HASH_MIN_PARTITIONS when running out of bits.
Regards,
Jeff Davis
On Thu, Dec 12, 2019 at 06:10:50PM -0800, Jeff Davis wrote:
On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote:
13) As for this:
/* make sure that we don't exhaust the hash bits */
if (partition_bits + input_bits >= 32)
partition_bits = 32 - input_bits;We already ran into this issue (exhausting bits in a hash value) in
hashjoin batching, we should be careful to use the same approach in
both
places (not the same code, just general approach).I assume you're talking about ExecHashIncreaseNumBatches(), and in
particular, commit 8442317b. But that's a 10-year-old commit, so
perhaps you're talking about something else?It looks like that code in HJ is protecting against having a very large
number of batches, such that we can't allocate an array of pointers for
each batch. And it seems like the concern is more related to a planner
error causing such a large nbatch.I don't quite see the analogous case in HashAgg. npartitions is already
constrained to a maximum of 256. And the batches are individually
allocated, held in a list, not an array.It could perhaps use some defensive programming to make sure that we
don't run into problems if the max is set very high.Can you clarify what you're looking for here?
I'm talking about this recent discussion on pgsql-bugs:
/messages/by-id/CA+hUKGLyafKXBMFqZCSeYikPbdYURbwr+jP6TAy8sY-8LO0V+Q@mail.gmail.com
I.e. when number of batches/partitions and buckets is high enough, we
may end up with very few bits in one of the parts.
Perhaps I can also add a comment saying that we can have less than
HASH_MIN_PARTITIONS when running out of bits.
Maybe.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Nov 27, 2019 at 02:58:04PM -0800, Jeff Davis wrote:
On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote:
Right now the patch always initializes 32 spill partitions. Have you
given
any thought into how to intelligently pick an optimal number of
partitions yet?Attached a new patch that addresses this.
1. Divide hash table memory used by the number of groups in the hash
table to get the average memory used per group.
2. Multiply by the number of groups spilled -- which I pessimistically
estimate as the number of tuples spilled -- to get the total amount of
memory that we'd like to have to process all spilled tuples at once.
3. Divide the desired amount of memory by work_mem to get the number of
partitions we'd like to have such that each partition can be processed
in work_mem without spilling.
4. Apply a few sanity checks, fudge factors, and limits.Using this runtime information should be substantially better than
using estimates and projections.Additionally, I removed some branches from the common path. I think I
still have more work to do there.I also rebased of course, and fixed a few other things.
I've done a bit more testing on this, after resolving a couple of minor
conflicts due to recent commits (rebased version attached).
In particular, I've made a comparison with different dataset sizes,
group sizes, GUC settings etc. The script and results from two different
machines are available here:
* https://bitbucket.org/tvondra/hashagg-tests/src/master/
The script essentially runs a simple grouping query with different
number of rows, groups, work_mem and parallelism settings. There's
nothing particularly magical about it.
I did run it both on master and patched code, allowing us to compare
results and assess impact of the patch. Overall, the changes are
expected and either neutral or beneficial, i.e. the timing are the same
or faster.
The number of cases that regressed is fairly small, but sometimes the
regressions are annoyingly large - up to 2x in some cases. Consider for
example this trivial example with 100M rows:
CREATE TABLE t AS
SELECT (100000000 * random())::int AS a
FROM generate_series(1,100000000) s(i);
On the master, the plan with default work_mem (i.e. 4MB) and
SET max_parallel_workers_per_gather = 8;
looks like this:
EXPLAIN SELECT * FROM (SELECT a, count(*) FROM t GROUP BY a OFFSET 1000000000) foo;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Limit (cost=16037474.49..16037474.49 rows=1 width=12)
-> Finalize GroupAggregate (cost=2383745.73..16037474.49 rows=60001208 width=12)
Group Key: t.a
-> Gather Merge (cost=2383745.73..14937462.25 rows=100000032 width=12)
Workers Planned: 8
-> Partial GroupAggregate (cost=2382745.59..2601495.66 rows=12500004 width=12)
Group Key: t.a
-> Sort (cost=2382745.59..2413995.60 rows=12500004 width=4)
Sort Key: t.a
-> Parallel Seq Scan on t (cost=0.00..567478.04 rows=12500004 width=4)
(10 rows)
Which kinda makes sense - we can't do hash aggregate, because there are
100M distinct values, and that won't fit into 4MB of memory (and the
planner knows about that).
And it completes in about 108381 ms, give or take. With the patch, the
plan changes like this:
EXPLAIN SELECT * FROM (SELECT a, count(*) FROM t GROUP BY a OFFSET 1000000000) foo;
QUERY PLAN
---------------------------------------------------------------------------
Limit (cost=2371037.74..2371037.74 rows=1 width=12)
-> HashAggregate (cost=1942478.48..2371037.74 rows=42855926 width=12)
Group Key: t.a
-> Seq Scan on t (cost=0.00..1442478.32 rows=100000032 width=4)
(4 rows)
i.e. it's way cheaper than the master plan, it's not parallel, but when
executed it takes much longer (about 147442 ms). After forcing a
parallel query (by setting parallel_setup_cost = 0) the plan changes to
a parallel one, but without a partial aggregate, but it's even slower.
The explain analyze for the non-parallel plan looks like this:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2371037.74..2371037.74 rows=1 width=12) (actual time=160180.718..160180.718 rows=0 loops=1)
-> HashAggregate (cost=1942478.48..2371037.74 rows=42855926 width=12) (actual time=54462.728..157594.756 rows=63215980 loops=1)
Group Key: t.a
Memory Usage: 4096kB Batches: 8320 Disk Usage:4529172kB
-> Seq Scan on t (cost=0.00..1442478.32 rows=100000032 width=4) (actual time=0.014..12198.044 rows=100000000 loops=1)
Planning Time: 0.110 ms
Execution Time: 160183.517 ms
(7 rows)
So the cost is about 7x lower than for master, but the duration is much
higher. I don't know how much of this is preventable, but it seems there
might be something missing in the costing, because when I set work_mem to
1TB on the master, and I tweak the n_distinct estimates for the column
to be exactly the same on the two clusters, I get this:
master:
-------
SET work_mem = '1TB';
EXPLAIN SELECT * FROM (SELECT a, count(*) FROM t GROUP BY a OFFSET 1000000000) foo;
QUERY PLAN
---------------------------------------------------------------------------
Limit (cost=2574638.28..2574638.28 rows=1 width=12)
-> HashAggregate (cost=1942478.48..2574638.28 rows=63215980 width=12)
Group Key: t.a
-> Seq Scan on t (cost=0.00..1442478.32 rows=100000032 width=4)
(4 rows)
patched:
--------
EXPLAIN SELECT * FROM (SELECT a, count(*) FROM t GROUP BY a OFFSET 1000000000) foo;
QUERY PLAN
---------------------------------------------------------------------------
Limit (cost=2574638.28..2574638.28 rows=1 width=12)
-> HashAggregate (cost=1942478.48..2574638.28 rows=63215980 width=12)
Group Key: t.a
-> Seq Scan on t (cost=0.00..1442478.32 rows=100000032 width=4)
(4 rows)
That is, the cost is exactly the same, except that in the second case we
expect to do quite a bit of batching - there are 8320 batches (and we
know that, because on master we'd not use hash aggregate without the
work_mem tweak).
So I think we're not costing the batching properly / at all.
A couple more comments:
1) IMHO we should rename hashagg_mem_overflow to enable_hashagg_overflow
or something like that. I think that describes the GUC purpose better
(and it's more consistent with enable_hashagg_spill).
2) show_hashagg_info
I think there's a missing space after ":" here:
" Batches: %d Disk Usage:%ldkB",
and maybe we should use just "Disk:" just like in we do for sort:
-> Sort (actual time=662.136..911.558 rows=1000000 loops=1)
Sort Key: t2.a
Sort Method: external merge Disk: 13800kB
3) I'm not quite sure what to think about the JIT recompile we do for
EEOP_AGG_INIT_TRANS_SPILLED etc. I'm no llvm/jit expert, but do we do
that for some other existing cases?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
hashagg-20191210.difftext/plain; charset=us-asciiDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 53ac14490a..10bfd7e1c3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,23 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-hashagg-mem-overflow" xreflabel="hashagg_mem_overflow">
+ <term><varname>hashagg_mem_overflow</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>hashagg_mem_overflow</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ If hash aggregation exceeds <varname>work_mem</varname> at query
+ execution time, and <varname>hashagg_mem_overflow</varname> is set
+ to <literal>on</literal>, continue consuming more memory rather than
+ performing disk-based hash aggregation. The default
+ is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
@@ -4451,6 +4468,24 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-hashagg-spill" xreflabel="enable_hashagg_spill">
+ <term><varname>enable_hashagg_spill</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_hashagg_spill</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the memory usage is expected to
+ exceed <varname>work_mem</varname>. This only affects the planner
+ choice; actual behavior at execution time is dictated by
+ <xref linkend="guc-hashagg-mem-overflow"/>. The default
+ is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
<term><varname>enable_hashjoin</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb3434a3..092a79ea14 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -102,6 +102,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1826,6 +1827,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ if (es->analyze)
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2715,6 +2718,56 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * If EXPLAIN ANALYZE, show information on hash aggregate memory usage and
+ * batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+ long diskKb = (aggstate->hash_disk_used + 1023) / 1024;
+
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(
+ es->str,
+ "Memory Usage: %ldkB",
+ memPeakKb);
+
+ if (aggstate->hash_batches_used > 0)
+ {
+ appendStringInfo(
+ es->str,
+ " Batches: %d Disk Usage:%ldkB",
+ aggstate->hash_batches_used, diskKb);
+ }
+
+ appendStringInfo(
+ es->str,
+ "\n");
+ }
+ else
+ {
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ ExplainPropertyInteger("Disk Usage", "kB", diskKb, es);
+ }
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 7e486449ec..b6d80ebe14 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -79,7 +79,8 @@ static void ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash);
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled);
/*
@@ -2927,7 +2928,7 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
*/
ExprState *
ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
- bool doSort, bool doHash)
+ bool doSort, bool doHash, bool spilled)
{
ExprState *state = makeNode(ExprState);
PlanState *parent = &aggstate->ss.ps;
@@ -3160,7 +3161,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (setno = 0; setno < processGroupingSets; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, false);
+ pertrans, transno, setno, setoff, false,
+ spilled);
setoff++;
}
}
@@ -3178,7 +3180,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (setno = 0; setno < numHashes; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, true);
+ pertrans, transno, setno, setoff, true,
+ spilled);
setoff++;
}
}
@@ -3226,7 +3229,8 @@ static void
ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash)
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled)
{
int adjust_init_jumpnull = -1;
int adjust_strict_jumpnull = -1;
@@ -3248,7 +3252,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
fcinfo->flinfo->fn_strict &&
pertrans->initValueIsNull)
{
- scratch->opcode = EEOP_AGG_INIT_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_INIT_TRANS_SPILLED : EEOP_AGG_INIT_TRANS;
scratch->d.agg_init_trans.aggstate = aggstate;
scratch->d.agg_init_trans.pertrans = pertrans;
scratch->d.agg_init_trans.setno = setno;
@@ -3265,7 +3270,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
if (pertrans->numSortCols == 0 &&
fcinfo->flinfo->fn_strict)
{
- scratch->opcode = EEOP_AGG_STRICT_TRANS_CHECK;
+ scratch->opcode = spilled ?
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED : EEOP_AGG_STRICT_TRANS_CHECK;
scratch->d.agg_strict_trans_check.aggstate = aggstate;
scratch->d.agg_strict_trans_check.setno = setno;
scratch->d.agg_strict_trans_check.setoff = setoff;
@@ -3283,9 +3289,11 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
/* invoke appropriate transition implementation */
if (pertrans->numSortCols == 0 && pertrans->transtypeByVal)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS_BYVAL;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED : EEOP_AGG_PLAIN_TRANS_BYVAL;
else if (pertrans->numSortCols == 0)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_SPILLED : EEOP_AGG_PLAIN_TRANS;
else if (pertrans->numInputs == 1)
scratch->opcode = EEOP_AGG_ORDERED_TRANS_DATUM;
else
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index dbed597816..49fbf8e4a4 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -430,9 +430,13 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
&&CASE_EEOP_AGG_INIT_TRANS,
+ &&CASE_EEOP_AGG_INIT_TRANS_SPILLED,
&&CASE_EEOP_AGG_STRICT_TRANS_CHECK,
+ &&CASE_EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_SPILLED,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
&&CASE_EEOP_LAST
@@ -1625,6 +1629,36 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ aggstate = op->d.agg_init_trans.aggstate;
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
+
+ /* If transValue has not yet been initialized, do so now. */
+ if (pergroup->noTransValue)
+ {
+ AggStatePerTrans pertrans = op->d.agg_init_trans.pertrans;
+
+ aggstate->curaggcontext = op->d.agg_init_trans.aggcontext;
+ aggstate->current_set = op->d.agg_init_trans.setno;
+
+ ExecAggInitGroup(aggstate, pertrans, pergroup);
+
+ /* copied trans value from input, done this round */
+ EEO_JUMP(op->d.agg_init_trans.jumpnull);
+ }
+
+ EEO_NEXT();
+ }
/* check that a strict aggregate's input isn't NULL */
EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK)
@@ -1642,6 +1676,25 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ aggstate = op->d.agg_strict_trans_check.aggstate;
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_strict_trans_check.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_strict_trans_check.transno];
+
+ if (unlikely(pergroup->transValueIsNull))
+ EEO_JUMP(op->d.agg_strict_trans_check.jumpnull);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1691,6 +1744,52 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ aggstate = op->d.agg_trans.aggstate;
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ Assert(pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1756,6 +1855,67 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ aggstate = op->d.agg_trans.aggstate;
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ Assert(!pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
+ /*
+ * For pass-by-ref datatype, must copy the new value into
+ * aggcontext and free the prior transValue. But if transfn
+ * returned a pointer to its first input, we don't need to do
+ * anything. Also, if transfn returned a pointer to a R/W
+ * expanded object that is already a child of the aggcontext,
+ * assume we can adopt that value without copying it.
+ */
+ if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggTransReparent(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
/* process single-column ordered aggregate datum */
EEO_CASE(EEOP_AGG_ORDERED_TRANS_DATUM)
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index e361143094..36f32f0cf9 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -25,8 +25,9 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
-static uint32 TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple);
static int TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
+static TupleHashEntry LookupTupleHashEntry_internal(
+ TupleHashTable hashtable, TupleTableSlot *slot, bool *isnew, uint32 hash);
/*
* Define parameters for tuple hash table code generation. The interface is
@@ -284,6 +285,17 @@ ResetTupleHashTable(TupleHashTable hashtable)
tuplehash_reset(hashtable->hashtab);
}
+/*
+ * Destroy the hash table. Note that the tablecxt passed to
+ * BuildTupleHashTableExt() should also be reset, otherwise there will be
+ * leaks.
+ */
+void
+DestroyTupleHashTable(TupleHashTable hashtable)
+{
+ tuplehash_destroy(hashtable->hashtab);
+}
+
/*
* Find or create a hashtable entry for the tuple group containing the
* given tuple. The tuple must be the same type as the hashtable entries.
@@ -300,10 +312,9 @@ TupleHashEntry
LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
bool *isnew)
{
- TupleHashEntryData *entry;
- MemoryContext oldContext;
- bool found;
- MinimalTuple key;
+ TupleHashEntry entry;
+ MemoryContext oldContext;
+ uint32 hash;
/* Need to run the hash functions in short-lived context */
oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
@@ -313,32 +324,29 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
hashtable->cur_eq_func = hashtable->tab_eq_func;
- key = NULL; /* flag to reference inputslot */
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+ entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
- if (isnew)
- {
- entry = tuplehash_insert(hashtable->hashtab, key, &found);
+ MemoryContextSwitchTo(oldContext);
- if (found)
- {
- /* found pre-existing entry */
- *isnew = false;
- }
- else
- {
- /* created new entry */
- *isnew = true;
- /* zero caller data */
- entry->additional = NULL;
- MemoryContextSwitchTo(hashtable->tablecxt);
- /* Copy the first tuple into the table context */
- entry->firstTuple = ExecCopySlotMinimalTuple(slot);
- }
- }
- else
- {
- entry = tuplehash_lookup(hashtable->hashtab, key);
- }
+ return entry;
+}
+
+/*
+ * A variant of LookupTupleHashEntry for callers that have already computed
+ * the hash value.
+ */
+TupleHashEntry
+LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
+ bool *isnew, uint32 hash)
+{
+ TupleHashEntry entry;
+ MemoryContext oldContext;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
MemoryContextSwitchTo(oldContext);
@@ -386,10 +394,12 @@ FindTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
* need to materialize virtual input tuples unless they actually need to get
* copied into the table.
*
+ * If tuple is NULL, use the input slot instead.
+ *
* Also, the caller must select an appropriate memory context for running
* the hash functions. (dynahash.c doesn't change CurrentMemoryContext.)
*/
-static uint32
+uint32
TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
{
TupleHashTable hashtable = (TupleHashTable) tb->private_data;
@@ -410,9 +420,6 @@ TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
{
/*
* Process a tuple already stored in the table.
- *
- * (this case never actually occurs due to the way simplehash.h is
- * used, as the hash-value is stored in the entries)
*/
slot = hashtable->tableslot;
ExecStoreMinimalTuple(tuple, slot, false);
@@ -450,6 +457,54 @@ TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
return murmurhash32(hashkey);
}
+/*
+ * Does the work of LookupTupleHashEntry and LookupTupleHashEntryHash. Useful
+ * so that we can avoid switching the memory context multiple times for
+ * LookupTupleHashEntry.
+ */
+static TupleHashEntry
+LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
+ bool *isnew, uint32 hash)
+{
+ TupleHashEntryData *entry;
+ bool found;
+ MinimalTuple key;
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = slot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ key = NULL; /* flag to reference inputslot */
+
+ if (isnew)
+ {
+ entry = tuplehash_insert_hash(hashtable->hashtab, key, hash, &found);
+
+ if (found)
+ {
+ /* found pre-existing entry */
+ *isnew = false;
+ }
+ else
+ {
+ /* created new entry */
+ *isnew = true;
+ /* zero caller data */
+ entry->additional = NULL;
+ MemoryContextSwitchTo(hashtable->tablecxt);
+ /* Copy the first tuple into the table context */
+ entry->firstTuple = ExecCopySlotMinimalTuple(slot);
+ }
+ }
+ else
+ {
+ entry = tuplehash_lookup_hash(hashtable->hashtab, key, hash);
+ }
+
+ return entry;
+}
+
/*
* See whether two tuples (presumably of the same hash value) match
*/
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 6ee24eab3d..f509c8e8f5 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -194,6 +194,18 @@
* transition values. hashcontext is the single context created to support
* all hash tables.
*
+ * When the hash table memory exceeds work_mem, we advance the transition
+ * states only for groups already in the hash table. For tuples that would
+ * need to create a new hash table entries (and initialize new transition
+ * states), we spill them to disk to be processed later. The tuples are
+ * spilled in a partitioned manner, so that subsequent batches are smaller
+ * and less likely to exceed work_mem (if a batch does exceed work_mem, it
+ * must be spilled recursively).
+ *
+ * Note that it's possible for transition states to start small but then
+ * grow very large; for instance in the case of ARRAY_AGG. In such cases,
+ * it's still possible to significantly exceed work_mem.
+ *
* Transition / Combine function invocation:
*
* For performance reasons transition functions, including combine
@@ -229,15 +241,70 @@
#include "optimizer/optimizer.h"
#include "parser/parse_agg.h"
#include "parser/parse_coerce.h"
+#include "storage/buffile.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/datum.h"
+#include "utils/dynahash.h"
#include "utils/expandeddatum.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+/*
+ * Control how many partitions are created when spilling HashAgg to
+ * disk.
+ *
+ * HASH_PARTITION_FACTOR is multiplied by the estimated number of partitions
+ * needed such that each partition will fit in memory. The factor is set
+ * higher than one because there's not a high cost to having a few too many
+ * partitions, and it makes it less likely that a partition will need to be
+ * spilled recursively. Another benefit of having more, smaller partitions is
+ * that small hash tables may perform better than large ones due to memory
+ * caching effects.
+ *
+ * HASH_PARTITION_MEM is the approximate amount of work_mem we should reserve
+ * for the partitions themselves (i.e. buffering of the files backing the
+ * partitions). This is sloppy, because we must reserve the memory before
+ * filling the hash table; but we choose the number of partitions at the time
+ * we need to spill.
+ *
+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (and
+ * possibly pushing hidden costs to the OS for managing more files).
+ */
+#define HASH_PARTITION_FACTOR 1.50
+#define HASH_MIN_PARTITIONS 4
+#define HASH_MAX_PARTITIONS 256
+#define HASH_PARTITION_MEM (HASH_MIN_PARTITIONS * BLCKSZ)
+
+/*
+ * Represents partitioned spill data for a single hashtable.
+ */
+typedef struct HashAggSpill
+{
+ int n_partitions; /* number of output partitions */
+ int partition_bits; /* number of bits for partition mask
+ log2(n_partitions) parent partition bits */
+ BufFile **partitions; /* output partition files */
+ int64 *ntuples; /* number of tuples in each partition */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation. Initially,
+ * only the input fields are set. If spilled to disk, also set the spill data.
+ */
+typedef struct HashAggBatch
+{
+ BufFile *input_file; /* input partition */
+ int input_bits; /* number of bits for input partition mask */
+ int64 input_groups; /* estimated number of input groups */
+ int setno; /* grouping set */
+ HashAggSpill spill; /* spill output */
+} HashAggBatch;
+
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -271,12 +338,35 @@ static void finalize_aggregates(AggState *aggstate,
static TupleTableSlot *project_aggregates(AggState *aggstate);
static Bitmapset *find_unaggregated_cols(AggState *aggstate);
static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
-static void build_hash_table(AggState *aggstate);
-static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
+static void build_hash_table(AggState *aggstate, int setno,
+ int64 ngroups_estimate);
+static void prepare_hash_slot(AggState *aggstate);
+static void hash_recompile_expressions(AggState *aggstate);
+static uint32 calculate_hash(AggState *aggstate);
+static long hash_choose_num_buckets(AggState *aggstate,
+ long estimated_nbuckets,
+ Size memory);
+static int hash_choose_num_spill_partitions(uint64 input_tuples,
+ double hashentrysize);
+static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+static void hash_spill_init(HashAggSpill *spill, int input_bits,
+ uint64 input_tuples, double hashentrysize);
+static Size hash_spill_tuple(HashAggSpill *spill, int input_bits,
+ TupleTableSlot *slot, uint32 hash);
+static MinimalTuple hash_read_spilled(BufFile *file, uint32 *hashp);
+static HashAggBatch *hash_batch_new(BufFile *input_file, int setno,
+ int64 input_groups, int input_bits);
+static void hash_finish_initial_spills(AggState *aggstate);
+static void hash_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno, int input_bits);
+static void hash_reset_spill(HashAggSpill *spill);
+static void hash_reset_spills(AggState *aggstate);
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1254,18 +1344,20 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
* for each entry.
*
* We have a separate hashtable and associated perhash data structure for each
- * grouping set for which we're doing hashing.
+ * grouping set for which we're doing hashing. If setno is -1, build hash
+ * tables for all grouping sets. Otherwise, build only for the specified
+ * grouping set.
*
* The contents of the hash tables always live in the hashcontext's per-tuple
* memory context (there is only one of these for all tables together, since
* they are all reset at the same time).
*/
static void
-build_hash_table(AggState *aggstate)
+build_hash_table(AggState *aggstate, int setno, long ngroups_estimate)
{
- MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
- Size additionalsize;
- int i;
+ MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
+ Size additionalsize;
+ int i;
Assert(aggstate->aggstrategy == AGG_HASHED || aggstate->aggstrategy == AGG_MIXED);
@@ -1274,26 +1366,71 @@ build_hash_table(AggState *aggstate)
for (i = 0; i < aggstate->num_hashes; ++i)
{
AggStatePerHash perhash = &aggstate->perhash[i];
+ int64 ngroups;
+ long nbuckets;
+ Size memory;
Assert(perhash->aggnode->numGroups > 0);
if (perhash->hashtable)
- ResetTupleHashTable(perhash->hashtable);
- else
- perhash->hashtable = BuildTupleHashTableExt(&aggstate->ss.ps,
- perhash->hashslot->tts_tupleDescriptor,
- perhash->numCols,
- perhash->hashGrpColIdxHash,
- perhash->eqfuncoids,
- perhash->hashfunctions,
- perhash->aggnode->grpCollations,
- perhash->aggnode->numGroups,
- additionalsize,
- aggstate->ss.ps.state->es_query_cxt,
- aggstate->hashcontext->ecxt_per_tuple_memory,
- tmpmem,
- DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+ DestroyTupleHashTable(perhash->hashtable);
+ perhash->hashtable = NULL;
+
+ /*
+ * If we are building a hash table for only a single grouping set,
+ * skip the others.
+ */
+ if (setno >= 0 && setno != i)
+ continue;
+
+ /*
+ * Use an estimate from execution time if we have it; otherwise fall
+ * back to the planner estimate.
+ */
+ ngroups = ngroups_estimate > 0 ?
+ ngroups_estimate : perhash->aggnode->numGroups;
+
+ /* divide memory by the number of hash tables we are initializing */
+ memory = (long)work_mem * 1024L /
+ (setno >= 0 ? 1 : aggstate->num_hashes);
+
+ /* choose reasonable number of buckets per hashtable */
+ nbuckets = hash_choose_num_buckets(aggstate, ngroups, memory);
+
+ perhash->hashtable = BuildTupleHashTableExt(&aggstate->ss.ps,
+ perhash->hashslot->tts_tupleDescriptor,
+ perhash->numCols,
+ perhash->hashGrpColIdxHash,
+ perhash->eqfuncoids,
+ perhash->hashfunctions,
+ perhash->aggnode->grpCollations,
+ nbuckets,
+ additionalsize,
+ aggstate->ss.ps.state->es_query_cxt,
+ aggstate->hashcontext->ecxt_per_tuple_memory,
+ tmpmem,
+ DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
}
+
+ aggstate->hash_mem_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_ngroups_current = 0;
+
+ /*
+ * Initialize the threshold at which we stop creating new hash entries and
+ * start spilling. If an empty hash table exceeds the limit, increase the
+ * limit to be the size of the empty hash table. This ensures that at
+ * least one entry can be added so that the algorithm can make progress.
+ */
+ if (hashagg_mem_overflow)
+ aggstate->hash_mem_limit = SIZE_MAX;
+ else if (work_mem * 1024L > HASH_PARTITION_MEM * 2)
+ aggstate->hash_mem_limit = (work_mem * 1024L) - HASH_PARTITION_MEM;
+ else
+ aggstate->hash_mem_limit = (work_mem * 1024L);
+
+ if (aggstate->hash_mem_current > aggstate->hash_mem_limit)
+ aggstate->hash_mem_limit = aggstate->hash_mem_current;
}
/*
@@ -1455,22 +1592,16 @@ hash_agg_entry_size(int numAggs)
}
/*
- * Find or create a hashtable entry for the tuple group containing the current
- * tuple (already set in tmpcontext's outertuple slot), in the current grouping
- * set (which the caller must have selected - note that initialize_aggregate
- * depends on this).
- *
- * When called, CurrentMemoryContext should be the per-query context.
+ * Extract the attributes that make up the grouping key into the
+ * hashslot. This is necessary to compute the hash of the grouping key.
*/
-static TupleHashEntryData *
-lookup_hash_entry(AggState *aggstate)
+static void
+prepare_hash_slot(AggState *aggstate)
{
- TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
- AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
- TupleTableSlot *hashslot = perhash->hashslot;
- TupleHashEntryData *entry;
- bool isnew;
- int i;
+ TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ int i;
/* transfer just the needed columns into hashslot */
slot_getsomeattrs(inputslot, perhash->largestGrpColIdx);
@@ -1484,14 +1615,169 @@ lookup_hash_entry(AggState *aggstate)
hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
}
ExecStoreVirtualTuple(hashslot);
+}
+
+/*
+ * Recompile the expressions for advancing aggregates while hashing. This is
+ * necessary for certain kinds of state changes that affect the resulting
+ * expression. For instance, changing aggstate->hash_spilled or
+ * aggstate->ss.ps.outerops require recompilation.
+ */
+static void
+hash_recompile_expressions(AggState *aggstate)
+{
+ AggStatePerPhase phase;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ if (aggstate->aggstrategy == AGG_HASHED)
+ phase = &aggstate->phases[0];
+ else /* AGG_MIXED */
+ phase = &aggstate->phases[1];
+
+ phase->evaltrans = ExecBuildAggTrans(
+ aggstate, phase,
+ aggstate->aggstrategy == AGG_MIXED ? true : false, /* dosort */
+ true, /* dohash */
+ aggstate->hash_spilled /* spilled */);
+}
+
+/*
+ * Calculate the hash value for a tuple. It's useful to do this outside of the
+ * hash table so that we can reuse saved hash values rather than recomputing.
+ */
+static uint32
+calculate_hash(AggState *aggstate)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleHashTable hashtable = perhash->hashtable;
+ MemoryContext oldContext;
+ uint32 hash;
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = perhash->hashslot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+
+ MemoryContextSwitchTo(oldContext);
+
+ return hash;
+}
+
+/*
+ * Choose a reasonable number of buckets for the initial hash table size.
+ */
+static long
+hash_choose_num_buckets(AggState *aggstate, long ngroups, Size memory)
+{
+ long max_nbuckets;
+ int log2_ngroups;
+ long nbuckets;
+
+ if (aggstate->hashentrysize == 0.0)
+ aggstate->hashentrysize = hash_agg_entry_size(aggstate->numtrans);
+
+ max_nbuckets = memory / aggstate->hashentrysize;
+
+ /*
+ * Lowest power of two greater than ngroups, without exceeding
+ * max_nbuckets.
+ */
+ for (log2_ngroups = 1, nbuckets = 2;
+ nbuckets < ngroups && nbuckets < max_nbuckets;
+ log2_ngroups++, nbuckets <<= 1);
+
+ if (nbuckets > max_nbuckets && nbuckets > 2)
+ nbuckets >>= 1;
+
+ return nbuckets;
+}
+
+/*
+ * Determine the number of partitions to create when spilling.
+ */
+static int
+hash_choose_num_spill_partitions(uint64 input_tuples, double hashentrysize)
+{
+ Size mem_needed;
+ int partition_limit;
+ int npartitions;
+
+ /*
+ * Avoid creating so many partitions that the memory requirements of the
+ * open partition files (estimated at BLCKSZ for buffering) are greater
+ * than 1/4 of work_mem.
+ */
+ partition_limit = (work_mem * 1024L * 0.25) / BLCKSZ;
+
+ /* pessimistically estimate that each input tuple creates a new group */
+ mem_needed = HASH_PARTITION_FACTOR * input_tuples * hashentrysize;
+
+ /* make enough partitions so that each one is likely to fit in memory */
+ npartitions = 1 + (mem_needed / (work_mem * 1024L));
+
+ if (npartitions > partition_limit)
+ npartitions = partition_limit;
+
+ if (npartitions < HASH_MIN_PARTITIONS)
+ npartitions = HASH_MIN_PARTITIONS;
+ if (npartitions > HASH_MAX_PARTITIONS)
+ npartitions = HASH_MAX_PARTITIONS;
+
+ return npartitions;
+}
+
+/*
+ * Find or create a hashtable entry for the tuple group containing the current
+ * tuple (already set in tmpcontext's outertuple slot), in the current grouping
+ * set (which the caller must have selected - note that initialize_aggregate
+ * depends on this).
+ *
+ * When called, CurrentMemoryContext should be the per-query context.
+ *
+ * If the hash table is at the memory limit, then only find existing hashtable
+ * entries; don't create new ones. If a tuple's group is not already present
+ * in the hash table for the current grouping set, return NULL and the caller
+ * will spill it to disk.
+ */
+static AggStatePerGroup
+lookup_hash_entry(AggState *aggstate, uint32 hash)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ TupleHashEntryData *entry;
+ bool isnew = false;
+ bool *p_isnew;
+
+ /* if hash table memory limit is exceeded, don't create new entries */
+ p_isnew = (aggstate->hash_mem_current > aggstate->hash_mem_limit) ?
+ NULL : &isnew;
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntry(perhash->hashtable, hashslot, &isnew);
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
+ hash);
+
+ if (entry == NULL)
+ return NULL;
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_ngroups_current++;
+
+ aggstate->hash_mem_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ if (aggstate->hash_mem_current > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = aggstate->hash_mem_current;
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1511,7 +1797,7 @@ lookup_hash_entry(AggState *aggstate)
}
}
- return entry;
+ return entry->additional;
}
/*
@@ -1519,18 +1805,64 @@ lookup_hash_entry(AggState *aggstate)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Some entries may be left NULL if we are at the memory limit. The same tuple
+ * will belong to different groups for each set, so may match a group already
+ * in memory for one set and match a group not in memory for another set. If
+ * at the memory limit and a tuple doesn't match a group in memory for a
+ * particular set, it will be spilled.
+ *
+ * NB: It's possible to spill the same tuple for several different grouping
+ * sets. This may seem wasteful, but it's actually a trade-off: if we spill
+ * the tuple multiple times for multiple grouping sets, it can be partitioned
+ * for each grouping set, making the refilling of the hash table very
+ * efficient.
*/
static void
lookup_hash_entries(AggState *aggstate)
{
- int numHashes = aggstate->num_hashes;
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
int setno;
- for (setno = 0; setno < numHashes; setno++)
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
{
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ uint32 hash;
+
select_current_set(aggstate, setno, true);
- pergroup[setno] = lookup_hash_entry(aggstate)->additional;
+ prepare_hash_slot(aggstate);
+ hash = calculate_hash(aggstate);
+ pergroup[setno] = lookup_hash_entry(aggstate, hash);
+
+ /* check to see if we need to spill the tuple for this grouping set */
+ if (pergroup[setno] == NULL)
+ {
+ HashAggSpill *spill;
+ TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
+
+ /* update hashentrysize estimate based on contents */
+ Assert(aggstate->hash_ngroups_current > 0);
+ aggstate->hashentrysize = (double)aggstate->hash_mem_current /
+ (double)aggstate->hash_ngroups_current;
+
+ if (aggstate->hash_spills == NULL)
+ aggstate->hash_spills = palloc0(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+
+ if (!aggstate->hash_spilled)
+ {
+ aggstate->hash_spilled = true;
+ hash_recompile_expressions(aggstate);
+ }
+
+ spill = &aggstate->hash_spills[setno];
+
+ if (spill->partitions == NULL)
+ hash_spill_init(spill, 0, perhash->aggnode->numGroups,
+ aggstate->hashentrysize);
+
+ aggstate->hash_disk_used += hash_spill_tuple(spill, 0, slot, hash);
+ }
}
}
@@ -1853,6 +2185,12 @@ agg_retrieve_direct(AggState *aggstate)
if (TupIsNull(outerslot))
{
/* no more outer-plan tuples available */
+
+ /* if we built hash tables, finalize any spills */
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hash_finish_initial_spills(aggstate);
+
if (hasGroupingSets)
{
aggstate->input_done = true;
@@ -1955,6 +2293,9 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ /* finalize spills, if any */
+ hash_finish_initial_spills(aggstate);
+
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1962,11 +2303,161 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * consumed.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ /*
+ * Each spill file contains spilled data for only a single grouping
+ * set. We want to ignore all others, which is done by setting the other
+ * pergroups to NULL.
+ */
+ memset(aggstate->all_pergroups, 0,
+ sizeof(AggStatePerGroup) *
+ (aggstate->maxsets + aggstate->num_hashes));
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ /*
+ * Free memory and rebuild a single hash table for this batch's grouping
+ * set.
+ */
+ ReScanExprContext(aggstate->hashcontext);
+ build_hash_table(aggstate, batch->setno, batch->input_groups);
+
+ Assert(aggstate->current_phase == 0);
+
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ /*
+ * The first pass (agg_fill_hash_table) reads whatever kind of slot comes
+ * from the outer plan, and considers the slot fixed. But spilled tuples
+ * are always MinimalTuples, so if that's different from the outer plan we
+ * need to change it and recompile the aggregate expressions.
+ */
+ if (aggstate->ss.ps.outerops != &TTSOpsMinimalTuple)
+ {
+ aggstate->ss.ps.outerops = &TTSOpsMinimalTuple;
+ hash_recompile_expressions(aggstate);
+ }
+
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+ uint32 hash;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hash_read_spilled(batch->input_file, &hash);
+ if (tuple == NULL)
+ break;
+
+ ExecStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ select_current_set(aggstate, batch->setno, true);
+ prepare_hash_slot(aggstate);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+
+ /* if there's no memory for a new group, spill */
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ {
+ /* update hashentrysize estimate based on contents */
+ Assert(aggstate->hash_ngroups_current > 0);
+ aggstate->hashentrysize = (double)aggstate->hash_mem_current /
+ (double)aggstate->hash_ngroups_current;
+
+ if (batch->spill.partitions == NULL)
+ hash_spill_init(&batch->spill, batch->input_bits,
+ batch->input_groups, aggstate->hashentrysize);
+
+ aggstate->hash_disk_used += hash_spill_tuple(
+ &batch->spill, batch->input_bits, slot, hash);
+ }
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ BufFileClose(batch->input_file);
+
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ hash_spill_finish(aggstate, &batch->spill, batch->setno,
+ batch->input_bits);
+
+ pfree(batch);
+
+ /* Initialize to walk the first hash table */
+ select_current_set(aggstate, 0, true);
+ ResetTupleHashIterator(aggstate->perhash[0].hashtable,
+ &aggstate->perhash[0].hashiter);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
+ *
+ * After exhausting in-memory tuples, also try refilling the hash table using
+ * previously-spilled tuples. Only returns NULL after all in-memory and
+ * spilled tuples are exhausted.
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+/*
+ * Retrieve the groups from the in-memory hash tables without considering any
+ * spilled tuples.
+ */
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -1995,7 +2486,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2026,8 +2517,6 @@ agg_retrieve_hash_table(AggState *aggstate)
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2084,6 +2573,283 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * hash_spill_init
+ *
+ * Called after we determined that spilling is necessary. Chooses the number
+ * of partitions to create, and initializes them.
+ */
+static void
+hash_spill_init(HashAggSpill *spill, int input_bits, uint64 input_tuples,
+ double hashentrysize)
+{
+ int npartitions;
+ int partition_bits;
+
+ npartitions = hash_choose_num_spill_partitions(input_tuples,
+ hashentrysize);
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits
+ TODO: be consistent with hashjoin batching */
+ if (partition_bits + input_bits >= 32)
+ partition_bits = 32 - input_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ spill->partition_bits = partition_bits;
+ spill->n_partitions = npartitions;
+ spill->partitions = palloc0(sizeof(BufFile *) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+}
+
+/*
+ * hash_spill_tuple
+ *
+ * Not enough memory to add tuple as new entry in hash table. Save for later
+ * in the appropriate partition.
+ */
+static Size
+hash_spill_tuple(HashAggSpill *spill, int input_bits, TupleTableSlot *slot,
+ uint32 hash)
+{
+ int partition;
+ MinimalTuple tuple;
+ BufFile *file;
+ int written;
+ int total_written = 0;
+ bool shouldFree;
+
+ Assert(spill->partitions != NULL);
+
+ /*TODO: project needed attributes only */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ if (spill->partition_bits == 0)
+ partition = 0;
+ else
+ partition = (hash << input_bits) >>
+ (32 - spill->partition_bits);
+
+ spill->ntuples[partition]++;
+
+ if (spill->partitions[partition] == NULL)
+ spill->partitions[partition] = BufFileCreateTemp(false);
+ file = spill->partitions[partition];
+
+ written = BufFileWrite(file, (void *) &hash, sizeof(uint32));
+ if (written != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to HashAgg temporary file: %m")));
+ total_written += written;
+
+ written = BufFileWrite(file, (void *) tuple, tuple->t_len);
+ if (written != tuple->t_len)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to HashAgg temporary file: %m")));
+ total_written += written;
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return total_written;
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hash_read_spilled(BufFile *file, uint32 *hashp)
+{
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+ uint32 hash;
+
+ nread = BufFileRead(file, &hash, sizeof(uint32));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+ if (hashp != NULL)
+ *hashp = hash;
+
+ nread = BufFileRead(file, &t_len, sizeof(t_len));
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = BufFileRead(file, (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+
+ return tuple;
+}
+
+/*
+ * new_hashagg_batch
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done. Should be called in the aggregate's memory context.
+ */
+static HashAggBatch *
+hash_batch_new(BufFile *input_file, int setno, int64 input_groups,
+ int input_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->input_file = input_file;
+ batch->input_bits = input_bits;
+ batch->input_groups = input_groups;
+ batch->setno = setno;
+
+ /* batch->spill will be set only after spilling this batch */
+
+ return batch;
+}
+
+/*
+ * hash_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hash_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+
+ if (aggstate->hash_spills == NULL)
+ return;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hash_spill_finish(aggstate, &aggstate->hash_spills[setno], setno, 0);
+
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+}
+
+/*
+ * hash_spill_finish
+ *
+ *
+ */
+static void
+hash_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno, int input_bits)
+{
+ int i;
+
+ if (spill->n_partitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->n_partitions; i++)
+ {
+ BufFile *file = spill->partitions[i];
+ MemoryContext oldContext;
+ HashAggBatch *new_batch;
+ int64 input_ngroups;
+
+ /* partition is empty */
+ if (file == NULL)
+ continue;
+
+ /* rewind file for reading */
+ if (BufFileSeek(file, 0, 0L, SEEK_SET))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rewind HashAgg temporary file: %m")));
+
+ /*
+ * Estimate the number of input groups for this new work item as the
+ * total number of tuples in its input file. Although that's a worst
+ * case, it's not bad here for two reasons: (1) overestimating is
+ * better than underestimating; and (2) we've already scanned the
+ * relation once, so it's likely that we've already finalized many of
+ * the common values.
+ */
+ input_ngroups = spill->ntuples[i];
+
+ oldContext = MemoryContextSwitchTo(aggstate->ss.ps.state->es_query_cxt);
+ new_batch = hash_batch_new(file, setno, input_ngroups,
+ spill->partition_bits + input_bits);
+ aggstate->hash_batches = lappend(aggstate->hash_batches, new_batch);
+ aggstate->hash_batches_used++;
+ MemoryContextSwitchTo(oldContext);
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Clear a HashAggSpill, free its memory, and close its files.
+ */
+static void
+hash_reset_spill(HashAggSpill *spill)
+{
+ int i;
+ for (i = 0; i < spill->n_partitions; i++)
+ {
+ BufFile *file = spill->partitions[i];
+
+ if (file != NULL)
+ BufFileClose(file);
+ }
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+}
+
+/*
+ * Find and reset all active HashAggSpills.
+ */
+static void
+hash_reset_spills(AggState *aggstate)
+{
+ ListCell *lc;
+
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hash_reset_spill(&aggstate->hash_spills[setno]);
+
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ if (batch->input_file != NULL)
+ {
+ BufFileClose(batch->input_file);
+ batch->input_file = NULL;
+ }
+ hash_reset_spill(&batch->spill);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2268,6 +3034,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->ss.ps.outeropsfixed = false;
}
+ if (use_hashing)
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsMinimalTuple);
+
/*
* Initialize result type, slot and projection.
*/
@@ -2497,7 +3267,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->hash_pergroup = pergroups;
find_hash_columns(aggstate);
- build_hash_table(aggstate);
+ build_hash_table(aggstate, -1, 0);
aggstate->table_filled = false;
}
@@ -2903,7 +3673,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
else
Assert(false);
- phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash);
+ phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash, false);
}
@@ -3398,6 +4168,8 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hash_reset_spills(node);
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3453,12 +4225,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3515,9 +4288,20 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ hash_reset_spills(node);
+
+ node->hash_spilled = false;
+ node->hash_mem_current = 0;
+ node->hash_ngroups_current = 0;
+
+ /* reset stats */
+ node->hash_mem_peak = 0;
+ node->hash_disk_used = 0;
+ node->hash_batches_used = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
- build_hash_table(node);
+ build_hash_table(node, -1, 0);
node->table_filled = false;
/* iterator will be reset when the table is filled */
}
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index a9d362100a..fd29ce5d12 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2083,6 +2083,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_INIT_TRANS:
+ case EEOP_AGG_INIT_TRANS_SPILLED:
{
AggState *aggstate;
AggStatePerTrans pertrans;
@@ -2093,6 +2094,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_allpergroupsp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_setoff,
v_transno;
@@ -2120,11 +2122,32 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_init_trans.setoff);
v_transno = l_int32_const(op->d.agg_init_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_notransvalue = l_bb_before_v(
+ opblocks[i + 1], "op.%d.check_notransvalue", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(
+ b, v_pergroup_allaggs, TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_check_notransvalue);
+
+ LLVMPositionBuilderAtEnd(b, b_check_notransvalue);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_notransvalue =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_NOTRANSVALUE,
@@ -2181,6 +2204,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_STRICT_TRANS_CHECK:
+ case EEOP_AGG_STRICT_TRANS_CHECK_SPILLED:
{
AggState *aggstate;
LLVMValueRef v_setoff,
@@ -2191,6 +2215,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_transnull;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
int jumpnull = op->d.agg_strict_trans_check.jumpnull;
@@ -2210,11 +2235,32 @@ llvm_compile_expr(ExprState *state)
l_int32_const(op->d.agg_strict_trans_check.setoff);
v_transno =
l_int32_const(op->d.agg_strict_trans_check.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_transnull = l_bb_before_v(
+ opblocks[i + 1], "op.%d.check_transnull", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[jumpnull],
+ b_check_transnull);
+
+ LLVMPositionBuilderAtEnd(b, b_check_transnull);
+ }
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_transnull =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_TRANSVALUEISNULL,
@@ -2230,7 +2276,9 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_PLAIN_TRANS_BYVAL:
+ case EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED:
case EEOP_AGG_PLAIN_TRANS:
+ case EEOP_AGG_PLAIN_TRANS_SPILLED:
{
AggState *aggstate;
AggStatePerTrans pertrans;
@@ -2256,6 +2304,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_pertransp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_retval;
@@ -2283,10 +2332,33 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_trans.setoff);
v_transno = l_int32_const(op->d.agg_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED ||
+ opcode == EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_advance_transval = l_bb_before_v(
+ opblocks[i + 1], "op.%d.advance_transval", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_advance_transval);
+
+ LLVMPositionBuilderAtEnd(b, b_advance_transval);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_fcinfo = l_ptr_const(fcinfo,
l_ptr(StructFunctionCallInfoData));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f6593485..3f0d289963 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
bool enable_hashagg = true;
+bool enable_hashagg_spill = true;
bool enable_nestloop = true;
bool enable_material = true;
bool enable_mergejoin = true;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7fe11b59a0..511f8861a8 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4255,6 +4255,9 @@ consider_groupingsets_paths(PlannerInfo *root,
* gd->rollups is empty if we have only unsortable columns to work
* with. Override work_mem in that case; otherwise, we'll rely on the
* sorted-input case to generate usable mixed paths.
+ *
+ * TODO: think more about how to plan grouping sets when spilling hash
+ * tables is an option
*/
if (hashsize > work_mem * 1024L && gd->rollups)
return; /* nope, won't fit */
@@ -6527,7 +6530,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
* were unable to sort above, then we'd better generate a Path, so
* that we at least have one.
*/
- if (hashaggtablesize < work_mem * 1024L ||
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L ||
grouped_rel->pathlist == NIL)
{
/*
@@ -6560,7 +6564,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
agg_final_costs,
dNumGroups);
- if (hashaggtablesize < work_mem * 1024L)
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L)
add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
@@ -6829,7 +6834,7 @@ create_partial_grouping_paths(PlannerInfo *root,
* Tentatively produce a partial HashAgg Path, depending on if it
* looks as if the hash table will fit in work_mem.
*/
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
@@ -6856,7 +6861,7 @@ create_partial_grouping_paths(PlannerInfo *root,
dNumPartialPartialGroups);
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 3bf96de256..b0cb1d7e6b 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -120,6 +120,7 @@ bool enableFsync = true;
bool allowSystemTableMods = false;
int work_mem = 1024;
int maintenance_work_mem = 16384;
+bool hashagg_mem_overflow = false;
int max_parallel_maintenance_workers = 2;
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ba74bf9f7d..d2b66a7f46 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -957,6 +957,26 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_hashagg_spill", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans that are expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_hashagg_spill,
+ true,
+ NULL, NULL, NULL
+ },
+ {
+ {"hashagg_mem_overflow", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables hashed aggregation to overflow work_mem at execution time."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &hashagg_mem_overflow,
+ false,
+ NULL, NULL, NULL
+ },
{
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of materialization."),
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index d21dbead0a..e50a7ad671 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -226,9 +226,13 @@ typedef enum ExprEvalOp
EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
EEOP_AGG_INIT_TRANS,
+ EEOP_AGG_INIT_TRANS_SPILLED,
EEOP_AGG_STRICT_TRANS_CHECK,
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
EEOP_AGG_PLAIN_TRANS_BYVAL,
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
EEOP_AGG_PLAIN_TRANS,
+ EEOP_AGG_PLAIN_TRANS_SPILLED,
EEOP_AGG_ORDERED_TRANS_DATUM,
EEOP_AGG_ORDERED_TRANS_TUPLE,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 6298c7c8ca..e8d88f2ce2 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -140,11 +140,17 @@ extern TupleHashTable BuildTupleHashTableExt(PlanState *parent,
extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
bool *isnew);
+extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
+ TupleTableSlot *slot,
+ bool *isnew, uint32 hash);
extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
ExprState *eqcomp,
FmgrInfo *hashfunctions);
+extern uint32 TupleHashTableHash(struct tuplehash_hash *tb,
+ const MinimalTuple tuple);
extern void ResetTupleHashTable(TupleHashTable hashtable);
+extern void DestroyTupleHashTable(TupleHashTable hashtable);
/*
* prototypes from functions in execJunk.c
@@ -250,7 +256,7 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
- bool doSort, bool doHash);
+ bool doSort, bool doHash, bool spilled);
extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
int numCols,
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bc6e03fbc7..321759ead5 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -244,6 +244,7 @@ extern bool enableFsync;
extern PGDLLIMPORT bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT bool hashagg_mem_overflow;
extern PGDLLIMPORT int max_parallel_maintenance_workers;
extern int VacuumCostPageHit;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 692438d6df..b9803a28bd 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2070,13 +2070,27 @@ typedef struct AggState
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
- int num_hashes;
+ int num_hashes; /* number of hash tables active at once */
+ bool hash_spilled; /* any hash table ever spilled? */
+ double hashentrysize; /* estimate revised during execution */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each hash table,
+ exists only during first pass if spilled */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ Size hash_mem_limit; /* limit before spilling hash table */
+ Size hash_mem_peak; /* peak hash table memory usage */
+ uint64 hash_ngroups_current; /* number of tuples currently in
+ memory in all hash tables */
+ Size hash_mem_current; /* current hash table memory usage */
+ uint64 hash_disk_used; /* bytes of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+ List *hash_batches; /* hash batches remaining to be processed */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 45
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fb..b72e2d0829 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_hashagg_spill;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 0b097f9652..a9ddcce3d3 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2331,3 +2331,95 @@ explain (costs off)
-> Seq Scan on onek
(8 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+set work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------------
+ GroupAggregate
+ Group Key: ((g % 100000))
+ -> Sort
+ Sort Key: ((g % 100000))
+ -> Function Scan on generate_series g
+(5 rows)
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+set jit_above_cost to default;
+create table agg_group_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_group_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+-- Produce results with hash aggregation
+set enable_hashagg = true;
+set enable_sort = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 100000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+set jit_above_cost to default;
+create table agg_hash_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_hash_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+set enable_sort = true;
+set work_mem to default;
+-- Compare group aggregation results to hash aggregation results
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a..767f60a96c 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1633,4 +1633,127 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
| 1 | 2
(4 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+-- Produce results with hash aggregation.
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+set enable_sort = true;
+set work_mem to default;
+-- Compare results
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+ g100 | g10 | unnest | c | m
+------+-----+--------+---+---
+(0 rows)
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
-- end
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..11c6f50fbf 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -148,6 +148,68 @@ SELECT count(*) FROM
4
(1 row)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+SET enable_hashagg=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------------
+ Unique
+ -> Sort
+ Sort Key: ((g % 1000))
+ -> Function Scan on generate_series g
+(4 rows)
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_hashagg=TRUE;
+-- Produce results with hash aggregation.
+SET enable_sort=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 1000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_sort=TRUE;
+SET work_mem TO DEFAULT;
+-- Compare results
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..c40bf6c16e 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -75,6 +75,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
+ enable_hashagg_spill | on
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 17fb256aec..bcd336c581 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1017,3 +1017,91 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
explain (costs off)
select 1 from tenk1
where (hundred, thousand) in (select twothousand, twothousand from onek);
+
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+set work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+set jit_above_cost to default;
+
+create table agg_group_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_group_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+-- Produce results with hash aggregation
+
+set enable_hashagg = true;
+set enable_sort = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+set jit_above_cost to default;
+
+create table agg_hash_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_hash_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare group aggregation results to hash aggregation results
+
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 95ac3fb52f..bf8bce6ed3 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -441,4 +441,103 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
from unnest(array[1,1], array['a','b']) u(i,v)
group by rollup(i, v||'a') order by 1,3;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+-- Produce results with hash aggregation.
+
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare results
+
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+
-- end
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..33102744eb 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -45,6 +45,68 @@ SELECT count(*) FROM
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+SET enable_hashagg=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_hashagg=TRUE;
+
+-- Produce results with hash aggregation.
+
+SET enable_sort=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_sort=TRUE;
+
+SET work_mem TO DEFAULT;
+
+-- Compare results
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
+
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
On Sat, Dec 14, 2019 at 06:32:25PM +0100, Tomas Vondra wrote:
I've done a bit more testing on this, after resolving a couple of minor
conflicts due to recent commits (rebased version attached).In particular, I've made a comparison with different dataset sizes,
group sizes, GUC settings etc. The script and results from two different
machines are available here:The script essentially runs a simple grouping query with different
number of rows, groups, work_mem and parallelism settings. There's
nothing particularly magical about it.
Nice!
I did run it both on master and patched code, allowing us to compare
results and assess impact of the patch. Overall, the changes are
expected and either neutral or beneficial, i.e. the timing are the same
or faster.The number of cases that regressed is fairly small, but sometimes the
regressions are annoyingly large - up to 2x in some cases. Consider for
example this trivial example with 100M rows:
I suppose this is because the patch has no costing changes yet. I hacked
a little to give hash agg a spilling punish, just some value based on
(groups_in_hashtable * num_of_input_tuples)/num_groups_from_planner, it
would not choose hash aggregate in this case.
However, that punish is wrong, because comparing to the external sort
algorithm, hash aggregate has the respilling, which involves even more
I/O, especially with a very large number of groups but a very small
number of tuples in a single group like the test you did. It would be a
challenge.
BTW, Jeff, Greenplum has a test for hash agg spill, I modified a little
to check how many batches a query uses, it's attached, not sure if it
would help.
--
Adam Lee
Attachments:
On Tue, 2019-12-10 at 13:34 -0800, Adam Lee wrote:
Melanie and I tried this, had a installcheck passed patch. The way
how
we verify it is composing a wide table with long unnecessary text
columns, then check the size it writes on every iteration.Please check out the attachment, it's based on your 1204 version.
Thank you. Attached a new patch that incorporates your projection work.
A few comments:
* You are only nulling out up to tts_nvalid, which means that you can
still end up storing more on disk if the wide column comes at the end
of the table and hasn't been deserialized yet. I fixed this by copying
needed attributes to the hash_spill_slot and making it virtual.
* aggregated_columns does not need to be a member of AggState; nor does
it need to be computed inside of the perhash loop. Aside: if adding a
field to AggState is necessary, you need to bump the field numbers of
later fields that are labeled for JIT use, otherwise it will break JIT.
* I used an array rather than a bitmapset. It makes it easier to find
the highest column (to do a slot_getsomeattrs), and it might be a
little more efficient for wide tables with mostly useless columns.
* Style nitpick: don't mix code and declarations
The updated patch also saves the transitionSpace calculation in the Agg
node for better hash table size estimating. This is a good way to
choose an initial number of buckets for the hash table, and also to cap
the number of groups we permit in the hash table when we expect the
groups to grow.
Regards,
Jeff Davis
Attachments:
hashagg-20191220.patchtext/x-patch; charset=UTF-8; name=hashagg-20191220.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c90282f9..89ced3cd978 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,23 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-hashagg-mem-overflow" xreflabel="hashagg_mem_overflow">
+ <term><varname>hashagg_mem_overflow</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>hashagg_mem_overflow</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ If hash aggregation exceeds <varname>work_mem</varname> at query
+ execution time, and <varname>hashagg_mem_overflow</varname> is set
+ to <literal>on</literal>, continue consuming more memory rather than
+ performing disk-based hash aggregation. The default
+ is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
@@ -4451,6 +4468,24 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-hashagg-spill" xreflabel="enable_hashagg_spill">
+ <term><varname>enable_hashagg_spill</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_hashagg_spill</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the memory usage is expected to
+ exceed <varname>work_mem</varname>. This only affects the planner
+ choice; actual behavior at execution time is dictated by
+ <xref linkend="guc-hashagg-mem-overflow"/>. The default
+ is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
<term><varname>enable_hashjoin</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 949fefa23ae..c2fb7a088a2 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -102,6 +102,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1844,6 +1845,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ if (es->analyze)
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2742,6 +2745,56 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * If EXPLAIN ANALYZE, show information on hash aggregate memory usage and
+ * batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+ long diskKb = (aggstate->hash_disk_used + 1023) / 1024;
+
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(
+ es->str,
+ "Memory Usage: %ldkB",
+ memPeakKb);
+
+ if (aggstate->hash_batches_used > 0)
+ {
+ appendStringInfo(
+ es->str,
+ " Batches: %d Disk: %ldkB",
+ aggstate->hash_batches_used, diskKb);
+ }
+
+ appendStringInfo(
+ es->str,
+ "\n");
+ }
+ else
+ {
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ ExplainPropertyInteger("Disk Usage", "kB", diskKb, es);
+ }
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 8da2e2dcbba..fb3e81764ad 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -79,7 +79,8 @@ static void ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash);
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled);
/*
@@ -2927,7 +2928,7 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
*/
ExprState *
ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
- bool doSort, bool doHash)
+ bool doSort, bool doHash, bool spilled)
{
ExprState *state = makeNode(ExprState);
PlanState *parent = &aggstate->ss.ps;
@@ -3160,7 +3161,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (setno = 0; setno < processGroupingSets; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, false);
+ pertrans, transno, setno, setoff, false,
+ spilled);
setoff++;
}
}
@@ -3178,7 +3180,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (setno = 0; setno < numHashes; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, true);
+ pertrans, transno, setno, setoff, true,
+ spilled);
setoff++;
}
}
@@ -3226,7 +3229,8 @@ static void
ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash)
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled)
{
int adjust_init_jumpnull = -1;
int adjust_strict_jumpnull = -1;
@@ -3248,7 +3252,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
fcinfo->flinfo->fn_strict &&
pertrans->initValueIsNull)
{
- scratch->opcode = EEOP_AGG_INIT_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_INIT_TRANS_SPILLED : EEOP_AGG_INIT_TRANS;
scratch->d.agg_init_trans.aggstate = aggstate;
scratch->d.agg_init_trans.pertrans = pertrans;
scratch->d.agg_init_trans.setno = setno;
@@ -3265,7 +3270,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
if (pertrans->numSortCols == 0 &&
fcinfo->flinfo->fn_strict)
{
- scratch->opcode = EEOP_AGG_STRICT_TRANS_CHECK;
+ scratch->opcode = spilled ?
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED : EEOP_AGG_STRICT_TRANS_CHECK;
scratch->d.agg_strict_trans_check.aggstate = aggstate;
scratch->d.agg_strict_trans_check.setno = setno;
scratch->d.agg_strict_trans_check.setoff = setoff;
@@ -3283,9 +3289,11 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
/* invoke appropriate transition implementation */
if (pertrans->numSortCols == 0 && pertrans->transtypeByVal)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS_BYVAL;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED : EEOP_AGG_PLAIN_TRANS_BYVAL;
else if (pertrans->numSortCols == 0)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_SPILLED : EEOP_AGG_PLAIN_TRANS;
else if (pertrans->numInputs == 1)
scratch->opcode = EEOP_AGG_ORDERED_TRANS_DATUM;
else
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index dbed5978162..49fbf8e4a42 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -430,9 +430,13 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
&&CASE_EEOP_AGG_INIT_TRANS,
+ &&CASE_EEOP_AGG_INIT_TRANS_SPILLED,
&&CASE_EEOP_AGG_STRICT_TRANS_CHECK,
+ &&CASE_EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_SPILLED,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
&&CASE_EEOP_LAST
@@ -1625,6 +1629,36 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ aggstate = op->d.agg_init_trans.aggstate;
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
+
+ /* If transValue has not yet been initialized, do so now. */
+ if (pergroup->noTransValue)
+ {
+ AggStatePerTrans pertrans = op->d.agg_init_trans.pertrans;
+
+ aggstate->curaggcontext = op->d.agg_init_trans.aggcontext;
+ aggstate->current_set = op->d.agg_init_trans.setno;
+
+ ExecAggInitGroup(aggstate, pertrans, pergroup);
+
+ /* copied trans value from input, done this round */
+ EEO_JUMP(op->d.agg_init_trans.jumpnull);
+ }
+
+ EEO_NEXT();
+ }
/* check that a strict aggregate's input isn't NULL */
EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK)
@@ -1642,6 +1676,25 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ aggstate = op->d.agg_strict_trans_check.aggstate;
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_strict_trans_check.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_strict_trans_check.transno];
+
+ if (unlikely(pergroup->transValueIsNull))
+ EEO_JUMP(op->d.agg_strict_trans_check.jumpnull);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1691,6 +1744,52 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ aggstate = op->d.agg_trans.aggstate;
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ Assert(pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1756,6 +1855,67 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ aggstate = op->d.agg_trans.aggstate;
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ Assert(!pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
+ /*
+ * For pass-by-ref datatype, must copy the new value into
+ * aggcontext and free the prior transValue. But if transfn
+ * returned a pointer to its first input, we don't need to do
+ * anything. Also, if transfn returned a pointer to a R/W
+ * expanded object that is already a child of the aggcontext,
+ * assume we can adopt that value without copying it.
+ */
+ if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggTransReparent(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
/* process single-column ordered aggregate datum */
EEO_CASE(EEOP_AGG_ORDERED_TRANS_DATUM)
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index e361143094c..02dba3eac18 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -25,8 +25,9 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
-static uint32 TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple);
static int TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
+static TupleHashEntry LookupTupleHashEntry_internal(
+ TupleHashTable hashtable, TupleTableSlot *slot, bool *isnew, uint32 hash);
/*
* Define parameters for tuple hash table code generation. The interface is
@@ -284,6 +285,17 @@ ResetTupleHashTable(TupleHashTable hashtable)
tuplehash_reset(hashtable->hashtab);
}
+/*
+ * Destroy the hash table. Note that the tablecxt passed to
+ * BuildTupleHashTableExt() should also be reset, otherwise there will be
+ * leaks.
+ */
+void
+DestroyTupleHashTable(TupleHashTable hashtable)
+{
+ tuplehash_destroy(hashtable->hashtab);
+}
+
/*
* Find or create a hashtable entry for the tuple group containing the
* given tuple. The tuple must be the same type as the hashtable entries.
@@ -300,10 +312,9 @@ TupleHashEntry
LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
bool *isnew)
{
- TupleHashEntryData *entry;
- MemoryContext oldContext;
- bool found;
- MinimalTuple key;
+ TupleHashEntry entry;
+ MemoryContext oldContext;
+ uint32 hash;
/* Need to run the hash functions in short-lived context */
oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
@@ -313,32 +324,29 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
hashtable->cur_eq_func = hashtable->tab_eq_func;
- key = NULL; /* flag to reference inputslot */
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+ entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
- if (isnew)
- {
- entry = tuplehash_insert(hashtable->hashtab, key, &found);
+ MemoryContextSwitchTo(oldContext);
- if (found)
- {
- /* found pre-existing entry */
- *isnew = false;
- }
- else
- {
- /* created new entry */
- *isnew = true;
- /* zero caller data */
- entry->additional = NULL;
- MemoryContextSwitchTo(hashtable->tablecxt);
- /* Copy the first tuple into the table context */
- entry->firstTuple = ExecCopySlotMinimalTuple(slot);
- }
- }
- else
- {
- entry = tuplehash_lookup(hashtable->hashtab, key);
- }
+ return entry;
+}
+
+/*
+ * A variant of LookupTupleHashEntry for callers that have already computed
+ * the hash value.
+ */
+TupleHashEntry
+LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
+ bool *isnew, uint32 hash)
+{
+ TupleHashEntry entry;
+ MemoryContext oldContext;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
MemoryContextSwitchTo(oldContext);
@@ -389,7 +397,7 @@ FindTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
* Also, the caller must select an appropriate memory context for running
* the hash functions. (dynahash.c doesn't change CurrentMemoryContext.)
*/
-static uint32
+uint32
TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
{
TupleHashTable hashtable = (TupleHashTable) tb->private_data;
@@ -450,6 +458,54 @@ TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
return murmurhash32(hashkey);
}
+/*
+ * Does the work of LookupTupleHashEntry and LookupTupleHashEntryHash. Useful
+ * so that we can avoid switching the memory context multiple times for
+ * LookupTupleHashEntry.
+ */
+static TupleHashEntry
+LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
+ bool *isnew, uint32 hash)
+{
+ TupleHashEntryData *entry;
+ bool found;
+ MinimalTuple key;
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = slot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ key = NULL; /* flag to reference inputslot */
+
+ if (isnew)
+ {
+ entry = tuplehash_insert_hash(hashtable->hashtab, key, hash, &found);
+
+ if (found)
+ {
+ /* found pre-existing entry */
+ *isnew = false;
+ }
+ else
+ {
+ /* created new entry */
+ *isnew = true;
+ /* zero caller data */
+ entry->additional = NULL;
+ MemoryContextSwitchTo(hashtable->tablecxt);
+ /* Copy the first tuple into the table context */
+ entry->firstTuple = ExecCopySlotMinimalTuple(slot);
+ }
+ }
+ else
+ {
+ entry = tuplehash_lookup_hash(hashtable->hashtab, key, hash);
+ }
+
+ return entry;
+}
+
/*
* See whether two tuples (presumably of the same hash value) match
*/
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 6ee24eab3d2..f1989b10eac 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -194,6 +194,18 @@
* transition values. hashcontext is the single context created to support
* all hash tables.
*
+ * When the hash table memory exceeds work_mem, we advance the transition
+ * states only for groups already in the hash table. For tuples that would
+ * need to create a new hash table entries (and initialize new transition
+ * states), we spill them to disk to be processed later. The tuples are
+ * spilled in a partitioned manner, so that subsequent batches are smaller
+ * and less likely to exceed work_mem (if a batch does exceed work_mem, it
+ * must be spilled recursively).
+ *
+ * Note that it's possible for transition states to start small but then
+ * grow very large; for instance in the case of ARRAY_AGG. In such cases,
+ * it's still possible to significantly exceed work_mem.
+ *
* Transition / Combine function invocation:
*
* For performance reasons transition functions, including combine
@@ -229,15 +241,70 @@
#include "optimizer/optimizer.h"
#include "parser/parse_agg.h"
#include "parser/parse_coerce.h"
+#include "storage/buffile.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/datum.h"
+#include "utils/dynahash.h"
#include "utils/expandeddatum.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+/*
+ * Control how many partitions are created when spilling HashAgg to
+ * disk.
+ *
+ * HASH_PARTITION_FACTOR is multiplied by the estimated number of partitions
+ * needed such that each partition will fit in memory. The factor is set
+ * higher than one because there's not a high cost to having a few too many
+ * partitions, and it makes it less likely that a partition will need to be
+ * spilled recursively. Another benefit of having more, smaller partitions is
+ * that small hash tables may perform better than large ones due to memory
+ * caching effects.
+ *
+ * HASH_PARTITION_MEM is the approximate amount of work_mem we should reserve
+ * for the partitions themselves (i.e. buffering of the files backing the
+ * partitions). This is sloppy, because we must reserve the memory before
+ * filling the hash table; but we choose the number of partitions at the time
+ * we need to spill.
+ *
+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (and
+ * possibly pushing hidden costs to the OS for managing more files).
+ */
+#define HASH_PARTITION_FACTOR 1.50
+#define HASH_MIN_PARTITIONS 4
+#define HASH_MAX_PARTITIONS 256
+#define HASH_PARTITION_MEM (HASH_MIN_PARTITIONS * BLCKSZ)
+
+/*
+ * Represents partitioned spill data for a single hashtable.
+ */
+typedef struct HashAggSpill
+{
+ int n_partitions; /* number of output partitions */
+ int partition_bits; /* number of bits for partition mask
+ log2(n_partitions) parent partition bits */
+ BufFile **partitions; /* output partition files */
+ int64 *ntuples; /* number of tuples in each partition */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation. Initially,
+ * only the input fields are set. If spilled to disk, also set the spill data.
+ */
+typedef struct HashAggBatch
+{
+ BufFile *input_file; /* input partition */
+ int input_bits; /* number of bits for input partition mask */
+ int64 input_tuples; /* number of tuples in this batch */
+ int setno; /* grouping set */
+ HashAggSpill spill; /* spill output */
+} HashAggBatch;
+
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -271,12 +338,35 @@ static void finalize_aggregates(AggState *aggstate,
static TupleTableSlot *project_aggregates(AggState *aggstate);
static Bitmapset *find_unaggregated_cols(AggState *aggstate);
static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
-static void build_hash_table(AggState *aggstate);
-static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
+static void build_hash_table(AggState *aggstate, int setno,
+ int64 ngroups_estimate);
+static void prepare_hash_slot(AggState *aggstate);
+static void hash_recompile_expressions(AggState *aggstate);
+static uint32 calculate_hash(AggState *aggstate);
+static long hash_choose_num_buckets(AggState *aggstate,
+ long estimated_nbuckets,
+ Size memory);
+static int hash_choose_num_spill_partitions(uint64 input_groups,
+ double hashentrysize);
+static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+static void hash_spill_init(HashAggSpill *spill, int input_bits,
+ uint64 input_tuples, double hashentrysize);
+static Size hash_spill_tuple(HashAggSpill *spill, int input_bits,
+ TupleTableSlot *slot, uint32 hash);
+static MinimalTuple hash_read_spilled(BufFile *file, uint32 *hashp);
+static HashAggBatch *hash_batch_new(BufFile *input_file, int setno,
+ int64 input_tuples, int input_bits);
+static void hash_finish_initial_spills(AggState *aggstate);
+static void hash_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno, int input_bits);
+static void hash_reset_spill(HashAggSpill *spill);
+static void hash_reset_spills(AggState *aggstate);
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1201,6 +1291,68 @@ project_aggregates(AggState *aggstate)
return NULL;
}
+static bool
+find_aggregated_cols_walker(Node *node, Bitmapset **colnos)
+{
+ if (node == NULL)
+ return false;
+
+ if (IsA(node, Var))
+ {
+ Var *var = (Var *) node;
+
+ *colnos = bms_add_member(*colnos, var->varattno);
+
+ return false;
+ }
+ return expression_tree_walker(node, find_aggregated_cols_walker,
+ (void *) colnos);
+}
+
+/*
+ * find_aggregated_cols
+ * Construct a bitmapset of the column numbers of aggregated Vars
+ * appearing in our targetlist and qual (HAVING clause)
+ */
+static Bitmapset *
+find_aggregated_cols(AggState *aggstate)
+{
+ Agg *node = (Agg *) aggstate->ss.ps.plan;
+ Bitmapset *colnos = NULL;
+ ListCell *temp;
+
+ /*
+ * We only want the columns used by aggregations in the targetlist or qual
+ */
+ if (node->plan.targetlist != NULL)
+ {
+ foreach(temp, (List *) node->plan.targetlist)
+ {
+ if (IsA(lfirst(temp), TargetEntry))
+ {
+ Node *node = (Node *)((TargetEntry *)lfirst(temp))->expr;
+ if (IsA(node, Aggref) || IsA(node, GroupingFunc))
+ find_aggregated_cols_walker(node, &colnos);
+ }
+ }
+ }
+
+ if (node->plan.qual != NULL)
+ {
+ foreach(temp, (List *) node->plan.qual)
+ {
+ if (IsA(lfirst(temp), TargetEntry))
+ {
+ Node *node = (Node *)((TargetEntry *)lfirst(temp))->expr;
+ if (IsA(node, Aggref) || IsA(node, GroupingFunc))
+ find_aggregated_cols_walker(node, &colnos);
+ }
+ }
+ }
+
+ return colnos;
+}
+
/*
* find_unaggregated_cols
* Construct a bitmapset of the column numbers of un-aggregated Vars
@@ -1254,46 +1406,80 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
* for each entry.
*
* We have a separate hashtable and associated perhash data structure for each
- * grouping set for which we're doing hashing.
+ * grouping set for which we're doing hashing. If setno is -1, build hash
+ * tables for all grouping sets. Otherwise, build only for the specified
+ * grouping set.
*
* The contents of the hash tables always live in the hashcontext's per-tuple
* memory context (there is only one of these for all tables together, since
* they are all reset at the same time).
*/
static void
-build_hash_table(AggState *aggstate)
+build_hash_table(AggState *aggstate, int setno, long ngroups_estimate)
{
- MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
- Size additionalsize;
- int i;
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
+ Size additionalsize;
+ int i;
Assert(aggstate->aggstrategy == AGG_HASHED || aggstate->aggstrategy == AGG_MIXED);
- additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
+ additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData) +
+ agg->transSpace;
for (i = 0; i < aggstate->num_hashes; ++i)
{
AggStatePerHash perhash = &aggstate->perhash[i];
+ int64 ngroups;
+ long nbuckets;
+ Size memory;
Assert(perhash->aggnode->numGroups > 0);
if (perhash->hashtable)
- ResetTupleHashTable(perhash->hashtable);
- else
- perhash->hashtable = BuildTupleHashTableExt(&aggstate->ss.ps,
- perhash->hashslot->tts_tupleDescriptor,
- perhash->numCols,
- perhash->hashGrpColIdxHash,
- perhash->eqfuncoids,
- perhash->hashfunctions,
- perhash->aggnode->grpCollations,
- perhash->aggnode->numGroups,
- additionalsize,
- aggstate->ss.ps.state->es_query_cxt,
- aggstate->hashcontext->ecxt_per_tuple_memory,
- tmpmem,
- DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+ DestroyTupleHashTable(perhash->hashtable);
+ perhash->hashtable = NULL;
+
+ /*
+ * If we are building a hash table for only a single grouping set,
+ * skip the others.
+ */
+ if (setno >= 0 && setno != i)
+ continue;
+
+ /*
+ * Use an estimate from execution time if we have it; otherwise fall
+ * back to the planner estimate.
+ */
+ ngroups = ngroups_estimate > 0 ?
+ ngroups_estimate : perhash->aggnode->numGroups;
+
+ /* divide memory by the number of hash tables we are initializing */
+ memory = (long)work_mem * 1024L /
+ (setno >= 0 ? 1 : aggstate->num_hashes);
+
+ /* choose reasonable number of buckets per hashtable */
+ nbuckets = hash_choose_num_buckets(aggstate, ngroups, memory);
+
+ perhash->hashtable = BuildTupleHashTableExt(&aggstate->ss.ps,
+ perhash->hashslot->tts_tupleDescriptor,
+ perhash->numCols,
+ perhash->hashGrpColIdxHash,
+ perhash->eqfuncoids,
+ perhash->hashfunctions,
+ perhash->aggnode->grpCollations,
+ nbuckets,
+ additionalsize,
+ aggstate->ss.ps.state->es_query_cxt,
+ aggstate->hashcontext->ecxt_per_tuple_memory,
+ tmpmem,
+ DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
}
+
+ aggstate->hash_mem_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_ngroups_current = 0;
+ aggstate->hash_no_new_groups = false;
}
/*
@@ -1325,6 +1511,7 @@ static void
find_hash_columns(AggState *aggstate)
{
Bitmapset *base_colnos;
+ Bitmapset *aggregated_colnos;
List *outerTlist = outerPlanState(aggstate)->plan->targetlist;
int numHashes = aggstate->num_hashes;
EState *estate = aggstate->ss.ps.state;
@@ -1332,11 +1519,13 @@ find_hash_columns(AggState *aggstate)
/* Find Vars that will be needed in tlist and qual */
base_colnos = find_unaggregated_cols(aggstate);
+ aggregated_colnos = find_aggregated_cols(aggstate);
for (j = 0; j < numHashes; ++j)
{
AggStatePerHash perhash = &aggstate->perhash[j];
Bitmapset *colnos = bms_copy(base_colnos);
+ Bitmapset *allNeededColsInput;
AttrNumber *grpColIdx = perhash->aggnode->grpColIdx;
List *hashTlist = NIL;
TupleDesc hashDesc;
@@ -1383,6 +1572,19 @@ find_hash_columns(AggState *aggstate)
for (i = 0; i < perhash->numCols; i++)
colnos = bms_add_member(colnos, grpColIdx[i]);
+ /*
+ * Track the necessary columns from the input. This is important for
+ * spilling tuples so that we don't waste disk space with unneeded
+ * columns.
+ */
+ allNeededColsInput = bms_union(colnos, aggregated_colnos);
+ perhash->numNeededColsInput = 0;
+ perhash->allNeededColsInput = palloc(
+ bms_num_members(allNeededColsInput) * sizeof(AttrNumber));
+
+ while ((i = bms_first_member(allNeededColsInput)) >= 0)
+ perhash->allNeededColsInput[perhash->numNeededColsInput++] = i;
+
/*
* First build mapping for columns directly hashed. These are the
* first, because they'll be accessed when computing hash values and
@@ -1455,22 +1657,16 @@ hash_agg_entry_size(int numAggs)
}
/*
- * Find or create a hashtable entry for the tuple group containing the current
- * tuple (already set in tmpcontext's outertuple slot), in the current grouping
- * set (which the caller must have selected - note that initialize_aggregate
- * depends on this).
- *
- * When called, CurrentMemoryContext should be the per-query context.
+ * Extract the attributes that make up the grouping key into the
+ * hashslot. This is necessary to compute the hash of the grouping key.
*/
-static TupleHashEntryData *
-lookup_hash_entry(AggState *aggstate)
+static void
+prepare_hash_slot(AggState *aggstate)
{
- TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
- AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
- TupleTableSlot *hashslot = perhash->hashslot;
- TupleHashEntryData *entry;
- bool isnew;
- int i;
+ TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ int i;
/* transfer just the needed columns into hashslot */
slot_getsomeattrs(inputslot, perhash->largestGrpColIdx);
@@ -1484,14 +1680,185 @@ lookup_hash_entry(AggState *aggstate)
hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
}
ExecStoreVirtualTuple(hashslot);
+}
+
+/*
+ * Recompile the expressions for advancing aggregates while hashing. This is
+ * necessary for certain kinds of state changes that affect the resulting
+ * expression. For instance, changing aggstate->hash_spilled or
+ * aggstate->ss.ps.outerops require recompilation.
+ */
+static void
+hash_recompile_expressions(AggState *aggstate)
+{
+ AggStatePerPhase phase;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ if (aggstate->aggstrategy == AGG_HASHED)
+ phase = &aggstate->phases[0];
+ else /* AGG_MIXED */
+ phase = &aggstate->phases[1];
+
+ phase->evaltrans = ExecBuildAggTrans(
+ aggstate, phase,
+ aggstate->aggstrategy == AGG_MIXED ? true : false, /* dosort */
+ true, /* dohash */
+ aggstate->hash_spilled /* spilled */);
+}
+
+/*
+ * Calculate the hash value for a tuple. It's useful to do this outside of the
+ * hash table so that we can reuse saved hash values rather than recomputing.
+ */
+static uint32
+calculate_hash(AggState *aggstate)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleHashTable hashtable = perhash->hashtable;
+ MemoryContext oldContext;
+ uint32 hash;
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = perhash->hashslot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+
+ MemoryContextSwitchTo(oldContext);
+
+ return hash;
+}
+
+/*
+ * Choose a reasonable number of buckets for the initial hash table size.
+ */
+static long
+hash_choose_num_buckets(AggState *aggstate, long ngroups, Size memory)
+{
+ long max_nbuckets;
+ int log2_ngroups;
+ long nbuckets;
+
+ max_nbuckets = memory / aggstate->hashentrysize;
+
+ /*
+ * Lowest power of two greater than ngroups, without exceeding
+ * max_nbuckets.
+ */
+ for (log2_ngroups = 1, nbuckets = 2;
+ nbuckets < ngroups && nbuckets < max_nbuckets;
+ log2_ngroups++, nbuckets <<= 1);
+
+ if (nbuckets > max_nbuckets && nbuckets > 2)
+ nbuckets >>= 1;
+
+ return nbuckets;
+}
+
+/*
+ * Determine the number of partitions to create when spilling.
+ */
+static int
+hash_choose_num_spill_partitions(uint64 input_groups, double hashentrysize)
+{
+ Size mem_needed;
+ int partition_limit;
+ int npartitions;
+
+ /*
+ * Avoid creating so many partitions that the memory requirements of the
+ * open partition files (estimated at BLCKSZ for buffering) are greater
+ * than 1/4 of work_mem.
+ */
+ partition_limit = (work_mem * 1024L * 0.25) / BLCKSZ;
+
+ /* pessimistically estimate that each input tuple creates a new group */
+ mem_needed = HASH_PARTITION_FACTOR * input_groups * hashentrysize;
+
+ /* make enough partitions so that each one is likely to fit in memory */
+ npartitions = 1 + (mem_needed / (work_mem * 1024L));
+
+ if (npartitions > partition_limit)
+ npartitions = partition_limit;
+
+ if (npartitions < HASH_MIN_PARTITIONS)
+ npartitions = HASH_MIN_PARTITIONS;
+ if (npartitions > HASH_MAX_PARTITIONS)
+ npartitions = HASH_MAX_PARTITIONS;
+
+ return npartitions;
+}
+
+/*
+ * Find or create a hashtable entry for the tuple group containing the current
+ * tuple (already set in tmpcontext's outertuple slot), in the current grouping
+ * set (which the caller must have selected - note that initialize_aggregate
+ * depends on this).
+ *
+ * When called, CurrentMemoryContext should be the per-query context.
+ *
+ * If the hash table is at the memory limit, then only find existing hashtable
+ * entries; don't create new ones. If a tuple's group is not already present
+ * in the hash table for the current grouping set, return NULL and the caller
+ * will spill it to disk.
+ */
+static AggStatePerGroup
+lookup_hash_entry(AggState *aggstate, uint32 hash)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ TupleHashEntryData *entry;
+ bool isnew = false;
+ bool *p_isnew;
+
+ /* if hash table already spilled, don't create new entries */
+ p_isnew = aggstate->hash_no_new_groups ? NULL : &isnew;
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntry(perhash->hashtable, hashslot, &isnew);
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
+ hash);
+
+ if (entry == NULL)
+ return NULL;
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_ngroups_current++;
+
+ aggstate->hash_mem_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ if (aggstate->hash_mem_current > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = aggstate->hash_mem_current;
+
+ /*
+ * Check whether we need to spill. For small values of work_mem, the
+ * empty hash tables might exceed it; so don't spill unless there's at
+ * least one group in the hash table.
+ */
+ if (aggstate->hash_ngroups_current > 0 &&
+ (aggstate->hash_mem_current > aggstate->hash_mem_limit ||
+ aggstate->hash_ngroups_current > aggstate->hash_ngroups_limit))
+ {
+ aggstate->hash_no_new_groups = true;
+ if (!aggstate->hash_spilled)
+ {
+ aggstate->hash_spilled = true;
+ aggstate->hash_spills = palloc0(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+
+ hash_recompile_expressions(aggstate);
+ }
+ }
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1511,7 +1878,7 @@ lookup_hash_entry(AggState *aggstate)
}
}
- return entry;
+ return entry->additional;
}
/*
@@ -1519,18 +1886,74 @@ lookup_hash_entry(AggState *aggstate)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Some entries may be left NULL if we have reached the limit and have begun
+ * to spill. The same tuple will belong to different groups for each set, so
+ * may match a group already in memory for one set and match a group not in
+ * memory for another set. If we have begun to spill and a tuple doesn't match
+ * a group in memory for a particular set, it will be spilled.
+ *
+ * NB: It's possible to spill the same tuple for several different grouping
+ * sets. This may seem wasteful, but it's actually a trade-off: if we spill
+ * the tuple multiple times for multiple grouping sets, it can be partitioned
+ * for each grouping set, making the refilling of the hash table very
+ * efficient.
*/
static void
lookup_hash_entries(AggState *aggstate)
{
- int numHashes = aggstate->num_hashes;
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
int setno;
- for (setno = 0; setno < numHashes; setno++)
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
{
+ uint32 hash;
+
select_current_set(aggstate, setno, true);
- pergroup[setno] = lookup_hash_entry(aggstate)->additional;
+ prepare_hash_slot(aggstate);
+ hash = calculate_hash(aggstate);
+ pergroup[setno] = lookup_hash_entry(aggstate, hash);
+
+ /* check to see if we need to spill the tuple for this grouping set */
+ if (pergroup[setno] == NULL)
+ {
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
+ TupleTableSlot *spillslot = aggstate->hash_spill_slot;
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ int idx;
+
+ if (spill->partitions == NULL)
+ hash_spill_init(spill, 0, perhash->aggnode->numGroups,
+ aggstate->hashentrysize);
+
+ /*
+ * Copy only necessary attributes to spill slot before writing to
+ * disk.
+ */
+ ExecClearTuple(spillslot);
+ memset(spillslot->tts_isnull, true,
+ spillslot->tts_tupleDescriptor->natts);
+
+ /* deserialize needed attributes */
+ if (perhash->numNeededColsInput > 0)
+ {
+ int maxNeededAttrIdx = perhash->numNeededColsInput - 1;
+ AttrNumber maxNeededAttr =
+ perhash->allNeededColsInput[maxNeededAttrIdx];
+ slot_getsomeattrs(inputslot, maxNeededAttr);
+ }
+
+ for (idx = 0; idx < perhash->numNeededColsInput; idx++)
+ {
+ AttrNumber att = perhash->allNeededColsInput[idx];
+ spillslot->tts_values[att-1] = inputslot->tts_values[att-1];
+ spillslot->tts_isnull[att-1] = inputslot->tts_isnull[att-1];
+ }
+
+ ExecStoreVirtualTuple(spillslot);
+ aggstate->hash_disk_used += hash_spill_tuple(spill, 0, spillslot, hash);
+ }
}
}
@@ -1853,6 +2276,12 @@ agg_retrieve_direct(AggState *aggstate)
if (TupIsNull(outerslot))
{
/* no more outer-plan tuples available */
+
+ /* if we built hash tables, finalize any spills */
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hash_finish_initial_spills(aggstate);
+
if (hasGroupingSets)
{
aggstate->input_done = true;
@@ -1955,6 +2384,9 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ /* finalize spills, if any */
+ hash_finish_initial_spills(aggstate);
+
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1962,11 +2394,175 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * consumed.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ /*
+ * Each spill file contains spilled data for only a single grouping
+ * set. We want to ignore all others, which is done by setting the other
+ * pergroups to NULL.
+ */
+ memset(aggstate->all_pergroups, 0,
+ sizeof(AggStatePerGroup) *
+ (aggstate->maxsets + aggstate->num_hashes));
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ /*
+ * Free memory and rebuild a single hash table for this batch's grouping
+ * set.
+ */
+ ReScanExprContext(aggstate->hashcontext);
+
+ /* estimate the number of groups to be the number of input tuples */
+ build_hash_table(aggstate, batch->setno, batch->input_tuples);
+
+ Assert(aggstate->current_phase == 0);
+
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ /*
+ * The first pass (agg_fill_hash_table) reads whatever kind of slot comes
+ * from the outer plan, and considers the slot fixed. But spilled tuples
+ * are always MinimalTuples, so if that's different from the outer plan we
+ * need to change it and recompile the aggregate expressions.
+ */
+ if (aggstate->ss.ps.outerops != &TTSOpsMinimalTuple)
+ {
+ aggstate->ss.ps.outerops = &TTSOpsMinimalTuple;
+ hash_recompile_expressions(aggstate);
+ }
+
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+ uint32 hash;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hash_read_spilled(batch->input_file, &hash);
+ if (tuple == NULL)
+ break;
+
+ ExecStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ select_current_set(aggstate, batch->setno, true);
+ prepare_hash_slot(aggstate);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+
+ /* if there's no memory for a new group, spill */
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ {
+ if (batch->spill.partitions == NULL)
+ {
+ /*
+ * Estimate the number of groups for this batch as the total
+ * number of tuples in its input file. Although that's a worst
+ * case, it's not bad here for two reasons: (1) overestimating
+ * is better than underestimating; and (2) we've already
+ * scanned the relation once, so it's likely that we've
+ * already finalized many of the common values.
+ */
+ hash_spill_init(&batch->spill, batch->input_bits,
+ batch->input_tuples, aggstate->hashentrysize);
+ }
+
+ aggstate->hash_disk_used += hash_spill_tuple(
+ &batch->spill, batch->input_bits, slot, hash);
+ }
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ BufFileClose(batch->input_file);
+
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ /* update hashentrysize estimate based on contents */
+ if (aggstate->hash_ngroups_current > 0)
+ {
+ aggstate->hashentrysize = (double)aggstate->hash_mem_current /
+ (double)aggstate->hash_ngroups_current;
+ }
+
+ hash_spill_finish(aggstate, &batch->spill, batch->setno,
+ batch->input_bits);
+
+ pfree(batch);
+
+ /* Initialize to walk the first hash table */
+ select_current_set(aggstate, 0, true);
+ ResetTupleHashIterator(aggstate->perhash[0].hashtable,
+ &aggstate->perhash[0].hashiter);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
+ *
+ * After exhausting in-memory tuples, also try refilling the hash table using
+ * previously-spilled tuples. Only returns NULL after all in-memory and
+ * spilled tuples are exhausted.
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+/*
+ * Retrieve the groups from the in-memory hash tables without considering any
+ * spilled tuples.
+ */
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -1995,7 +2591,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2026,8 +2622,6 @@ agg_retrieve_hash_table(AggState *aggstate)
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2084,6 +2678,281 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * hash_spill_init
+ *
+ * Called after we determined that spilling is necessary. Chooses the number
+ * of partitions to create, and initializes them.
+ */
+static void
+hash_spill_init(HashAggSpill *spill, int input_bits, uint64 input_groups,
+ double hashentrysize)
+{
+ int npartitions;
+ int partition_bits;
+
+ npartitions = hash_choose_num_spill_partitions(input_groups,
+ hashentrysize);
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits */
+ if (partition_bits + input_bits >= 32)
+ partition_bits = 32 - input_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ spill->partition_bits = partition_bits;
+ spill->n_partitions = npartitions;
+ spill->partitions = palloc0(sizeof(BufFile *) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+}
+
+/*
+ * hash_spill_tuple
+ *
+ * No room for new groups in the hash table. Save for later in the appropriate
+ * partition spill file.
+ */
+static Size
+hash_spill_tuple(HashAggSpill *spill, int input_bits, TupleTableSlot *slot,
+ uint32 hash)
+{
+ int partition;
+ MinimalTuple tuple;
+ BufFile *file;
+ int written;
+ int total_written = 0;
+ bool shouldFree;
+
+ Assert(spill->partitions != NULL);
+
+ /*
+ * When spilling tuples from the input, the slot will be virtual
+ * (containing only the needed attributes and the rest as NULL), and we
+ * need to materialize the minimal tuple. When spilling tuples
+ * recursively, the slot will hold a minimal tuple already.
+ */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ if (spill->partition_bits == 0)
+ partition = 0;
+ else
+ partition = (hash << input_bits) >>
+ (32 - spill->partition_bits);
+
+ spill->ntuples[partition]++;
+
+ if (spill->partitions[partition] == NULL)
+ spill->partitions[partition] = BufFileCreateTemp(false);
+ file = spill->partitions[partition];
+
+ written = BufFileWrite(file, (void *) &hash, sizeof(uint32));
+ if (written != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to HashAgg temporary file: %m")));
+ total_written += written;
+
+ written = BufFileWrite(file, (void *) tuple, tuple->t_len);
+ if (written != tuple->t_len)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to HashAgg temporary file: %m")));
+ total_written += written;
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return total_written;
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hash_read_spilled(BufFile *file, uint32 *hashp)
+{
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+ uint32 hash;
+
+ nread = BufFileRead(file, &hash, sizeof(uint32));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+ if (hashp != NULL)
+ *hashp = hash;
+
+ nread = BufFileRead(file, &t_len, sizeof(t_len));
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = BufFileRead(file, (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+
+ return tuple;
+}
+
+/*
+ * new_hashagg_batch
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done. Should be called in the aggregate's memory context.
+ */
+static HashAggBatch *
+hash_batch_new(BufFile *input_file, int setno, int64 input_tuples,
+ int input_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->input_file = input_file;
+ batch->input_bits = input_bits;
+ batch->input_tuples = input_tuples;
+ batch->setno = setno;
+
+ /* batch->spill will be set only after spilling this batch */
+
+ return batch;
+}
+
+/*
+ * hash_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hash_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+
+ if (aggstate->hash_spills == NULL)
+ return;
+
+ /* update hashentrysize estimate based on contents */
+ Assert(aggstate->hash_ngroups_current > 0);
+ aggstate->hashentrysize = (double)aggstate->hash_mem_current /
+ (double)aggstate->hash_ngroups_current;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hash_spill_finish(aggstate, &aggstate->hash_spills[setno], setno, 0);
+
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+}
+
+/*
+ * hash_spill_finish
+ *
+ * Transform spill files into new batches.
+ */
+static void
+hash_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno, int input_bits)
+{
+ int i;
+
+ if (spill->n_partitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->n_partitions; i++)
+ {
+ BufFile *file = spill->partitions[i];
+ MemoryContext oldContext;
+ HashAggBatch *new_batch;
+
+ /* partition is empty */
+ if (file == NULL)
+ continue;
+
+ /* rewind file for reading */
+ if (BufFileSeek(file, 0, 0L, SEEK_SET))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rewind HashAgg temporary file: %m")));
+
+ oldContext = MemoryContextSwitchTo(aggstate->ss.ps.state->es_query_cxt);
+ new_batch = hash_batch_new(file, setno, spill->ntuples[i],
+ spill->partition_bits + input_bits);
+ aggstate->hash_batches = lappend(aggstate->hash_batches, new_batch);
+ aggstate->hash_batches_used++;
+ MemoryContextSwitchTo(oldContext);
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Clear a HashAggSpill, free its memory, and close its files.
+ */
+static void
+hash_reset_spill(HashAggSpill *spill)
+{
+ int i;
+ for (i = 0; i < spill->n_partitions; i++)
+ {
+ BufFile *file = spill->partitions[i];
+
+ if (file != NULL)
+ BufFileClose(file);
+ }
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+}
+
+/*
+ * Find and reset all active HashAggSpills.
+ */
+static void
+hash_reset_spills(AggState *aggstate)
+{
+ ListCell *lc;
+
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hash_reset_spill(&aggstate->hash_spills[setno]);
+
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ if (batch->input_file != NULL)
+ {
+ BufFileClose(batch->input_file);
+ batch->input_file = NULL;
+ }
+ hash_reset_spill(&batch->spill);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2268,6 +3137,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->ss.ps.outeropsfixed = false;
}
+ if (use_hashing)
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsMinimalTuple);
+
/*
* Initialize result type, slot and projection.
*/
@@ -2496,8 +3369,36 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
/* this is an array of pointers, not structures */
aggstate->hash_pergroup = pergroups;
+ aggstate->hashentrysize =
+ hash_agg_entry_size(aggstate->numtrans) +
+ node->transSpace;
+
+ /*
+ * Initialize the thresholds at which we stop creating new hash entries
+ * and start spilling.
+ */
+ if (hashagg_mem_overflow)
+ aggstate->hash_mem_limit = SIZE_MAX;
+ else if (work_mem * 1024L > HASH_PARTITION_MEM * 2)
+ aggstate->hash_mem_limit = work_mem * 1024L - HASH_PARTITION_MEM;
+ else
+ aggstate->hash_mem_limit = work_mem * 1024L;
+
+ /*
+ * Set a separate limit on the maximum number of groups to
+ * create. This is important for aggregates where the initial state
+ * size is small, but aggtransspace is large.
+ */
+ if (hashagg_mem_overflow)
+ aggstate->hash_ngroups_limit = LONG_MAX;
+ else if (aggstate->hash_mem_limit > aggstate->hashentrysize)
+ aggstate->hash_ngroups_limit =
+ aggstate->hash_mem_limit / aggstate->hashentrysize;
+ else
+ aggstate->hash_ngroups_limit = 1;
+
find_hash_columns(aggstate);
- build_hash_table(aggstate);
+ build_hash_table(aggstate, -1, 0);
aggstate->table_filled = false;
}
@@ -2903,7 +3804,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
else
Assert(false);
- phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash);
+ phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash, false);
}
@@ -3398,6 +4299,8 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hash_reset_spills(node);
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3453,12 +4356,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3515,9 +4419,21 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ hash_reset_spills(node);
+
+ node->hash_spilled = false;
+ node->hash_no_new_groups = false;
+ node->hash_mem_current = 0;
+ node->hash_ngroups_current = 0;
+
+ /* reset stats */
+ node->hash_mem_peak = 0;
+ node->hash_disk_used = 0;
+ node->hash_batches_used = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
- build_hash_table(node);
+ build_hash_table(node, -1, 0);
node->table_filled = false;
/* iterator will be reset when the table is filled */
}
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index ffd887c71aa..93517d03819 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2082,6 +2082,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_INIT_TRANS:
+ case EEOP_AGG_INIT_TRANS_SPILLED:
{
AggState *aggstate;
AggStatePerTrans pertrans;
@@ -2092,6 +2093,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_allpergroupsp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_setoff,
v_transno;
@@ -2119,11 +2121,32 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_init_trans.setoff);
v_transno = l_int32_const(op->d.agg_init_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_notransvalue = l_bb_before_v(
+ opblocks[i + 1], "op.%d.check_notransvalue", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(
+ b, v_pergroup_allaggs, TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_check_notransvalue);
+
+ LLVMPositionBuilderAtEnd(b, b_check_notransvalue);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_notransvalue =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_NOTRANSVALUE,
@@ -2180,6 +2203,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_STRICT_TRANS_CHECK:
+ case EEOP_AGG_STRICT_TRANS_CHECK_SPILLED:
{
AggState *aggstate;
LLVMValueRef v_setoff,
@@ -2190,6 +2214,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_transnull;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
int jumpnull = op->d.agg_strict_trans_check.jumpnull;
@@ -2209,11 +2234,32 @@ llvm_compile_expr(ExprState *state)
l_int32_const(op->d.agg_strict_trans_check.setoff);
v_transno =
l_int32_const(op->d.agg_strict_trans_check.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_transnull = l_bb_before_v(
+ opblocks[i + 1], "op.%d.check_transnull", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[jumpnull],
+ b_check_transnull);
+
+ LLVMPositionBuilderAtEnd(b, b_check_transnull);
+ }
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_transnull =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_TRANSVALUEISNULL,
@@ -2229,7 +2275,9 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_PLAIN_TRANS_BYVAL:
+ case EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED:
case EEOP_AGG_PLAIN_TRANS:
+ case EEOP_AGG_PLAIN_TRANS_SPILLED:
{
AggState *aggstate;
AggStatePerTrans pertrans;
@@ -2255,6 +2303,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_pertransp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_retval;
@@ -2282,10 +2331,33 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_trans.setoff);
v_transno = l_int32_const(op->d.agg_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED ||
+ opcode == EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_advance_transval = l_bb_before_v(
+ opblocks[i + 1], "op.%d.advance_transval", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_advance_transval);
+
+ LLVMPositionBuilderAtEnd(b, b_advance_transval);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_fcinfo = l_ptr_const(fcinfo,
l_ptr(StructFunctionCallInfoData));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f65934859..3f0d2899635 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
bool enable_hashagg = true;
+bool enable_hashagg_spill = true;
bool enable_nestloop = true;
bool enable_material = true;
bool enable_mergejoin = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 8c8b4f8ed69..f93150d4199 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1644,6 +1644,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
NIL,
NIL,
best_path->path.rows,
+ 0,
subplan);
}
else
@@ -2096,6 +2097,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
NIL,
NIL,
best_path->numGroups,
+ best_path->transSpace,
subplan);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2257,6 +2259,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
NIL,
rollup->numGroups,
+ best_path->transSpace,
sort_plan);
/*
@@ -2295,6 +2298,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
chain,
rollup->numGroups,
+ best_path->transSpace,
subplan);
/* Copy cost data from Path to Plan */
@@ -6195,7 +6199,7 @@ make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree)
+ double dNumGroups, int32 transSpace, Plan *lefttree)
{
Agg *node = makeNode(Agg);
Plan *plan = &node->plan;
@@ -6211,6 +6215,7 @@ make_agg(List *tlist, List *qual,
node->grpOperators = grpOperators;
node->grpCollations = grpCollations;
node->numGroups = numGroups;
+ node->transSpace = transSpace;
node->aggParams = NULL; /* SS_finalize_plan() will fill this */
node->groupingSets = groupingSets;
node->chain = chain;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index cb54b15507b..b6172fb426a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4261,6 +4261,9 @@ consider_groupingsets_paths(PlannerInfo *root,
* gd->rollups is empty if we have only unsortable columns to work
* with. Override work_mem in that case; otherwise, we'll rely on the
* sorted-input case to generate usable mixed paths.
+ *
+ * TODO: think more about how to plan grouping sets when spilling hash
+ * tables is an option
*/
if (hashsize > work_mem * 1024L && gd->rollups)
return; /* nope, won't fit */
@@ -6533,7 +6536,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
* were unable to sort above, then we'd better generate a Path, so
* that we at least have one.
*/
- if (hashaggtablesize < work_mem * 1024L ||
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L ||
grouped_rel->pathlist == NIL)
{
/*
@@ -6566,7 +6570,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
agg_final_costs,
dNumGroups);
- if (hashaggtablesize < work_mem * 1024L)
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L)
add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
@@ -6835,7 +6840,7 @@ create_partial_grouping_paths(PlannerInfo *root,
* Tentatively produce a partial HashAgg Path, depending on if it
* looks as if the hash table will fit in work_mem.
*/
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
@@ -6862,7 +6867,7 @@ create_partial_grouping_paths(PlannerInfo *root,
dNumPartialPartialGroups);
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 60c93ee7c59..7f5fc6ebb50 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2949,6 +2949,7 @@ create_agg_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->aggsplit = aggsplit;
pathnode->numGroups = numGroups;
+ pathnode->transSpace = aggcosts ? aggcosts->transitionSpace : 0;
pathnode->groupClause = groupClause;
pathnode->qual = qual;
@@ -3036,6 +3037,7 @@ create_groupingsets_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->rollups = rollups;
pathnode->qual = having_qual;
+ pathnode->transSpace = agg_costs ? agg_costs->transitionSpace : 0;
Assert(rollups != NIL);
Assert(aggstrategy != AGG_PLAIN || list_length(rollups) == 1);
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 3a091022e24..752f09e3a35 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -120,6 +120,7 @@ bool enableFsync = true;
bool allowSystemTableMods = false;
int work_mem = 1024;
int maintenance_work_mem = 16384;
+bool hashagg_mem_overflow = false;
int max_parallel_maintenance_workers = 2;
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8d951ce404c..467f42944d7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -958,6 +958,26 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_hashagg_spill", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans that are expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_hashagg_spill,
+ true,
+ NULL, NULL, NULL
+ },
+ {
+ {"hashagg_mem_overflow", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables hashed aggregation to overflow work_mem at execution time."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &hashagg_mem_overflow,
+ false,
+ NULL, NULL, NULL
+ },
{
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of materialization."),
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index d21dbead0a2..e50a7ad6712 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -226,9 +226,13 @@ typedef enum ExprEvalOp
EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
EEOP_AGG_INIT_TRANS,
+ EEOP_AGG_INIT_TRANS_SPILLED,
EEOP_AGG_STRICT_TRANS_CHECK,
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
EEOP_AGG_PLAIN_TRANS_BYVAL,
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
EEOP_AGG_PLAIN_TRANS,
+ EEOP_AGG_PLAIN_TRANS_SPILLED,
EEOP_AGG_ORDERED_TRANS_DATUM,
EEOP_AGG_ORDERED_TRANS_TUPLE,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 6298c7c8cad..e8d88f2ce26 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -140,11 +140,17 @@ extern TupleHashTable BuildTupleHashTableExt(PlanState *parent,
extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
bool *isnew);
+extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
+ TupleTableSlot *slot,
+ bool *isnew, uint32 hash);
extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
ExprState *eqcomp,
FmgrInfo *hashfunctions);
+extern uint32 TupleHashTableHash(struct tuplehash_hash *tb,
+ const MinimalTuple tuple);
extern void ResetTupleHashTable(TupleHashTable hashtable);
+extern void DestroyTupleHashTable(TupleHashTable hashtable);
/*
* prototypes from functions in execJunk.c
@@ -250,7 +256,7 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
- bool doSort, bool doHash);
+ bool doSort, bool doHash, bool spilled);
extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
int numCols,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 68c9e5f5400..e58180e937a 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -302,6 +302,8 @@ typedef struct AggStatePerHashData
AttrNumber *hashGrpColIdxInput; /* hash col indices in input slot */
AttrNumber *hashGrpColIdxHash; /* indices in hash table tuples */
Agg *aggnode; /* original Agg node, for numGroups etc. */
+ int numNeededColsInput; /* number of columns needed from input */
+ AttrNumber *allNeededColsInput; /* all columns needed from input */
} AggStatePerHashData;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index ed80f1d6681..77a87cded44 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -244,6 +244,7 @@ extern bool enableFsync;
extern PGDLLIMPORT bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT bool hashagg_mem_overflow;
extern PGDLLIMPORT int max_parallel_maintenance_workers;
extern int VacuumCostPageHit;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0c2a77aaf8d..8d4a36a3538 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2066,13 +2066,30 @@ typedef struct AggState
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
- int num_hashes;
+ int num_hashes; /* number of hash tables active at once */
+ bool hash_spilled; /* any hash table ever spilled? */
+ double hashentrysize; /* estimate revised during execution */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each hash table,
+ exists only during first pass if spilled */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ bool hash_no_new_groups; /* we hit a limit during the current batch
+ and we must not create new groups */
+ Size hash_mem_current; /* current hash table memory usage */
+ Size hash_mem_limit; /* limit before spilling hash table */
+ Size hash_mem_peak; /* peak hash table memory usage */
+ long hash_ngroups_current; /* number of groups currently in
+ memory in all hash tables */
+ long hash_ngroups_limit; /* limit before spilling hash table */
+ uint64 hash_disk_used; /* bytes of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+ List *hash_batches; /* hash batches remaining to be processed */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 47
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 31b631cfe0f..f8557404703 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1663,6 +1663,7 @@ typedef struct AggPath
AggStrategy aggstrategy; /* basic strategy, see nodes.h */
AggSplit aggsplit; /* agg-splitting mode, see nodes.h */
double numGroups; /* estimated number of groups in input */
+ int32 transSpace; /* estimated transition state size */
List *groupClause; /* a list of SortGroupClause's */
List *qual; /* quals (HAVING quals), if any */
} AggPath;
@@ -1700,6 +1701,7 @@ typedef struct GroupingSetsPath
AggStrategy aggstrategy; /* basic strategy */
List *rollups; /* list of RollupData */
List *qual; /* quals (HAVING quals), if any */
+ int32 transSpace; /* estimated transition state size */
} GroupingSetsPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 477b4da192c..360a4801f59 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -813,6 +813,7 @@ typedef struct Agg
Oid *grpOperators; /* equality operators to compare with */
Oid *grpCollations;
long numGroups; /* estimated number of groups in input */
+ int32 transSpace; /* estimated transition state size */
Bitmapset *aggParams; /* IDs of Params used in Aggref inputs */
/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
List *groupingSets; /* grouping sets to use */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fbc..b72e2d08290 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_hashagg_spill;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index e7aaddd50d6..41e4b4a336d 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -55,7 +55,7 @@ extern Agg *make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree);
+ double dNumGroups, int32 transSpace, Plan *lefttree);
extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
/*
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index d091ae4c6e4..92e5dbad77e 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2331,3 +2331,95 @@ explain (costs off)
-> Seq Scan on onek
(8 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+set work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------------
+ GroupAggregate
+ Group Key: ((g % 100000))
+ -> Sort
+ Sort Key: ((g % 100000))
+ -> Function Scan on generate_series g
+(5 rows)
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+set jit_above_cost to default;
+create table agg_group_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_group_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+-- Produce results with hash aggregation
+set enable_hashagg = true;
+set enable_sort = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 100000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+set jit_above_cost to default;
+create table agg_hash_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_hash_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+set enable_sort = true;
+set work_mem to default;
+-- Compare group aggregation results to hash aggregation results
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a7..767f60a96c7 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1633,4 +1633,127 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
| 1 | 2
(4 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+-- Produce results with hash aggregation.
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+set enable_sort = true;
+set work_mem to default;
+-- Compare results
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+ g100 | g10 | unnest | c | m
+------+-----+--------+---+---
+(0 rows)
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
-- end
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1de..11c6f50fbfa 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -148,6 +148,68 @@ SELECT count(*) FROM
4
(1 row)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+SET enable_hashagg=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------------
+ Unique
+ -> Sort
+ Sort Key: ((g % 1000))
+ -> Function Scan on generate_series g
+(4 rows)
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_hashagg=TRUE;
+-- Produce results with hash aggregation.
+SET enable_sort=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 1000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_sort=TRUE;
+SET work_mem TO DEFAULT;
+-- Compare results
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb9057..c40bf6c16eb 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -75,6 +75,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
+ enable_hashagg_spill | on
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 17fb256aec5..bcd336c5812 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1017,3 +1017,91 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
explain (costs off)
select 1 from tenk1
where (hundred, thousand) in (select twothousand, twothousand from onek);
+
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+set work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+set jit_above_cost to default;
+
+create table agg_group_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_group_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+-- Produce results with hash aggregation
+
+set enable_hashagg = true;
+set enable_sort = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+set jit_above_cost to default;
+
+create table agg_hash_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_hash_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare group aggregation results to hash aggregation results
+
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 95ac3fb52f6..bf8bce6ed31 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -441,4 +441,103 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
from unnest(array[1,1], array['a','b']) u(i,v)
group by rollup(i, v||'a') order by 1,3;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+-- Produce results with hash aggregation.
+
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare results
+
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+
-- end
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449e..33102744ebf 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -45,6 +45,68 @@ SELECT count(*) FROM
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+SET enable_hashagg=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_hashagg=TRUE;
+
+-- Produce results with hash aggregation.
+
+SET enable_sort=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_sort=TRUE;
+
+SET work_mem TO DEFAULT;
+
+-- Compare results
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
+
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
On Sat, 2019-12-14 at 18:32 +0100, Tomas Vondra wrote:
So I think we're not costing the batching properly / at all.
Thank you for all of the testing! I think the results are good: even
for cases where HashAgg is the wrong choice, it's not too bad. You're
right that costing is not done, and when it is, I think it will avoid
these bad choices most of the time.
A couple more comments:
1) IMHO we should rename hashagg_mem_overflow to
enable_hashagg_overflow
or something like that. I think that describes the GUC purpose better
(and it's more consistent with enable_hashagg_spill).
The other enable_* GUCs are all planner GUCs, so I named this one
differently to stand out as an executor GUC.
2) show_hashagg_info
I think there's a missing space after ":" here:
" Batches: %d Disk Usage:%ldkB",
and maybe we should use just "Disk:" just like in we do for sort:
Done, thank you.
3) I'm not quite sure what to think about the JIT recompile we do for
EEOP_AGG_INIT_TRANS_SPILLED etc. I'm no llvm/jit expert, but do we do
that for some other existing cases?
Andres asked for that explicitly to avoid branches in the non-spilling
code path (or at least branches that are likely to be mispredicted).
Regards,
Jeff Davis
On Sat, 2019-12-14 at 18:32 +0100, Tomas Vondra wrote:
So I think we're not costing the batching properly / at all.
Hi,
I've attached a new patch that adds some basic costing for disk during
hashagg.
The accuracy is unfortunately not great, especially at smaller work_mem
sizes and smaller entry sizes. The biggest discrepency seems to be the
estimate for the average size of an entry in the hash table is
significantly smaller than the actual average size. I'm not sure how
big of a problem this accuracy is or how it compares to sort, for
instance (it's a bit hard to compare because sort works with
theoretical memory usage while hashagg looks at actual allocated
memory).
Costing was the last major TODO, so I'm considering this feature
complete, though it still needs some work on quality.
Regards,
Jeff Davis
Attachments:
hashagg-20191227.patchtext/x-patch; charset=UTF-8; name=hashagg-20191227.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 949fefa23ae..c2fb7a088a2 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -102,6 +102,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1844,6 +1845,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ if (es->analyze)
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2742,6 +2745,56 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * If EXPLAIN ANALYZE, show information on hash aggregate memory usage and
+ * batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+ long diskKb = (aggstate->hash_disk_used + 1023) / 1024;
+
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(
+ es->str,
+ "Memory Usage: %ldkB",
+ memPeakKb);
+
+ if (aggstate->hash_batches_used > 0)
+ {
+ appendStringInfo(
+ es->str,
+ " Batches: %d Disk: %ldkB",
+ aggstate->hash_batches_used, diskKb);
+ }
+
+ appendStringInfo(
+ es->str,
+ "\n");
+ }
+ else
+ {
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ ExplainPropertyInteger("Disk Usage", "kB", diskKb, es);
+ }
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 8da2e2dcbba..fb3e81764ad 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -79,7 +79,8 @@ static void ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash);
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled);
/*
@@ -2927,7 +2928,7 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
*/
ExprState *
ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
- bool doSort, bool doHash)
+ bool doSort, bool doHash, bool spilled)
{
ExprState *state = makeNode(ExprState);
PlanState *parent = &aggstate->ss.ps;
@@ -3160,7 +3161,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (setno = 0; setno < processGroupingSets; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, false);
+ pertrans, transno, setno, setoff, false,
+ spilled);
setoff++;
}
}
@@ -3178,7 +3180,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (setno = 0; setno < numHashes; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, true);
+ pertrans, transno, setno, setoff, true,
+ spilled);
setoff++;
}
}
@@ -3226,7 +3229,8 @@ static void
ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash)
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled)
{
int adjust_init_jumpnull = -1;
int adjust_strict_jumpnull = -1;
@@ -3248,7 +3252,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
fcinfo->flinfo->fn_strict &&
pertrans->initValueIsNull)
{
- scratch->opcode = EEOP_AGG_INIT_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_INIT_TRANS_SPILLED : EEOP_AGG_INIT_TRANS;
scratch->d.agg_init_trans.aggstate = aggstate;
scratch->d.agg_init_trans.pertrans = pertrans;
scratch->d.agg_init_trans.setno = setno;
@@ -3265,7 +3270,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
if (pertrans->numSortCols == 0 &&
fcinfo->flinfo->fn_strict)
{
- scratch->opcode = EEOP_AGG_STRICT_TRANS_CHECK;
+ scratch->opcode = spilled ?
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED : EEOP_AGG_STRICT_TRANS_CHECK;
scratch->d.agg_strict_trans_check.aggstate = aggstate;
scratch->d.agg_strict_trans_check.setno = setno;
scratch->d.agg_strict_trans_check.setoff = setoff;
@@ -3283,9 +3289,11 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
/* invoke appropriate transition implementation */
if (pertrans->numSortCols == 0 && pertrans->transtypeByVal)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS_BYVAL;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED : EEOP_AGG_PLAIN_TRANS_BYVAL;
else if (pertrans->numSortCols == 0)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_SPILLED : EEOP_AGG_PLAIN_TRANS;
else if (pertrans->numInputs == 1)
scratch->opcode = EEOP_AGG_ORDERED_TRANS_DATUM;
else
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 034970648f3..11ba8c09542 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -430,9 +430,13 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
&&CASE_EEOP_AGG_INIT_TRANS,
+ &&CASE_EEOP_AGG_INIT_TRANS_SPILLED,
&&CASE_EEOP_AGG_STRICT_TRANS_CHECK,
+ &&CASE_EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_SPILLED,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
&&CASE_EEOP_LAST
@@ -1625,6 +1629,36 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ aggstate = op->d.agg_init_trans.aggstate;
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
+
+ /* If transValue has not yet been initialized, do so now. */
+ if (pergroup->noTransValue)
+ {
+ AggStatePerTrans pertrans = op->d.agg_init_trans.pertrans;
+
+ aggstate->curaggcontext = op->d.agg_init_trans.aggcontext;
+ aggstate->current_set = op->d.agg_init_trans.setno;
+
+ ExecAggInitGroup(aggstate, pertrans, pergroup);
+
+ /* copied trans value from input, done this round */
+ EEO_JUMP(op->d.agg_init_trans.jumpnull);
+ }
+
+ EEO_NEXT();
+ }
/* check that a strict aggregate's input isn't NULL */
EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK)
@@ -1642,6 +1676,25 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ aggstate = op->d.agg_strict_trans_check.aggstate;
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_strict_trans_check.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_strict_trans_check.transno];
+
+ if (unlikely(pergroup->transValueIsNull))
+ EEO_JUMP(op->d.agg_strict_trans_check.jumpnull);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1691,6 +1744,52 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ aggstate = op->d.agg_trans.aggstate;
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ Assert(pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1756,6 +1855,67 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ aggstate = op->d.agg_trans.aggstate;
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ Assert(!pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
+ /*
+ * For pass-by-ref datatype, must copy the new value into
+ * aggcontext and free the prior transValue. But if transfn
+ * returned a pointer to its first input, we don't need to do
+ * anything. Also, if transfn returned a pointer to a R/W
+ * expanded object that is already a child of the aggcontext,
+ * assume we can adopt that value without copying it.
+ */
+ if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggTransReparent(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
/* process single-column ordered aggregate datum */
EEO_CASE(EEOP_AGG_ORDERED_TRANS_DATUM)
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index e361143094c..02dba3eac18 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -25,8 +25,9 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
-static uint32 TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple);
static int TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
+static TupleHashEntry LookupTupleHashEntry_internal(
+ TupleHashTable hashtable, TupleTableSlot *slot, bool *isnew, uint32 hash);
/*
* Define parameters for tuple hash table code generation. The interface is
@@ -284,6 +285,17 @@ ResetTupleHashTable(TupleHashTable hashtable)
tuplehash_reset(hashtable->hashtab);
}
+/*
+ * Destroy the hash table. Note that the tablecxt passed to
+ * BuildTupleHashTableExt() should also be reset, otherwise there will be
+ * leaks.
+ */
+void
+DestroyTupleHashTable(TupleHashTable hashtable)
+{
+ tuplehash_destroy(hashtable->hashtab);
+}
+
/*
* Find or create a hashtable entry for the tuple group containing the
* given tuple. The tuple must be the same type as the hashtable entries.
@@ -300,10 +312,9 @@ TupleHashEntry
LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
bool *isnew)
{
- TupleHashEntryData *entry;
- MemoryContext oldContext;
- bool found;
- MinimalTuple key;
+ TupleHashEntry entry;
+ MemoryContext oldContext;
+ uint32 hash;
/* Need to run the hash functions in short-lived context */
oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
@@ -313,32 +324,29 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
hashtable->cur_eq_func = hashtable->tab_eq_func;
- key = NULL; /* flag to reference inputslot */
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+ entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
- if (isnew)
- {
- entry = tuplehash_insert(hashtable->hashtab, key, &found);
+ MemoryContextSwitchTo(oldContext);
- if (found)
- {
- /* found pre-existing entry */
- *isnew = false;
- }
- else
- {
- /* created new entry */
- *isnew = true;
- /* zero caller data */
- entry->additional = NULL;
- MemoryContextSwitchTo(hashtable->tablecxt);
- /* Copy the first tuple into the table context */
- entry->firstTuple = ExecCopySlotMinimalTuple(slot);
- }
- }
- else
- {
- entry = tuplehash_lookup(hashtable->hashtab, key);
- }
+ return entry;
+}
+
+/*
+ * A variant of LookupTupleHashEntry for callers that have already computed
+ * the hash value.
+ */
+TupleHashEntry
+LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
+ bool *isnew, uint32 hash)
+{
+ TupleHashEntry entry;
+ MemoryContext oldContext;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
MemoryContextSwitchTo(oldContext);
@@ -389,7 +397,7 @@ FindTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
* Also, the caller must select an appropriate memory context for running
* the hash functions. (dynahash.c doesn't change CurrentMemoryContext.)
*/
-static uint32
+uint32
TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
{
TupleHashTable hashtable = (TupleHashTable) tb->private_data;
@@ -450,6 +458,54 @@ TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
return murmurhash32(hashkey);
}
+/*
+ * Does the work of LookupTupleHashEntry and LookupTupleHashEntryHash. Useful
+ * so that we can avoid switching the memory context multiple times for
+ * LookupTupleHashEntry.
+ */
+static TupleHashEntry
+LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
+ bool *isnew, uint32 hash)
+{
+ TupleHashEntryData *entry;
+ bool found;
+ MinimalTuple key;
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = slot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ key = NULL; /* flag to reference inputslot */
+
+ if (isnew)
+ {
+ entry = tuplehash_insert_hash(hashtable->hashtab, key, hash, &found);
+
+ if (found)
+ {
+ /* found pre-existing entry */
+ *isnew = false;
+ }
+ else
+ {
+ /* created new entry */
+ *isnew = true;
+ /* zero caller data */
+ entry->additional = NULL;
+ MemoryContextSwitchTo(hashtable->tablecxt);
+ /* Copy the first tuple into the table context */
+ entry->firstTuple = ExecCopySlotMinimalTuple(slot);
+ }
+ }
+ else
+ {
+ entry = tuplehash_lookup_hash(hashtable->hashtab, key, hash);
+ }
+
+ return entry;
+}
+
/*
* See whether two tuples (presumably of the same hash value) match
*/
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 6ee24eab3d2..bc5ab981e1d 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -194,6 +194,18 @@
* transition values. hashcontext is the single context created to support
* all hash tables.
*
+ * When the hash table memory exceeds work_mem, we advance the transition
+ * states only for groups already in the hash table. For tuples that would
+ * need to create a new hash table entries (and initialize new transition
+ * states), we spill them to disk to be processed later. The tuples are
+ * spilled in a partitioned manner, so that subsequent batches are smaller
+ * and less likely to exceed work_mem (if a batch does exceed work_mem, it
+ * must be spilled recursively).
+ *
+ * Note that it's possible for transition states to start small but then
+ * grow very large; for instance in the case of ARRAY_AGG. In such cases,
+ * it's still possible to significantly exceed work_mem.
+ *
* Transition / Combine function invocation:
*
* For performance reasons transition functions, including combine
@@ -229,15 +241,69 @@
#include "optimizer/optimizer.h"
#include "parser/parse_agg.h"
#include "parser/parse_coerce.h"
+#include "storage/buffile.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/datum.h"
+#include "utils/dynahash.h"
#include "utils/expandeddatum.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+/*
+ * Control how many partitions are created when spilling HashAgg to
+ * disk.
+ *
+ * HASHAGG_PARTITION_FACTOR is multiplied by the estimated number of
+ * partitions needed such that each partition will fit in memory. The factor
+ * is set higher than one because there's not a high cost to having a few too
+ * many partitions, and it makes it less likely that a partition will need to
+ * be spilled recursively. Another benefit of having more, smaller partitions
+ * is that small hash tables may perform better than large ones due to memory
+ * caching effects.
+ *
+ * HASHAGG_PARTITION_MEM is the approximate amount of work_mem we should
+ * reserve for the partitions themselves (i.e. buffering of the files backing
+ * the partitions). This is sloppy, because we must reserve the memory before
+ * filling the hash table; but we choose the number of partitions at the time
+ * we need to spill.
+ *
+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (and
+ * possibly pushing hidden costs to the OS for managing more files).
+ */
+#define HASHAGG_PARTITION_FACTOR 1.50
+#define HASHAGG_MIN_PARTITIONS 4
+#define HASHAGG_PARTITION_MEM (HASHAGG_MIN_PARTITIONS * BLCKSZ)
+
+/*
+ * Represents partitioned spill data for a single hashtable.
+ */
+typedef struct HashAggSpill
+{
+ int n_partitions; /* number of output partitions */
+ int partition_bits; /* number of bits for partition mask
+ log2(n_partitions) parent partition bits */
+ BufFile **partitions; /* output partition files */
+ int64 *ntuples; /* number of tuples in each partition */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation. Initially,
+ * only the input fields are set. If spilled to disk, also set the spill data.
+ */
+typedef struct HashAggBatch
+{
+ BufFile *input_file; /* input partition */
+ int input_bits; /* number of bits for input partition mask */
+ int64 input_tuples; /* number of tuples in this batch */
+ int setno; /* grouping set */
+ HashAggSpill spill; /* spill output */
+} HashAggBatch;
+
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -271,12 +337,35 @@ static void finalize_aggregates(AggState *aggstate,
static TupleTableSlot *project_aggregates(AggState *aggstate);
static Bitmapset *find_unaggregated_cols(AggState *aggstate);
static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
-static void build_hash_table(AggState *aggstate);
-static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
+static void build_hash_table(AggState *aggstate, int setno,
+ int64 ngroups_estimate);
+static void prepare_hash_slot(AggState *aggstate);
+static void hash_recompile_expressions(AggState *aggstate);
+static uint32 calculate_hash(AggState *aggstate);
+static long hash_choose_num_buckets(AggState *aggstate,
+ long estimated_nbuckets,
+ Size memory);
+static int hash_choose_num_spill_partitions(uint64 input_groups,
+ double hashentrysize);
+static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+static void hash_spill_init(HashAggSpill *spill, int input_bits,
+ uint64 input_tuples, double hashentrysize);
+static Size hash_spill_tuple(HashAggSpill *spill, int input_bits,
+ TupleTableSlot *slot, uint32 hash);
+static MinimalTuple hash_read_spilled(BufFile *file, uint32 *hashp);
+static HashAggBatch *hash_batch_new(BufFile *input_file, int setno,
+ int64 input_tuples, int input_bits);
+static void hash_finish_initial_spills(AggState *aggstate);
+static void hash_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno, int input_bits);
+static void hash_reset_spill(HashAggSpill *spill);
+static void hash_reset_spills(AggState *aggstate);
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1201,6 +1290,68 @@ project_aggregates(AggState *aggstate)
return NULL;
}
+static bool
+find_aggregated_cols_walker(Node *node, Bitmapset **colnos)
+{
+ if (node == NULL)
+ return false;
+
+ if (IsA(node, Var))
+ {
+ Var *var = (Var *) node;
+
+ *colnos = bms_add_member(*colnos, var->varattno);
+
+ return false;
+ }
+ return expression_tree_walker(node, find_aggregated_cols_walker,
+ (void *) colnos);
+}
+
+/*
+ * find_aggregated_cols
+ * Construct a bitmapset of the column numbers of aggregated Vars
+ * appearing in our targetlist and qual (HAVING clause)
+ */
+static Bitmapset *
+find_aggregated_cols(AggState *aggstate)
+{
+ Agg *node = (Agg *) aggstate->ss.ps.plan;
+ Bitmapset *colnos = NULL;
+ ListCell *temp;
+
+ /*
+ * We only want the columns used by aggregations in the targetlist or qual
+ */
+ if (node->plan.targetlist != NULL)
+ {
+ foreach(temp, (List *) node->plan.targetlist)
+ {
+ if (IsA(lfirst(temp), TargetEntry))
+ {
+ Node *node = (Node *)((TargetEntry *)lfirst(temp))->expr;
+ if (IsA(node, Aggref) || IsA(node, GroupingFunc))
+ find_aggregated_cols_walker(node, &colnos);
+ }
+ }
+ }
+
+ if (node->plan.qual != NULL)
+ {
+ foreach(temp, (List *) node->plan.qual)
+ {
+ if (IsA(lfirst(temp), TargetEntry))
+ {
+ Node *node = (Node *)((TargetEntry *)lfirst(temp))->expr;
+ if (IsA(node, Aggref) || IsA(node, GroupingFunc))
+ find_aggregated_cols_walker(node, &colnos);
+ }
+ }
+ }
+
+ return colnos;
+}
+
/*
* find_unaggregated_cols
* Construct a bitmapset of the column numbers of un-aggregated Vars
@@ -1254,46 +1405,84 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
* for each entry.
*
* We have a separate hashtable and associated perhash data structure for each
- * grouping set for which we're doing hashing.
+ * grouping set for which we're doing hashing. If setno is -1, build hash
+ * tables for all grouping sets. Otherwise, build only for the specified
+ * grouping set.
*
* The contents of the hash tables always live in the hashcontext's per-tuple
* memory context (there is only one of these for all tables together, since
* they are all reset at the same time).
*/
static void
-build_hash_table(AggState *aggstate)
+build_hash_table(AggState *aggstate, int setno, long ngroups_estimate)
{
- MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
- Size additionalsize;
- int i;
+ MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
+ Size additionalsize;
+ int i;
Assert(aggstate->aggstrategy == AGG_HASHED || aggstate->aggstrategy == AGG_MIXED);
+ /*
+ * Used to make sure initial hash table allocation does not exceed
+ * work_mem. Note that the estimate does not include space for
+ * pass-by-reference transition data values, nor for the representative
+ * tuple of each group.
+ */
additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
for (i = 0; i < aggstate->num_hashes; ++i)
{
AggStatePerHash perhash = &aggstate->perhash[i];
+ int64 ngroups;
+ long nbuckets;
+ Size memory;
Assert(perhash->aggnode->numGroups > 0);
if (perhash->hashtable)
- ResetTupleHashTable(perhash->hashtable);
- else
- perhash->hashtable = BuildTupleHashTableExt(&aggstate->ss.ps,
- perhash->hashslot->tts_tupleDescriptor,
- perhash->numCols,
- perhash->hashGrpColIdxHash,
- perhash->eqfuncoids,
- perhash->hashfunctions,
- perhash->aggnode->grpCollations,
- perhash->aggnode->numGroups,
- additionalsize,
- aggstate->ss.ps.state->es_query_cxt,
- aggstate->hashcontext->ecxt_per_tuple_memory,
- tmpmem,
- DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+ DestroyTupleHashTable(perhash->hashtable);
+ perhash->hashtable = NULL;
+
+ /*
+ * If we are building a hash table for only a single grouping set,
+ * skip the others.
+ */
+ if (setno >= 0 && setno != i)
+ continue;
+
+ /*
+ * Use an estimate from execution time if we have it; otherwise fall
+ * back to the planner estimate.
+ */
+ ngroups = ngroups_estimate > 0 ?
+ ngroups_estimate : perhash->aggnode->numGroups;
+
+ /* divide memory by the number of hash tables we are initializing */
+ memory = (long)work_mem * 1024L /
+ (setno >= 0 ? 1 : aggstate->num_hashes);
+
+ /* choose reasonable number of buckets per hashtable */
+ nbuckets = hash_choose_num_buckets(aggstate, ngroups, memory);
+
+ perhash->hashtable = BuildTupleHashTableExt(&aggstate->ss.ps,
+ perhash->hashslot->tts_tupleDescriptor,
+ perhash->numCols,
+ perhash->hashGrpColIdxHash,
+ perhash->eqfuncoids,
+ perhash->hashfunctions,
+ perhash->aggnode->grpCollations,
+ nbuckets,
+ additionalsize,
+ aggstate->ss.ps.state->es_query_cxt,
+ aggstate->hashcontext->ecxt_per_tuple_memory,
+ tmpmem,
+ DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
}
+
+ aggstate->hash_mem_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_ngroups_current = 0;
+ aggstate->hash_no_new_groups = false;
}
/*
@@ -1325,6 +1514,7 @@ static void
find_hash_columns(AggState *aggstate)
{
Bitmapset *base_colnos;
+ Bitmapset *aggregated_colnos;
List *outerTlist = outerPlanState(aggstate)->plan->targetlist;
int numHashes = aggstate->num_hashes;
EState *estate = aggstate->ss.ps.state;
@@ -1332,11 +1522,13 @@ find_hash_columns(AggState *aggstate)
/* Find Vars that will be needed in tlist and qual */
base_colnos = find_unaggregated_cols(aggstate);
+ aggregated_colnos = find_aggregated_cols(aggstate);
for (j = 0; j < numHashes; ++j)
{
AggStatePerHash perhash = &aggstate->perhash[j];
Bitmapset *colnos = bms_copy(base_colnos);
+ Bitmapset *allNeededColsInput;
AttrNumber *grpColIdx = perhash->aggnode->grpColIdx;
List *hashTlist = NIL;
TupleDesc hashDesc;
@@ -1383,6 +1575,19 @@ find_hash_columns(AggState *aggstate)
for (i = 0; i < perhash->numCols; i++)
colnos = bms_add_member(colnos, grpColIdx[i]);
+ /*
+ * Track the necessary columns from the input. This is important for
+ * spilling tuples so that we don't waste disk space with unneeded
+ * columns.
+ */
+ allNeededColsInput = bms_union(colnos, aggregated_colnos);
+ perhash->numNeededColsInput = 0;
+ perhash->allNeededColsInput = palloc(
+ bms_num_members(allNeededColsInput) * sizeof(AttrNumber));
+
+ while ((i = bms_first_member(allNeededColsInput)) >= 0)
+ perhash->allNeededColsInput[perhash->numNeededColsInput++] = i;
+
/*
* First build mapping for columns directly hashed. These are the
* first, because they'll be accessed when computing hash values and
@@ -1435,42 +1640,31 @@ find_hash_columns(AggState *aggstate)
/*
* Estimate per-hash-table-entry overhead for the planner.
- *
- * Note that the estimate does not include space for pass-by-reference
- * transition data values, nor for the representative tuple of each group.
- * Nor does this account of the target fill-factor and growth policy of the
- * hash table.
*/
Size
-hash_agg_entry_size(int numAggs)
+hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
{
- Size entrysize;
-
- /* This must match build_hash_table */
- entrysize = sizeof(TupleHashEntryData) +
- numAggs * sizeof(AggStatePerGroupData);
- entrysize = MAXALIGN(entrysize);
-
- return entrysize;
+ return
+ /* key */
+ MAXALIGN(SizeofMinimalTupleHeader) +
+ MAXALIGN(tupleWidth) +
+ /* data */
+ MAXALIGN(sizeof(TupleHashEntryData) +
+ numAggs * sizeof(AggStatePerGroupData)) +
+ transitionSpace;
}
/*
- * Find or create a hashtable entry for the tuple group containing the current
- * tuple (already set in tmpcontext's outertuple slot), in the current grouping
- * set (which the caller must have selected - note that initialize_aggregate
- * depends on this).
- *
- * When called, CurrentMemoryContext should be the per-query context.
+ * Extract the attributes that make up the grouping key into the
+ * hashslot. This is necessary to compute the hash of the grouping key.
*/
-static TupleHashEntryData *
-lookup_hash_entry(AggState *aggstate)
+static void
+prepare_hash_slot(AggState *aggstate)
{
- TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
- AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
- TupleTableSlot *hashslot = perhash->hashslot;
- TupleHashEntryData *entry;
- bool isnew;
- int i;
+ TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ int i;
/* transfer just the needed columns into hashslot */
slot_getsomeattrs(inputslot, perhash->largestGrpColIdx);
@@ -1484,14 +1678,185 @@ lookup_hash_entry(AggState *aggstate)
hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
}
ExecStoreVirtualTuple(hashslot);
+}
+
+/*
+ * Recompile the expressions for advancing aggregates while hashing. This is
+ * necessary for certain kinds of state changes that affect the resulting
+ * expression. For instance, changing aggstate->hash_spilled or
+ * aggstate->ss.ps.outerops require recompilation.
+ */
+static void
+hash_recompile_expressions(AggState *aggstate)
+{
+ AggStatePerPhase phase;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ if (aggstate->aggstrategy == AGG_HASHED)
+ phase = &aggstate->phases[0];
+ else /* AGG_MIXED */
+ phase = &aggstate->phases[1];
+
+ phase->evaltrans = ExecBuildAggTrans(
+ aggstate, phase,
+ aggstate->aggstrategy == AGG_MIXED ? true : false, /* dosort */
+ true, /* dohash */
+ aggstate->hash_spilled /* spilled */);
+}
+
+/*
+ * Calculate the hash value for a tuple. It's useful to do this outside of the
+ * hash table so that we can reuse saved hash values rather than recomputing.
+ */
+static uint32
+calculate_hash(AggState *aggstate)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleHashTable hashtable = perhash->hashtable;
+ MemoryContext oldContext;
+ uint32 hash;
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = perhash->hashslot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+
+ MemoryContextSwitchTo(oldContext);
+
+ return hash;
+}
+
+/*
+ * Choose a reasonable number of buckets for the initial hash table size.
+ */
+static long
+hash_choose_num_buckets(AggState *aggstate, long ngroups, Size memory)
+{
+ long max_nbuckets;
+ int log2_ngroups;
+ long nbuckets;
+
+ max_nbuckets = memory / aggstate->hashentrysize;
+
+ /*
+ * Lowest power of two greater than ngroups, without exceeding
+ * max_nbuckets.
+ */
+ for (log2_ngroups = 1, nbuckets = 2;
+ nbuckets < ngroups && nbuckets < max_nbuckets;
+ log2_ngroups++, nbuckets <<= 1);
+
+ if (nbuckets > max_nbuckets && nbuckets > 2)
+ nbuckets >>= 1;
+
+ return nbuckets;
+}
+
+/*
+ * Determine the number of partitions to create when spilling.
+ */
+static int
+hash_choose_num_spill_partitions(uint64 input_groups, double hashentrysize)
+{
+ Size mem_needed;
+ int partition_limit;
+ int npartitions;
+
+ /*
+ * Avoid creating so many partitions that the memory requirements of the
+ * open partition files (estimated at BLCKSZ for buffering) are greater
+ * than 1/4 of work_mem.
+ */
+ partition_limit = (work_mem * 1024L * 0.25) / BLCKSZ;
+
+ /* pessimistically estimate that each input tuple creates a new group */
+ mem_needed = HASHAGG_PARTITION_FACTOR * input_groups * hashentrysize;
+
+ /* make enough partitions so that each one is likely to fit in memory */
+ npartitions = 1 + (mem_needed / (work_mem * 1024L));
+
+ if (npartitions > partition_limit)
+ npartitions = partition_limit;
+
+ if (npartitions < HASHAGG_MIN_PARTITIONS)
+ npartitions = HASHAGG_MIN_PARTITIONS;
+ if (npartitions > HASHAGG_MAX_PARTITIONS)
+ npartitions = HASHAGG_MAX_PARTITIONS;
+
+ return npartitions;
+}
+
+/*
+ * Find or create a hashtable entry for the tuple group containing the current
+ * tuple (already set in tmpcontext's outertuple slot), in the current grouping
+ * set (which the caller must have selected - note that initialize_aggregate
+ * depends on this).
+ *
+ * When called, CurrentMemoryContext should be the per-query context.
+ *
+ * If the hash table is at the memory limit, then only find existing hashtable
+ * entries; don't create new ones. If a tuple's group is not already present
+ * in the hash table for the current grouping set, return NULL and the caller
+ * will spill it to disk.
+ */
+static AggStatePerGroup
+lookup_hash_entry(AggState *aggstate, uint32 hash)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ TupleHashEntryData *entry;
+ bool isnew = false;
+ bool *p_isnew;
+
+ /* if hash table already spilled, don't create new entries */
+ p_isnew = aggstate->hash_no_new_groups ? NULL : &isnew;
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntry(perhash->hashtable, hashslot, &isnew);
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
+ hash);
+
+ if (entry == NULL)
+ return NULL;
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_ngroups_current++;
+
+ aggstate->hash_mem_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ if (aggstate->hash_mem_current > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = aggstate->hash_mem_current;
+
+ /*
+ * Check whether we need to spill. For small values of work_mem, the
+ * empty hash tables might exceed it; so don't spill unless there's at
+ * least one group in the hash table.
+ */
+ if (aggstate->hash_ngroups_current > 0 &&
+ (aggstate->hash_mem_current > aggstate->hash_mem_limit ||
+ aggstate->hash_ngroups_current > aggstate->hash_ngroups_limit))
+ {
+ aggstate->hash_no_new_groups = true;
+ if (!aggstate->hash_spilled)
+ {
+ aggstate->hash_spilled = true;
+ aggstate->hash_spills = palloc0(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+
+ hash_recompile_expressions(aggstate);
+ }
+ }
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1511,7 +1876,7 @@ lookup_hash_entry(AggState *aggstate)
}
}
- return entry;
+ return entry->additional;
}
/*
@@ -1519,18 +1884,74 @@ lookup_hash_entry(AggState *aggstate)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Some entries may be left NULL if we have reached the limit and have begun
+ * to spill. The same tuple will belong to different groups for each set, so
+ * may match a group already in memory for one set and match a group not in
+ * memory for another set. If we have begun to spill and a tuple doesn't match
+ * a group in memory for a particular set, it will be spilled.
+ *
+ * NB: It's possible to spill the same tuple for several different grouping
+ * sets. This may seem wasteful, but it's actually a trade-off: if we spill
+ * the tuple multiple times for multiple grouping sets, it can be partitioned
+ * for each grouping set, making the refilling of the hash table very
+ * efficient.
*/
static void
lookup_hash_entries(AggState *aggstate)
{
- int numHashes = aggstate->num_hashes;
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
int setno;
- for (setno = 0; setno < numHashes; setno++)
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
{
+ uint32 hash;
+
select_current_set(aggstate, setno, true);
- pergroup[setno] = lookup_hash_entry(aggstate)->additional;
+ prepare_hash_slot(aggstate);
+ hash = calculate_hash(aggstate);
+ pergroup[setno] = lookup_hash_entry(aggstate, hash);
+
+ /* check to see if we need to spill the tuple for this grouping set */
+ if (pergroup[setno] == NULL)
+ {
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
+ TupleTableSlot *spillslot = aggstate->hash_spill_slot;
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ int idx;
+
+ if (spill->partitions == NULL)
+ hash_spill_init(spill, 0, perhash->aggnode->numGroups,
+ aggstate->hashentrysize);
+
+ /*
+ * Copy only necessary attributes to spill slot before writing to
+ * disk.
+ */
+ ExecClearTuple(spillslot);
+ memset(spillslot->tts_isnull, true,
+ spillslot->tts_tupleDescriptor->natts);
+
+ /* deserialize needed attributes */
+ if (perhash->numNeededColsInput > 0)
+ {
+ int maxNeededAttrIdx = perhash->numNeededColsInput - 1;
+ AttrNumber maxNeededAttr =
+ perhash->allNeededColsInput[maxNeededAttrIdx];
+ slot_getsomeattrs(inputslot, maxNeededAttr);
+ }
+
+ for (idx = 0; idx < perhash->numNeededColsInput; idx++)
+ {
+ AttrNumber att = perhash->allNeededColsInput[idx];
+ spillslot->tts_values[att-1] = inputslot->tts_values[att-1];
+ spillslot->tts_isnull[att-1] = inputslot->tts_isnull[att-1];
+ }
+
+ ExecStoreVirtualTuple(spillslot);
+ aggstate->hash_disk_used += hash_spill_tuple(spill, 0, spillslot, hash);
+ }
}
}
@@ -1853,6 +2274,12 @@ agg_retrieve_direct(AggState *aggstate)
if (TupIsNull(outerslot))
{
/* no more outer-plan tuples available */
+
+ /* if we built hash tables, finalize any spills */
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hash_finish_initial_spills(aggstate);
+
if (hasGroupingSets)
{
aggstate->input_done = true;
@@ -1955,6 +2382,9 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ /* finalize spills, if any */
+ hash_finish_initial_spills(aggstate);
+
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1962,11 +2392,175 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * consumed.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ /*
+ * Each spill file contains spilled data for only a single grouping
+ * set. We want to ignore all others, which is done by setting the other
+ * pergroups to NULL.
+ */
+ memset(aggstate->all_pergroups, 0,
+ sizeof(AggStatePerGroup) *
+ (aggstate->maxsets + aggstate->num_hashes));
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ /*
+ * Free memory and rebuild a single hash table for this batch's grouping
+ * set.
+ */
+ ReScanExprContext(aggstate->hashcontext);
+
+ /* estimate the number of groups to be the number of input tuples */
+ build_hash_table(aggstate, batch->setno, batch->input_tuples);
+
+ Assert(aggstate->current_phase == 0);
+
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ /*
+ * The first pass (agg_fill_hash_table) reads whatever kind of slot comes
+ * from the outer plan, and considers the slot fixed. But spilled tuples
+ * are always MinimalTuples, so if that's different from the outer plan we
+ * need to change it and recompile the aggregate expressions.
+ */
+ if (aggstate->ss.ps.outerops != &TTSOpsMinimalTuple)
+ {
+ aggstate->ss.ps.outerops = &TTSOpsMinimalTuple;
+ hash_recompile_expressions(aggstate);
+ }
+
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+ uint32 hash;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hash_read_spilled(batch->input_file, &hash);
+ if (tuple == NULL)
+ break;
+
+ ExecStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ select_current_set(aggstate, batch->setno, true);
+ prepare_hash_slot(aggstate);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+
+ /* if there's no memory for a new group, spill */
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ {
+ if (batch->spill.partitions == NULL)
+ {
+ /*
+ * Estimate the number of groups for this batch as the total
+ * number of tuples in its input file. Although that's a worst
+ * case, it's not bad here for two reasons: (1) overestimating
+ * is better than underestimating; and (2) we've already
+ * scanned the relation once, so it's likely that we've
+ * already finalized many of the common values.
+ */
+ hash_spill_init(&batch->spill, batch->input_bits,
+ batch->input_tuples, aggstate->hashentrysize);
+ }
+
+ aggstate->hash_disk_used += hash_spill_tuple(
+ &batch->spill, batch->input_bits, slot, hash);
+ }
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ BufFileClose(batch->input_file);
+
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ /* update hashentrysize estimate based on contents */
+ if (aggstate->hash_ngroups_current > 0)
+ {
+ aggstate->hashentrysize = (double)aggstate->hash_mem_current /
+ (double)aggstate->hash_ngroups_current;
+ }
+
+ hash_spill_finish(aggstate, &batch->spill, batch->setno,
+ batch->input_bits);
+
+ pfree(batch);
+
+ /* Initialize to walk the first hash table */
+ select_current_set(aggstate, 0, true);
+ ResetTupleHashIterator(aggstate->perhash[0].hashtable,
+ &aggstate->perhash[0].hashiter);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
+ *
+ * After exhausting in-memory tuples, also try refilling the hash table using
+ * previously-spilled tuples. Only returns NULL after all in-memory and
+ * spilled tuples are exhausted.
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+/*
+ * Retrieve the groups from the in-memory hash tables without considering any
+ * spilled tuples.
+ */
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -1995,7 +2589,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2026,8 +2620,6 @@ agg_retrieve_hash_table(AggState *aggstate)
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2084,6 +2676,281 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * hash_spill_init
+ *
+ * Called after we determined that spilling is necessary. Chooses the number
+ * of partitions to create, and initializes them.
+ */
+static void
+hash_spill_init(HashAggSpill *spill, int input_bits, uint64 input_groups,
+ double hashentrysize)
+{
+ int npartitions;
+ int partition_bits;
+
+ npartitions = hash_choose_num_spill_partitions(input_groups,
+ hashentrysize);
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits */
+ if (partition_bits + input_bits >= 32)
+ partition_bits = 32 - input_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ spill->partition_bits = partition_bits;
+ spill->n_partitions = npartitions;
+ spill->partitions = palloc0(sizeof(BufFile *) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+}
+
+/*
+ * hash_spill_tuple
+ *
+ * No room for new groups in the hash table. Save for later in the appropriate
+ * partition spill file.
+ */
+static Size
+hash_spill_tuple(HashAggSpill *spill, int input_bits, TupleTableSlot *slot,
+ uint32 hash)
+{
+ int partition;
+ MinimalTuple tuple;
+ BufFile *file;
+ int written;
+ int total_written = 0;
+ bool shouldFree;
+
+ Assert(spill->partitions != NULL);
+
+ /*
+ * When spilling tuples from the input, the slot will be virtual
+ * (containing only the needed attributes and the rest as NULL), and we
+ * need to materialize the minimal tuple. When spilling tuples
+ * recursively, the slot will hold a minimal tuple already.
+ */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ if (spill->partition_bits == 0)
+ partition = 0;
+ else
+ partition = (hash << input_bits) >>
+ (32 - spill->partition_bits);
+
+ spill->ntuples[partition]++;
+
+ if (spill->partitions[partition] == NULL)
+ spill->partitions[partition] = BufFileCreateTemp(false);
+ file = spill->partitions[partition];
+
+ written = BufFileWrite(file, (void *) &hash, sizeof(uint32));
+ if (written != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to HashAgg temporary file: %m")));
+ total_written += written;
+
+ written = BufFileWrite(file, (void *) tuple, tuple->t_len);
+ if (written != tuple->t_len)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to HashAgg temporary file: %m")));
+ total_written += written;
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return total_written;
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hash_read_spilled(BufFile *file, uint32 *hashp)
+{
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+ uint32 hash;
+
+ nread = BufFileRead(file, &hash, sizeof(uint32));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+ if (hashp != NULL)
+ *hashp = hash;
+
+ nread = BufFileRead(file, &t_len, sizeof(t_len));
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = BufFileRead(file, (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from HashAgg temporary file: %m")));
+
+ return tuple;
+}
+
+/*
+ * new_hashagg_batch
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done. Should be called in the aggregate's memory context.
+ */
+static HashAggBatch *
+hash_batch_new(BufFile *input_file, int setno, int64 input_tuples,
+ int input_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->input_file = input_file;
+ batch->input_bits = input_bits;
+ batch->input_tuples = input_tuples;
+ batch->setno = setno;
+
+ /* batch->spill will be set only after spilling this batch */
+
+ return batch;
+}
+
+/*
+ * hash_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hash_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+
+ if (aggstate->hash_spills == NULL)
+ return;
+
+ /* update hashentrysize estimate based on contents */
+ Assert(aggstate->hash_ngroups_current > 0);
+ aggstate->hashentrysize = (double)aggstate->hash_mem_current /
+ (double)aggstate->hash_ngroups_current;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hash_spill_finish(aggstate, &aggstate->hash_spills[setno], setno, 0);
+
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+}
+
+/*
+ * hash_spill_finish
+ *
+ * Transform spill files into new batches.
+ */
+static void
+hash_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno, int input_bits)
+{
+ int i;
+
+ if (spill->n_partitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->n_partitions; i++)
+ {
+ BufFile *file = spill->partitions[i];
+ MemoryContext oldContext;
+ HashAggBatch *new_batch;
+
+ /* partition is empty */
+ if (file == NULL)
+ continue;
+
+ /* rewind file for reading */
+ if (BufFileSeek(file, 0, 0L, SEEK_SET))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rewind HashAgg temporary file: %m")));
+
+ oldContext = MemoryContextSwitchTo(aggstate->ss.ps.state->es_query_cxt);
+ new_batch = hash_batch_new(file, setno, spill->ntuples[i],
+ spill->partition_bits + input_bits);
+ aggstate->hash_batches = lappend(aggstate->hash_batches, new_batch);
+ aggstate->hash_batches_used++;
+ MemoryContextSwitchTo(oldContext);
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Clear a HashAggSpill, free its memory, and close its files.
+ */
+static void
+hash_reset_spill(HashAggSpill *spill)
+{
+ int i;
+ for (i = 0; i < spill->n_partitions; i++)
+ {
+ BufFile *file = spill->partitions[i];
+
+ if (file != NULL)
+ BufFileClose(file);
+ }
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+}
+
+/*
+ * Find and reset all active HashAggSpills.
+ */
+static void
+hash_reset_spills(AggState *aggstate)
+{
+ ListCell *lc;
+
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hash_reset_spill(&aggstate->hash_spills[setno]);
+
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ if (batch->input_file != NULL)
+ {
+ BufFileClose(batch->input_file);
+ batch->input_file = NULL;
+ }
+ hash_reset_spill(&batch->spill);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2268,6 +3135,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->ss.ps.outeropsfixed = false;
}
+ if (use_hashing)
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsMinimalTuple);
+
/*
* Initialize result type, slot and projection.
*/
@@ -2493,11 +3364,41 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
*/
if (use_hashing)
{
+ Plan *outerplan = outerPlan(node);
+
/* this is an array of pointers, not structures */
aggstate->hash_pergroup = pergroups;
+ aggstate->hashentrysize = hash_agg_entry_size(
+ aggstate->numtrans, outerplan->plan_width, node->transitionSpace);
+
+ /*
+ * Initialize the thresholds at which we stop creating new hash entries
+ * and start spilling.
+ */
+ if (hashagg_mem_overflow)
+ aggstate->hash_mem_limit = SIZE_MAX;
+ else if (work_mem * 1024L > HASHAGG_PARTITION_MEM * 2)
+ aggstate->hash_mem_limit =
+ work_mem * 1024L - HASHAGG_PARTITION_MEM;
+ else
+ aggstate->hash_mem_limit = work_mem * 1024L;
+
+ /*
+ * Set a separate limit on the maximum number of groups to
+ * create. This is important for aggregates where the initial state
+ * size is small, but aggtransspace is large.
+ */
+ if (hashagg_mem_overflow)
+ aggstate->hash_ngroups_limit = LONG_MAX;
+ else if (aggstate->hash_mem_limit > aggstate->hashentrysize)
+ aggstate->hash_ngroups_limit =
+ aggstate->hash_mem_limit / aggstate->hashentrysize;
+ else
+ aggstate->hash_ngroups_limit = 1;
+
find_hash_columns(aggstate);
- build_hash_table(aggstate);
+ build_hash_table(aggstate, -1, 0);
aggstate->table_filled = false;
}
@@ -2903,7 +3804,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
else
Assert(false);
- phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash);
+ phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash, false);
}
@@ -3398,6 +4299,8 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hash_reset_spills(node);
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3453,12 +4356,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3515,9 +4419,21 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ hash_reset_spills(node);
+
+ node->hash_spilled = false;
+ node->hash_no_new_groups = false;
+ node->hash_mem_current = 0;
+ node->hash_ngroups_current = 0;
+
+ /* reset stats */
+ node->hash_mem_peak = 0;
+ node->hash_disk_used = 0;
+ node->hash_batches_used = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
- build_hash_table(node);
+ build_hash_table(node, -1, 0);
node->table_filled = false;
/* iterator will be reset when the table is filled */
}
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index ffd887c71aa..93517d03819 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2082,6 +2082,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_INIT_TRANS:
+ case EEOP_AGG_INIT_TRANS_SPILLED:
{
AggState *aggstate;
AggStatePerTrans pertrans;
@@ -2092,6 +2093,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_allpergroupsp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_setoff,
v_transno;
@@ -2119,11 +2121,32 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_init_trans.setoff);
v_transno = l_int32_const(op->d.agg_init_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_notransvalue = l_bb_before_v(
+ opblocks[i + 1], "op.%d.check_notransvalue", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(
+ b, v_pergroup_allaggs, TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_check_notransvalue);
+
+ LLVMPositionBuilderAtEnd(b, b_check_notransvalue);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_notransvalue =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_NOTRANSVALUE,
@@ -2180,6 +2203,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_STRICT_TRANS_CHECK:
+ case EEOP_AGG_STRICT_TRANS_CHECK_SPILLED:
{
AggState *aggstate;
LLVMValueRef v_setoff,
@@ -2190,6 +2214,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_transnull;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
int jumpnull = op->d.agg_strict_trans_check.jumpnull;
@@ -2209,11 +2234,32 @@ llvm_compile_expr(ExprState *state)
l_int32_const(op->d.agg_strict_trans_check.setoff);
v_transno =
l_int32_const(op->d.agg_strict_trans_check.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_transnull = l_bb_before_v(
+ opblocks[i + 1], "op.%d.check_transnull", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[jumpnull],
+ b_check_transnull);
+
+ LLVMPositionBuilderAtEnd(b, b_check_transnull);
+ }
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_transnull =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_TRANSVALUEISNULL,
@@ -2229,7 +2275,9 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_PLAIN_TRANS_BYVAL:
+ case EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED:
case EEOP_AGG_PLAIN_TRANS:
+ case EEOP_AGG_PLAIN_TRANS_SPILLED:
{
AggState *aggstate;
AggStatePerTrans pertrans;
@@ -2255,6 +2303,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_pertransp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_retval;
@@ -2282,10 +2331,33 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_trans.setoff);
v_transno = l_int32_const(op->d.agg_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED ||
+ opcode == EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_advance_transval = l_bb_before_v(
+ opblocks[i + 1], "op.%d.advance_transval", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_advance_transval);
+
+ LLVMPositionBuilderAtEnd(b, b_advance_transval);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_fcinfo = l_ptr_const(fcinfo,
l_ptr(StructFunctionCallInfoData));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 3f0d2899635..1b3ea3321c6 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -77,6 +77,7 @@
#include "access/htup_details.h"
#include "access/tsmapi.h"
#include "executor/executor.h"
+#include "executor/nodeAgg.h"
#include "executor/nodeHash.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -2154,7 +2155,7 @@ cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples)
+ double input_tuples, double input_width)
{
double output_tuples;
Cost startup_cost;
@@ -2220,20 +2221,69 @@ cost_agg(Path *path, PlannerInfo *root,
total_cost += aggcosts->finalCost.per_tuple * numGroups;
total_cost += cpu_tuple_cost * numGroups;
output_tuples = numGroups;
+
+ /*
+ * We don't need to compute the disk costs of hash aggregation here,
+ * because the planner does not choose hash aggregation for grouping
+ * sets that it doesn't expect to fit in memory.
+ */
}
else
{
+ double hashentrysize = hash_agg_entry_size(
+ aggcosts->numAggs, input_width, aggcosts->transitionSpace);
+ double nbatches =
+ (numGroups * hashentrysize) / (work_mem * 1024L);
+ double pages_written = 0.0;
+ double pages_read = 0.0;
+
/* must be AGG_HASHED */
startup_cost = input_total_cost;
if (!enable_hashagg)
startup_cost += disable_cost;
startup_cost += aggcosts->transCost.startup;
startup_cost += aggcosts->transCost.per_tuple * input_tuples;
+ /* cost of computing hash value */
startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;
startup_cost += aggcosts->finalCost.startup;
+
+ /*
+ * Add the disk costs of hash aggregation that spills to disk.
+ *
+ * Groups that go into the hash table stay in memory until finalized,
+ * so spilling and reprocessing tuples doesn't incur additional
+ * invocations of transCost or finalCost. Furthermore, the computed
+ * hash value is stored with the spilled tuples, so we don't incur
+ * extra invocations of the hash function.
+ *
+ * The disk cost depends on the depth of recursion; each level
+ * requiring one additional write and then read of a tuple. Writes are
+ * random and reads are sequential, so we assume 1/2 random and half
+ * sequential.
+ *
+ * Hash Agg begins returning tuples after the first batch is
+ * complete. Accrue writes (spilled tuples) to startup_cost and reads
+ * only to total_cost. This is not perfect; it penalizes startup_cost
+ * in the case of recursive spills. Also, transCost is entirely
+ * counted in startup_cost; but some of that cost could be counted
+ * only against total_cost.
+ */
+ if (!hashagg_mem_overflow && nbatches > 1.0)
+ {
+ double depth;
+ double pages;
+
+ pages = relation_byte_size(input_tuples, input_width) / BLCKSZ;
+ depth = ceil( log(nbatches - 1) / log(HASHAGG_MAX_PARTITIONS) );
+ pages_written = pages_read = pages * depth;
+ startup_cost += pages_written * random_page_cost;
+ }
+
total_cost = startup_cost;
total_cost += aggcosts->finalCost.per_tuple * numGroups;
+ /* cost of retrieving from hash table */
total_cost += cpu_tuple_cost * numGroups;
+ total_cost += pages_read * seq_page_cost;
output_tuples = numGroups;
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 8c8b4f8ed69..465b933f2ec 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1644,6 +1644,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
NIL,
NIL,
best_path->path.rows,
+ 0,
subplan);
}
else
@@ -2096,6 +2097,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
NIL,
NIL,
best_path->numGroups,
+ best_path->transitionSpace,
subplan);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2257,6 +2259,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
NIL,
rollup->numGroups,
+ best_path->transitionSpace,
sort_plan);
/*
@@ -2295,6 +2298,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
chain,
rollup->numGroups,
+ best_path->transitionSpace,
subplan);
/* Copy cost data from Path to Plan */
@@ -6194,8 +6198,8 @@ Agg *
make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree)
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree)
{
Agg *node = makeNode(Agg);
Plan *plan = &node->plan;
@@ -6211,6 +6215,7 @@ make_agg(List *tlist, List *qual,
node->grpOperators = grpOperators;
node->grpCollations = grpCollations;
node->numGroups = numGroups;
+ node->transitionSpace = transitionSpace;
node->aggParams = NULL; /* SS_finalize_plan() will fill this */
node->groupingSets = groupingSets;
node->chain = chain;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index cb54b15507b..a5686d822b3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4867,13 +4867,8 @@ create_distinct_paths(PlannerInfo *root,
allow_hash = false; /* policy-based decision not to hash */
else
{
- Size hashentrysize;
-
- /* Estimate per-hash-entry space at tuple width... */
- hashentrysize = MAXALIGN(cheapest_input_path->pathtarget->width) +
- MAXALIGN(SizeofMinimalTupleHeader);
- /* plus the per-hash-entry overhead */
- hashentrysize += hash_agg_entry_size(0);
+ Size hashentrysize = hash_agg_entry_size(
+ 0, cheapest_input_path->pathtarget->width, 0);
/* Allow hashing only if hashtable is predicted to fit in work_mem */
allow_hash = (hashentrysize * numDistinctRows <= work_mem * 1024L);
@@ -6533,7 +6528,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
* were unable to sort above, then we'd better generate a Path, so
* that we at least have one.
*/
- if (hashaggtablesize < work_mem * 1024L ||
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L ||
grouped_rel->pathlist == NIL)
{
/*
@@ -6566,7 +6562,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
agg_final_costs,
dNumGroups);
- if (hashaggtablesize < work_mem * 1024L)
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L)
add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
@@ -6835,7 +6832,7 @@ create_partial_grouping_paths(PlannerInfo *root,
* Tentatively produce a partial HashAgg Path, depending on if it
* looks as if the hash table will fit in work_mem.
*/
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
@@ -6862,7 +6859,7 @@ create_partial_grouping_paths(PlannerInfo *root,
dNumPartialPartialGroups);
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index b01c9bbae7d..5f8fc50f8d3 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1072,7 +1072,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
numGroupCols, dNumGroups,
NIL,
input_path->startup_cost, input_path->total_cost,
- input_path->rows);
+ input_path->rows, input_path->pathtarget->width);
/*
* Now for the sorted case. Note that the input is *always* unsorted,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 60c93ee7c59..1cb4fed1f81 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1704,7 +1704,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
NIL,
subpath->startup_cost,
subpath->total_cost,
- rel->rows);
+ rel->rows,
+ subpath->pathtarget->width);
}
if (sjinfo->semi_can_btree && sjinfo->semi_can_hash)
@@ -2949,6 +2950,7 @@ create_agg_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->aggsplit = aggsplit;
pathnode->numGroups = numGroups;
+ pathnode->transitionSpace = aggcosts ? aggcosts->transitionSpace : 0;
pathnode->groupClause = groupClause;
pathnode->qual = qual;
@@ -2957,7 +2959,7 @@ create_agg_path(PlannerInfo *root,
list_length(groupClause), numGroups,
qual,
subpath->startup_cost, subpath->total_cost,
- subpath->rows);
+ subpath->rows, subpath->pathtarget->width);
/* add tlist eval cost for each output row */
pathnode->path.startup_cost += target->cost.startup;
@@ -3036,6 +3038,7 @@ create_groupingsets_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->rollups = rollups;
pathnode->qual = having_qual;
+ pathnode->transitionSpace = agg_costs ? agg_costs->transitionSpace : 0;
Assert(rollups != NIL);
Assert(aggstrategy != AGG_PLAIN || list_length(rollups) == 1);
@@ -3067,7 +3070,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
subpath->startup_cost,
subpath->total_cost,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
is_first = false;
if (!rollup->is_hashed)
is_first_sort = false;
@@ -3090,7 +3094,8 @@ create_groupingsets_path(PlannerInfo *root,
rollup->numGroups,
having_qual,
0.0, 0.0,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
if (!rollup->is_hashed)
is_first_sort = false;
}
@@ -3115,7 +3120,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
sort_path.startup_cost,
sort_path.total_cost,
- sort_path.rows);
+ sort_path.rows,
+ subpath->pathtarget->width);
}
pathnode->path.total_cost += agg_path.total_cost;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index ff02b5aafab..45c715385c7 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3526,16 +3526,8 @@ double
estimate_hashagg_tablesize(Path *path, const AggClauseCosts *agg_costs,
double dNumGroups)
{
- Size hashentrysize;
-
- /* Estimate per-hash-entry space at tuple width... */
- hashentrysize = MAXALIGN(path->pathtarget->width) +
- MAXALIGN(SizeofMinimalTupleHeader);
-
- /* plus space for pass-by-ref transition values... */
- hashentrysize += agg_costs->transitionSpace;
- /* plus the per-hash-entry overhead */
- hashentrysize += hash_agg_entry_size(agg_costs->numAggs);
+ Size hashentrysize = hash_agg_entry_size(
+ agg_costs->numAggs, path->pathtarget->width, agg_costs->transitionSpace);
/*
* Note that this disregards the effect of fill-factor and growth policy
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index d21dbead0a2..e50a7ad6712 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -226,9 +226,13 @@ typedef enum ExprEvalOp
EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
EEOP_AGG_INIT_TRANS,
+ EEOP_AGG_INIT_TRANS_SPILLED,
EEOP_AGG_STRICT_TRANS_CHECK,
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
EEOP_AGG_PLAIN_TRANS_BYVAL,
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
EEOP_AGG_PLAIN_TRANS,
+ EEOP_AGG_PLAIN_TRANS_SPILLED,
EEOP_AGG_ORDERED_TRANS_DATUM,
EEOP_AGG_ORDERED_TRANS_TUPLE,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 6298c7c8cad..e8d88f2ce26 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -140,11 +140,17 @@ extern TupleHashTable BuildTupleHashTableExt(PlanState *parent,
extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
bool *isnew);
+extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
+ TupleTableSlot *slot,
+ bool *isnew, uint32 hash);
extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
ExprState *eqcomp,
FmgrInfo *hashfunctions);
+extern uint32 TupleHashTableHash(struct tuplehash_hash *tb,
+ const MinimalTuple tuple);
extern void ResetTupleHashTable(TupleHashTable hashtable);
+extern void DestroyTupleHashTable(TupleHashTable hashtable);
/*
* prototypes from functions in execJunk.c
@@ -250,7 +256,7 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
- bool doSort, bool doHash);
+ bool doSort, bool doHash, bool spilled);
extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
int numCols,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 68c9e5f5400..29bbb9b0d09 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -302,13 +302,17 @@ typedef struct AggStatePerHashData
AttrNumber *hashGrpColIdxInput; /* hash col indices in input slot */
AttrNumber *hashGrpColIdxHash; /* indices in hash table tuples */
Agg *aggnode; /* original Agg node, for numGroups etc. */
+ int numNeededColsInput; /* number of columns needed from input */
+ AttrNumber *allNeededColsInput; /* all columns needed from input */
} AggStatePerHashData;
+#define HASHAGG_MAX_PARTITIONS 256
extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
extern void ExecEndAgg(AggState *node);
extern void ExecReScanAgg(AggState *node);
-extern Size hash_agg_entry_size(int numAggs);
+extern Size hash_agg_entry_size(int numAggs, Size tupleWidth,
+ Size transitionSpace);
#endif /* NODEAGG_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0c2a77aaf8d..8d4a36a3538 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2066,13 +2066,30 @@ typedef struct AggState
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
- int num_hashes;
+ int num_hashes; /* number of hash tables active at once */
+ bool hash_spilled; /* any hash table ever spilled? */
+ double hashentrysize; /* estimate revised during execution */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each hash table,
+ exists only during first pass if spilled */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ bool hash_no_new_groups; /* we hit a limit during the current batch
+ and we must not create new groups */
+ Size hash_mem_current; /* current hash table memory usage */
+ Size hash_mem_limit; /* limit before spilling hash table */
+ Size hash_mem_peak; /* peak hash table memory usage */
+ long hash_ngroups_current; /* number of groups currently in
+ memory in all hash tables */
+ long hash_ngroups_limit; /* limit before spilling hash table */
+ uint64 hash_disk_used; /* bytes of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+ List *hash_batches; /* hash batches remaining to be processed */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 47
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 31b631cfe0f..625a8aecb77 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1663,6 +1663,7 @@ typedef struct AggPath
AggStrategy aggstrategy; /* basic strategy, see nodes.h */
AggSplit aggsplit; /* agg-splitting mode, see nodes.h */
double numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
List *groupClause; /* a list of SortGroupClause's */
List *qual; /* quals (HAVING quals), if any */
} AggPath;
@@ -1700,6 +1701,7 @@ typedef struct GroupingSetsPath
AggStrategy aggstrategy; /* basic strategy */
List *rollups; /* list of RollupData */
List *qual; /* quals (HAVING quals), if any */
+ int32 transitionSpace; /* estimated transition state size */
} GroupingSetsPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 477b4da192c..ea3e0a643ec 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -813,6 +813,7 @@ typedef struct Agg
Oid *grpOperators; /* equality operators to compare with */
Oid *grpCollations;
long numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
Bitmapset *aggParams; /* IDs of Params used in Aggref inputs */
/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
List *groupingSets; /* grouping sets to use */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b72e2d08290..fa6ad5e5857 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -115,7 +115,7 @@ extern void cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples);
+ double input_tuples, double input_width);
extern void cost_windowagg(Path *path, PlannerInfo *root,
List *windowFuncs, int numPartCols, int numOrderCols,
Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index e7aaddd50d6..e20a66404ba 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -54,8 +54,8 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
extern Agg *make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree);
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree);
extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
/*
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index d091ae4c6e4..92e5dbad77e 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2331,3 +2331,95 @@ explain (costs off)
-> Seq Scan on onek
(8 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+set work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------------
+ GroupAggregate
+ Group Key: ((g % 100000))
+ -> Sort
+ Sort Key: ((g % 100000))
+ -> Function Scan on generate_series g
+(5 rows)
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+set jit_above_cost to default;
+create table agg_group_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_group_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+-- Produce results with hash aggregation
+set enable_hashagg = true;
+set enable_sort = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 100000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+set jit_above_cost to default;
+create table agg_hash_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_hash_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+set enable_sort = true;
+set work_mem to default;
+-- Compare group aggregation results to hash aggregation results
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a7..767f60a96c7 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1633,4 +1633,127 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
| 1 | 2
(4 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+-- Produce results with hash aggregation.
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+set enable_sort = true;
+set work_mem to default;
+-- Compare results
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+ g100 | g10 | unnest | c | m
+------+-----+--------+---+---
+(0 rows)
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
-- end
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1de..11c6f50fbfa 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -148,6 +148,68 @@ SELECT count(*) FROM
4
(1 row)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+SET enable_hashagg=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------------
+ Unique
+ -> Sort
+ Sort Key: ((g % 1000))
+ -> Function Scan on generate_series g
+(4 rows)
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_hashagg=TRUE;
+-- Produce results with hash aggregation.
+SET enable_sort=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 1000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_sort=TRUE;
+SET work_mem TO DEFAULT;
+-- Compare results
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 17fb256aec5..bcd336c5812 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1017,3 +1017,91 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
explain (costs off)
select 1 from tenk1
where (hundred, thousand) in (select twothousand, twothousand from onek);
+
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+set work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+set jit_above_cost to default;
+
+create table agg_group_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_group_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+-- Produce results with hash aggregation
+
+set enable_hashagg = true;
+set enable_sort = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+set jit_above_cost to default;
+
+create table agg_hash_2 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_hash_3 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare group aggregation results to hash aggregation results
+
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 95ac3fb52f6..bf8bce6ed31 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -441,4 +441,103 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
from unnest(array[1,1], array['a','b']) u(i,v)
group by rollup(i, v||'a') order by 1,3;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+-- Produce results with hash aggregation.
+
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare results
+
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+
-- end
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449e..33102744ebf 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -45,6 +45,68 @@ SELECT count(*) FROM
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+SET enable_hashagg=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_hashagg=TRUE;
+
+-- Produce results with hash aggregation.
+
+SET enable_sort=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_sort=TRUE;
+
+SET work_mem TO DEFAULT;
+
+-- Compare results
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
+
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
Hi, Jeff
I tried to use the logical tape APIs for hash agg spilling, based on
your 1220 version.
Turns out it doesn't make much of performance difference with the
default 8K block size (might be my patch's problem), but the disk space
(not I/O) would be saved a lot because I force the respilling to use the
same LogicalTapeSet.
Logtape APIs with default block size 8K:
```
postgres=# EXPLAIN ANALYZE SELECT avg(g) FROM generate_series(0,5000000) g GROUP BY g;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=75000.02..75002.52 rows=200 width=36) (actual time=7701.706..24473.002 rows=5000001 loops=1)
Group Key: g
Memory Usage: 4096kB Batches: 516 Disk: 116921kB
-> Function Scan on generate_series g (cost=0.00..50000.01 rows=5000001 width=4) (actual time=1611.829..3253.150 rows=5000001 loops=1)
Planning Time: 0.194 ms
Execution Time: 25129.239 ms
(6 rows)
```
Bare BufFile APIs:
```
postgres=# EXPLAIN ANALYZE SELECT avg(g) FROM generate_series(0,5000000) g GROUP BY g;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=75000.02..75002.52 rows=200 width=36) (actual time=7339.835..24472.466 rows=5000001 loops=1)
Group Key: g
Memory Usage: 4096kB Batches: 516 Disk: 232773kB
-> Function Scan on generate_series g (cost=0.00..50000.01 rows=5000001 width=4) (actual time=1580.057..3128.749 rows=5000001 loops=1)
Planning Time: 0.769 ms
Execution Time: 26696.502 ms
(6 rows)
```
Even though, I'm not sure which API is better, because we should avoid
the respilling as much as we could in the planner, and hash join uses
the bare BufFile.
Attached my hacky and probably not robust diff for your reference.
--
Adam Lee
Attachments:
hashagg_logtape.difftext/plain; charset=us-asciiDownload
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index f1989b10ea..8c743d7561 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -247,6 +247,7 @@
#include "utils/datum.h"
#include "utils/dynahash.h"
#include "utils/expandeddatum.h"
+#include "utils/logtape.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
@@ -288,8 +289,9 @@ typedef struct HashAggSpill
int n_partitions; /* number of output partitions */
int partition_bits; /* number of bits for partition mask
log2(n_partitions) parent partition bits */
- BufFile **partitions; /* output partition files */
+ int *partitions; /* output logtape numbers */
int64 *ntuples; /* number of tuples in each partition */
+ LogicalTapeSet *lts;
} HashAggSpill;
/*
@@ -298,11 +300,12 @@ typedef struct HashAggSpill
*/
typedef struct HashAggBatch
{
- BufFile *input_file; /* input partition */
+ int input_tape; /* input partition */
int input_bits; /* number of bits for input partition mask */
int64 input_tuples; /* number of tuples in this batch */
int setno; /* grouping set */
HashAggSpill spill; /* spill output */
+ LogicalTapeSet *lts;
} HashAggBatch;
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
@@ -359,9 +362,8 @@ static void hash_spill_init(HashAggSpill *spill, int input_bits,
uint64 input_tuples, double hashentrysize);
static Size hash_spill_tuple(HashAggSpill *spill, int input_bits,
TupleTableSlot *slot, uint32 hash);
-static MinimalTuple hash_read_spilled(BufFile *file, uint32 *hashp);
-static HashAggBatch *hash_batch_new(BufFile *input_file, int setno,
- int64 input_tuples, int input_bits);
+static MinimalTuple hash_read_spilled(LogicalTapeSet *lts, int tapenum, uint32 *hashp);
+static HashAggBatch *hash_batch_new(LogicalTapeSet *lts, int tapenum, int setno, int64 input_tuples, int input_bits);
static void hash_finish_initial_spills(AggState *aggstate);
static void hash_spill_finish(AggState *aggstate, HashAggSpill *spill,
int setno, int input_bits);
@@ -2462,7 +2464,7 @@ agg_refill_hash_table(AggState *aggstate)
CHECK_FOR_INTERRUPTS();
- tuple = hash_read_spilled(batch->input_file, &hash);
+ tuple = hash_read_spilled(batch->lts, batch->input_tape, &hash);
if (tuple == NULL)
break;
@@ -2490,8 +2492,8 @@ agg_refill_hash_table(AggState *aggstate)
batch->input_tuples, aggstate->hashentrysize);
}
- aggstate->hash_disk_used += hash_spill_tuple(
- &batch->spill, batch->input_bits, slot, hash);
+ //aggstate->hash_disk_used +=
+ hash_spill_tuple(&batch->spill, batch->input_bits, slot, hash);
}
/* Advance the aggregates (or combine functions) */
@@ -2504,8 +2506,6 @@ agg_refill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
- BufFileClose(batch->input_file);
-
aggstate->current_phase = 0;
aggstate->phase = &aggstate->phases[aggstate->current_phase];
@@ -2690,6 +2690,9 @@ hash_spill_init(HashAggSpill *spill, int input_bits, uint64 input_groups,
{
int npartitions;
int partition_bits;
+ int i;
+ int j;
+ int old_npartitions;
npartitions = hash_choose_num_spill_partitions(input_groups,
hashentrysize);
@@ -2702,10 +2705,33 @@ hash_spill_init(HashAggSpill *spill, int input_bits, uint64 input_groups,
/* number of partitions will be a power of two */
npartitions = 1L << partition_bits;
- spill->partition_bits = partition_bits;
- spill->n_partitions = npartitions;
- spill->partitions = palloc0(sizeof(BufFile *) * npartitions);
- spill->ntuples = palloc0(sizeof(int64) * npartitions);
+ if (spill->lts == NULL)
+ {
+ spill->partition_bits = partition_bits;
+ spill->n_partitions = npartitions;
+ spill->partitions = palloc0(sizeof(int) * npartitions);
+ for (i = 0; i < spill->n_partitions; ++i)
+ {
+ spill->partitions[i] = i;
+ }
+ spill->ntuples = palloc0(sizeof(int64) * spill->n_partitions);
+ spill->lts = LogicalTapeSetCreate(npartitions, NULL, NULL, 0); // TODO: worker is 0?
+ }
+ else // respill
+ {
+ old_npartitions = LogicalTapeGetNTapes(spill->lts);
+ spill->partition_bits = my_log2(npartitions);
+ spill->n_partitions = (1L << spill->partition_bits);
+ spill->partitions = palloc0(sizeof(int) * npartitions);
+ j = old_npartitions;
+ for (i = 0; i < spill->n_partitions; ++i)
+ {
+ spill->partitions[i] = j;
+ j++;
+ }
+ spill->ntuples = palloc0(sizeof(int64) * spill->n_partitions);
+ spill->lts = LogicalTapeSetExtend(spill->lts, spill->n_partitions);
+ }
}
/*
@@ -2720,8 +2746,6 @@ hash_spill_tuple(HashAggSpill *spill, int input_bits, TupleTableSlot *slot,
{
int partition;
MinimalTuple tuple;
- BufFile *file;
- int written;
int total_written = 0;
bool shouldFree;
@@ -2743,23 +2767,11 @@ hash_spill_tuple(HashAggSpill *spill, int input_bits, TupleTableSlot *slot,
spill->ntuples[partition]++;
- if (spill->partitions[partition] == NULL)
- spill->partitions[partition] = BufFileCreateTemp(false);
- file = spill->partitions[partition];
-
- written = BufFileWrite(file, (void *) &hash, sizeof(uint32));
- if (written != sizeof(uint32))
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not write to HashAgg temporary file: %m")));
- total_written += written;
+ LogicalTapeWrite(spill->lts, spill->partitions[partition], (void *) &hash, sizeof(uint32));
+ total_written += sizeof(uint32);
- written = BufFileWrite(file, (void *) tuple, tuple->t_len);
- if (written != tuple->t_len)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not write to HashAgg temporary file: %m")));
- total_written += written;
+ LogicalTapeWrite(spill->lts, spill->partitions[partition], (void *) tuple, tuple->t_len);
+ total_written += tuple->t_len;
if (shouldFree)
pfree(tuple);
@@ -2772,38 +2784,37 @@ hash_spill_tuple(HashAggSpill *spill, int input_bits, TupleTableSlot *slot,
* read the next tuple from a batch file. Return NULL if no more.
*/
static MinimalTuple
-hash_read_spilled(BufFile *file, uint32 *hashp)
+hash_read_spilled(LogicalTapeSet *lts, int tapenum, uint32 *hashp)
{
MinimalTuple tuple;
uint32 t_len;
size_t nread;
uint32 hash;
- nread = BufFileRead(file, &hash, sizeof(uint32));
+ nread = LogicalTapeRead(lts, tapenum, &hash, sizeof(uint32));
if (nread == 0)
return NULL;
if (nread != sizeof(uint32))
ereport(ERROR,
(errcode_for_file_access(),
- errmsg("could not read from HashAgg temporary file: %m")));
+ errmsg("could not read the hash from HashAgg spilled tape: %m")));
if (hashp != NULL)
*hashp = hash;
- nread = BufFileRead(file, &t_len, sizeof(t_len));
+ nread = LogicalTapeRead(lts, tapenum, &t_len, sizeof(t_len));
if (nread != sizeof(uint32))
ereport(ERROR,
(errcode_for_file_access(),
- errmsg("could not read from HashAgg temporary file: %m")));
+ errmsg("could not read the t_len from HashAgg spilled tape: %m")));
tuple = (MinimalTuple) palloc(t_len);
tuple->t_len = t_len;
- nread = BufFileRead(file, (void *)((char *)tuple + sizeof(uint32)),
- t_len - sizeof(uint32));
+ nread = LogicalTapeRead(lts, tapenum, (void *)((char *)tuple + sizeof(uint32)), t_len - sizeof(uint32));
if (nread != t_len - sizeof(uint32))
ereport(ERROR,
(errcode_for_file_access(),
- errmsg("could not read from HashAgg temporary file: %m")));
+ errmsg("could not read the data from HashAgg spilled tape: %m")));
return tuple;
}
@@ -2815,15 +2826,17 @@ hash_read_spilled(BufFile *file, uint32 *hashp)
* be done. Should be called in the aggregate's memory context.
*/
static HashAggBatch *
-hash_batch_new(BufFile *input_file, int setno, int64 input_tuples,
+hash_batch_new(LogicalTapeSet *lts, int tapenum, int setno, int64 input_tuples,
int input_bits)
{
HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
- batch->input_file = input_file;
+ batch->input_tape = tapenum;
batch->input_bits = input_bits;
batch->input_tuples = input_tuples;
batch->setno = setno;
+ batch->lts = lts;
+ batch->spill.lts = lts; // share same logical tape set
/* batch->spill will be set only after spilling this batch */
@@ -2860,7 +2873,7 @@ hash_finish_initial_spills(AggState *aggstate)
/*
* hash_spill_finish
*
- * Transform spill files into new batches.
+ * Transform spill files into new batches. // XXX so the partitions are empty and ready to be reused
*/
static void
hash_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno, int input_bits)
@@ -2872,28 +2885,20 @@ hash_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno, int input_
for (i = 0; i < spill->n_partitions; i++)
{
- BufFile *file = spill->partitions[i];
MemoryContext oldContext;
HashAggBatch *new_batch;
- /* partition is empty */
- if (file == NULL)
- continue;
-
- /* rewind file for reading */
- if (BufFileSeek(file, 0, 0L, SEEK_SET))
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not rewind HashAgg temporary file: %m")));
-
oldContext = MemoryContextSwitchTo(aggstate->ss.ps.state->es_query_cxt);
- new_batch = hash_batch_new(file, setno, spill->ntuples[i],
- spill->partition_bits + input_bits);
+ LogicalTapeRewindForRead(spill->lts, spill->partitions[i], 0);
+ new_batch = hash_batch_new(spill->lts, spill->partitions[i], setno, spill->ntuples[i],
+ spill->partition_bits + input_bits);
aggstate->hash_batches = lappend(aggstate->hash_batches, new_batch);
aggstate->hash_batches_used++;
MemoryContextSwitchTo(oldContext);
}
+ if (!list_member_ptr(aggstate->lts_list, spill->lts))
+ aggstate->lts_list = lappend(aggstate->lts_list, spill->lts);
pfree(spill->ntuples);
pfree(spill->partitions);
}
@@ -2904,13 +2909,10 @@ hash_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno, int input_
static void
hash_reset_spill(HashAggSpill *spill)
{
- int i;
- for (i = 0; i < spill->n_partitions; i++)
+ if (spill->lts != NULL)
{
- BufFile *file = spill->partitions[i];
-
- if (file != NULL)
- BufFileClose(file);
+ LogicalTapeSetClose(spill->lts);
+ spill->lts = NULL;
}
if (spill->ntuples != NULL)
pfree(spill->ntuples);
@@ -2940,16 +2942,19 @@ hash_reset_spills(AggState *aggstate)
foreach(lc, aggstate->hash_batches)
{
HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
- if (batch->input_file != NULL)
- {
- BufFileClose(batch->input_file);
- batch->input_file = NULL;
- }
hash_reset_spill(&batch->spill);
pfree(batch);
}
list_free(aggstate->hash_batches);
aggstate->hash_batches = NIL;
+
+ foreach(lc, aggstate->lts_list)
+ {
+ LogicalTapeSet *lts = (LogicalTapeSet *) lfirst(lc);
+ LogicalTapeSetClose(lts);
+ }
+ list_free(aggstate->lts_list);
+ aggstate->lts_list = NIL;
}
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 8985b9e095..677b992743 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -202,7 +202,7 @@ struct LogicalTapeSet
/* The array of logical tapes. */
int nTapes; /* # of logical tapes in set */
- LogicalTape tapes[FLEXIBLE_ARRAY_MEMBER]; /* has nTapes nentries */
+ LogicalTape *tapes; /* has nTapes nentries */
};
static void ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer);
@@ -518,8 +518,8 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
* Create top-level struct including per-tape LogicalTape structs.
*/
Assert(ntapes > 0);
- lts = (LogicalTapeSet *) palloc(offsetof(LogicalTapeSet, tapes) +
- ntapes * sizeof(LogicalTape));
+ lts = (LogicalTapeSet *) palloc0(sizeof(LogicalTapeSet));
+ lts->tapes = (LogicalTape *)palloc0(ntapes * sizeof(LogicalTape));
lts->nBlocksAllocated = 0L;
lts->nBlocksWritten = 0L;
lts->nHoleBlocks = 0L;
@@ -577,6 +577,45 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
return lts;
}
+LogicalTapeSet *
+LogicalTapeSetExtend(LogicalTapeSet *lts, int ntoextend)
+{
+ LogicalTape *lt;
+ int i;
+
+ /*
+ * Create top-level struct including per-tape LogicalTape structs.
+ */
+ Assert(ntoextend > 0);
+ lts->tapes = (LogicalTape *) repalloc(lts->tapes, (lts->nTapes + ntoextend) * sizeof(LogicalTape));
+ lts->nTapes = lts->nTapes + ntoextend;
+
+ /*
+ * Initialize per-tape structs. Note we allocate the I/O buffer and the
+ * first block for a tape only when it is first actually written to. This
+ * avoids wasting memory space when we overestimate the number of tapes needed.
+ */
+ for (i = lts->nTapes - ntoextend; i < lts->nTapes; i++)
+ {
+ lt = <s->tapes[i];
+ lt->writing = true;
+ lt->frozen = false;
+ lt->dirty = false;
+ lt->firstBlockNumber = -1L;
+ lt->curBlockNumber = -1L;
+ lt->nextBlockNumber = -1L;
+ lt->offsetBlockNumber = 0L;
+ lt->buffer = NULL;
+ lt->buffer_size = 0;
+ /* palloc() larger than MaxAllocSize would fail */
+ lt->max_size = MaxAllocSize;
+ lt->pos = 0;
+ lt->nbytes = 0;
+ }
+
+ return lts;
+}
+
/*
* Close a logical tape set and release all resources.
*/
@@ -1083,3 +1122,9 @@ LogicalTapeSetBlocks(LogicalTapeSet *lts)
{
return lts->nBlocksAllocated - lts->nHoleBlocks;
}
+
+int
+LogicalTapeGetNTapes(LogicalTapeSet *lts)
+{
+ return lts->nTapes;
+}
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 8d4a36a353..d45473101c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2083,6 +2083,7 @@ typedef struct AggState
uint64 hash_disk_used; /* bytes of disk space used */
int hash_batches_used; /* batches used during entire execution */
List *hash_batches; /* hash batches remaining to be processed */
+ List *lts_list;
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index 081b03880a..c2f5c72665 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -56,6 +56,7 @@ typedef struct TapeShare
extern LogicalTapeSet *LogicalTapeSetCreate(int ntapes, TapeShare *shared,
SharedFileSet *fileset, int worker);
+extern LogicalTapeSet * LogicalTapeSetExtend(LogicalTapeSet *lts, int ntoextend);
extern void LogicalTapeSetClose(LogicalTapeSet *lts);
extern void LogicalTapeSetForgetFreeSpace(LogicalTapeSet *lts);
extern size_t LogicalTapeRead(LogicalTapeSet *lts, int tapenum,
@@ -74,5 +75,6 @@ extern void LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
extern void LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
long *blocknum, int *offset);
extern long LogicalTapeSetBlocks(LogicalTapeSet *lts);
+extern int LogicalTapeGetNTapes(LogicalTapeSet *lts);
#endif /* LOGTAPE_H */
On 28/12/2019 01:35, Jeff Davis wrote:
I've attached a new patch that adds some basic costing for disk during
hashagg.
This patch (hashagg-20191227.patch) doesn't compile:
nodeAgg.c:3379:7: error: ‘hashagg_mem_overflow’ undeclared (first use in
this function)
if (hashagg_mem_overflow)
^~~~~~~~~~~~~~~~~~~~
Looks like the new GUCs got lost somewhere between
hashagg-20191220.patch and hashagg-20191227.patch.
/*
* find_aggregated_cols
* Construct a bitmapset of the column numbers of aggregated Vars
* appearing in our targetlist and qual (HAVING clause)
*/
static Bitmapset *
find_aggregated_cols(AggState *aggstate)
{
Agg *node = (Agg *) aggstate->ss.ps.plan;
Bitmapset *colnos = NULL;
ListCell *temp;/*
* We only want the columns used by aggregations in the targetlist or qual
*/
if (node->plan.targetlist != NULL)
{
foreach(temp, (List *) node->plan.targetlist)
{
if (IsA(lfirst(temp), TargetEntry))
{
Node *node = (Node *)((TargetEntry *)lfirst(temp))->expr;
if (IsA(node, Aggref) || IsA(node, GroupingFunc))
find_aggregated_cols_walker(node, &colnos);
}
}
}
This makes the assumption that all Aggrefs or GroupingFuncs are at the
top of the TargetEntry. That's not true, e.g.:
select 0+sum(a) from foo group by b;
I think find_aggregated_cols() and find_unaggregated_cols() should be
merged into one function that scans the targetlist once, and returns two
Bitmapsets. They're always used together, anyway.
- Heikki
On Wed, 2020-01-08 at 12:38 +0200, Heikki Linnakangas wrote:
This makes the assumption that all Aggrefs or GroupingFuncs are at
the
top of the TargetEntry. That's not true, e.g.:select 0+sum(a) from foo group by b;
I think find_aggregated_cols() and find_unaggregated_cols() should
be
merged into one function that scans the targetlist once, and returns
two
Bitmapsets. They're always used together, anyway.
I cut the projection out for now, because there's some work in that
area in another thread[1]/messages/by-id/CAAKRu_Yj=Q_ZxiGX+pgstNWMbUJApEJX-imvAEwryCk5SLUebg@mail.gmail.com. If that work doesn't pan out, I can
reintroduce the projection logic to this one.
New patch attached.
It now uses logtape.c (thanks Adam for prototyping this work) instead
of buffile.c. This gives better control over the number of files and
the memory consumed for buffers, and reduces waste. It requires two
changes to logtape.c though:
* add API to extend the number of tapes
* lazily allocate buffers for reading (buffers for writing were
already allocated lazily) so that the total number of buffers needed at
any time is bounded
Unfortunately, I'm seeing some bad behavior (at least in some cases)
with logtape.c, where it's spending a lot of time qsorting the list of
free blocks. Adam, did you also see this during your perf tests? It
seems to be worst with lower work_mem settings and a large number of
input groups (perhaps there are just too many small tapes?).
It also has some pretty major refactoring that hopefully makes it
simpler to understand and reason about, and hopefully I didn't
introduce too many bugs/regressions.
A list of other changes:
* added test that involves rescan
* tweaked some details and tunables so that I think memory usage
tracking and reporting (EXPLAIN ANALYZE) is better, especially for
smaller work_mem
* simplified quite a few function signatures
Regards,
Jeff Davis
[1]: /messages/by-id/CAAKRu_Yj=Q_ZxiGX+pgstNWMbUJApEJX-imvAEwryCk5SLUebg@mail.gmail.com
/messages/by-id/CAAKRu_Yj=Q_ZxiGX+pgstNWMbUJApEJX-imvAEwryCk5SLUebg@mail.gmail.com
Attachments:
hashagg-20200124.patchtext/x-patch; charset=UTF-8; name=hashagg-20200124.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e07dc01e802..fde53579709 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,23 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-hashagg-mem-overflow" xreflabel="hashagg_mem_overflow">
+ <term><varname>hashagg_mem_overflow</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>hashagg_mem_overflow</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ If hash aggregation exceeds <varname>work_mem</varname> at query
+ execution time, and <varname>hashagg_mem_overflow</varname> is set
+ to <literal>on</literal>, continue consuming more memory rather than
+ performing disk-based hash aggregation. The default
+ is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
@@ -4471,6 +4488,24 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-hashagg-spill" xreflabel="enable_hashagg_spill">
+ <term><varname>enable_hashagg_spill</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_hashagg_spill</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the memory usage is expected to
+ exceed <varname>work_mem</varname>. This only affects the planner
+ choice; actual behavior at execution time is dictated by
+ <xref linkend="guc-hashagg-mem-overflow"/>. The default
+ is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
<term><varname>enable_hashjoin</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d189b8d573a..d3ce5511826 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -102,6 +102,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1843,6 +1844,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ if (es->analyze)
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2741,6 +2744,56 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * If EXPLAIN ANALYZE, show information on hash aggregate memory usage and
+ * batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+ long diskKb = (aggstate->hash_disk_used + 1023) / 1024;
+
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(
+ es->str,
+ "Memory Usage: %ldkB",
+ memPeakKb);
+
+ if (aggstate->hash_batches_used > 0)
+ {
+ appendStringInfo(
+ es->str,
+ " Batches: %d Disk: %ldkB",
+ aggstate->hash_batches_used, diskKb);
+ }
+
+ appendStringInfo(
+ es->str,
+ "\n");
+ }
+ else
+ {
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ ExplainPropertyInteger("Disk Usage", "kB", diskKb, es);
+ }
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 8619246c8e0..6f64a2abd2f 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -79,7 +79,8 @@ static void ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash);
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled);
/*
@@ -2927,7 +2928,7 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
*/
ExprState *
ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
- bool doSort, bool doHash)
+ bool doSort, bool doHash, bool spilled)
{
ExprState *state = makeNode(ExprState);
PlanState *parent = &aggstate->ss.ps;
@@ -3160,7 +3161,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (setno = 0; setno < processGroupingSets; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, false);
+ pertrans, transno, setno, setoff, false,
+ spilled);
setoff++;
}
}
@@ -3178,7 +3180,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (setno = 0; setno < numHashes; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, true);
+ pertrans, transno, setno, setoff, true,
+ spilled);
setoff++;
}
}
@@ -3226,7 +3229,8 @@ static void
ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash)
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled)
{
int adjust_init_jumpnull = -1;
int adjust_strict_jumpnull = -1;
@@ -3248,7 +3252,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
fcinfo->flinfo->fn_strict &&
pertrans->initValueIsNull)
{
- scratch->opcode = EEOP_AGG_INIT_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_INIT_TRANS_SPILLED : EEOP_AGG_INIT_TRANS;
scratch->d.agg_init_trans.aggstate = aggstate;
scratch->d.agg_init_trans.pertrans = pertrans;
scratch->d.agg_init_trans.setno = setno;
@@ -3265,7 +3270,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
if (pertrans->numSortCols == 0 &&
fcinfo->flinfo->fn_strict)
{
- scratch->opcode = EEOP_AGG_STRICT_TRANS_CHECK;
+ scratch->opcode = spilled ?
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED : EEOP_AGG_STRICT_TRANS_CHECK;
scratch->d.agg_strict_trans_check.aggstate = aggstate;
scratch->d.agg_strict_trans_check.setno = setno;
scratch->d.agg_strict_trans_check.setoff = setoff;
@@ -3283,9 +3289,11 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
/* invoke appropriate transition implementation */
if (pertrans->numSortCols == 0 && pertrans->transtypeByVal)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS_BYVAL;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED : EEOP_AGG_PLAIN_TRANS_BYVAL;
else if (pertrans->numSortCols == 0)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_SPILLED : EEOP_AGG_PLAIN_TRANS;
else if (pertrans->numInputs == 1)
scratch->opcode = EEOP_AGG_ORDERED_TRANS_DATUM;
else
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index f901baf1ed3..e03be8bb6b7 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -430,9 +430,13 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
&&CASE_EEOP_AGG_INIT_TRANS,
+ &&CASE_EEOP_AGG_INIT_TRANS_SPILLED,
&&CASE_EEOP_AGG_STRICT_TRANS_CHECK,
+ &&CASE_EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_SPILLED,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
&&CASE_EEOP_LAST
@@ -1625,6 +1629,36 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ aggstate = op->d.agg_init_trans.aggstate;
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
+
+ /* If transValue has not yet been initialized, do so now. */
+ if (pergroup->noTransValue)
+ {
+ AggStatePerTrans pertrans = op->d.agg_init_trans.pertrans;
+
+ aggstate->curaggcontext = op->d.agg_init_trans.aggcontext;
+ aggstate->current_set = op->d.agg_init_trans.setno;
+
+ ExecAggInitGroup(aggstate, pertrans, pergroup);
+
+ /* copied trans value from input, done this round */
+ EEO_JUMP(op->d.agg_init_trans.jumpnull);
+ }
+
+ EEO_NEXT();
+ }
/* check that a strict aggregate's input isn't NULL */
EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK)
@@ -1642,6 +1676,25 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ aggstate = op->d.agg_strict_trans_check.aggstate;
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_strict_trans_check.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_strict_trans_check.transno];
+
+ if (unlikely(pergroup->transValueIsNull))
+ EEO_JUMP(op->d.agg_strict_trans_check.jumpnull);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1691,6 +1744,52 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ aggstate = op->d.agg_trans.aggstate;
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ Assert(pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1735,6 +1834,67 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
newVal = FunctionCallInvoke(fcinfo);
+ /*
+ * For pass-by-ref datatype, must copy the new value into
+ * aggcontext and free the prior transValue. But if transfn
+ * returned a pointer to its first input, we don't need to do
+ * anything. Also, if transfn returned a pointer to a R/W
+ * expanded object that is already a child of the aggcontext,
+ * assume we can adopt that value without copying it.
+ */
+ if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggTransReparent(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ aggstate = op->d.agg_trans.aggstate;
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ Assert(!pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
/*
* For pass-by-ref datatype, must copy the new value into
* aggcontext and free the prior transValue. But if transfn
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 3603c58b63e..b3907e06ed0 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -25,8 +25,9 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
-static uint32 TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple);
static int TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
+static TupleHashEntry LookupTupleHashEntry_internal(
+ TupleHashTable hashtable, TupleTableSlot *slot, bool *isnew, uint32 hash);
/*
* Define parameters for tuple hash table code generation. The interface is
@@ -284,6 +285,17 @@ ResetTupleHashTable(TupleHashTable hashtable)
tuplehash_reset(hashtable->hashtab);
}
+/*
+ * Destroy the hash table. Note that the tablecxt passed to
+ * BuildTupleHashTableExt() should also be reset, otherwise there will be
+ * leaks.
+ */
+void
+DestroyTupleHashTable(TupleHashTable hashtable)
+{
+ tuplehash_destroy(hashtable->hashtab);
+}
+
/*
* Find or create a hashtable entry for the tuple group containing the
* given tuple. The tuple must be the same type as the hashtable entries.
@@ -300,10 +312,9 @@ TupleHashEntry
LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
bool *isnew)
{
- TupleHashEntryData *entry;
- MemoryContext oldContext;
- bool found;
- MinimalTuple key;
+ TupleHashEntry entry;
+ MemoryContext oldContext;
+ uint32 hash;
/* Need to run the hash functions in short-lived context */
oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
@@ -313,32 +324,29 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
hashtable->cur_eq_func = hashtable->tab_eq_func;
- key = NULL; /* flag to reference inputslot */
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+ entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
- if (isnew)
- {
- entry = tuplehash_insert(hashtable->hashtab, key, &found);
+ MemoryContextSwitchTo(oldContext);
- if (found)
- {
- /* found pre-existing entry */
- *isnew = false;
- }
- else
- {
- /* created new entry */
- *isnew = true;
- /* zero caller data */
- entry->additional = NULL;
- MemoryContextSwitchTo(hashtable->tablecxt);
- /* Copy the first tuple into the table context */
- entry->firstTuple = ExecCopySlotMinimalTuple(slot);
- }
- }
- else
- {
- entry = tuplehash_lookup(hashtable->hashtab, key);
- }
+ return entry;
+}
+
+/*
+ * A variant of LookupTupleHashEntry for callers that have already computed
+ * the hash value.
+ */
+TupleHashEntry
+LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
+ bool *isnew, uint32 hash)
+{
+ TupleHashEntry entry;
+ MemoryContext oldContext;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
MemoryContextSwitchTo(oldContext);
@@ -389,7 +397,7 @@ FindTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
* Also, the caller must select an appropriate memory context for running
* the hash functions. (dynahash.c doesn't change CurrentMemoryContext.)
*/
-static uint32
+uint32
TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
{
TupleHashTable hashtable = (TupleHashTable) tb->private_data;
@@ -450,6 +458,54 @@ TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
return murmurhash32(hashkey);
}
+/*
+ * Does the work of LookupTupleHashEntry and LookupTupleHashEntryHash. Useful
+ * so that we can avoid switching the memory context multiple times for
+ * LookupTupleHashEntry.
+ */
+static TupleHashEntry
+LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
+ bool *isnew, uint32 hash)
+{
+ TupleHashEntryData *entry;
+ bool found;
+ MinimalTuple key;
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = slot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ key = NULL; /* flag to reference inputslot */
+
+ if (isnew)
+ {
+ entry = tuplehash_insert_hash(hashtable->hashtab, key, hash, &found);
+
+ if (found)
+ {
+ /* found pre-existing entry */
+ *isnew = false;
+ }
+ else
+ {
+ /* created new entry */
+ *isnew = true;
+ /* zero caller data */
+ entry->additional = NULL;
+ MemoryContextSwitchTo(hashtable->tablecxt);
+ /* Copy the first tuple into the table context */
+ entry->firstTuple = ExecCopySlotMinimalTuple(slot);
+ }
+ }
+ else
+ {
+ entry = tuplehash_lookup_hash(hashtable->hashtab, key, hash);
+ }
+
+ return entry;
+}
+
/*
* See whether two tuples (presumably of the same hash value) match
*/
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 7b8cb91f04d..887e5b99e64 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -194,6 +194,24 @@
* transition values. hashcontext is the single context created to support
* all hash tables.
*
+ * Spilling To Disk
+ *
+ * When the hash table memory exceeds work_mem, we advance the transition
+ * states only for groups already in the hash table. For tuples that would
+ * need to create a new hash table entries (and initialize new transition
+ * states), we spill them to disk to be processed later. The tuples are
+ * spilled in a partitioned manner, so that subsequent batches are smaller
+ * and less likely to exceed work_mem (if a batch does exceed work_mem, it
+ * must be spilled recursively).
+ *
+ * Spilled data is written to logical tapes. These provide better control
+ * over memory usage, disk space, and the number of files than if we were
+ * to use a BufFile for each spill.
+ *
+ * Note that it's possible for transition states to start small but then
+ * grow very large; for instance in the case of ARRAY_AGG. In such cases,
+ * it's still possible to significantly exceed work_mem.
+ *
* Transition / Combine function invocation:
*
* For performance reasons transition functions, including combine
@@ -233,12 +251,98 @@
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/datum.h"
+#include "utils/dynahash.h"
#include "utils/expandeddatum.h"
+#include "utils/logtape.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+/*
+ * Control how many partitions are created when spilling HashAgg to
+ * disk.
+ *
+ * HASHAGG_PARTITION_FACTOR is multiplied by the estimated number of
+ * partitions needed such that each partition will fit in memory. The factor
+ * is set higher than one because there's not a high cost to having a few too
+ * many partitions, and it makes it less likely that a partition will need to
+ * be spilled recursively. Another benefit of having more, smaller partitions
+ * is that small hash tables may perform better than large ones due to memory
+ * caching effects.
+ *
+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (and
+ * possibly pushing hidden costs to the OS for managing more files).
+ *
+ * For reading from tapes, the buffer size must be a multiple of
+ * BLCKSZ. Larger values help when reading from multiple tapes concurrently,
+ * but that doesn't happen in HashAgg, so we simply use BLCKSZ. Writing to a
+ * tape always uses a buffer of size BLCKSZ.
+ */
+#define HASHAGG_PARTITION_FACTOR 1.50
+#define HASHAGG_MIN_PARTITIONS 4
+#define HASHAGG_READ_BUFFER_SIZE BLCKSZ
+#define HASHAGG_WRITE_BUFFER_SIZE BLCKSZ
+
+/*
+ * Track all tapes needed for a HashAgg that spills. We don't know the maximum
+ * number of tapes needed at the start of the algorithm (because it can
+ * recurse), so one tape set is allocated and extended as needed for new
+ * tapes. When a particular tape is already read, rewind it for write mode and
+ * put it in the free list.
+ *
+ * Tapes' buffers can take up substantial memory when many tapes are open at
+ * once. We only need one tape open at a time in read mode (using a buffer
+ * that's a multiple of BLCKSZ); but we need up to HASHAGG_MAX_PARTITIONS
+ * tapes open in write mode (each requiring a buffer of size BLCKSZ).
+ */
+typedef struct HashTapeInfo
+{
+ LogicalTapeSet *tapeset;
+ int ntapes;
+ int *freetapes;
+ int nfreetapes;
+} HashTapeInfo;
+
+/*
+ * Represents partitioned spill data for a single hashtable. Contains the
+ * necessary information to route tuples to the correct partition, and to
+ * transform the spilled data into new batches.
+ *
+ * The high bits are used for partition selection (when recursing, we ignore
+ * the bits that have already been used for partition selection at an earlier
+ * level).
+ */
+typedef struct HashAggSpill
+{
+ LogicalTapeSet *tapeset; /* borrowed reference to tape set */
+ int npartitions; /* number of partitions */
+ int *partitions; /* spill partition tape numbers */
+ int64 *ntuples; /* number of tuples in each partition */
+ uint32 mask; /* mask to find partition from hash value */
+ int shift; /* after masking, shift by this amount */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation (with only one
+ * grouping set).
+ *
+ * Also tracks the bits of the hash already used for partition selection by
+ * earlier iterations, so that this batch can use new bits. If all bits have
+ * already been used, no partitioning will be done (any spilled data will go
+ * to a single output tape).
+ */
+typedef struct HashAggBatch
+{
+ int setno; /* grouping set */
+ int used_bits; /* number of bits of hash already used */
+ LogicalTapeSet *tapeset; /* borrowed reference to tape set */
+ int input_tapenum; /* input partition tape */
+ int64 input_tuples; /* number of tuples in this batch */
+} HashAggBatch;
+
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -269,15 +373,53 @@ static void prepare_projection_slot(AggState *aggstate,
static void finalize_aggregates(AggState *aggstate,
AggStatePerAgg peragg,
AggStatePerGroup pergroup);
-static TupleTableSlot *project_aggregates(AggState *aggstate);
-static Bitmapset *find_unaggregated_cols(AggState *aggstate);
-static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
-static void build_hash_table(AggState *aggstate);
-static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
+static void build_hash_tables(AggState *aggstate);
+static void build_hash_table(AggState *aggstate, int setno,
+ int64 ngroups_estimate);
+static void prepare_hash_slot(AggState *aggstate);
+static void hashagg_recompile_expressions(AggState *aggstate);
+static uint32 calculate_hash(AggState *aggstate);
+static long hash_choose_num_buckets(AggState *aggstate,
+ long estimated_nbuckets,
+ Size memory);
+static int hash_choose_num_partitions(uint64 input_groups,
+ double hashentrysize,
+ int used_bits,
+ int *log2_npartittions);
+static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+
+/* Hash Aggregation helpers */
+static TupleTableSlot *project_aggregates(AggState *aggstate);
+static Bitmapset *find_unaggregated_cols(AggState *aggstate);
+static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
+static void hashagg_set_limits(AggState *aggstate, uint64 input_groups,
+ int used_bits);
+static void hashagg_check_limits(AggState *aggstate);
+static void hashagg_finish_initial_spills(AggState *aggstate);
+static void hashagg_reset_spill_state(AggState *aggstate);
+
+/* Structure APIs */
+static HashAggBatch *hashagg_batch_new(LogicalTapeSet *tapeset,
+ int input_tapenum, int setno,
+ int64 input_tuples, int used_bits);
+static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
+static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
+ int used_bits, uint64 input_tuples,
+ double hashentrysize);
+static Size hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot,
+ uint32 hash);
+static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno);
+static void hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *dest,
+ int ndest);
+static void hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum);
+
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1235,7 +1377,7 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
}
/*
- * (Re-)initialize the hash table(s) to empty.
+ * Initialize the hash table(s).
*
* To implement hashed aggregation, we need a hashtable that stores a
* representative tuple and an array of AggStatePerGroup structs for each
@@ -1246,44 +1388,79 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
* We have a separate hashtable and associated perhash data structure for each
* grouping set for which we're doing hashing.
*
- * The contents of the hash tables always live in the hashcontext's per-tuple
- * memory context (there is only one of these for all tables together, since
- * they are all reset at the same time).
+ * The hash tables and their contents always live in the hashcontext's
+ * per-tuple memory context (there is only one of these for all tables
+ * together, since they are all reset at the same time).
*/
static void
-build_hash_table(AggState *aggstate)
+build_hash_tables(AggState *aggstate)
{
- MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
- Size additionalsize;
- int i;
-
- Assert(aggstate->aggstrategy == AGG_HASHED || aggstate->aggstrategy == AGG_MIXED);
-
- additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
+ int setno;
- for (i = 0; i < aggstate->num_hashes; ++i)
+ for (setno = 0; setno < aggstate->num_hashes; ++setno)
{
- AggStatePerHash perhash = &aggstate->perhash[i];
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ long nbuckets;
+ Size memory;
Assert(perhash->aggnode->numGroups > 0);
- if (perhash->hashtable)
- ResetTupleHashTable(perhash->hashtable);
- else
- perhash->hashtable = BuildTupleHashTableExt(&aggstate->ss.ps,
- perhash->hashslot->tts_tupleDescriptor,
- perhash->numCols,
- perhash->hashGrpColIdxHash,
- perhash->eqfuncoids,
- perhash->hashfunctions,
- perhash->aggnode->grpCollations,
- perhash->aggnode->numGroups,
- additionalsize,
- aggstate->ss.ps.state->es_query_cxt,
- aggstate->hashcontext->ecxt_per_tuple_memory,
- tmpmem,
- DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+ memory = aggstate->hash_mem_limit / aggstate->num_hashes;
+
+ /* choose reasonable number of buckets per hashtable */
+ nbuckets = hash_choose_num_buckets(
+ aggstate, perhash->aggnode->numGroups, memory);
+
+ build_hash_table(aggstate, setno, nbuckets);
}
+
+ aggstate->hash_alloc_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_ngroups_current = 0;
+}
+
+/*
+ * Build a single hashtable for this grouping set. Pass the hash memory
+ * context as both metacxt and tablecxt, so that resetting the hashcontext
+ * will free all memory including metadata. That means that we cannot reset
+ * the hash table to empty and reuse it, though (see execGrouping.c).
+ */
+static void
+build_hash_table(AggState *aggstate, int setno, long nbuckets)
+{
+ TupleHashTable table;
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ MemoryContext hashmem = aggstate->hashcontext->ecxt_per_tuple_memory;
+ MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
+ Size additionalsize;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ /*
+ * Used to make sure initial hash table allocation does not exceed
+ * work_mem. Note that the estimate does not include space for
+ * pass-by-reference transition data values, nor for the representative
+ * tuple of each group.
+ */
+ additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
+
+ table = BuildTupleHashTableExt(&aggstate->ss.ps,
+ perhash->hashslot->tts_tupleDescriptor,
+ perhash->numCols,
+ perhash->hashGrpColIdxHash,
+ perhash->eqfuncoids,
+ perhash->hashfunctions,
+ perhash->aggnode->grpCollations,
+ nbuckets,
+ additionalsize,
+ hashmem,
+ hashmem,
+ tmpmem,
+ DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+
+ perhash->hashtable = table;
}
/*
@@ -1425,42 +1602,31 @@ find_hash_columns(AggState *aggstate)
/*
* Estimate per-hash-table-entry overhead for the planner.
- *
- * Note that the estimate does not include space for pass-by-reference
- * transition data values, nor for the representative tuple of each group.
- * Nor does this account of the target fill-factor and growth policy of the
- * hash table.
*/
Size
-hash_agg_entry_size(int numAggs)
+hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
{
- Size entrysize;
-
- /* This must match build_hash_table */
- entrysize = sizeof(TupleHashEntryData) +
- numAggs * sizeof(AggStatePerGroupData);
- entrysize = MAXALIGN(entrysize);
-
- return entrysize;
+ return
+ /* key */
+ MAXALIGN(SizeofMinimalTupleHeader) +
+ MAXALIGN(tupleWidth) +
+ /* data */
+ MAXALIGN(sizeof(TupleHashEntryData) +
+ numAggs * sizeof(AggStatePerGroupData)) +
+ transitionSpace;
}
/*
- * Find or create a hashtable entry for the tuple group containing the current
- * tuple (already set in tmpcontext's outertuple slot), in the current grouping
- * set (which the caller must have selected - note that initialize_aggregate
- * depends on this).
- *
- * When called, CurrentMemoryContext should be the per-query context.
+ * Extract the attributes that make up the grouping key into the
+ * hashslot. This is necessary to compute the hash of the grouping key.
*/
-static TupleHashEntryData *
-lookup_hash_entry(AggState *aggstate)
+static void
+prepare_hash_slot(AggState *aggstate)
{
- TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
- AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
- TupleTableSlot *hashslot = perhash->hashslot;
- TupleHashEntryData *entry;
- bool isnew;
- int i;
+ TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ int i;
/* transfer just the needed columns into hashslot */
slot_getsomeattrs(inputslot, perhash->largestGrpColIdx);
@@ -1474,14 +1640,315 @@ lookup_hash_entry(AggState *aggstate)
hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
}
ExecStoreVirtualTuple(hashslot);
+}
+
+/*
+ * Recompile the expressions for advancing aggregates while hashing. This is
+ * necessary for certain kinds of state changes that affect the resulting
+ * expression. For instance, changing aggstate->hash_ever_spilled or
+ * aggstate->ss.ps.outerops requires recompilation.
+ *
+ * A compiled expression where hash_ever_spilled is true will work even when
+ * hash_spill_mode is false, because it merely introduces additional branches
+ * that are unnecessary when hash_spill_mode is false. That allows us to only
+ * recompile when hash_ever_spilled changes, rather than every time
+ * hash_spill_mode changes.
+ */
+static void
+hashagg_recompile_expressions(AggState *aggstate)
+{
+ AggStatePerPhase phase;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ if (aggstate->aggstrategy == AGG_HASHED)
+ phase = &aggstate->phases[0];
+ else /* AGG_MIXED */
+ phase = &aggstate->phases[1];
+
+ phase->evaltrans = ExecBuildAggTrans(
+ aggstate, phase,
+ aggstate->aggstrategy == AGG_MIXED ? true : false, /* dosort */
+ true, /* dohash */
+ aggstate->hash_ever_spilled);
+}
+
+/*
+ * Calculate the hash value for a tuple. It's useful to do this outside of the
+ * hash table so that we can reuse saved hash values rather than recomputing.
+ */
+static uint32
+calculate_hash(AggState *aggstate)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleHashTable hashtable = perhash->hashtable;
+ MemoryContext oldContext;
+ uint32 hash;
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = perhash->hashslot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+
+ MemoryContextSwitchTo(oldContext);
+
+ return hash;
+}
+
+/*
+ * Set limits that trigger spilling to avoid exceeding work_mem. Consider the
+ * number of partitions we expect to create (if we do spill).
+ *
+ * There are two limits: a memory limit, and also an ngroups limit. The
+ * ngroups limit becomes important when we expect transition values to grow
+ * substantially larger than the initial value.
+ */
+static void
+hashagg_set_limits(AggState *aggstate, uint64 input_groups, int used_bits)
+{
+ int npartitions;
+ Size partition_mem;
+
+ /* no attempt to obey work_mem */
+ if (hashagg_mem_overflow)
+ {
+ aggstate->hash_mem_limit = SIZE_MAX;
+ aggstate->hash_ngroups_limit = LONG_MAX;
+ return;
+ }
+
+ /* if not expected to spill, use all of work_mem */
+ if (input_groups * aggstate->hashentrysize < work_mem * 1024L)
+ {
+ aggstate->hash_mem_limit = work_mem * 1024L;
+ aggstate->hash_ngroups_limit =
+ aggstate->hash_mem_limit / aggstate->hashentrysize;
+ return;
+ }
+
+ /*
+ * Calculate expected memory requirements for spilling, which is the size
+ * of the buffers needed for all the tapes that need to be open at
+ * once. Then, subtract that from the memory available for holding hash
+ * tables.
+ */
+ npartitions = hash_choose_num_partitions(input_groups,
+ aggstate->hashentrysize,
+ used_bits,
+ NULL);
+ partition_mem =
+ HASHAGG_READ_BUFFER_SIZE +
+ HASHAGG_WRITE_BUFFER_SIZE * npartitions;
+
+ /*
+ * Don't set the limit below 3/4 of work_mem. In that case, we are at the
+ * minimum number of partitions, so we aren't going to dramatically exceed
+ * work mem anyway.
+ */
+ if (work_mem * 1024L > 4 * partition_mem)
+ aggstate->hash_mem_limit = work_mem * 1024L - partition_mem;
+ else
+ aggstate->hash_mem_limit = work_mem * 1024L * 0.75;
+
+ if (aggstate->hash_mem_limit > aggstate->hashentrysize)
+ aggstate->hash_ngroups_limit =
+ aggstate->hash_mem_limit / aggstate->hashentrysize;
+ else
+ aggstate->hash_ngroups_limit = 1;
+}
+
+/*
+ * hashagg_check_limits
+ *
+ * After adding a new group to the hash table, check whether we need to enter
+ * spill mode. Allocations may happen without adding new groups (for instance,
+ * if the transition state size grows), so this check is imperfect.
+ *
+ * Memory usage is tracked by how much is allocated to the underlying memory
+ * context, not individual chunks. This is more accurate because it accounts
+ * for all memory in the context, and also accounts for fragmentation and
+ * other forms of overhead and waste that can be difficult to estimate. It's
+ * also cheaper because we don't have to track each chunk.
+ *
+ * To measure the memory usage by looking at the allocations to the underlying
+ * context, we check if the allocated size has grown since the last time we
+ * observed it. If so, we remember the last allocated size of the memory
+ * context, and use that as the value for the current memory usage to
+ * determine whether to enter spill mode or not.
+ */
+static void
+hashagg_check_limits(AggState *aggstate)
+{
+ Size allocation;
+
+ /*
+ * Even if already in spill mode, it's possible for memory usage to grow,
+ * and we should still track it for the purposes of EXPLAIN ANALYZE.
+ */
+ allocation = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ /* has allocation grown since the last observation? */
+ if (allocation > aggstate->hash_alloc_current)
+ {
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_alloc_current = allocation;
+ }
+
+ if (aggstate->hash_alloc_last > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = aggstate->hash_alloc_last;
+
+ /*
+ * Don't spill unless there's at least one group in the hash table so we
+ * can be sure to make progress even in edge cases.
+ */
+ if (aggstate->hash_ngroups_current > 0 &&
+ (aggstate->hash_alloc_last > aggstate->hash_mem_limit ||
+ aggstate->hash_ngroups_current > aggstate->hash_ngroups_limit))
+ {
+ aggstate->hash_spill_mode = true;
+
+ if (!aggstate->hash_ever_spilled)
+ {
+ aggstate->hash_ever_spilled = true;
+ aggstate->hash_spills = palloc0(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+ aggstate->hash_tapeinfo = palloc0(sizeof(HashTapeInfo));
+ hashagg_recompile_expressions(aggstate);
+ }
+ }
+}
+
+/*
+ * Choose a reasonable number of buckets for the initial hash table size.
+ */
+static long
+hash_choose_num_buckets(AggState *aggstate, long ngroups, Size memory)
+{
+ long max_nbuckets;
+ int log2_ngroups;
+ long nbuckets;
+
+ max_nbuckets = memory / aggstate->hashentrysize;
+
+ /*
+ * Leave room for slop to avoid a case where the initial hash table size
+ * exceeds the memory limit (though that may still happen in edge cases).
+ */
+ max_nbuckets *= 0.75;
+
+ /*
+ * Lowest power of two greater than ngroups, without exceeding
+ * max_nbuckets.
+ */
+ for (log2_ngroups = 1, nbuckets = 2;
+ nbuckets < ngroups && nbuckets < max_nbuckets;
+ log2_ngroups++, nbuckets <<= 1);
+
+ if (nbuckets > max_nbuckets && nbuckets > 2)
+ nbuckets >>= 1;
+
+ return nbuckets;
+}
+
+/*
+ * Determine the number of partitions to create when spilling, which will
+ * always be a power of two. If log2_npartitions is non-NULL, set
+ * *log2_npartitions to the log2() of the number of partitions.
+ */
+static int
+hash_choose_num_partitions(uint64 input_groups, double hashentrysize,
+ int used_bits, int *log2_npartitions)
+{
+ Size mem_wanted;
+ int partition_limit;
+ int npartitions;
+ int partition_bits;
+
+ /*
+ * Avoid creating so many partitions that the memory requirements of the
+ * open partition files (estimated at BLCKSZ for buffering) are greater
+ * than 1/4 of work_mem.
+ */
+ partition_limit =
+ (work_mem * 1024L * 0.25 - HASHAGG_READ_BUFFER_SIZE) /
+ HASHAGG_WRITE_BUFFER_SIZE;
+
+ /* pessimistically estimate that each input tuple creates a new group */
+ mem_wanted = HASHAGG_PARTITION_FACTOR * input_groups * hashentrysize;
+
+ /* make enough partitions so that each one is likely to fit in memory */
+ npartitions = 1 + (mem_wanted / (work_mem * 1024L));
+
+ if (npartitions > partition_limit)
+ npartitions = partition_limit;
+
+ if (npartitions < HASHAGG_MIN_PARTITIONS)
+ npartitions = HASHAGG_MIN_PARTITIONS;
+ if (npartitions > HASHAGG_MAX_PARTITIONS)
+ npartitions = HASHAGG_MAX_PARTITIONS;
+
+ /* ceil(log2(npartitions)) */
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits */
+ if (partition_bits + used_bits >= 32)
+ partition_bits = 32 - used_bits;
+
+ if (log2_npartitions != NULL)
+ *log2_npartitions = partition_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ return npartitions;
+}
+
+/*
+ * Find or create a hashtable entry for the tuple group containing the current
+ * tuple (already set in tmpcontext's outertuple slot), in the current grouping
+ * set (which the caller must have selected - note that initialize_aggregate
+ * depends on this).
+ *
+ * When called, CurrentMemoryContext should be the per-query context.
+ *
+ * If the hash table is at the memory limit, then only find existing hashtable
+ * entries; don't create new ones. If a tuple's group is not already present
+ * in the hash table for the current grouping set, return NULL and the caller
+ * will spill it to disk.
+ */
+static AggStatePerGroup
+lookup_hash_entry(AggState *aggstate, uint32 hash)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ TupleHashEntryData *entry;
+ bool isnew = false;
+ bool *p_isnew;
+
+ /* if hash table already spilled, don't create new entries */
+ p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntry(perhash->hashtable, hashslot, &isnew);
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
+ hash);
+
+ if (entry == NULL)
+ return NULL;
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_ngroups_current++;
+ hashagg_check_limits(aggstate);
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1501,7 +1968,7 @@ lookup_hash_entry(AggState *aggstate)
}
}
- return entry;
+ return entry->additional;
}
/*
@@ -1509,18 +1976,49 @@ lookup_hash_entry(AggState *aggstate)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Some entries may be left NULL if we have reached the limit and have begun
+ * to spill. The same tuple will belong to different groups for each set, so
+ * may match a group already in memory for one set and match a group not in
+ * memory for another set. If we have begun to spill and a tuple doesn't match
+ * a group in memory for a particular set, it will be spilled.
+ *
+ * NB: It's possible to spill the same tuple for several different grouping
+ * sets. This may seem wasteful, but it's actually a trade-off: if we spill
+ * the tuple multiple times for multiple grouping sets, it can be partitioned
+ * for each grouping set, making the refilling of the hash table very
+ * efficient.
*/
static void
lookup_hash_entries(AggState *aggstate)
{
- int numHashes = aggstate->num_hashes;
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
int setno;
- for (setno = 0; setno < numHashes; setno++)
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
{
+ uint32 hash;
+
select_current_set(aggstate, setno, true);
- pergroup[setno] = lookup_hash_entry(aggstate)->additional;
+ prepare_hash_slot(aggstate);
+ hash = calculate_hash(aggstate);
+ pergroup[setno] = lookup_hash_entry(aggstate, hash);
+
+ /* check to see if we need to spill the tuple for this grouping set */
+ if (pergroup[setno] == NULL)
+ {
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
+
+ if (spill->partitions == NULL)
+ hashagg_spill_init(spill, aggstate->hash_tapeinfo, 0,
+ perhash->aggnode->numGroups,
+ aggstate->hashentrysize);
+
+ aggstate->hash_disk_used += hashagg_spill_tuple(
+ spill, slot, hash);
+ }
}
}
@@ -1843,6 +2341,12 @@ agg_retrieve_direct(AggState *aggstate)
if (TupIsNull(outerslot))
{
/* no more outer-plan tuples available */
+
+ /* if we built hash tables, finalize any spills */
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hashagg_finish_initial_spills(aggstate);
+
if (hasGroupingSets)
{
aggstate->input_done = true;
@@ -1945,6 +2449,9 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ /* finalize spills, if any */
+ hashagg_finish_initial_spills(aggstate);
+
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1952,11 +2459,190 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * consumed.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+ HashAggSpill spill;
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ LogicalTapeSet *tapeset;
+ long nbuckets;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ tapeset = tapeinfo->tapeset;
+ spill.npartitions = 0;
+ spill.partitions = NULL;
+ /*
+ * Each spill file contains spilled data for only a single grouping
+ * set. We want to ignore all others, which is done by setting the other
+ * pergroups to NULL.
+ */
+ memset(aggstate->all_pergroups, 0,
+ sizeof(AggStatePerGroup) *
+ (aggstate->maxsets + aggstate->num_hashes));
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ hashagg_set_limits(aggstate, batch->input_tuples, batch->used_bits);
+
+ /*
+ * Free memory and rebuild a single hash table for this batch's grouping
+ * set. Estimate the number of groups to be the number of input tuples in
+ * this batch.
+ */
+ ReScanExprContext(aggstate->hashcontext);
+
+ nbuckets = hash_choose_num_buckets(
+ aggstate, batch->input_tuples, aggstate->hash_mem_limit);
+ build_hash_table(aggstate, batch->setno, nbuckets);
+ aggstate->hash_alloc_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_ngroups_current = 0;
+
+ Assert(aggstate->current_phase == 0);
+
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ /*
+ * The first pass (agg_fill_hash_table) reads whatever kind of slot comes
+ * from the outer plan, and considers the slot fixed. But spilled tuples
+ * are always MinimalTuples, so if that's different from the outer plan we
+ * need to change it and recompile the aggregate expressions.
+ */
+ if (aggstate->ss.ps.outerops != &TTSOpsMinimalTuple)
+ {
+ aggstate->ss.ps.outerops = &TTSOpsMinimalTuple;
+ hashagg_recompile_expressions(aggstate);
+ }
+
+ LogicalTapeRewindForRead(tapeset, batch->input_tapenum, HASHAGG_READ_BUFFER_SIZE);
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+ uint32 hash;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hashagg_batch_read(batch, &hash);
+ if (tuple == NULL)
+ break;
+
+ ExecStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ select_current_set(aggstate, batch->setno, true);
+ prepare_hash_slot(aggstate);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+
+ /* if there's no memory for a new group, spill */
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ {
+ /*
+ * Estimate the number of groups for this batch as the total
+ * number of tuples in its input file. Although that's a worst
+ * case, it's not bad here for two reasons: (1) overestimating
+ * is better than underestimating; and (2) we've already
+ * scanned the relation once, so it's likely that we've
+ * already finalized many of the common values.
+ */
+ if (spill.partitions == NULL)
+ hashagg_spill_init(&spill, tapeinfo, batch->used_bits,
+ batch->input_tuples,
+ aggstate->hashentrysize);
+
+ aggstate->hash_disk_used += hashagg_spill_tuple(
+ &spill, slot, hash);
+ }
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ hashagg_tapeinfo_release(tapeinfo, batch->input_tapenum);
+
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ /* update hashentrysize estimate based on contents */
+ if (aggstate->hash_ngroups_current > 0)
+ {
+ aggstate->hashentrysize = (double)aggstate->hash_alloc_last /
+ (double)aggstate->hash_ngroups_current;
+ }
+
+ hashagg_spill_finish(aggstate, &spill, batch->setno);
+ aggstate->hash_spill_mode = false;
+
+ pfree(batch);
+
+ /* Initialize to walk the first hash table */
+ select_current_set(aggstate, 0, true);
+ ResetTupleHashIterator(aggstate->perhash[0].hashtable,
+ &aggstate->perhash[0].hashiter);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
+ *
+ * After exhausting in-memory tuples, also try refilling the hash table using
+ * previously-spilled tuples. Only returns NULL after all in-memory and
+ * spilled tuples are exhausted.
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+/*
+ * Retrieve the groups from the in-memory hash tables without considering any
+ * spilled tuples.
+ */
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -1985,7 +2671,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2016,8 +2702,6 @@ agg_retrieve_hash_table(AggState *aggstate)
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2074,6 +2758,293 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * Assign unused tapes to spill partitions, extending the tape set if
+ * necessary.
+ */
+static void
+hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *partitions,
+ int npartitions)
+{
+ int partidx = 0;
+
+ /* use free tapes if available */
+ while (partidx < npartitions && tapeinfo->nfreetapes > 0)
+ partitions[partidx++] = tapeinfo->freetapes[--tapeinfo->nfreetapes];
+
+ if (tapeinfo->tapeset == NULL)
+ tapeinfo->tapeset = LogicalTapeSetCreate(npartitions, NULL, NULL, -1);
+ else if (partidx < npartitions)
+ LogicalTapeSetExtend(tapeinfo->tapeset, npartitions - partidx);
+
+ while (partidx < npartitions)
+ partitions[partidx++] = tapeinfo->ntapes++;
+}
+
+/*
+ * After a tape has already been written to and then read, this function
+ * rewinds it for writing and adds it to the free list.
+ */
+static void
+hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum)
+{
+ LogicalTapeRewindForWrite(tapeinfo->tapeset, tapenum);
+ if (tapeinfo->freetapes == NULL)
+ tapeinfo->freetapes = palloc(sizeof(int));
+ else
+ tapeinfo->freetapes = repalloc(
+ tapeinfo->freetapes, sizeof(int) * (tapeinfo->nfreetapes + 1));
+ tapeinfo->freetapes[tapeinfo->nfreetapes++] = tapenum;
+}
+
+/*
+ * hashagg_spill_init
+ *
+ * Called after we determined that spilling is necessary. Chooses the number
+ * of partitions to create, and initializes them.
+ */
+static void
+hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo, int used_bits,
+ uint64 input_groups, double hashentrysize)
+{
+ int npartitions;
+ int partition_bits;
+
+ npartitions = hash_choose_num_partitions(
+ input_groups, hashentrysize, used_bits, &partition_bits);
+
+ spill->partitions = palloc0(sizeof(int) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+
+ hashagg_tapeinfo_assign(tapeinfo, spill->partitions, npartitions);
+
+ spill->tapeset = tapeinfo->tapeset;
+ spill->shift = 32 - used_bits - partition_bits;
+ spill->mask = (npartitions - 1) << spill->shift;
+ spill->npartitions = npartitions;
+}
+
+/*
+ * hashagg_spill_tuple
+ *
+ * No room for new groups in the hash table. Save for later in the appropriate
+ * partition.
+ */
+static Size
+hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot, uint32 hash)
+{
+ LogicalTapeSet *tapeset = spill->tapeset;
+ int partition;
+ MinimalTuple tuple;
+ int tapenum;
+ int total_written = 0;
+ bool shouldFree;
+
+ Assert(spill->partitions != NULL);
+
+ /* may contain unnecessary attributes, consider projecting? */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ partition = (hash & spill->mask) >> spill->shift;
+ spill->ntuples[partition]++;
+
+ tapenum = spill->partitions[partition];
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) &hash, sizeof(uint32));
+ total_written += sizeof(uint32);
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) tuple, tuple->t_len);
+ total_written += tuple->t_len;
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return total_written;
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hashagg_batch_read(HashAggBatch *batch, uint32 *hashp)
+{
+ LogicalTapeSet *tapeset = batch->tapeset;
+ int tapenum = batch->input_tapenum;
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+ uint32 hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &hash, sizeof(uint32));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+ if (hashp != NULL)
+ *hashp = hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &t_len, sizeof(t_len));
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = LogicalTapeRead(tapeset, tapenum,
+ (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, t_len - sizeof(uint32), nread)));
+
+ return tuple;
+}
+
+/*
+ * new_hashagg_batch
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done. Should be called in the aggregate's memory context.
+ */
+static HashAggBatch *
+hashagg_batch_new(LogicalTapeSet *tapeset, int tapenum, int setno,
+ int64 input_tuples, int used_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->setno = setno;
+ batch->used_bits = used_bits;
+ batch->tapeset = tapeset;
+ batch->input_tapenum = tapenum;
+ batch->input_tuples = input_tuples;
+
+ return batch;
+}
+
+/*
+ * hashagg_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hashagg_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+
+ if (aggstate->hash_spills == NULL)
+ return;
+
+ /* update hashentrysize estimate based on contents */
+ Assert(aggstate->hash_ngroups_current > 0);
+ aggstate->hashentrysize = (double)aggstate->hash_alloc_last /
+ (double)aggstate->hash_ngroups_current;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hashagg_spill_finish(aggstate, &aggstate->hash_spills[setno], setno);
+
+ aggstate->hash_spill_mode = false;
+
+ /*
+ * We're not processing tuples from outer plan any more; only processing
+ * batches of spilled tuples. The initial spill structures are no longer
+ * needed.
+ */
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+}
+
+/*
+ * hashagg_spill_finish
+ *
+ * Transform spill partitions into new batches.
+ */
+static void
+hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
+{
+ int i;
+ int used_bits = 32 - spill->shift;
+
+ if (spill->npartitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->npartitions; i++)
+ {
+ int tapenum = spill->partitions[i];
+ MemoryContext oldContext;
+ HashAggBatch *new_batch;
+
+ oldContext = MemoryContextSwitchTo(aggstate->ss.ps.state->es_query_cxt);
+ new_batch = hashagg_batch_new(aggstate->hash_tapeinfo->tapeset,
+ tapenum, setno, spill->ntuples[i],
+ used_bits);
+ aggstate->hash_batches = lcons(new_batch, aggstate->hash_batches);
+ aggstate->hash_batches_used++;
+ MemoryContextSwitchTo(oldContext);
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Free resources related to a spilled HashAgg.
+ */
+static void
+hashagg_reset_spill_state(AggState *aggstate)
+{
+ ListCell *lc;
+
+ /* free spills from initial pass */
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+ }
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ /* free batches */
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+
+ /* close tape set */
+ if (aggstate->hash_tapeinfo != NULL)
+ {
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ if (tapeinfo->tapeset != NULL)
+ LogicalTapeSetClose(tapeinfo->tapeset);
+ if (tapeinfo->freetapes != NULL)
+ pfree(tapeinfo->freetapes);
+ pfree(tapeinfo);
+ aggstate->hash_tapeinfo = NULL;
+ }
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2258,6 +3229,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->ss.ps.outeropsfixed = false;
}
+ if (use_hashing)
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsMinimalTuple);
+
/*
* Initialize result type, slot and projection.
*/
@@ -2483,11 +3458,22 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
*/
if (use_hashing)
{
+ Plan *outerplan = outerPlan(node);
+ long totalGroups = 0;
+ int i;
+
/* this is an array of pointers, not structures */
aggstate->hash_pergroup = pergroups;
+ aggstate->hashentrysize = hash_agg_entry_size(
+ aggstate->numtrans, outerplan->plan_width, node->transitionSpace);
+
+ for (i = 0; i < aggstate->num_hashes; i++)
+ totalGroups = aggstate->perhash[i].aggnode->numGroups;
+
+ hashagg_set_limits(aggstate, totalGroups, 0);
find_hash_columns(aggstate);
- build_hash_table(aggstate);
+ build_hash_tables(aggstate);
aggstate->table_filled = false;
}
@@ -2893,7 +3879,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
else
Assert(false);
- phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash);
+ phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash, false);
}
@@ -3388,6 +4374,8 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hashagg_reset_spill_state(node);
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3443,12 +4431,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_ever_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3505,11 +4494,29 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ hashagg_reset_spill_state(node);
+
+ node->hash_ever_spilled = false;
+ node->hash_spill_mode = false;
+ node->hash_alloc_last = 0;
+ node->hash_alloc_current = 0;
+ node->hash_ngroups_current = 0;
+
+ /* reset stats */
+ node->hash_mem_peak = 0;
+ node->hash_disk_used = 0;
+ node->hash_batches_used = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
- build_hash_table(node);
+ build_hash_tables(node);
node->table_filled = false;
/* iterator will be reset when the table is filled */
+
+ node->ss.ps.outerops =
+ ExecGetResultSlotOps(outerPlanState(&node->ss),
+ &node->ss.ps.outeropsfixed);
+ hashagg_recompile_expressions(node);
}
if (node->aggstrategy != AGG_HASHED)
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 21a5ca4b404..6fa555ada88 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2082,6 +2082,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_INIT_TRANS:
+ case EEOP_AGG_INIT_TRANS_SPILLED:
{
AggState *aggstate;
AggStatePerTrans pertrans;
@@ -2092,6 +2093,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_allpergroupsp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_setoff,
v_transno;
@@ -2119,11 +2121,32 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_init_trans.setoff);
v_transno = l_int32_const(op->d.agg_init_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_notransvalue = l_bb_before_v(
+ opblocks[i + 1], "op.%d.check_notransvalue", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(
+ b, v_pergroup_allaggs, TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_check_notransvalue);
+
+ LLVMPositionBuilderAtEnd(b, b_check_notransvalue);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_notransvalue =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_NOTRANSVALUE,
@@ -2180,6 +2203,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_STRICT_TRANS_CHECK:
+ case EEOP_AGG_STRICT_TRANS_CHECK_SPILLED:
{
AggState *aggstate;
LLVMValueRef v_setoff,
@@ -2190,6 +2214,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_transnull;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
int jumpnull = op->d.agg_strict_trans_check.jumpnull;
@@ -2209,11 +2234,32 @@ llvm_compile_expr(ExprState *state)
l_int32_const(op->d.agg_strict_trans_check.setoff);
v_transno =
l_int32_const(op->d.agg_strict_trans_check.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_transnull = l_bb_before_v(
+ opblocks[i + 1], "op.%d.check_transnull", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[jumpnull],
+ b_check_transnull);
+
+ LLVMPositionBuilderAtEnd(b, b_check_transnull);
+ }
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_transnull =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_TRANSVALUEISNULL,
@@ -2229,7 +2275,9 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_PLAIN_TRANS_BYVAL:
+ case EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED:
case EEOP_AGG_PLAIN_TRANS:
+ case EEOP_AGG_PLAIN_TRANS_SPILLED:
{
AggState *aggstate;
AggStatePerTrans pertrans;
@@ -2255,6 +2303,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_pertransp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_retval;
@@ -2282,10 +2331,33 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_trans.setoff);
v_transno = l_int32_const(op->d.agg_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED ||
+ opcode == EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_advance_transval = l_bb_before_v(
+ opblocks[i + 1], "op.%d.advance_transval", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_advance_transval);
+
+ LLVMPositionBuilderAtEnd(b, b_advance_transval);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_fcinfo = l_ptr_const(fcinfo,
l_ptr(StructFunctionCallInfoData));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721f..9575469800b 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -77,6 +77,7 @@
#include "access/htup_details.h"
#include "access/tsmapi.h"
#include "executor/executor.h"
+#include "executor/nodeAgg.h"
#include "executor/nodeHash.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -128,6 +129,7 @@ bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
bool enable_hashagg = true;
+bool enable_hashagg_spill = true;
bool enable_nestloop = true;
bool enable_material = true;
bool enable_mergejoin = true;
@@ -2153,7 +2155,7 @@ cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples)
+ double input_tuples, double input_width)
{
double output_tuples;
Cost startup_cost;
@@ -2219,20 +2221,69 @@ cost_agg(Path *path, PlannerInfo *root,
total_cost += aggcosts->finalCost.per_tuple * numGroups;
total_cost += cpu_tuple_cost * numGroups;
output_tuples = numGroups;
+
+ /*
+ * We don't need to compute the disk costs of hash aggregation here,
+ * because the planner does not choose hash aggregation for grouping
+ * sets that it doesn't expect to fit in memory.
+ */
}
else
{
+ double hashentrysize = hash_agg_entry_size(
+ aggcosts->numAggs, input_width, aggcosts->transitionSpace);
+ double nbatches =
+ (numGroups * hashentrysize) / (work_mem * 1024L);
+ double pages_written = 0.0;
+ double pages_read = 0.0;
+
/* must be AGG_HASHED */
startup_cost = input_total_cost;
if (!enable_hashagg)
startup_cost += disable_cost;
startup_cost += aggcosts->transCost.startup;
startup_cost += aggcosts->transCost.per_tuple * input_tuples;
+ /* cost of computing hash value */
startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;
startup_cost += aggcosts->finalCost.startup;
+
+ /*
+ * Add the disk costs of hash aggregation that spills to disk.
+ *
+ * Groups that go into the hash table stay in memory until finalized,
+ * so spilling and reprocessing tuples doesn't incur additional
+ * invocations of transCost or finalCost. Furthermore, the computed
+ * hash value is stored with the spilled tuples, so we don't incur
+ * extra invocations of the hash function.
+ *
+ * The disk cost depends on the depth of recursion; each level
+ * requiring one additional write and then read of a tuple. Writes are
+ * random and reads are sequential, so we assume 1/2 random and half
+ * sequential.
+ *
+ * Hash Agg begins returning tuples after the first batch is
+ * complete. Accrue writes (spilled tuples) to startup_cost and reads
+ * only to total_cost. This is not perfect; it penalizes startup_cost
+ * in the case of recursive spills. Also, transCost is entirely
+ * counted in startup_cost; but some of that cost could be counted
+ * only against total_cost.
+ */
+ if (!hashagg_mem_overflow && nbatches > 1.0)
+ {
+ double depth;
+ double pages;
+
+ pages = relation_byte_size(input_tuples, input_width) / BLCKSZ;
+ depth = ceil( log(nbatches - 1) / log(HASHAGG_MAX_PARTITIONS) );
+ pages_written = pages_read = pages * depth;
+ startup_cost += pages_written * random_page_cost;
+ }
+
total_cost = startup_cost;
total_cost += aggcosts->finalCost.per_tuple * numGroups;
+ /* cost of retrieving from hash table */
total_cost += cpu_tuple_cost * numGroups;
+ total_cost += pages_read * seq_page_cost;
output_tuples = numGroups;
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index dff826a8280..d2699dbc23c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1644,6 +1644,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
NIL,
NIL,
best_path->path.rows,
+ 0,
subplan);
}
else
@@ -2096,6 +2097,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
NIL,
NIL,
best_path->numGroups,
+ best_path->transitionSpace,
subplan);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2257,6 +2259,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
NIL,
rollup->numGroups,
+ best_path->transitionSpace,
sort_plan);
/*
@@ -2295,6 +2298,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
chain,
rollup->numGroups,
+ best_path->transitionSpace,
subplan);
/* Copy cost data from Path to Plan */
@@ -6194,8 +6198,8 @@ Agg *
make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree)
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree)
{
Agg *node = makeNode(Agg);
Plan *plan = &node->plan;
@@ -6211,6 +6215,7 @@ make_agg(List *tlist, List *qual,
node->grpOperators = grpOperators;
node->grpCollations = grpCollations;
node->numGroups = numGroups;
+ node->transitionSpace = transitionSpace;
node->aggParams = NULL; /* SS_finalize_plan() will fill this */
node->groupingSets = groupingSets;
node->chain = chain;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index d6f21535937..913ad9335e5 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4867,13 +4867,8 @@ create_distinct_paths(PlannerInfo *root,
allow_hash = false; /* policy-based decision not to hash */
else
{
- Size hashentrysize;
-
- /* Estimate per-hash-entry space at tuple width... */
- hashentrysize = MAXALIGN(cheapest_input_path->pathtarget->width) +
- MAXALIGN(SizeofMinimalTupleHeader);
- /* plus the per-hash-entry overhead */
- hashentrysize += hash_agg_entry_size(0);
+ Size hashentrysize = hash_agg_entry_size(
+ 0, cheapest_input_path->pathtarget->width, 0);
/* Allow hashing only if hashtable is predicted to fit in work_mem */
allow_hash = (hashentrysize * numDistinctRows <= work_mem * 1024L);
@@ -6533,7 +6528,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
* were unable to sort above, then we'd better generate a Path, so
* that we at least have one.
*/
- if (hashaggtablesize < work_mem * 1024L ||
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L ||
grouped_rel->pathlist == NIL)
{
/*
@@ -6566,7 +6562,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
agg_final_costs,
dNumGroups);
- if (hashaggtablesize < work_mem * 1024L)
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L)
add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
@@ -6835,7 +6832,7 @@ create_partial_grouping_paths(PlannerInfo *root,
* Tentatively produce a partial HashAgg Path, depending on if it
* looks as if the hash table will fit in work_mem.
*/
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
@@ -6862,7 +6859,7 @@ create_partial_grouping_paths(PlannerInfo *root,
dNumPartialPartialGroups);
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 1a23e18970d..951aed80e7a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1072,7 +1072,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
numGroupCols, dNumGroups,
NIL,
input_path->startup_cost, input_path->total_cost,
- input_path->rows);
+ input_path->rows, input_path->pathtarget->width);
/*
* Now for the sorted case. Note that the input is *always* unsorted,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e6d08aede56..8ba8122ee2f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1704,7 +1704,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
NIL,
subpath->startup_cost,
subpath->total_cost,
- rel->rows);
+ rel->rows,
+ subpath->pathtarget->width);
}
if (sjinfo->semi_can_btree && sjinfo->semi_can_hash)
@@ -2949,6 +2950,7 @@ create_agg_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->aggsplit = aggsplit;
pathnode->numGroups = numGroups;
+ pathnode->transitionSpace = aggcosts ? aggcosts->transitionSpace : 0;
pathnode->groupClause = groupClause;
pathnode->qual = qual;
@@ -2957,7 +2959,7 @@ create_agg_path(PlannerInfo *root,
list_length(groupClause), numGroups,
qual,
subpath->startup_cost, subpath->total_cost,
- subpath->rows);
+ subpath->rows, subpath->pathtarget->width);
/* add tlist eval cost for each output row */
pathnode->path.startup_cost += target->cost.startup;
@@ -3036,6 +3038,7 @@ create_groupingsets_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->rollups = rollups;
pathnode->qual = having_qual;
+ pathnode->transitionSpace = agg_costs ? agg_costs->transitionSpace : 0;
Assert(rollups != NIL);
Assert(aggstrategy != AGG_PLAIN || list_length(rollups) == 1);
@@ -3067,7 +3070,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
subpath->startup_cost,
subpath->total_cost,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
is_first = false;
if (!rollup->is_hashed)
is_first_sort = false;
@@ -3090,7 +3094,8 @@ create_groupingsets_path(PlannerInfo *root,
rollup->numGroups,
having_qual,
0.0, 0.0,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
if (!rollup->is_hashed)
is_first_sort = false;
}
@@ -3115,7 +3120,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
sort_path.startup_cost,
sort_path.total_cost,
- sort_path.rows);
+ sort_path.rows,
+ subpath->pathtarget->width);
}
pathnode->path.total_cost += agg_path.total_cost;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 7c6f0574b37..0be26fe0378 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3526,16 +3526,8 @@ double
estimate_hashagg_tablesize(Path *path, const AggClauseCosts *agg_costs,
double dNumGroups)
{
- Size hashentrysize;
-
- /* Estimate per-hash-entry space at tuple width... */
- hashentrysize = MAXALIGN(path->pathtarget->width) +
- MAXALIGN(SizeofMinimalTupleHeader);
-
- /* plus space for pass-by-ref transition values... */
- hashentrysize += agg_costs->transitionSpace;
- /* plus the per-hash-entry overhead */
- hashentrysize += hash_agg_entry_size(agg_costs->numAggs);
+ Size hashentrysize = hash_agg_entry_size(
+ agg_costs->numAggs, path->pathtarget->width, agg_costs->transitionSpace);
/*
* Note that this disregards the effect of fill-factor and growth policy
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index b1f6291b99e..daaff08ceee 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -120,6 +120,7 @@ bool enableFsync = true;
bool allowSystemTableMods = false;
int work_mem = 1024;
int maintenance_work_mem = 16384;
+bool hashagg_mem_overflow = false;
int max_parallel_maintenance_workers = 2;
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9f179a91295..73a052ab9d5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -963,6 +963,26 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_hashagg_spill", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans that are expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_hashagg_spill,
+ true,
+ NULL, NULL, NULL
+ },
+ {
+ {"hashagg_mem_overflow", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables hashed aggregation to overflow work_mem at execution time."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &hashagg_mem_overflow,
+ false,
+ NULL, NULL, NULL
+ },
{
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of materialization."),
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 42cfb1f9f98..5a12ba623c6 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -202,7 +202,7 @@ struct LogicalTapeSet
/* The array of logical tapes. */
int nTapes; /* # of logical tapes in set */
- LogicalTape tapes[FLEXIBLE_ARRAY_MEMBER]; /* has nTapes nentries */
+ LogicalTape *tapes; /* has nTapes nentries */
};
static void ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer);
@@ -211,6 +211,7 @@ static long ltsGetFreeBlock(LogicalTapeSet *lts);
static void ltsReleaseBlock(LogicalTapeSet *lts, long blocknum);
static void ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
SharedFileSet *fileset);
+static void ltsInitTape(LogicalTape *lt);
/*
@@ -486,6 +487,30 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
lts->nHoleBlocks = lts->nBlocksAllocated - nphysicalblocks;
}
+/*
+ * Initialize per-tape struct. Note we allocate the I/O buffer and the first
+ * block for a tape only when it is first actually written to. This avoids
+ * wasting memory space when tuplesort.c overestimates the number of tapes
+ * needed.
+ */
+static void
+ltsInitTape(LogicalTape *lt)
+{
+ lt->writing = true;
+ lt->frozen = false;
+ lt->dirty = false;
+ lt->firstBlockNumber = -1L;
+ lt->curBlockNumber = -1L;
+ lt->nextBlockNumber = -1L;
+ lt->offsetBlockNumber = 0L;
+ lt->buffer = NULL;
+ lt->buffer_size = 0;
+ /* palloc() larger than MaxAllocSize would fail */
+ lt->max_size = MaxAllocSize;
+ lt->pos = 0;
+ lt->nbytes = 0;
+}
+
/*
* Create a set of logical tapes in a temporary underlying file.
*
@@ -511,15 +536,13 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
int worker)
{
LogicalTapeSet *lts;
- LogicalTape *lt;
int i;
/*
* Create top-level struct including per-tape LogicalTape structs.
*/
Assert(ntapes > 0);
- lts = (LogicalTapeSet *) palloc(offsetof(LogicalTapeSet, tapes) +
- ntapes * sizeof(LogicalTape));
+ lts = (LogicalTapeSet *) palloc(sizeof(LogicalTapeSet));
lts->nBlocksAllocated = 0L;
lts->nBlocksWritten = 0L;
lts->nHoleBlocks = 0L;
@@ -529,30 +552,10 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
lts->freeBlocks = (long *) palloc(lts->freeBlocksLen * sizeof(long));
lts->nFreeBlocks = 0;
lts->nTapes = ntapes;
+ lts->tapes = (LogicalTape *) palloc(ntapes * sizeof(LogicalTape));
- /*
- * Initialize per-tape structs. Note we allocate the I/O buffer and the
- * first block for a tape only when it is first actually written to. This
- * avoids wasting memory space when tuplesort.c overestimates the number
- * of tapes needed.
- */
for (i = 0; i < ntapes; i++)
- {
- lt = <s->tapes[i];
- lt->writing = true;
- lt->frozen = false;
- lt->dirty = false;
- lt->firstBlockNumber = -1L;
- lt->curBlockNumber = -1L;
- lt->nextBlockNumber = -1L;
- lt->offsetBlockNumber = 0L;
- lt->buffer = NULL;
- lt->buffer_size = 0;
- /* palloc() larger than MaxAllocSize would fail */
- lt->max_size = MaxAllocSize;
- lt->pos = 0;
- lt->nbytes = 0;
- }
+ ltsInitTape(<s->tapes[i]);
/*
* Create temp BufFile storage as required.
@@ -773,15 +776,12 @@ LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum, size_t buffer_size)
lt->buffer_size = 0;
if (lt->firstBlockNumber != -1L)
{
- lt->buffer = palloc(buffer_size);
+ /*
+ * The buffer is lazily allocated in LogicalTapeRead(), but we set the
+ * size here.
+ */
lt->buffer_size = buffer_size;
}
-
- /* Read the first block, or reset if tape is empty */
- lt->nextBlockNumber = lt->firstBlockNumber;
- lt->pos = 0;
- lt->nbytes = 0;
- ltsReadFillBuffer(lts, lt);
}
/*
@@ -830,6 +830,22 @@ LogicalTapeRead(LogicalTapeSet *lts, int tapenum,
lt = <s->tapes[tapenum];
Assert(!lt->writing);
+ if (lt->buffer == NULL)
+ {
+ /* lazily allocate buffer */
+ if (lt->firstBlockNumber != -1L)
+ {
+ Assert(lt->buffer_size > 0);
+ lt->buffer = palloc(lt->buffer_size);
+ }
+
+ /* Read the first block, or reset if tape is empty */
+ lt->nextBlockNumber = lt->firstBlockNumber;
+ lt->pos = 0;
+ lt->nbytes = 0;
+ ltsReadFillBuffer(lts, lt);
+ }
+
while (size > 0)
{
if (lt->pos >= lt->nbytes)
@@ -943,6 +959,25 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum, TapeShare *share)
}
}
+/*
+ * Add additional tapes to this tape set.
+ */
+LogicalTapeSet *
+LogicalTapeSetExtend(LogicalTapeSet *lts, int ntoextend)
+{
+ int i;
+
+ Assert(ntoextend > 0);
+ lts->tapes = (LogicalTape *) repalloc(
+ lts->tapes, (lts->nTapes + ntoextend) * sizeof(LogicalTape));
+ lts->nTapes = lts->nTapes + ntoextend;
+
+ for (i = lts->nTapes - ntoextend; i < lts->nTapes; i++)
+ ltsInitTape(<s->tapes[i]);
+
+ return lts;
+}
+
/*
* Backspace the tape a given number of bytes. (We also support a more
* general seek interface, see below.)
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 7112558363f..2365f5bdafb 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -226,9 +226,13 @@ typedef enum ExprEvalOp
EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
EEOP_AGG_INIT_TRANS,
+ EEOP_AGG_INIT_TRANS_SPILLED,
EEOP_AGG_STRICT_TRANS_CHECK,
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
EEOP_AGG_PLAIN_TRANS_BYVAL,
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
EEOP_AGG_PLAIN_TRANS,
+ EEOP_AGG_PLAIN_TRANS_SPILLED,
EEOP_AGG_ORDERED_TRANS_DATUM,
EEOP_AGG_ORDERED_TRANS_TUPLE,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 6ef3e1fe069..e21138b5f7c 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -140,11 +140,17 @@ extern TupleHashTable BuildTupleHashTableExt(PlanState *parent,
extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
bool *isnew);
+extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
+ TupleTableSlot *slot,
+ bool *isnew, uint32 hash);
extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
ExprState *eqcomp,
FmgrInfo *hashfunctions);
+extern uint32 TupleHashTableHash(struct tuplehash_hash *tb,
+ const MinimalTuple tuple);
extern void ResetTupleHashTable(TupleHashTable hashtable);
+extern void DestroyTupleHashTable(TupleHashTable hashtable);
/*
* prototypes from functions in execJunk.c
@@ -250,7 +256,7 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
- bool doSort, bool doHash);
+ bool doSort, bool doHash, bool spilled);
extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
int numCols,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 2fe82da6ff7..ae9fe05abc6 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -304,11 +304,13 @@ typedef struct AggStatePerHashData
Agg *aggnode; /* original Agg node, for numGroups etc. */
} AggStatePerHashData;
+#define HASHAGG_MAX_PARTITIONS 256
extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
extern void ExecEndAgg(AggState *node);
extern void ExecReScanAgg(AggState *node);
-extern Size hash_agg_entry_size(int numAggs);
+extern Size hash_agg_entry_size(int numAggs, Size tupleWidth,
+ Size transitionSpace);
#endif /* NODEAGG_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 62d64aa0a14..288764929ce 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -244,6 +244,7 @@ extern bool enableFsync;
extern PGDLLIMPORT bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT bool hashagg_mem_overflow;
extern PGDLLIMPORT int max_parallel_maintenance_workers;
extern int VacuumCostPageHit;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1f6f5bbc207..6288b5f38a9 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2074,13 +2074,32 @@ typedef struct AggState
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
- int num_hashes;
+ int num_hashes; /* number of hash tables active at once */
+ double hashentrysize; /* estimate revised during execution */
+ struct HashTapeInfo *hash_tapeinfo; /* metadata for spill tapes */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each hash table,
+ exists only during first pass if spilled */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ bool hash_ever_spilled; /* ever spilled during this execution? */
+ bool hash_spill_mode; /* we hit a limit during the current batch
+ and we must not create new groups */
+ Size hash_alloc_last; /* previous total memory allocation */
+ Size hash_alloc_current; /* current total memory allocation */
+ Size hash_mem_limit; /* limit before spilling hash table */
+ Size hash_mem_peak; /* peak hash table memory usage */
+ long hash_ngroups_current; /* number of groups currently in
+ memory in all hash tables */
+ long hash_ngroups_limit; /* limit before spilling hash table */
+ uint64 hash_disk_used; /* bytes of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+ List *hash_batches; /* hash batches remaining to be processed */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 49
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 3d3be197e0e..be592d0fee4 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1663,6 +1663,7 @@ typedef struct AggPath
AggStrategy aggstrategy; /* basic strategy, see nodes.h */
AggSplit aggsplit; /* agg-splitting mode, see nodes.h */
double numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
List *groupClause; /* a list of SortGroupClause's */
List *qual; /* quals (HAVING quals), if any */
} AggPath;
@@ -1700,6 +1701,7 @@ typedef struct GroupingSetsPath
AggStrategy aggstrategy; /* basic strategy */
List *rollups; /* list of RollupData */
List *qual; /* quals (HAVING quals), if any */
+ int32 transitionSpace; /* estimated transition state size */
} GroupingSetsPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 32c0d87f80e..f4183e1efa5 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -813,6 +813,7 @@ typedef struct Agg
Oid *grpOperators; /* equality operators to compare with */
Oid *grpCollations;
long numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
Bitmapset *aggParams; /* IDs of Params used in Aggref inputs */
/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
List *groupingSets; /* grouping sets to use */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index cb012ba1980..6572dc24699 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_hashagg_spill;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
@@ -114,7 +115,7 @@ extern void cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples);
+ double input_tuples, double input_width);
extern void cost_windowagg(Path *path, PlannerInfo *root,
List *windowFuncs, int numPartCols, int numOrderCols,
Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index eab486a6214..c7bda2b0917 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -54,8 +54,8 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
extern Agg *make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree);
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree);
extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
/*
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index 695d2c00ee4..3467b52c7f7 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -67,6 +67,8 @@ extern void LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum,
extern void LogicalTapeRewindForWrite(LogicalTapeSet *lts, int tapenum);
extern void LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum,
TapeShare *share);
+extern LogicalTapeSet *LogicalTapeSetExtend(LogicalTapeSet *lts,
+ int ntoextend);
extern size_t LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum,
size_t size);
extern void LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index f457b5b150f..7eeeaaa5e4a 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2357,3 +2357,124 @@ explain (costs off)
-> Seq Scan on onek
(8 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+set work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------------
+ GroupAggregate
+ Group Key: ((g % 100000))
+ -> Sort
+ Sort Key: ((g % 100000))
+ -> Function Scan on generate_series g
+(5 rows)
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+-- Produce results with hash aggregation
+set enable_hashagg = true;
+set enable_sort = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 100000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+set enable_sort = true;
+set work_mem to default;
+-- Compare group aggregation results to hash aggregation results
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+ a | c1 | c2 | c3
+---+----+----+----
+(0 rows)
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a7..767f60a96c7 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1633,4 +1633,127 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
| 1 | 2
(4 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+-- Produce results with hash aggregation.
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+set enable_sort = true;
+set work_mem to default;
+-- Compare results
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+ g100 | g10 | unnest | c | m
+------+-----+--------+---+---
+(0 rows)
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
-- end
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1de..11c6f50fbfa 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -148,6 +148,68 @@ SELECT count(*) FROM
4
(1 row)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+SET enable_hashagg=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------------
+ Unique
+ -> Sort
+ Sort Key: ((g % 1000))
+ -> Function Scan on generate_series g
+(4 rows)
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_hashagg=TRUE;
+-- Produce results with hash aggregation.
+SET enable_sort=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 1000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_sort=TRUE;
+SET work_mem TO DEFAULT;
+-- Compare results
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb9057..c40bf6c16eb 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -75,6 +75,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
+ enable_hashagg_spill | on
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 3e593f2d615..a4d476c4bb3 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1032,3 +1032,119 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
explain (costs off)
select 1 from tenk1
where (hundred, thousand) in (select twothousand, twothousand from onek);
+
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+set work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+-- Produce results with hash aggregation
+
+set enable_hashagg = true;
+set enable_sort = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare group aggregation results to hash aggregation results
+
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 95ac3fb52f6..bf8bce6ed31 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -441,4 +441,103 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
from unnest(array[1,1], array['a','b']) u(i,v)
group by rollup(i, v||'a') order by 1,3;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+-- Produce results with hash aggregation.
+
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare results
+
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+
-- end
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449e..33102744ebf 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -45,6 +45,68 @@ SELECT count(*) FROM
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+SET enable_hashagg=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_hashagg=TRUE;
+
+-- Produce results with hash aggregation.
+
+SET enable_sort=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_sort=TRUE;
+
+SET work_mem TO DEFAULT;
+
+-- Compare results
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
+
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
On Fri, Jan 24, 2020 at 5:01 PM Jeff Davis <pgsql@j-davis.com> wrote:
Unfortunately, I'm seeing some bad behavior (at least in some cases)
with logtape.c, where it's spending a lot of time qsorting the list of
free blocks. Adam, did you also see this during your perf tests? It
seems to be worst with lower work_mem settings and a large number of
input groups (perhaps there are just too many small tapes?).
That sounds weird. Might be pathological in some sense.
I have a wild guess for you. Maybe this has something to do with the
"test for presorted input" added by commit a3f0b3d68f9. That can
perform very badly when the input is almost sorted, but has a few
tuples that are out of order towards the end. (I have called these
"banana skin tuples" in the past.)
--
Peter Geoghegan
On Fri, 2020-01-24 at 17:16 -0800, Peter Geoghegan wrote:
That sounds weird. Might be pathological in some sense.
I have a wild guess for you. Maybe this has something to do with the
"test for presorted input" added by commit a3f0b3d68f9. That can
perform very badly when the input is almost sorted, but has a few
tuples that are out of order towards the end. (I have called these
"banana skin tuples" in the past.)
My simple test case is: 'explain analyze select i from big group by
i;', where "big" has 20M tuples.
I tried without that change and it helped (brought the time from 55s to
45s). But if I completely remove the sorting of the freelist, it goes
down to 12s. So it's something about the access pattern.
After digging a bit more, I see that, for Sort, the LogicalTapeSet's
freelist hovers around 300 entries and doesn't grow larger than that.
For HashAgg, it gets up to almost 60K. The pattern in HashAgg is that
the space required is at a maximum after the first spill, and after
that point the used space declines with each batch (because the groups
that fit in the hash table were finalized and emitted, and only the
ones that didn't fit were written out). As the amount of required space
declines, the size of the freelist grows.
That leaves a few options:
1) Cap the size of the LogicalTapeSet's freelist. If the freelist is
growing large, that's probably because it will never actually be used.
I'm not quite sure how to pick the cap though, and it seems a bit hacky
to just leak the freed space.
2) Use a different structure more capable of handling a large fraction
of free space. A compressed bitmap might make sense, but that seems
like overkill to waste effort tracking a lot of space that is unlikely
to ever be used.
3) Don't bother tracking free space for HashAgg at all. There's already
an API for that so I don't need to further hack logtape.c.
4) Try to be clever and shrink the file (or at least the tracked
portion of the file) if the freed blocks are at the end. This wouldn't
be very useful in the first recursive level, but the problem is worst
for the later levels anyway. Unfortunately, I think this requires a
breadth-first strategy to make sure that blocks at the end get freed.
If I do change it to breadth-first also, this does amount to a
significant speedup.
I am leaning toward #1 or #3.
As an aside, I'm curious why the freelist is managed the way it is.
Newly-released blocks are likely to be higher in number (or at least
not the lowest in number), but they are added to the end of an array.
The array is therefore likely to require repeated re-sorting to get
back to descending order. Wouldn't a minheap or something make more
sense?
Regards,
Jeff Davis
On Wed, 2020-01-29 at 14:48 -0800, Jeff Davis wrote:
2) Use a different structure more capable of handling a large
fraction
of free space. A compressed bitmap might make sense, but that seems
like overkill to waste effort tracking a lot of space that is
unlikely
to ever be used.
I ended up converting the freelist to a min heap.
Attached is a patch which makes three changes to better support
HashAgg:
1. Use a minheap for the freelist. The original design used an array
that had to be sorted between a read (which frees a block) and a write
(which needs to sort the array to consume the lowest block number). The
comments said:
* sorted. This is an efficient way to handle it because we expect
cycles
* of releasing many blocks followed by re-using many blocks, due to
* the larger read buffer.
But I didn't find a case where that actually wins over a simple
minheap. With that in mind, a minheap seems closer to what one might
expect for that purpose, and more robust when the assumptions don't
hold up as well. If someone knows of a case where the re-sorting
behavior is important, please let me know.
Changing to a minheap effectively solves the problem for HashAgg,
though in theory the memory consumption of the freelist itself could
become significant (though it's only 0.1% of the free space being
tracked).
2. Lazily-allocate the read buffer. The write buffer was always lazily-
allocated, so this patch creates better symmetry. More importantly, it
means freshly-rewound tapes don't have any buffer allocated, so it
greatly expands the number of tapes that can be managed efficiently as
long as only a limited number are active at once.
3. Allow expanding the number of tapes for an existing tape set. This
is useful for HashAgg, which doesn't know how many tapes will be needed
in advance.
Regards,
Jeff Davis
Attachments:
logtape.patchtext/x-patch; charset=UTF-8; name=logtape.patchDownload
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 42cfb1f9f98..20b27b3558b 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -49,12 +49,8 @@
* when reading, and read multiple blocks from the same tape in one go,
* whenever the buffer becomes empty.
*
- * To support the above policy of writing to the lowest free block,
- * ltsGetFreeBlock sorts the list of free block numbers into decreasing
- * order each time it is asked for a block and the list isn't currently
- * sorted. This is an efficient way to handle it because we expect cycles
- * of releasing many blocks followed by re-using many blocks, due to
- * the larger read buffer.
+ * To support the above policy of writing to the lowest free block, the
+ * freelist is a min heap.
*
* Since all the bookkeeping and buffer memory is allocated with palloc(),
* and the underlying file(s) are made with OpenTemporaryFile, all resources
@@ -170,7 +166,7 @@ struct LogicalTapeSet
/*
* File size tracking. nBlocksWritten is the size of the underlying file,
* in BLCKSZ blocks. nBlocksAllocated is the number of blocks allocated
- * by ltsGetFreeBlock(), and it is always greater than or equal to
+ * by ltsReleaseBlock(), and it is always greater than or equal to
* nBlocksWritten. Blocks between nBlocksAllocated and nBlocksWritten are
* blocks that have been allocated for a tape, but have not been written
* to the underlying file yet. nHoleBlocks tracks the total number of
@@ -188,15 +184,9 @@ struct LogicalTapeSet
* If forgetFreeSpace is true then any freed blocks are simply forgotten
* rather than being remembered in freeBlocks[]. See notes for
* LogicalTapeSetForgetFreeSpace().
- *
- * If blocksSorted is true then the block numbers in freeBlocks are in
- * *decreasing* order, so that removing the last entry gives us the lowest
- * free block. We re-sort the blocks whenever a block is demanded; this
- * should be reasonably efficient given the expected usage pattern.
*/
bool forgetFreeSpace; /* are we remembering free blocks? */
- bool blocksSorted; /* is freeBlocks[] currently in order? */
- long *freeBlocks; /* resizable array */
+ long *freeBlocks; /* resizable array holding minheap */
int nFreeBlocks; /* # of currently free blocks */
int freeBlocksLen; /* current allocated length of freeBlocks[] */
@@ -211,6 +201,7 @@ static long ltsGetFreeBlock(LogicalTapeSet *lts);
static void ltsReleaseBlock(LogicalTapeSet *lts, long blocknum);
static void ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
SharedFileSet *fileset);
+static void ltsInitTape(LogicalTape *lt);
/*
@@ -321,46 +312,88 @@ ltsReadFillBuffer(LogicalTapeSet *lts, LogicalTape *lt)
return (lt->nbytes > 0);
}
-/*
- * qsort comparator for sorting freeBlocks[] into decreasing order.
- */
-static int
-freeBlocks_cmp(const void *a, const void *b)
+static inline void
+swap_nodes(long *heap, int a, int b)
{
- long ablk = *((const long *) a);
- long bblk = *((const long *) b);
-
- /* can't just subtract because long might be wider than int */
- if (ablk < bblk)
- return 1;
- if (ablk > bblk)
- return -1;
- return 0;
+ long swap;
+
+ swap = heap[a];
+ heap[a] = heap[b];
+ heap[b] = swap;
+}
+
+static inline int
+left_offset(int i)
+{
+ return 2 * i + 1;
+}
+
+static inline int
+right_offset(int i)
+{
+ return 2 * i + 2;
+}
+
+static inline int
+parent_offset(int i)
+{
+ return (i - 1) / 2;
}
/*
- * Select a currently unused block for writing to.
+ * Select the lowest currently unused block by taking the first element from
+ * the freelist min heap.
*/
static long
ltsGetFreeBlock(LogicalTapeSet *lts)
{
- /*
- * If there are multiple free blocks, we select the one appearing last in
- * freeBlocks[] (after sorting the array if needed). If there are none,
- * assign the next block at the end of the file.
- */
- if (lts->nFreeBlocks > 0)
+ long *heap = lts->freeBlocks;
+ long blocknum;
+ int heapsize;
+ int pos;
+
+ /* freelist empty; allocate a new block */
+ if (lts->nFreeBlocks == 0)
+ return lts->nBlocksAllocated++;
+
+ if (lts->nFreeBlocks == 1)
{
- if (!lts->blocksSorted)
- {
- qsort((void *) lts->freeBlocks, lts->nFreeBlocks,
- sizeof(long), freeBlocks_cmp);
- lts->blocksSorted = true;
- }
- return lts->freeBlocks[--lts->nFreeBlocks];
+ lts->nFreeBlocks--;
+ return lts->freeBlocks[0];
}
- else
- return lts->nBlocksAllocated++;
+
+ /* take top of minheap */
+ blocknum = heap[0];
+
+ /* replace with end of minheap array */
+ heap[0] = heap[--lts->nFreeBlocks];
+
+ /* sift down */
+ pos = 0;
+ heapsize = lts->nFreeBlocks;
+ while (true)
+ {
+ int left = left_offset(pos);
+ int right = right_offset(pos);
+ int min_child;
+
+ if (left < heapsize && right < heapsize)
+ min_child = (heap[left] < heap[right]) ? left : right;
+ else if (left < heapsize)
+ min_child = left;
+ else if (right < heapsize)
+ min_child = right;
+ else
+ break;
+
+ if (heap[min_child] >= heap[pos])
+ break;
+
+ swap_nodes(heap, min_child, pos);
+ pos = min_child;
+ }
+
+ return blocknum;
}
/*
@@ -369,7 +402,8 @@ ltsGetFreeBlock(LogicalTapeSet *lts)
static void
ltsReleaseBlock(LogicalTapeSet *lts, long blocknum)
{
- int ndx;
+ long *heap;
+ int pos;
/*
* Do nothing if we're no longer interested in remembering free space.
@@ -387,14 +421,23 @@ ltsReleaseBlock(LogicalTapeSet *lts, long blocknum)
lts->freeBlocksLen * sizeof(long));
}
- /*
- * Add blocknum to array, and mark the array unsorted if it's no longer in
- * decreasing order.
- */
- ndx = lts->nFreeBlocks++;
- lts->freeBlocks[ndx] = blocknum;
- if (ndx > 0 && lts->freeBlocks[ndx - 1] < blocknum)
- lts->blocksSorted = false;
+ heap = lts->freeBlocks;
+ pos = lts->nFreeBlocks;
+
+ /* place entry at end of minheap array */
+ heap[pos] = blocknum;
+ lts->nFreeBlocks++;
+
+ /* sift up */
+ while (pos != 0)
+ {
+ int parent = parent_offset(pos);
+ if (heap[parent] < heap[pos])
+ break;
+
+ swap_nodes(heap, parent, pos);
+ pos = parent;
+ }
}
/*
@@ -486,6 +529,30 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
lts->nHoleBlocks = lts->nBlocksAllocated - nphysicalblocks;
}
+/*
+ * Initialize per-tape struct. Note we allocate the I/O buffer and the first
+ * block for a tape only when it is first actually written to. This avoids
+ * wasting memory space when tuplesort.c overestimates the number of tapes
+ * needed.
+ */
+static void
+ltsInitTape(LogicalTape *lt)
+{
+ lt->writing = true;
+ lt->frozen = false;
+ lt->dirty = false;
+ lt->firstBlockNumber = -1L;
+ lt->curBlockNumber = -1L;
+ lt->nextBlockNumber = -1L;
+ lt->offsetBlockNumber = 0L;
+ lt->buffer = NULL;
+ lt->buffer_size = 0;
+ /* palloc() larger than MaxAllocSize would fail */
+ lt->max_size = MaxAllocSize;
+ lt->pos = 0;
+ lt->nbytes = 0;
+}
+
/*
* Create a set of logical tapes in a temporary underlying file.
*
@@ -511,7 +578,6 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
int worker)
{
LogicalTapeSet *lts;
- LogicalTape *lt;
int i;
/*
@@ -524,35 +590,13 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
lts->nBlocksWritten = 0L;
lts->nHoleBlocks = 0L;
lts->forgetFreeSpace = false;
- lts->blocksSorted = true; /* a zero-length array is sorted ... */
lts->freeBlocksLen = 32; /* reasonable initial guess */
lts->freeBlocks = (long *) palloc(lts->freeBlocksLen * sizeof(long));
lts->nFreeBlocks = 0;
lts->nTapes = ntapes;
- /*
- * Initialize per-tape structs. Note we allocate the I/O buffer and the
- * first block for a tape only when it is first actually written to. This
- * avoids wasting memory space when tuplesort.c overestimates the number
- * of tapes needed.
- */
for (i = 0; i < ntapes; i++)
- {
- lt = <s->tapes[i];
- lt->writing = true;
- lt->frozen = false;
- lt->dirty = false;
- lt->firstBlockNumber = -1L;
- lt->curBlockNumber = -1L;
- lt->nextBlockNumber = -1L;
- lt->offsetBlockNumber = 0L;
- lt->buffer = NULL;
- lt->buffer_size = 0;
- /* palloc() larger than MaxAllocSize would fail */
- lt->max_size = MaxAllocSize;
- lt->pos = 0;
- lt->nbytes = 0;
- }
+ ltsInitTape(<s->tapes[i]);
/*
* Create temp BufFile storage as required.
@@ -773,15 +817,12 @@ LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum, size_t buffer_size)
lt->buffer_size = 0;
if (lt->firstBlockNumber != -1L)
{
- lt->buffer = palloc(buffer_size);
+ /*
+ * The buffer is lazily allocated in LogicalTapeRead(), but we set the
+ * size here.
+ */
lt->buffer_size = buffer_size;
}
-
- /* Read the first block, or reset if tape is empty */
- lt->nextBlockNumber = lt->firstBlockNumber;
- lt->pos = 0;
- lt->nbytes = 0;
- ltsReadFillBuffer(lts, lt);
}
/*
@@ -830,6 +871,22 @@ LogicalTapeRead(LogicalTapeSet *lts, int tapenum,
lt = <s->tapes[tapenum];
Assert(!lt->writing);
+ if (lt->buffer == NULL)
+ {
+ /* lazily allocate buffer */
+ if (lt->firstBlockNumber != -1L)
+ {
+ Assert(lt->buffer_size > 0);
+ lt->buffer = palloc(lt->buffer_size);
+ }
+
+ /* Read the first block, or reset if tape is empty */
+ lt->nextBlockNumber = lt->firstBlockNumber;
+ lt->pos = 0;
+ lt->nbytes = 0;
+ ltsReadFillBuffer(lts, lt);
+ }
+
while (size > 0)
{
if (lt->pos >= lt->nbytes)
@@ -943,6 +1000,28 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum, TapeShare *share)
}
}
+/*
+ * Add additional tapes to this tape set.
+ */
+LogicalTapeSet *
+LogicalTapeSetExtend(LogicalTapeSet *lts, int nAdditional)
+{
+ int i;
+ int nTapesOrig = lts->nTapes;
+ Size newSize;
+
+ lts->nTapes += nAdditional;
+ newSize = offsetof(LogicalTapeSet, tapes) +
+ lts->nTapes * sizeof(LogicalTape);
+
+ lts = (LogicalTapeSet *) repalloc(lts, newSize);
+
+ for (i = nTapesOrig; i < lts->nTapes; i++)
+ ltsInitTape(<s->tapes[i]);
+
+ return lts;
+}
+
/*
* Backspace the tape a given number of bytes. (We also support a more
* general seek interface, see below.)
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index 695d2c00ee4..3ebe52239f8 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -67,6 +67,8 @@ extern void LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum,
extern void LogicalTapeRewindForWrite(LogicalTapeSet *lts, int tapenum);
extern void LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum,
TapeShare *share);
+extern LogicalTapeSet *LogicalTapeSetExtend(LogicalTapeSet *lts,
+ int nAdditional);
extern size_t LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum,
size_t size);
extern void LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
On Mon, 2020-02-03 at 10:29 -0800, Jeff Davis wrote:
I ended up converting the freelist to a min heap.
Attached is a patch which makes three changes to better support
HashAgg:
And now I'm attaching another version of the main Hash Aggregation
patch to be applied on top of the logtape.c patch.
Not a lot of changes from the last version; mostly some cleanup and
rebasing. But it's faster now with the logtape.c changes.
Regards,
Jeff Davis
Attachments:
logtape.patchtext/x-patch; charset=UTF-8; name=logtape.patchDownload
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 42cfb1f9f98..20b27b3558b 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -49,12 +49,8 @@
* when reading, and read multiple blocks from the same tape in one go,
* whenever the buffer becomes empty.
*
- * To support the above policy of writing to the lowest free block,
- * ltsGetFreeBlock sorts the list of free block numbers into decreasing
- * order each time it is asked for a block and the list isn't currently
- * sorted. This is an efficient way to handle it because we expect cycles
- * of releasing many blocks followed by re-using many blocks, due to
- * the larger read buffer.
+ * To support the above policy of writing to the lowest free block, the
+ * freelist is a min heap.
*
* Since all the bookkeeping and buffer memory is allocated with palloc(),
* and the underlying file(s) are made with OpenTemporaryFile, all resources
@@ -170,7 +166,7 @@ struct LogicalTapeSet
/*
* File size tracking. nBlocksWritten is the size of the underlying file,
* in BLCKSZ blocks. nBlocksAllocated is the number of blocks allocated
- * by ltsGetFreeBlock(), and it is always greater than or equal to
+ * by ltsReleaseBlock(), and it is always greater than or equal to
* nBlocksWritten. Blocks between nBlocksAllocated and nBlocksWritten are
* blocks that have been allocated for a tape, but have not been written
* to the underlying file yet. nHoleBlocks tracks the total number of
@@ -188,15 +184,9 @@ struct LogicalTapeSet
* If forgetFreeSpace is true then any freed blocks are simply forgotten
* rather than being remembered in freeBlocks[]. See notes for
* LogicalTapeSetForgetFreeSpace().
- *
- * If blocksSorted is true then the block numbers in freeBlocks are in
- * *decreasing* order, so that removing the last entry gives us the lowest
- * free block. We re-sort the blocks whenever a block is demanded; this
- * should be reasonably efficient given the expected usage pattern.
*/
bool forgetFreeSpace; /* are we remembering free blocks? */
- bool blocksSorted; /* is freeBlocks[] currently in order? */
- long *freeBlocks; /* resizable array */
+ long *freeBlocks; /* resizable array holding minheap */
int nFreeBlocks; /* # of currently free blocks */
int freeBlocksLen; /* current allocated length of freeBlocks[] */
@@ -211,6 +201,7 @@ static long ltsGetFreeBlock(LogicalTapeSet *lts);
static void ltsReleaseBlock(LogicalTapeSet *lts, long blocknum);
static void ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
SharedFileSet *fileset);
+static void ltsInitTape(LogicalTape *lt);
/*
@@ -321,46 +312,88 @@ ltsReadFillBuffer(LogicalTapeSet *lts, LogicalTape *lt)
return (lt->nbytes > 0);
}
-/*
- * qsort comparator for sorting freeBlocks[] into decreasing order.
- */
-static int
-freeBlocks_cmp(const void *a, const void *b)
+static inline void
+swap_nodes(long *heap, int a, int b)
{
- long ablk = *((const long *) a);
- long bblk = *((const long *) b);
-
- /* can't just subtract because long might be wider than int */
- if (ablk < bblk)
- return 1;
- if (ablk > bblk)
- return -1;
- return 0;
+ long swap;
+
+ swap = heap[a];
+ heap[a] = heap[b];
+ heap[b] = swap;
+}
+
+static inline int
+left_offset(int i)
+{
+ return 2 * i + 1;
+}
+
+static inline int
+right_offset(int i)
+{
+ return 2 * i + 2;
+}
+
+static inline int
+parent_offset(int i)
+{
+ return (i - 1) / 2;
}
/*
- * Select a currently unused block for writing to.
+ * Select the lowest currently unused block by taking the first element from
+ * the freelist min heap.
*/
static long
ltsGetFreeBlock(LogicalTapeSet *lts)
{
- /*
- * If there are multiple free blocks, we select the one appearing last in
- * freeBlocks[] (after sorting the array if needed). If there are none,
- * assign the next block at the end of the file.
- */
- if (lts->nFreeBlocks > 0)
+ long *heap = lts->freeBlocks;
+ long blocknum;
+ int heapsize;
+ int pos;
+
+ /* freelist empty; allocate a new block */
+ if (lts->nFreeBlocks == 0)
+ return lts->nBlocksAllocated++;
+
+ if (lts->nFreeBlocks == 1)
{
- if (!lts->blocksSorted)
- {
- qsort((void *) lts->freeBlocks, lts->nFreeBlocks,
- sizeof(long), freeBlocks_cmp);
- lts->blocksSorted = true;
- }
- return lts->freeBlocks[--lts->nFreeBlocks];
+ lts->nFreeBlocks--;
+ return lts->freeBlocks[0];
}
- else
- return lts->nBlocksAllocated++;
+
+ /* take top of minheap */
+ blocknum = heap[0];
+
+ /* replace with end of minheap array */
+ heap[0] = heap[--lts->nFreeBlocks];
+
+ /* sift down */
+ pos = 0;
+ heapsize = lts->nFreeBlocks;
+ while (true)
+ {
+ int left = left_offset(pos);
+ int right = right_offset(pos);
+ int min_child;
+
+ if (left < heapsize && right < heapsize)
+ min_child = (heap[left] < heap[right]) ? left : right;
+ else if (left < heapsize)
+ min_child = left;
+ else if (right < heapsize)
+ min_child = right;
+ else
+ break;
+
+ if (heap[min_child] >= heap[pos])
+ break;
+
+ swap_nodes(heap, min_child, pos);
+ pos = min_child;
+ }
+
+ return blocknum;
}
/*
@@ -369,7 +402,8 @@ ltsGetFreeBlock(LogicalTapeSet *lts)
static void
ltsReleaseBlock(LogicalTapeSet *lts, long blocknum)
{
- int ndx;
+ long *heap;
+ int pos;
/*
* Do nothing if we're no longer interested in remembering free space.
@@ -387,14 +421,23 @@ ltsReleaseBlock(LogicalTapeSet *lts, long blocknum)
lts->freeBlocksLen * sizeof(long));
}
- /*
- * Add blocknum to array, and mark the array unsorted if it's no longer in
- * decreasing order.
- */
- ndx = lts->nFreeBlocks++;
- lts->freeBlocks[ndx] = blocknum;
- if (ndx > 0 && lts->freeBlocks[ndx - 1] < blocknum)
- lts->blocksSorted = false;
+ heap = lts->freeBlocks;
+ pos = lts->nFreeBlocks;
+
+ /* place entry at end of minheap array */
+ heap[pos] = blocknum;
+ lts->nFreeBlocks++;
+
+ /* sift up */
+ while (pos != 0)
+ {
+ int parent = parent_offset(pos);
+ if (heap[parent] < heap[pos])
+ break;
+
+ swap_nodes(heap, parent, pos);
+ pos = parent;
+ }
}
/*
@@ -486,6 +529,30 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
lts->nHoleBlocks = lts->nBlocksAllocated - nphysicalblocks;
}
+/*
+ * Initialize per-tape struct. Note we allocate the I/O buffer and the first
+ * block for a tape only when it is first actually written to. This avoids
+ * wasting memory space when tuplesort.c overestimates the number of tapes
+ * needed.
+ */
+static void
+ltsInitTape(LogicalTape *lt)
+{
+ lt->writing = true;
+ lt->frozen = false;
+ lt->dirty = false;
+ lt->firstBlockNumber = -1L;
+ lt->curBlockNumber = -1L;
+ lt->nextBlockNumber = -1L;
+ lt->offsetBlockNumber = 0L;
+ lt->buffer = NULL;
+ lt->buffer_size = 0;
+ /* palloc() larger than MaxAllocSize would fail */
+ lt->max_size = MaxAllocSize;
+ lt->pos = 0;
+ lt->nbytes = 0;
+}
+
/*
* Create a set of logical tapes in a temporary underlying file.
*
@@ -511,7 +578,6 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
int worker)
{
LogicalTapeSet *lts;
- LogicalTape *lt;
int i;
/*
@@ -524,35 +590,13 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
lts->nBlocksWritten = 0L;
lts->nHoleBlocks = 0L;
lts->forgetFreeSpace = false;
- lts->blocksSorted = true; /* a zero-length array is sorted ... */
lts->freeBlocksLen = 32; /* reasonable initial guess */
lts->freeBlocks = (long *) palloc(lts->freeBlocksLen * sizeof(long));
lts->nFreeBlocks = 0;
lts->nTapes = ntapes;
- /*
- * Initialize per-tape structs. Note we allocate the I/O buffer and the
- * first block for a tape only when it is first actually written to. This
- * avoids wasting memory space when tuplesort.c overestimates the number
- * of tapes needed.
- */
for (i = 0; i < ntapes; i++)
- {
- lt = <s->tapes[i];
- lt->writing = true;
- lt->frozen = false;
- lt->dirty = false;
- lt->firstBlockNumber = -1L;
- lt->curBlockNumber = -1L;
- lt->nextBlockNumber = -1L;
- lt->offsetBlockNumber = 0L;
- lt->buffer = NULL;
- lt->buffer_size = 0;
- /* palloc() larger than MaxAllocSize would fail */
- lt->max_size = MaxAllocSize;
- lt->pos = 0;
- lt->nbytes = 0;
- }
+ ltsInitTape(<s->tapes[i]);
/*
* Create temp BufFile storage as required.
@@ -773,15 +817,12 @@ LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum, size_t buffer_size)
lt->buffer_size = 0;
if (lt->firstBlockNumber != -1L)
{
- lt->buffer = palloc(buffer_size);
+ /*
+ * The buffer is lazily allocated in LogicalTapeRead(), but we set the
+ * size here.
+ */
lt->buffer_size = buffer_size;
}
-
- /* Read the first block, or reset if tape is empty */
- lt->nextBlockNumber = lt->firstBlockNumber;
- lt->pos = 0;
- lt->nbytes = 0;
- ltsReadFillBuffer(lts, lt);
}
/*
@@ -830,6 +871,22 @@ LogicalTapeRead(LogicalTapeSet *lts, int tapenum,
lt = <s->tapes[tapenum];
Assert(!lt->writing);
+ if (lt->buffer == NULL)
+ {
+ /* lazily allocate buffer */
+ if (lt->firstBlockNumber != -1L)
+ {
+ Assert(lt->buffer_size > 0);
+ lt->buffer = palloc(lt->buffer_size);
+ }
+
+ /* Read the first block, or reset if tape is empty */
+ lt->nextBlockNumber = lt->firstBlockNumber;
+ lt->pos = 0;
+ lt->nbytes = 0;
+ ltsReadFillBuffer(lts, lt);
+ }
+
while (size > 0)
{
if (lt->pos >= lt->nbytes)
@@ -943,6 +1000,28 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum, TapeShare *share)
}
}
+/*
+ * Add additional tapes to this tape set.
+ */
+LogicalTapeSet *
+LogicalTapeSetExtend(LogicalTapeSet *lts, int nAdditional)
+{
+ int i;
+ int nTapesOrig = lts->nTapes;
+ Size newSize;
+
+ lts->nTapes += nAdditional;
+ newSize = offsetof(LogicalTapeSet, tapes) +
+ lts->nTapes * sizeof(LogicalTape);
+
+ lts = (LogicalTapeSet *) repalloc(lts, newSize);
+
+ for (i = nTapesOrig; i < lts->nTapes; i++)
+ ltsInitTape(<s->tapes[i]);
+
+ return lts;
+}
+
/*
* Backspace the tape a given number of bytes. (We also support a more
* general seek interface, see below.)
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index 695d2c00ee4..3ebe52239f8 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -67,6 +67,8 @@ extern void LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum,
extern void LogicalTapeRewindForWrite(LogicalTapeSet *lts, int tapenum);
extern void LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum,
TapeShare *share);
+extern LogicalTapeSet *LogicalTapeSetExtend(LogicalTapeSet *lts,
+ int nAdditional);
extern size_t LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum,
size_t size);
extern void LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
hashagg-20200203.patchtext/x-patch; charset=UTF-8; name=hashagg-20200203.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f89ec7..85f559387f9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,23 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-hashagg-mem-overflow" xreflabel="hashagg_mem_overflow">
+ <term><varname>hashagg_mem_overflow</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>hashagg_mem_overflow</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ If hash aggregation exceeds <varname>work_mem</varname> at query
+ execution time, and <varname>hashagg_mem_overflow</varname> is set
+ to <literal>on</literal>, continue consuming more memory rather than
+ performing disk-based hash aggregation. The default
+ is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
@@ -4476,6 +4493,24 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-hashagg-spill" xreflabel="enable_hashagg_spill">
+ <term><varname>enable_hashagg_spill</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_hashagg_spill</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the memory usage is expected to
+ exceed <varname>work_mem</varname>. This only affects the planner
+ choice; actual behavior at execution time is dictated by
+ <xref linkend="guc-hashagg-mem-overflow"/>. The default
+ is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
<term><varname>enable_hashjoin</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c367c750b19..70fecaa7261 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -104,6 +104,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1881,6 +1882,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ if (es->analyze)
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2749,6 +2752,55 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * If EXPLAIN ANALYZE, show information on hash aggregate memory usage and
+ * batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(
+ es->str,
+ "Memory Usage: %ldkB",
+ memPeakKb);
+
+ if (aggstate->hash_batches_used > 0)
+ {
+ appendStringInfo(
+ es->str,
+ " Batches: %d Disk: %ldkB",
+ aggstate->hash_batches_used, aggstate->hash_disk_used);
+ }
+
+ appendStringInfo(
+ es->str,
+ "\n");
+ }
+ else
+ {
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ ExplainPropertyInteger("Disk Usage", "kB",
+ aggstate->hash_disk_used, es);
+ }
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 8619246c8e0..6f64a2abd2f 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -79,7 +79,8 @@ static void ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash);
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled);
/*
@@ -2927,7 +2928,7 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
*/
ExprState *
ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
- bool doSort, bool doHash)
+ bool doSort, bool doHash, bool spilled)
{
ExprState *state = makeNode(ExprState);
PlanState *parent = &aggstate->ss.ps;
@@ -3160,7 +3161,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (setno = 0; setno < processGroupingSets; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, false);
+ pertrans, transno, setno, setoff, false,
+ spilled);
setoff++;
}
}
@@ -3178,7 +3180,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (setno = 0; setno < numHashes; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, true);
+ pertrans, transno, setno, setoff, true,
+ spilled);
setoff++;
}
}
@@ -3226,7 +3229,8 @@ static void
ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash)
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled)
{
int adjust_init_jumpnull = -1;
int adjust_strict_jumpnull = -1;
@@ -3248,7 +3252,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
fcinfo->flinfo->fn_strict &&
pertrans->initValueIsNull)
{
- scratch->opcode = EEOP_AGG_INIT_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_INIT_TRANS_SPILLED : EEOP_AGG_INIT_TRANS;
scratch->d.agg_init_trans.aggstate = aggstate;
scratch->d.agg_init_trans.pertrans = pertrans;
scratch->d.agg_init_trans.setno = setno;
@@ -3265,7 +3270,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
if (pertrans->numSortCols == 0 &&
fcinfo->flinfo->fn_strict)
{
- scratch->opcode = EEOP_AGG_STRICT_TRANS_CHECK;
+ scratch->opcode = spilled ?
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED : EEOP_AGG_STRICT_TRANS_CHECK;
scratch->d.agg_strict_trans_check.aggstate = aggstate;
scratch->d.agg_strict_trans_check.setno = setno;
scratch->d.agg_strict_trans_check.setoff = setoff;
@@ -3283,9 +3289,11 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
/* invoke appropriate transition implementation */
if (pertrans->numSortCols == 0 && pertrans->transtypeByVal)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS_BYVAL;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED : EEOP_AGG_PLAIN_TRANS_BYVAL;
else if (pertrans->numSortCols == 0)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_SPILLED : EEOP_AGG_PLAIN_TRANS;
else if (pertrans->numInputs == 1)
scratch->opcode = EEOP_AGG_ORDERED_TRANS_DATUM;
else
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 061752ea9c1..094ed7c34aa 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -430,9 +430,13 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
&&CASE_EEOP_AGG_INIT_TRANS,
+ &&CASE_EEOP_AGG_INIT_TRANS_SPILLED,
&&CASE_EEOP_AGG_STRICT_TRANS_CHECK,
+ &&CASE_EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_SPILLED,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
&&CASE_EEOP_LAST
@@ -1625,6 +1629,36 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ aggstate = op->d.agg_init_trans.aggstate;
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
+
+ /* If transValue has not yet been initialized, do so now. */
+ if (pergroup->noTransValue)
+ {
+ AggStatePerTrans pertrans = op->d.agg_init_trans.pertrans;
+
+ aggstate->curaggcontext = op->d.agg_init_trans.aggcontext;
+ aggstate->current_set = op->d.agg_init_trans.setno;
+
+ ExecAggInitGroup(aggstate, pertrans, pergroup);
+
+ /* copied trans value from input, done this round */
+ EEO_JUMP(op->d.agg_init_trans.jumpnull);
+ }
+
+ EEO_NEXT();
+ }
/* check that a strict aggregate's input isn't NULL */
EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK)
@@ -1642,6 +1676,25 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ aggstate = op->d.agg_strict_trans_check.aggstate;
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_strict_trans_check.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_strict_trans_check.transno];
+
+ if (unlikely(pergroup->transValueIsNull))
+ EEO_JUMP(op->d.agg_strict_trans_check.jumpnull);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1691,6 +1744,52 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ aggstate = op->d.agg_trans.aggstate;
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ Assert(pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1735,6 +1834,67 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
newVal = FunctionCallInvoke(fcinfo);
+ /*
+ * For pass-by-ref datatype, must copy the new value into
+ * aggcontext and free the prior transValue. But if transfn
+ * returned a pointer to its first input, we don't need to do
+ * anything. Also, if transfn returned a pointer to a R/W
+ * expanded object that is already a child of the aggcontext,
+ * assume we can adopt that value without copying it.
+ */
+ if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggTransReparent(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ AggState *aggstate;
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ aggstate = op->d.agg_trans.aggstate;
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ Assert(!pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
/*
* For pass-by-ref datatype, must copy the new value into
* aggcontext and free the prior transValue. But if transfn
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 3603c58b63e..94439e2ab9e 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -25,8 +25,9 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
-static uint32 TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple);
static int TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
+static TupleHashEntry LookupTupleHashEntry_internal(
+ TupleHashTable hashtable, TupleTableSlot *slot, bool *isnew, uint32 hash);
/*
* Define parameters for tuple hash table code generation. The interface is
@@ -300,10 +301,9 @@ TupleHashEntry
LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
bool *isnew)
{
- TupleHashEntryData *entry;
- MemoryContext oldContext;
- bool found;
- MinimalTuple key;
+ TupleHashEntry entry;
+ MemoryContext oldContext;
+ uint32 hash;
/* Need to run the hash functions in short-lived context */
oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
@@ -313,32 +313,29 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
hashtable->cur_eq_func = hashtable->tab_eq_func;
- key = NULL; /* flag to reference inputslot */
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+ entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
- if (isnew)
- {
- entry = tuplehash_insert(hashtable->hashtab, key, &found);
+ MemoryContextSwitchTo(oldContext);
- if (found)
- {
- /* found pre-existing entry */
- *isnew = false;
- }
- else
- {
- /* created new entry */
- *isnew = true;
- /* zero caller data */
- entry->additional = NULL;
- MemoryContextSwitchTo(hashtable->tablecxt);
- /* Copy the first tuple into the table context */
- entry->firstTuple = ExecCopySlotMinimalTuple(slot);
- }
- }
- else
- {
- entry = tuplehash_lookup(hashtable->hashtab, key);
- }
+ return entry;
+}
+
+/*
+ * A variant of LookupTupleHashEntry for callers that have already computed
+ * the hash value.
+ */
+TupleHashEntry
+LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
+ bool *isnew, uint32 hash)
+{
+ TupleHashEntry entry;
+ MemoryContext oldContext;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
MemoryContextSwitchTo(oldContext);
@@ -389,7 +386,7 @@ FindTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
* Also, the caller must select an appropriate memory context for running
* the hash functions. (dynahash.c doesn't change CurrentMemoryContext.)
*/
-static uint32
+uint32
TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
{
TupleHashTable hashtable = (TupleHashTable) tb->private_data;
@@ -450,6 +447,54 @@ TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
return murmurhash32(hashkey);
}
+/*
+ * Does the work of LookupTupleHashEntry and LookupTupleHashEntryHash. Useful
+ * so that we can avoid switching the memory context multiple times for
+ * LookupTupleHashEntry.
+ */
+static TupleHashEntry
+LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
+ bool *isnew, uint32 hash)
+{
+ TupleHashEntryData *entry;
+ bool found;
+ MinimalTuple key;
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = slot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ key = NULL; /* flag to reference inputslot */
+
+ if (isnew)
+ {
+ entry = tuplehash_insert_hash(hashtable->hashtab, key, hash, &found);
+
+ if (found)
+ {
+ /* found pre-existing entry */
+ *isnew = false;
+ }
+ else
+ {
+ /* created new entry */
+ *isnew = true;
+ /* zero caller data */
+ entry->additional = NULL;
+ MemoryContextSwitchTo(hashtable->tablecxt);
+ /* Copy the first tuple into the table context */
+ entry->firstTuple = ExecCopySlotMinimalTuple(slot);
+ }
+ }
+ else
+ {
+ entry = tuplehash_lookup_hash(hashtable->hashtab, key, hash);
+ }
+
+ return entry;
+}
+
/*
* See whether two tuples (presumably of the same hash value) match
*/
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 9073395eacf..8f801b3b53a 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -194,6 +194,29 @@
* transition values. hashcontext is the single context created to support
* all hash tables.
*
+ * Spilling To Disk
+ *
+ * When performing hash aggregation, if the hash table memory exceeds the
+ * limit (see hashagg_check_limits()), we enter "spill mode". In spill
+ * mode, we advance the transition states only for groups already in the
+ * hash table. For tuples that would need to create a new hash table
+ * entries (and initialize new transition states), we instead spill them to
+ * disk to be processed later. The tuples are spilled in a partitioned
+ * manner, so that subsequent batches are smaller and less likely to exceed
+ * work_mem (if a batch does exceed work_mem, it must be spilled
+ * recursively).
+ *
+ * Spilled data is written to logical tapes. These provide better control
+ * over memory usage, disk space, and the number of files than if we were
+ * to use a BufFile for each spill.
+ *
+ * Note that it's possible for transition states to start small but then
+ * grow very large; for instance in the case of ARRAY_AGG. In such cases,
+ * it's still possible to significantly exceed work_mem. We try to avoid
+ * this situation by estimating what will fit in the available memory, and
+ * imposing a limit on the number of groups separately from the amount of
+ * memory consumed.
+ *
* Transition / Combine function invocation:
*
* For performance reasons transition functions, including combine
@@ -233,12 +256,98 @@
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/datum.h"
+#include "utils/dynahash.h"
#include "utils/expandeddatum.h"
+#include "utils/logtape.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+/*
+ * Control how many partitions are created when spilling HashAgg to
+ * disk.
+ *
+ * HASHAGG_PARTITION_FACTOR is multiplied by the estimated number of
+ * partitions needed such that each partition will fit in memory. The factor
+ * is set higher than one because there's not a high cost to having a few too
+ * many partitions, and it makes it less likely that a partition will need to
+ * be spilled recursively. Another benefit of having more, smaller partitions
+ * is that small hash tables may perform better than large ones due to memory
+ * caching effects.
+ *
+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (which
+ * could instead be spent on a larger hash table).
+ *
+ * For reading from tapes, the buffer size must be a multiple of
+ * BLCKSZ. Larger values help when reading from multiple tapes concurrently,
+ * but that doesn't happen in HashAgg, so we simply use BLCKSZ. Writing to a
+ * tape always uses a buffer of size BLCKSZ.
+ */
+#define HASHAGG_PARTITION_FACTOR 1.50
+#define HASHAGG_MIN_PARTITIONS 4
+#define HASHAGG_READ_BUFFER_SIZE BLCKSZ
+#define HASHAGG_WRITE_BUFFER_SIZE BLCKSZ
+
+/*
+ * Track all tapes needed for a HashAgg that spills. We don't know the maximum
+ * number of tapes needed at the start of the algorithm (because it can
+ * recurse), so one tape set is allocated and extended as needed for new
+ * tapes. When a particular tape is already read, rewind it for write mode and
+ * put it in the free list.
+ *
+ * Tapes' buffers can take up substantial memory when many tapes are open at
+ * once. We only need one tape open at a time in read mode (using a buffer
+ * that's a multiple of BLCKSZ); but we need up to HASHAGG_MAX_PARTITIONS
+ * tapes open in write mode (each requiring a buffer of size BLCKSZ).
+ */
+typedef struct HashTapeInfo
+{
+ LogicalTapeSet *tapeset;
+ int ntapes;
+ int *freetapes;
+ int nfreetapes;
+} HashTapeInfo;
+
+/*
+ * Represents partitioned spill data for a single hashtable. Contains the
+ * necessary information to route tuples to the correct partition, and to
+ * transform the spilled data into new batches.
+ *
+ * The high bits are used for partition selection (when recursing, we ignore
+ * the bits that have already been used for partition selection at an earlier
+ * level).
+ */
+typedef struct HashAggSpill
+{
+ HashTapeInfo *tapeinfo; /* borrowed reference to tape info */
+ int npartitions; /* number of partitions */
+ int *partitions; /* spill partition tape numbers */
+ int64 *ntuples; /* number of tuples in each partition */
+ uint32 mask; /* mask to find partition from hash value */
+ int shift; /* after masking, shift by this amount */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation (with only one
+ * grouping set).
+ *
+ * Also tracks the bits of the hash already used for partition selection by
+ * earlier iterations, so that this batch can use new bits. If all bits have
+ * already been used, no partitioning will be done (any spilled data will go
+ * to a single output tape).
+ */
+typedef struct HashAggBatch
+{
+ int setno; /* grouping set */
+ int used_bits; /* number of bits of hash already used */
+ HashTapeInfo *tapeinfo; /* borrowed reference to tape info */
+ int input_tapenum; /* input partition tape */
+ int64 input_tuples; /* number of tuples in this batch */
+} HashAggBatch;
+
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -269,15 +378,53 @@ static void prepare_projection_slot(AggState *aggstate,
static void finalize_aggregates(AggState *aggstate,
AggStatePerAgg peragg,
AggStatePerGroup pergroup);
-static TupleTableSlot *project_aggregates(AggState *aggstate);
-static Bitmapset *find_unaggregated_cols(AggState *aggstate);
-static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
-static void build_hash_table(AggState *aggstate);
-static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
+static void build_hash_tables(AggState *aggstate);
+static void build_hash_table(AggState *aggstate, int setno,
+ int64 ngroups_estimate);
+static void prepare_hash_slot(AggState *aggstate);
+static void hashagg_recompile_expressions(AggState *aggstate);
+static uint32 calculate_hash(AggState *aggstate);
+static long hash_choose_num_buckets(AggState *aggstate,
+ long estimated_nbuckets,
+ Size memory);
+static int hash_choose_num_partitions(uint64 input_groups,
+ double hashentrysize,
+ int used_bits,
+ int *log2_npartittions);
+static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+
+/* Hash Aggregation helpers */
+static TupleTableSlot *project_aggregates(AggState *aggstate);
+static Bitmapset *find_unaggregated_cols(AggState *aggstate);
+static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
+static void hashagg_set_limits(AggState *aggstate, uint64 input_groups,
+ int used_bits);
+static void hashagg_check_limits(AggState *aggstate);
+static void hashagg_finish_initial_spills(AggState *aggstate);
+static void hashagg_reset_spill_state(AggState *aggstate);
+
+/* Structure APIs */
+static HashAggBatch *hashagg_batch_new(HashTapeInfo *tapeinfo,
+ int input_tapenum, int setno,
+ int64 input_tuples, int used_bits);
+static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
+static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
+ int used_bits, uint64 input_tuples,
+ double hashentrysize);
+static Size hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot,
+ uint32 hash);
+static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno);
+static void hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *dest,
+ int ndest);
+static void hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum);
+
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1233,7 +1380,7 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
}
/*
- * (Re-)initialize the hash table(s) to empty.
+ * Initialize the hash table(s).
*
* To implement hashed aggregation, we need a hashtable that stores a
* representative tuple and an array of AggStatePerGroup structs for each
@@ -1244,44 +1391,79 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
* We have a separate hashtable and associated perhash data structure for each
* grouping set for which we're doing hashing.
*
- * The contents of the hash tables always live in the hashcontext's per-tuple
- * memory context (there is only one of these for all tables together, since
- * they are all reset at the same time).
+ * The hash tables and their contents always live in the hashcontext's
+ * per-tuple memory context (there is only one of these for all tables
+ * together, since they are all reset at the same time).
*/
static void
-build_hash_table(AggState *aggstate)
+build_hash_tables(AggState *aggstate)
{
- MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
- Size additionalsize;
- int i;
-
- Assert(aggstate->aggstrategy == AGG_HASHED || aggstate->aggstrategy == AGG_MIXED);
-
- additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
+ int setno;
- for (i = 0; i < aggstate->num_hashes; ++i)
+ for (setno = 0; setno < aggstate->num_hashes; ++setno)
{
- AggStatePerHash perhash = &aggstate->perhash[i];
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ long nbuckets;
+ Size memory;
Assert(perhash->aggnode->numGroups > 0);
- if (perhash->hashtable)
- ResetTupleHashTable(perhash->hashtable);
- else
- perhash->hashtable = BuildTupleHashTableExt(&aggstate->ss.ps,
- perhash->hashslot->tts_tupleDescriptor,
- perhash->numCols,
- perhash->hashGrpColIdxHash,
- perhash->eqfuncoids,
- perhash->hashfunctions,
- perhash->aggnode->grpCollations,
- perhash->aggnode->numGroups,
- additionalsize,
- aggstate->ss.ps.state->es_query_cxt,
- aggstate->hashcontext->ecxt_per_tuple_memory,
- tmpmem,
- DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+ memory = aggstate->hash_mem_limit / aggstate->num_hashes;
+
+ /* choose reasonable number of buckets per hashtable */
+ nbuckets = hash_choose_num_buckets(
+ aggstate, perhash->aggnode->numGroups, memory);
+
+ build_hash_table(aggstate, setno, nbuckets);
}
+
+ aggstate->hash_alloc_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_ngroups_current = 0;
+}
+
+/*
+ * Build a single hashtable for this grouping set. Pass the hash memory
+ * context as both metacxt and tablecxt, so that resetting the hashcontext
+ * will free all memory including metadata. That means that we cannot reset
+ * the hash table to empty and reuse it, though (see execGrouping.c).
+ */
+static void
+build_hash_table(AggState *aggstate, int setno, long nbuckets)
+{
+ TupleHashTable table;
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ MemoryContext hashmem = aggstate->hashcontext->ecxt_per_tuple_memory;
+ MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
+ Size additionalsize;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ /*
+ * Used to make sure initial hash table allocation does not exceed
+ * work_mem. Note that the estimate does not include space for
+ * pass-by-reference transition data values, nor for the representative
+ * tuple of each group.
+ */
+ additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
+
+ table = BuildTupleHashTableExt(&aggstate->ss.ps,
+ perhash->hashslot->tts_tupleDescriptor,
+ perhash->numCols,
+ perhash->hashGrpColIdxHash,
+ perhash->eqfuncoids,
+ perhash->hashfunctions,
+ perhash->aggnode->grpCollations,
+ nbuckets,
+ additionalsize,
+ hashmem,
+ hashmem,
+ tmpmem,
+ DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+
+ perhash->hashtable = table;
}
/*
@@ -1423,42 +1605,31 @@ find_hash_columns(AggState *aggstate)
/*
* Estimate per-hash-table-entry overhead for the planner.
- *
- * Note that the estimate does not include space for pass-by-reference
- * transition data values, nor for the representative tuple of each group.
- * Nor does this account of the target fill-factor and growth policy of the
- * hash table.
*/
Size
-hash_agg_entry_size(int numAggs)
+hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
{
- Size entrysize;
-
- /* This must match build_hash_table */
- entrysize = sizeof(TupleHashEntryData) +
- numAggs * sizeof(AggStatePerGroupData);
- entrysize = MAXALIGN(entrysize);
-
- return entrysize;
+ return
+ /* key */
+ MAXALIGN(SizeofMinimalTupleHeader) +
+ MAXALIGN(tupleWidth) +
+ /* data */
+ MAXALIGN(sizeof(TupleHashEntryData) +
+ numAggs * sizeof(AggStatePerGroupData)) +
+ transitionSpace;
}
/*
- * Find or create a hashtable entry for the tuple group containing the current
- * tuple (already set in tmpcontext's outertuple slot), in the current grouping
- * set (which the caller must have selected - note that initialize_aggregate
- * depends on this).
- *
- * When called, CurrentMemoryContext should be the per-query context.
+ * Extract the attributes that make up the grouping key into the
+ * hashslot. This is necessary to compute the hash of the grouping key.
*/
-static TupleHashEntryData *
-lookup_hash_entry(AggState *aggstate)
+static void
+prepare_hash_slot(AggState *aggstate)
{
- TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
- AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
- TupleTableSlot *hashslot = perhash->hashslot;
- TupleHashEntryData *entry;
- bool isnew;
- int i;
+ TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ int i;
/* transfer just the needed columns into hashslot */
slot_getsomeattrs(inputslot, perhash->largestGrpColIdx);
@@ -1472,14 +1643,313 @@ lookup_hash_entry(AggState *aggstate)
hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
}
ExecStoreVirtualTuple(hashslot);
+}
+
+/*
+ * Recompile the expressions for advancing aggregates while hashing. This is
+ * necessary for certain kinds of state changes that affect the resulting
+ * expression. For instance, changing aggstate->hash_ever_spilled or
+ * aggstate->ss.ps.outerops requires recompilation.
+ *
+ * A compiled expression where hash_ever_spilled is true will work even when
+ * hash_spill_mode is false, because it merely introduces additional branches
+ * that are unnecessary when hash_spill_mode is false. That allows us to only
+ * recompile when hash_ever_spilled changes, rather than every time
+ * hash_spill_mode changes.
+ */
+static void
+hashagg_recompile_expressions(AggState *aggstate)
+{
+ AggStatePerPhase phase;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ if (aggstate->aggstrategy == AGG_HASHED)
+ phase = &aggstate->phases[0];
+ else /* AGG_MIXED */
+ phase = &aggstate->phases[1];
+
+ phase->evaltrans = ExecBuildAggTrans(
+ aggstate, phase,
+ aggstate->aggstrategy == AGG_MIXED ? true : false, /* dosort */
+ true, /* dohash */
+ aggstate->hash_ever_spilled);
+}
+
+/*
+ * Calculate the hash value for a tuple. It's useful to do this outside of the
+ * hash table so that we can reuse saved hash values rather than recomputing.
+ */
+static uint32
+calculate_hash(AggState *aggstate)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleHashTable hashtable = perhash->hashtable;
+ MemoryContext oldContext;
+ uint32 hash;
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = perhash->hashslot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+
+ MemoryContextSwitchTo(oldContext);
+
+ return hash;
+}
+
+/*
+ * Set limits that trigger spilling to avoid exceeding work_mem. Consider the
+ * number of partitions we expect to create (if we do spill).
+ *
+ * There are two limits: a memory limit, and also an ngroups limit. The
+ * ngroups limit becomes important when we expect transition values to grow
+ * substantially larger than the initial value.
+ */
+static void
+hashagg_set_limits(AggState *aggstate, uint64 input_groups, int used_bits)
+{
+ int npartitions;
+ Size partition_mem;
+
+ /* no attempt to obey work_mem */
+ if (hashagg_mem_overflow)
+ {
+ aggstate->hash_mem_limit = SIZE_MAX;
+ aggstate->hash_ngroups_limit = LONG_MAX;
+ return;
+ }
+
+ /* if not expected to spill, use all of work_mem */
+ if (input_groups * aggstate->hashentrysize < work_mem * 1024L)
+ {
+ aggstate->hash_mem_limit = work_mem * 1024L;
+ aggstate->hash_ngroups_limit =
+ aggstate->hash_mem_limit / aggstate->hashentrysize;
+ return;
+ }
+
+ /*
+ * Calculate expected memory requirements for spilling, which is the size
+ * of the buffers needed for all the tapes that need to be open at
+ * once. Then, subtract that from the memory available for holding hash
+ * tables.
+ */
+ npartitions = hash_choose_num_partitions(input_groups,
+ aggstate->hashentrysize,
+ used_bits,
+ NULL);
+ partition_mem =
+ HASHAGG_READ_BUFFER_SIZE +
+ HASHAGG_WRITE_BUFFER_SIZE * npartitions;
+
+ /*
+ * Don't set the limit below 3/4 of work_mem. In that case, we are at the
+ * minimum number of partitions, so we aren't going to dramatically exceed
+ * work mem anyway.
+ */
+ if (work_mem * 1024L > 4 * partition_mem)
+ aggstate->hash_mem_limit = work_mem * 1024L - partition_mem;
+ else
+ aggstate->hash_mem_limit = work_mem * 1024L * 0.75;
+
+ if (aggstate->hash_mem_limit > aggstate->hashentrysize)
+ aggstate->hash_ngroups_limit =
+ aggstate->hash_mem_limit / aggstate->hashentrysize;
+ else
+ aggstate->hash_ngroups_limit = 1;
+}
+
+/*
+ * hashagg_check_limits
+ *
+ * After adding a new group to the hash table, check whether we need to enter
+ * spill mode. Allocations may happen without adding new groups (for instance,
+ * if the transition state size grows), so this check is imperfect.
+ *
+ * Memory usage is tracked by how much is allocated to the underlying memory
+ * context, not individual chunks. This is more accurate because it accounts
+ * for all memory in the context, and also accounts for fragmentation and
+ * other forms of overhead and waste that can be difficult to estimate. It's
+ * also cheaper because we don't have to track each chunk.
+ *
+ * When memory is first allocated to a memory context, it is not actually
+ * used. So when the next allocation happens, we consider the
+ * previously-allocated amount to be the memory currently used.
+ */
+static void
+hashagg_check_limits(AggState *aggstate)
+{
+ Size allocation;
+
+ /*
+ * Even if already in spill mode, it's possible for memory usage to grow,
+ * and we should still track it for the purposes of EXPLAIN ANALYZE.
+ */
+ allocation = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ /* has allocation grown since the last observation? */
+ if (allocation > aggstate->hash_alloc_current)
+ {
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_alloc_current = allocation;
+ }
+
+ if (aggstate->hash_alloc_last > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = aggstate->hash_alloc_last;
+
+ /*
+ * Don't spill unless there's at least one group in the hash table so we
+ * can be sure to make progress even in edge cases.
+ */
+ if (aggstate->hash_ngroups_current > 0 &&
+ (aggstate->hash_alloc_last > aggstate->hash_mem_limit ||
+ aggstate->hash_ngroups_current > aggstate->hash_ngroups_limit))
+ {
+ aggstate->hash_spill_mode = true;
+
+ if (!aggstate->hash_ever_spilled)
+ {
+ aggstate->hash_ever_spilled = true;
+ aggstate->hash_spills = palloc0(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+ aggstate->hash_tapeinfo = palloc0(sizeof(HashTapeInfo));
+ hashagg_recompile_expressions(aggstate);
+ }
+ }
+}
+
+/*
+ * Choose a reasonable number of buckets for the initial hash table size.
+ */
+static long
+hash_choose_num_buckets(AggState *aggstate, long ngroups, Size memory)
+{
+ long max_nbuckets;
+ int log2_ngroups;
+ long nbuckets;
+
+ max_nbuckets = memory / aggstate->hashentrysize;
+
+ /*
+ * Leave room for slop to avoid a case where the initial hash table size
+ * exceeds the memory limit (though that may still happen in edge cases).
+ */
+ max_nbuckets *= 0.75;
+
+ /*
+ * Lowest power of two greater than ngroups, without exceeding
+ * max_nbuckets.
+ */
+ for (log2_ngroups = 1, nbuckets = 2;
+ nbuckets < ngroups && nbuckets < max_nbuckets;
+ log2_ngroups++, nbuckets <<= 1);
+
+ if (nbuckets > max_nbuckets && nbuckets > 2)
+ nbuckets >>= 1;
+
+ return nbuckets;
+}
+
+/*
+ * Determine the number of partitions to create when spilling, which will
+ * always be a power of two. If log2_npartitions is non-NULL, set
+ * *log2_npartitions to the log2() of the number of partitions.
+ */
+static int
+hash_choose_num_partitions(uint64 input_groups, double hashentrysize,
+ int used_bits, int *log2_npartitions)
+{
+ Size mem_wanted;
+ int partition_limit;
+ int npartitions;
+ int partition_bits;
+
+ /*
+ * Avoid creating so many partitions that the memory requirements of the
+ * open partition files (estimated at BLCKSZ for buffering) are greater
+ * than 1/4 of work_mem.
+ */
+ partition_limit =
+ (work_mem * 1024L * 0.25 - HASHAGG_READ_BUFFER_SIZE) /
+ HASHAGG_WRITE_BUFFER_SIZE;
+
+ /* pessimistically estimate that each input tuple creates a new group */
+ mem_wanted = HASHAGG_PARTITION_FACTOR * input_groups * hashentrysize;
+
+ /* make enough partitions so that each one is likely to fit in memory */
+ npartitions = 1 + (mem_wanted / (work_mem * 1024L));
+
+ if (npartitions > partition_limit)
+ npartitions = partition_limit;
+
+ if (npartitions < HASHAGG_MIN_PARTITIONS)
+ npartitions = HASHAGG_MIN_PARTITIONS;
+ if (npartitions > HASHAGG_MAX_PARTITIONS)
+ npartitions = HASHAGG_MAX_PARTITIONS;
+
+ /* ceil(log2(npartitions)) */
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits */
+ if (partition_bits + used_bits >= 32)
+ partition_bits = 32 - used_bits;
+
+ if (log2_npartitions != NULL)
+ *log2_npartitions = partition_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ return npartitions;
+}
+
+/*
+ * Find or create a hashtable entry for the tuple group containing the current
+ * tuple (already set in tmpcontext's outertuple slot), in the current grouping
+ * set (which the caller must have selected - note that initialize_aggregate
+ * depends on this).
+ *
+ * When called, CurrentMemoryContext should be the per-query context.
+ *
+ * If the hash table is at the memory limit, then only find existing hashtable
+ * entries; don't create new ones. If a tuple's group is not already present
+ * in the hash table for the current grouping set, return NULL and the caller
+ * will spill it to disk.
+ */
+static AggStatePerGroup
+lookup_hash_entry(AggState *aggstate, uint32 hash)
+{
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ TupleHashEntryData *entry;
+ bool isnew = false;
+ bool *p_isnew;
+
+ /* if hash table already spilled, don't create new entries */
+ p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntry(perhash->hashtable, hashslot, &isnew);
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
+ hash);
+
+ if (entry == NULL)
+ return NULL;
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_ngroups_current++;
+ hashagg_check_limits(aggstate);
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1499,7 +1969,7 @@ lookup_hash_entry(AggState *aggstate)
}
}
- return entry;
+ return entry->additional;
}
/*
@@ -1507,18 +1977,51 @@ lookup_hash_entry(AggState *aggstate)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Some entries may be left NULL if we have reached the limit and have begun
+ * to spill. The same tuple will belong to different groups for each set, so
+ * may match a group already in memory for one set and match a group not in
+ * memory for another set. If we have begun to spill and a tuple doesn't match
+ * a group in memory for a particular set, it will be spilled.
+ *
+ * NB: It's possible to spill the same tuple for several different grouping
+ * sets. This may seem wasteful, but it's actually a trade-off: if we spill
+ * the tuple multiple times for multiple grouping sets, it can be partitioned
+ * for each grouping set, making the refilling of the hash table very
+ * efficient.
*/
static void
lookup_hash_entries(AggState *aggstate)
{
- int numHashes = aggstate->num_hashes;
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
int setno;
- for (setno = 0; setno < numHashes; setno++)
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
{
+ uint32 hash;
+
select_current_set(aggstate, setno, true);
- pergroup[setno] = lookup_hash_entry(aggstate)->additional;
+ prepare_hash_slot(aggstate);
+ hash = calculate_hash(aggstate);
+ pergroup[setno] = lookup_hash_entry(aggstate, hash);
+
+ /* check to see if we need to spill the tuple for this grouping set */
+ if (pergroup[setno] == NULL)
+ {
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
+
+ if (spill->partitions == NULL)
+ hashagg_spill_init(spill, aggstate->hash_tapeinfo, 0,
+ perhash->aggnode->numGroups,
+ aggstate->hashentrysize);
+
+ hashagg_spill_tuple(spill, slot, hash);
+
+ aggstate->hash_disk_used = LogicalTapeSetBlocks(
+ aggstate->hash_tapeinfo->tapeset) * (BLCKSZ / 1024);
+ }
}
}
@@ -1841,6 +2344,12 @@ agg_retrieve_direct(AggState *aggstate)
if (TupIsNull(outerslot))
{
/* no more outer-plan tuples available */
+
+ /* if we built hash tables, finalize any spills */
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hashagg_finish_initial_spills(aggstate);
+
if (hasGroupingSets)
{
aggstate->input_done = true;
@@ -1943,6 +2452,9 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ /* finalize spills, if any */
+ hashagg_finish_initial_spills(aggstate);
+
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1950,11 +2462,191 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * consumed.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+ HashAggSpill spill;
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ long nbuckets;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ spill.npartitions = 0;
+ spill.partitions = NULL;
+ /*
+ * Each spill file contains spilled data for only a single grouping
+ * set. We want to ignore all others, which is done by setting the other
+ * pergroups to NULL.
+ */
+ memset(aggstate->all_pergroups, 0,
+ sizeof(AggStatePerGroup) *
+ (aggstate->maxsets + aggstate->num_hashes));
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ hashagg_set_limits(aggstate, batch->input_tuples, batch->used_bits);
+
+ /*
+ * Free memory and rebuild a single hash table for this batch's grouping
+ * set. Estimate the number of groups to be the number of input tuples in
+ * this batch.
+ */
+ ReScanExprContext(aggstate->hashcontext);
+
+ nbuckets = hash_choose_num_buckets(
+ aggstate, batch->input_tuples, aggstate->hash_mem_limit);
+ build_hash_table(aggstate, batch->setno, nbuckets);
+ aggstate->hash_alloc_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_ngroups_current = 0;
+
+ Assert(aggstate->current_phase == 0);
+
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ /*
+ * The first pass (agg_fill_hash_table) reads whatever kind of slot comes
+ * from the outer plan, and considers the slot fixed. But spilled tuples
+ * are always MinimalTuples, so if that's different from the outer plan we
+ * need to change it and recompile the aggregate expressions.
+ */
+ if (aggstate->ss.ps.outerops != &TTSOpsMinimalTuple)
+ {
+ aggstate->ss.ps.outerops = &TTSOpsMinimalTuple;
+ hashagg_recompile_expressions(aggstate);
+ }
+
+ LogicalTapeRewindForRead(tapeinfo->tapeset, batch->input_tapenum,
+ HASHAGG_READ_BUFFER_SIZE);
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+ uint32 hash;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hashagg_batch_read(batch, &hash);
+ if (tuple == NULL)
+ break;
+
+ ExecStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ select_current_set(aggstate, batch->setno, true);
+ prepare_hash_slot(aggstate);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+
+ /* if there's no memory for a new group, spill */
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ {
+ /*
+ * Estimate the number of groups for this batch as the total
+ * number of tuples in its input file. Although that's a worst
+ * case, it's not bad here for two reasons: (1) overestimating
+ * is better than underestimating; and (2) we've already
+ * scanned the relation once, so it's likely that we've
+ * already finalized many of the common values.
+ */
+ if (spill.partitions == NULL)
+ hashagg_spill_init(&spill, tapeinfo, batch->used_bits,
+ batch->input_tuples,
+ aggstate->hashentrysize);
+
+ hashagg_spill_tuple(&spill, slot, hash);
+
+ aggstate->hash_disk_used = LogicalTapeSetBlocks(
+ aggstate->hash_tapeinfo->tapeset) * (BLCKSZ / 1024);
+ }
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ hashagg_tapeinfo_release(tapeinfo, batch->input_tapenum);
+
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ /* update hashentrysize estimate based on contents */
+ if (aggstate->hash_ngroups_current > 0)
+ {
+ aggstate->hashentrysize = (double)aggstate->hash_alloc_last /
+ (double)aggstate->hash_ngroups_current;
+ }
+
+ hashagg_spill_finish(aggstate, &spill, batch->setno);
+ aggstate->hash_spill_mode = false;
+
+ pfree(batch);
+
+ /* Initialize to walk the first hash table */
+ select_current_set(aggstate, 0, true);
+ ResetTupleHashIterator(aggstate->perhash[0].hashtable,
+ &aggstate->perhash[0].hashiter);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
+ *
+ * After exhausting in-memory tuples, also try refilling the hash table using
+ * previously-spilled tuples. Only returns NULL after all in-memory and
+ * spilled tuples are exhausted.
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+/*
+ * Retrieve the groups from the in-memory hash tables without considering any
+ * spilled tuples.
+ */
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -1983,7 +2675,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2014,8 +2706,6 @@ agg_retrieve_hash_table(AggState *aggstate)
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2072,6 +2762,296 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * Assign unused tapes to spill partitions, extending the tape set if
+ * necessary.
+ */
+static void
+hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *partitions,
+ int npartitions)
+{
+ int partidx = 0;
+
+ /* use free tapes if available */
+ while (partidx < npartitions && tapeinfo->nfreetapes > 0)
+ partitions[partidx++] = tapeinfo->freetapes[--tapeinfo->nfreetapes];
+
+ if (tapeinfo->tapeset == NULL)
+ tapeinfo->tapeset = LogicalTapeSetCreate(npartitions, NULL, NULL, -1);
+ else if (partidx < npartitions)
+ {
+ tapeinfo->tapeset = LogicalTapeSetExtend(
+ tapeinfo->tapeset, npartitions - partidx);
+ }
+
+ while (partidx < npartitions)
+ partitions[partidx++] = tapeinfo->ntapes++;
+}
+
+/*
+ * After a tape has already been written to and then read, this function
+ * rewinds it for writing and adds it to the free list.
+ */
+static void
+hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum)
+{
+ LogicalTapeRewindForWrite(tapeinfo->tapeset, tapenum);
+ if (tapeinfo->freetapes == NULL)
+ tapeinfo->freetapes = palloc(sizeof(int));
+ else
+ tapeinfo->freetapes = repalloc(
+ tapeinfo->freetapes, sizeof(int) * (tapeinfo->nfreetapes + 1));
+ tapeinfo->freetapes[tapeinfo->nfreetapes++] = tapenum;
+}
+
+/*
+ * hashagg_spill_init
+ *
+ * Called after we determined that spilling is necessary. Chooses the number
+ * of partitions to create, and initializes them.
+ */
+static void
+hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo, int used_bits,
+ uint64 input_groups, double hashentrysize)
+{
+ int npartitions;
+ int partition_bits;
+
+ npartitions = hash_choose_num_partitions(
+ input_groups, hashentrysize, used_bits, &partition_bits);
+
+ spill->partitions = palloc0(sizeof(int) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+
+ hashagg_tapeinfo_assign(tapeinfo, spill->partitions, npartitions);
+
+ spill->tapeinfo = tapeinfo;
+ spill->shift = 32 - used_bits - partition_bits;
+ spill->mask = (npartitions - 1) << spill->shift;
+ spill->npartitions = npartitions;
+}
+
+/*
+ * hashagg_spill_tuple
+ *
+ * No room for new groups in the hash table. Save for later in the appropriate
+ * partition.
+ */
+static Size
+hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot, uint32 hash)
+{
+ LogicalTapeSet *tapeset = spill->tapeinfo->tapeset;
+ int partition;
+ MinimalTuple tuple;
+ int tapenum;
+ int total_written = 0;
+ bool shouldFree;
+
+ Assert(spill->partitions != NULL);
+
+ /* may contain unnecessary attributes, consider projecting? */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ partition = (hash & spill->mask) >> spill->shift;
+ spill->ntuples[partition]++;
+
+ tapenum = spill->partitions[partition];
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) &hash, sizeof(uint32));
+ total_written += sizeof(uint32);
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) tuple, tuple->t_len);
+ total_written += tuple->t_len;
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return total_written;
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hashagg_batch_read(HashAggBatch *batch, uint32 *hashp)
+{
+ LogicalTapeSet *tapeset = batch->tapeinfo->tapeset;
+ int tapenum = batch->input_tapenum;
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+ uint32 hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &hash, sizeof(uint32));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+ if (hashp != NULL)
+ *hashp = hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &t_len, sizeof(t_len));
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = LogicalTapeRead(tapeset, tapenum,
+ (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, t_len - sizeof(uint32), nread)));
+
+ return tuple;
+}
+
+/*
+ * new_hashagg_batch
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done. Should be called in the aggregate's memory context.
+ */
+static HashAggBatch *
+hashagg_batch_new(HashTapeInfo *tapeinfo, int tapenum, int setno,
+ int64 input_tuples, int used_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->setno = setno;
+ batch->used_bits = used_bits;
+ batch->tapeinfo = tapeinfo;
+ batch->input_tapenum = tapenum;
+ batch->input_tuples = input_tuples;
+
+ return batch;
+}
+
+/*
+ * hashagg_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hashagg_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+
+ if (aggstate->hash_spills == NULL)
+ return;
+
+ /* update hashentrysize estimate based on contents */
+ Assert(aggstate->hash_ngroups_current > 0);
+ aggstate->hashentrysize = (double)aggstate->hash_alloc_last /
+ (double)aggstate->hash_ngroups_current;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hashagg_spill_finish(aggstate, &aggstate->hash_spills[setno], setno);
+
+ aggstate->hash_spill_mode = false;
+
+ /*
+ * We're not processing tuples from outer plan any more; only processing
+ * batches of spilled tuples. The initial spill structures are no longer
+ * needed.
+ */
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+}
+
+/*
+ * hashagg_spill_finish
+ *
+ * Transform spill partitions into new batches.
+ */
+static void
+hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
+{
+ int i;
+ int used_bits = 32 - spill->shift;
+
+ if (spill->npartitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->npartitions; i++)
+ {
+ int tapenum = spill->partitions[i];
+ MemoryContext oldContext;
+ HashAggBatch *new_batch;
+
+ oldContext = MemoryContextSwitchTo(aggstate->ss.ps.state->es_query_cxt);
+ new_batch = hashagg_batch_new(aggstate->hash_tapeinfo,
+ tapenum, setno, spill->ntuples[i],
+ used_bits);
+ aggstate->hash_batches = lcons(new_batch, aggstate->hash_batches);
+ aggstate->hash_batches_used++;
+ MemoryContextSwitchTo(oldContext);
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Free resources related to a spilled HashAgg.
+ */
+static void
+hashagg_reset_spill_state(AggState *aggstate)
+{
+ ListCell *lc;
+
+ /* free spills from initial pass */
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+ }
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ /* free batches */
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+
+ /* close tape set */
+ if (aggstate->hash_tapeinfo != NULL)
+ {
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ if (tapeinfo->tapeset != NULL)
+ LogicalTapeSetClose(tapeinfo->tapeset);
+ if (tapeinfo->freetapes != NULL)
+ pfree(tapeinfo->freetapes);
+ pfree(tapeinfo);
+ aggstate->hash_tapeinfo = NULL;
+ }
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2256,6 +3236,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->ss.ps.outeropsfixed = false;
}
+ if (use_hashing)
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsMinimalTuple);
+
/*
* Initialize result type, slot and projection.
*/
@@ -2481,11 +3465,22 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
*/
if (use_hashing)
{
+ Plan *outerplan = outerPlan(node);
+ long totalGroups = 0;
+ int i;
+
/* this is an array of pointers, not structures */
aggstate->hash_pergroup = pergroups;
+ aggstate->hashentrysize = hash_agg_entry_size(
+ aggstate->numtrans, outerplan->plan_width, node->transitionSpace);
+
+ for (i = 0; i < aggstate->num_hashes; i++)
+ totalGroups = aggstate->perhash[i].aggnode->numGroups;
+
+ hashagg_set_limits(aggstate, totalGroups, 0);
find_hash_columns(aggstate);
- build_hash_table(aggstate);
+ build_hash_tables(aggstate);
aggstate->table_filled = false;
}
@@ -2891,7 +3886,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
else
Assert(false);
- phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash);
+ phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash, false);
}
@@ -3386,6 +4381,8 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hashagg_reset_spill_state(node);
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3441,12 +4438,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_ever_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3503,11 +4501,29 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ hashagg_reset_spill_state(node);
+
+ node->hash_ever_spilled = false;
+ node->hash_spill_mode = false;
+ node->hash_alloc_last = 0;
+ node->hash_alloc_current = 0;
+ node->hash_ngroups_current = 0;
+
+ /* reset stats */
+ node->hash_mem_peak = 0;
+ node->hash_disk_used = 0;
+ node->hash_batches_used = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
- build_hash_table(node);
+ build_hash_tables(node);
node->table_filled = false;
/* iterator will be reset when the table is filled */
+
+ node->ss.ps.outerops =
+ ExecGetResultSlotOps(outerPlanState(&node->ss),
+ &node->ss.ps.outeropsfixed);
+ hashagg_recompile_expressions(node);
}
if (node->aggstrategy != AGG_HASHED)
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 21a5ca4b404..6fa555ada88 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2082,6 +2082,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_INIT_TRANS:
+ case EEOP_AGG_INIT_TRANS_SPILLED:
{
AggState *aggstate;
AggStatePerTrans pertrans;
@@ -2092,6 +2093,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_allpergroupsp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_setoff,
v_transno;
@@ -2119,11 +2121,32 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_init_trans.setoff);
v_transno = l_int32_const(op->d.agg_init_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_notransvalue = l_bb_before_v(
+ opblocks[i + 1], "op.%d.check_notransvalue", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(
+ b, v_pergroup_allaggs, TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_check_notransvalue);
+
+ LLVMPositionBuilderAtEnd(b, b_check_notransvalue);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_notransvalue =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_NOTRANSVALUE,
@@ -2180,6 +2203,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_STRICT_TRANS_CHECK:
+ case EEOP_AGG_STRICT_TRANS_CHECK_SPILLED:
{
AggState *aggstate;
LLVMValueRef v_setoff,
@@ -2190,6 +2214,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_transnull;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
int jumpnull = op->d.agg_strict_trans_check.jumpnull;
@@ -2209,11 +2234,32 @@ llvm_compile_expr(ExprState *state)
l_int32_const(op->d.agg_strict_trans_check.setoff);
v_transno =
l_int32_const(op->d.agg_strict_trans_check.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_transnull = l_bb_before_v(
+ opblocks[i + 1], "op.%d.check_transnull", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[jumpnull],
+ b_check_transnull);
+
+ LLVMPositionBuilderAtEnd(b, b_check_transnull);
+ }
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_transnull =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_TRANSVALUEISNULL,
@@ -2229,7 +2275,9 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_PLAIN_TRANS_BYVAL:
+ case EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED:
case EEOP_AGG_PLAIN_TRANS:
+ case EEOP_AGG_PLAIN_TRANS_SPILLED:
{
AggState *aggstate;
AggStatePerTrans pertrans;
@@ -2255,6 +2303,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_pertransp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_retval;
@@ -2282,10 +2331,33 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_trans.setoff);
v_transno = l_int32_const(op->d.agg_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED ||
+ opcode == EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_advance_transval = l_bb_before_v(
+ opblocks[i + 1], "op.%d.advance_transval", i);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[i + 1],
+ b_advance_transval);
+
+ LLVMPositionBuilderAtEnd(b, b_advance_transval);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_fcinfo = l_ptr_const(fcinfo,
l_ptr(StructFunctionCallInfoData));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721f..9575469800b 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -77,6 +77,7 @@
#include "access/htup_details.h"
#include "access/tsmapi.h"
#include "executor/executor.h"
+#include "executor/nodeAgg.h"
#include "executor/nodeHash.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -128,6 +129,7 @@ bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
bool enable_hashagg = true;
+bool enable_hashagg_spill = true;
bool enable_nestloop = true;
bool enable_material = true;
bool enable_mergejoin = true;
@@ -2153,7 +2155,7 @@ cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples)
+ double input_tuples, double input_width)
{
double output_tuples;
Cost startup_cost;
@@ -2219,20 +2221,69 @@ cost_agg(Path *path, PlannerInfo *root,
total_cost += aggcosts->finalCost.per_tuple * numGroups;
total_cost += cpu_tuple_cost * numGroups;
output_tuples = numGroups;
+
+ /*
+ * We don't need to compute the disk costs of hash aggregation here,
+ * because the planner does not choose hash aggregation for grouping
+ * sets that it doesn't expect to fit in memory.
+ */
}
else
{
+ double hashentrysize = hash_agg_entry_size(
+ aggcosts->numAggs, input_width, aggcosts->transitionSpace);
+ double nbatches =
+ (numGroups * hashentrysize) / (work_mem * 1024L);
+ double pages_written = 0.0;
+ double pages_read = 0.0;
+
/* must be AGG_HASHED */
startup_cost = input_total_cost;
if (!enable_hashagg)
startup_cost += disable_cost;
startup_cost += aggcosts->transCost.startup;
startup_cost += aggcosts->transCost.per_tuple * input_tuples;
+ /* cost of computing hash value */
startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;
startup_cost += aggcosts->finalCost.startup;
+
+ /*
+ * Add the disk costs of hash aggregation that spills to disk.
+ *
+ * Groups that go into the hash table stay in memory until finalized,
+ * so spilling and reprocessing tuples doesn't incur additional
+ * invocations of transCost or finalCost. Furthermore, the computed
+ * hash value is stored with the spilled tuples, so we don't incur
+ * extra invocations of the hash function.
+ *
+ * The disk cost depends on the depth of recursion; each level
+ * requiring one additional write and then read of a tuple. Writes are
+ * random and reads are sequential, so we assume 1/2 random and half
+ * sequential.
+ *
+ * Hash Agg begins returning tuples after the first batch is
+ * complete. Accrue writes (spilled tuples) to startup_cost and reads
+ * only to total_cost. This is not perfect; it penalizes startup_cost
+ * in the case of recursive spills. Also, transCost is entirely
+ * counted in startup_cost; but some of that cost could be counted
+ * only against total_cost.
+ */
+ if (!hashagg_mem_overflow && nbatches > 1.0)
+ {
+ double depth;
+ double pages;
+
+ pages = relation_byte_size(input_tuples, input_width) / BLCKSZ;
+ depth = ceil( log(nbatches - 1) / log(HASHAGG_MAX_PARTITIONS) );
+ pages_written = pages_read = pages * depth;
+ startup_cost += pages_written * random_page_cost;
+ }
+
total_cost = startup_cost;
total_cost += aggcosts->finalCost.per_tuple * numGroups;
+ /* cost of retrieving from hash table */
total_cost += cpu_tuple_cost * numGroups;
+ total_cost += pages_read * seq_page_cost;
output_tuples = numGroups;
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e048d200bb4..090919e39a0 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1644,6 +1644,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
NIL,
NIL,
best_path->path.rows,
+ 0,
subplan);
}
else
@@ -2096,6 +2097,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
NIL,
NIL,
best_path->numGroups,
+ best_path->transitionSpace,
subplan);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2257,6 +2259,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
NIL,
rollup->numGroups,
+ best_path->transitionSpace,
sort_plan);
/*
@@ -2295,6 +2298,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
chain,
rollup->numGroups,
+ best_path->transitionSpace,
subplan);
/* Copy cost data from Path to Plan */
@@ -6192,8 +6196,8 @@ Agg *
make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree)
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree)
{
Agg *node = makeNode(Agg);
Plan *plan = &node->plan;
@@ -6209,6 +6213,7 @@ make_agg(List *tlist, List *qual,
node->grpOperators = grpOperators;
node->grpCollations = grpCollations;
node->numGroups = numGroups;
+ node->transitionSpace = transitionSpace;
node->aggParams = NULL; /* SS_finalize_plan() will fill this */
node->groupingSets = groupingSets;
node->chain = chain;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index d6f21535937..913ad9335e5 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4867,13 +4867,8 @@ create_distinct_paths(PlannerInfo *root,
allow_hash = false; /* policy-based decision not to hash */
else
{
- Size hashentrysize;
-
- /* Estimate per-hash-entry space at tuple width... */
- hashentrysize = MAXALIGN(cheapest_input_path->pathtarget->width) +
- MAXALIGN(SizeofMinimalTupleHeader);
- /* plus the per-hash-entry overhead */
- hashentrysize += hash_agg_entry_size(0);
+ Size hashentrysize = hash_agg_entry_size(
+ 0, cheapest_input_path->pathtarget->width, 0);
/* Allow hashing only if hashtable is predicted to fit in work_mem */
allow_hash = (hashentrysize * numDistinctRows <= work_mem * 1024L);
@@ -6533,7 +6528,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
* were unable to sort above, then we'd better generate a Path, so
* that we at least have one.
*/
- if (hashaggtablesize < work_mem * 1024L ||
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L ||
grouped_rel->pathlist == NIL)
{
/*
@@ -6566,7 +6562,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
agg_final_costs,
dNumGroups);
- if (hashaggtablesize < work_mem * 1024L)
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L)
add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
@@ -6835,7 +6832,7 @@ create_partial_grouping_paths(PlannerInfo *root,
* Tentatively produce a partial HashAgg Path, depending on if it
* looks as if the hash table will fit in work_mem.
*/
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
@@ -6862,7 +6859,7 @@ create_partial_grouping_paths(PlannerInfo *root,
dNumPartialPartialGroups);
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 1a23e18970d..951aed80e7a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1072,7 +1072,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
numGroupCols, dNumGroups,
NIL,
input_path->startup_cost, input_path->total_cost,
- input_path->rows);
+ input_path->rows, input_path->pathtarget->width);
/*
* Now for the sorted case. Note that the input is *always* unsorted,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e6d08aede56..8ba8122ee2f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1704,7 +1704,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
NIL,
subpath->startup_cost,
subpath->total_cost,
- rel->rows);
+ rel->rows,
+ subpath->pathtarget->width);
}
if (sjinfo->semi_can_btree && sjinfo->semi_can_hash)
@@ -2949,6 +2950,7 @@ create_agg_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->aggsplit = aggsplit;
pathnode->numGroups = numGroups;
+ pathnode->transitionSpace = aggcosts ? aggcosts->transitionSpace : 0;
pathnode->groupClause = groupClause;
pathnode->qual = qual;
@@ -2957,7 +2959,7 @@ create_agg_path(PlannerInfo *root,
list_length(groupClause), numGroups,
qual,
subpath->startup_cost, subpath->total_cost,
- subpath->rows);
+ subpath->rows, subpath->pathtarget->width);
/* add tlist eval cost for each output row */
pathnode->path.startup_cost += target->cost.startup;
@@ -3036,6 +3038,7 @@ create_groupingsets_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->rollups = rollups;
pathnode->qual = having_qual;
+ pathnode->transitionSpace = agg_costs ? agg_costs->transitionSpace : 0;
Assert(rollups != NIL);
Assert(aggstrategy != AGG_PLAIN || list_length(rollups) == 1);
@@ -3067,7 +3070,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
subpath->startup_cost,
subpath->total_cost,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
is_first = false;
if (!rollup->is_hashed)
is_first_sort = false;
@@ -3090,7 +3094,8 @@ create_groupingsets_path(PlannerInfo *root,
rollup->numGroups,
having_qual,
0.0, 0.0,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
if (!rollup->is_hashed)
is_first_sort = false;
}
@@ -3115,7 +3120,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
sort_path.startup_cost,
sort_path.total_cost,
- sort_path.rows);
+ sort_path.rows,
+ subpath->pathtarget->width);
}
pathnode->path.total_cost += agg_path.total_cost;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 7c6f0574b37..0be26fe0378 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3526,16 +3526,8 @@ double
estimate_hashagg_tablesize(Path *path, const AggClauseCosts *agg_costs,
double dNumGroups)
{
- Size hashentrysize;
-
- /* Estimate per-hash-entry space at tuple width... */
- hashentrysize = MAXALIGN(path->pathtarget->width) +
- MAXALIGN(SizeofMinimalTupleHeader);
-
- /* plus space for pass-by-ref transition values... */
- hashentrysize += agg_costs->transitionSpace;
- /* plus the per-hash-entry overhead */
- hashentrysize += hash_agg_entry_size(agg_costs->numAggs);
+ Size hashentrysize = hash_agg_entry_size(
+ agg_costs->numAggs, path->pathtarget->width, agg_costs->transitionSpace);
/*
* Note that this disregards the effect of fill-factor and growth policy
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index b1f6291b99e..daaff08ceee 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -120,6 +120,7 @@ bool enableFsync = true;
bool allowSystemTableMods = false;
int work_mem = 1024;
int maintenance_work_mem = 16384;
+bool hashagg_mem_overflow = false;
int max_parallel_maintenance_workers = 2;
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9630866a5f9..f7cce62eac0 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1002,6 +1002,26 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_hashagg_spill", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans that are expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_hashagg_spill,
+ true,
+ NULL, NULL, NULL
+ },
+ {
+ {"hashagg_mem_overflow", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables hashed aggregation to overflow work_mem at execution time."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &hashagg_mem_overflow,
+ false,
+ NULL, NULL, NULL
+ },
{
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of materialization."),
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 7112558363f..2365f5bdafb 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -226,9 +226,13 @@ typedef enum ExprEvalOp
EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
EEOP_AGG_INIT_TRANS,
+ EEOP_AGG_INIT_TRANS_SPILLED,
EEOP_AGG_STRICT_TRANS_CHECK,
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
EEOP_AGG_PLAIN_TRANS_BYVAL,
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
EEOP_AGG_PLAIN_TRANS,
+ EEOP_AGG_PLAIN_TRANS_SPILLED,
EEOP_AGG_ORDERED_TRANS_DATUM,
EEOP_AGG_ORDERED_TRANS_TUPLE,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 6ef3e1fe069..6e7f358fa2f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -140,10 +140,15 @@ extern TupleHashTable BuildTupleHashTableExt(PlanState *parent,
extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
bool *isnew);
+extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
+ TupleTableSlot *slot,
+ bool *isnew, uint32 hash);
extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
ExprState *eqcomp,
FmgrInfo *hashfunctions);
+extern uint32 TupleHashTableHash(struct tuplehash_hash *tb,
+ const MinimalTuple tuple);
extern void ResetTupleHashTable(TupleHashTable hashtable);
/*
@@ -250,7 +255,7 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
- bool doSort, bool doHash);
+ bool doSort, bool doHash, bool spilled);
extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
int numCols,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 2fe82da6ff7..ae9fe05abc6 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -304,11 +304,13 @@ typedef struct AggStatePerHashData
Agg *aggnode; /* original Agg node, for numGroups etc. */
} AggStatePerHashData;
+#define HASHAGG_MAX_PARTITIONS 256
extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
extern void ExecEndAgg(AggState *node);
extern void ExecReScanAgg(AggState *node);
-extern Size hash_agg_entry_size(int numAggs);
+extern Size hash_agg_entry_size(int numAggs, Size tupleWidth,
+ Size transitionSpace);
#endif /* NODEAGG_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 62d64aa0a14..288764929ce 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -244,6 +244,7 @@ extern bool enableFsync;
extern PGDLLIMPORT bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT bool hashagg_mem_overflow;
extern PGDLLIMPORT int max_parallel_maintenance_workers;
extern int VacuumCostPageHit;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1f6f5bbc207..772d9b8d5c6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2074,13 +2074,32 @@ typedef struct AggState
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
- int num_hashes;
+ int num_hashes; /* number of hash tables active at once */
+ double hashentrysize; /* estimate revised during execution */
+ struct HashTapeInfo *hash_tapeinfo; /* metadata for spill tapes */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each hash table,
+ exists only during first pass if spilled */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ bool hash_ever_spilled; /* ever spilled during this execution? */
+ bool hash_spill_mode; /* we hit a limit during the current batch
+ and we must not create new groups */
+ Size hash_alloc_last; /* previous total memory allocation */
+ Size hash_alloc_current; /* current total memory allocation */
+ Size hash_mem_limit; /* limit before spilling hash table */
+ Size hash_mem_peak; /* peak hash table memory usage */
+ long hash_ngroups_current; /* number of groups currently in
+ memory in all hash tables */
+ long hash_ngroups_limit; /* limit before spilling hash table */
+ long hash_disk_used; /* kB of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+ List *hash_batches; /* hash batches remaining to be processed */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 49
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 3d3be197e0e..be592d0fee4 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1663,6 +1663,7 @@ typedef struct AggPath
AggStrategy aggstrategy; /* basic strategy, see nodes.h */
AggSplit aggsplit; /* agg-splitting mode, see nodes.h */
double numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
List *groupClause; /* a list of SortGroupClause's */
List *qual; /* quals (HAVING quals), if any */
} AggPath;
@@ -1700,6 +1701,7 @@ typedef struct GroupingSetsPath
AggStrategy aggstrategy; /* basic strategy */
List *rollups; /* list of RollupData */
List *qual; /* quals (HAVING quals), if any */
+ int32 transitionSpace; /* estimated transition state size */
} GroupingSetsPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 32c0d87f80e..f4183e1efa5 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -813,6 +813,7 @@ typedef struct Agg
Oid *grpOperators; /* equality operators to compare with */
Oid *grpCollations;
long numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
Bitmapset *aggParams; /* IDs of Params used in Aggref inputs */
/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
List *groupingSets; /* grouping sets to use */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index cb012ba1980..6572dc24699 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_hashagg_spill;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
@@ -114,7 +115,7 @@ extern void cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples);
+ double input_tuples, double input_width);
extern void cost_windowagg(Path *path, PlannerInfo *root,
List *windowFuncs, int numPartCols, int numOrderCols,
Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index eab486a6214..c7bda2b0917 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -54,8 +54,8 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
extern Agg *make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree);
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree);
extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
/*
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index f457b5b150f..7eeeaaa5e4a 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2357,3 +2357,124 @@ explain (costs off)
-> Seq Scan on onek
(8 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+set work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------------
+ GroupAggregate
+ Group Key: ((g % 100000))
+ -> Sort
+ Sort Key: ((g % 100000))
+ -> Function Scan on generate_series g
+(5 rows)
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+-- Produce results with hash aggregation
+set enable_hashagg = true;
+set enable_sort = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 100000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+set enable_sort = true;
+set work_mem to default;
+-- Compare group aggregation results to hash aggregation results
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+ a | c1 | c2 | c3
+---+----+----+----
+(0 rows)
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a7..767f60a96c7 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1633,4 +1633,127 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
| 1 | 2
(4 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+-- Produce results with hash aggregation.
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+set enable_sort = true;
+set work_mem to default;
+-- Compare results
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+ g100 | g10 | unnest | c | m
+------+-----+--------+---+---
+(0 rows)
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
-- end
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1de..11c6f50fbfa 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -148,6 +148,68 @@ SELECT count(*) FROM
4
(1 row)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+SET enable_hashagg=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------------
+ Unique
+ -> Sort
+ Sort Key: ((g % 1000))
+ -> Function Scan on generate_series g
+(4 rows)
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_hashagg=TRUE;
+-- Produce results with hash aggregation.
+SET enable_sort=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 1000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_sort=TRUE;
+SET work_mem TO DEFAULT;
+-- Compare results
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb9057..c40bf6c16eb 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -75,6 +75,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
+ enable_hashagg_spill | on
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 3e593f2d615..a4d476c4bb3 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1032,3 +1032,119 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
explain (costs off)
select 1 from tenk1
where (hundred, thousand) in (select twothousand, twothousand from onek);
+
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+set work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+-- Produce results with hash aggregation
+
+set enable_hashagg = true;
+set enable_sort = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare group aggregation results to hash aggregation results
+
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 95ac3fb52f6..bf8bce6ed31 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -441,4 +441,103 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
from unnest(array[1,1], array['a','b']) u(i,v)
group by rollup(i, v||'a') order by 1,3;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+-- Produce results with hash aggregation.
+
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare results
+
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+
-- end
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449e..33102744ebf 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -45,6 +45,68 @@ SELECT count(*) FROM
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+SET enable_hashagg=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_hashagg=TRUE;
+
+-- Produce results with hash aggregation.
+
+SET enable_sort=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_sort=TRUE;
+
+SET work_mem TO DEFAULT;
+
+-- Compare results
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
+
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
On Mon, Feb 03, 2020 at 06:24:14PM -0800, Jeff Davis wrote:
On Mon, 2020-02-03 at 10:29 -0800, Jeff Davis wrote:
I ended up converting the freelist to a min heap.
Attached is a patch which makes three changes to better support
HashAgg:And now I'm attaching another version of the main Hash Aggregation
patch to be applied on top of the logtape.c patch.Not a lot of changes from the last version; mostly some cleanup and
rebasing. But it's faster now with the logtape.c changes.
Nice!
Just back from the holiday. I had the perf test with Tomas's script,
didn't notice the freelist sorting regression at that time.
The minheap looks good, have you tested the performance and aggregate
validation?
About the "Cap the size of the LogicalTapeSet's freelist" and "Don't
bother tracking free space for HashAgg at all" you mentioned in last
mail, I suppose these two options will lost the disk space saving
benefit since some blocks are not reusable then?
--
Adam Lee
On 03/02/2020 20:29, Jeff Davis wrote:
1. Use a minheap for the freelist. The original design used an array
that had to be sorted between a read (which frees a block) and a write
(which needs to sort the array to consume the lowest block number). The
comments said:* sorted. This is an efficient way to handle it because we expect
cycles
* of releasing many blocks followed by re-using many blocks, due to
* the larger read buffer.But I didn't find a case where that actually wins over a simple
minheap. With that in mind, a minheap seems closer to what one might
expect for that purpose, and more robust when the assumptions don't
hold up as well. If someone knows of a case where the re-sorting
behavior is important, please let me know.
A minheap certainly seems more natural for that. I guess re-sorting the
array would be faster in the extreme case that you free almost all of
the blocks, and then consume almost all of the blocks, but I don't think
the usage pattern is ever that extreme. Because if all the data fit in
memory, we wouldn't be spilling in the first place.
I wonder if a more advanced heap like the pairing heap or fibonacci heap
would perform better? Probably doesn't matter in practice, so better
keep it simple...
Changing to a minheap effectively solves the problem for HashAgg,
though in theory the memory consumption of the freelist itself could
become significant (though it's only 0.1% of the free space being
tracked).
We could fairly easily spill parts of the freelist to disk, too, if
necessary. But it's probably not worth the trouble.
2. Lazily-allocate the read buffer. The write buffer was always lazily-
allocated, so this patch creates better symmetry. More importantly, it
means freshly-rewound tapes don't have any buffer allocated, so it
greatly expands the number of tapes that can be managed efficiently as
long as only a limited number are active at once.
Makes sense.
3. Allow expanding the number of tapes for an existing tape set. This
is useful for HashAgg, which doesn't know how many tapes will be needed
in advance.
I'd love to change the LogicalTape API so that you could allocate and
free tapes more freely. I wrote a patch to do that, as part of replacing
tuplesort.c's polyphase algorithm with a simpler one (see [1]/messages/by-id/420a0ec7-602c-d406-1e75-1ef7ddc58d83@iki.fi), but I
never got around to committing it. Maybe the time is ripe to do that now?
[1]: /messages/by-id/420a0ec7-602c-d406-1e75-1ef7ddc58d83@iki.fi
/messages/by-id/420a0ec7-602c-d406-1e75-1ef7ddc58d83@iki.fi
- Heikki
On Mon, Feb 3, 2020 at 6:24 PM Jeff Davis <pgsql@j-davis.com> wrote:
And now I'm attaching another version of the main Hash Aggregation
patch to be applied on top of the logtape.c patch.
Have you tested this against tuplesort.c, particularly parallel CREATE
INDEX? It would be worth trying to measure any performance impact.
Note that most parallel CREATE INDEX tuplesorts will do a merge within
each worker, and one big merge in the leader. It's much more likely to
have multiple passes than a regular serial external sort.
Parallel CREATE INDEX is currently accidentally disabled on the master
branch. That should be fixed in the next couple of days. You can
temporarily revert 74618e77 if you want to get it back for testing
purposes today.
Have you thought about integer overflow in your heap related routines?
This isn't as unlikely as you might think. See commit 512f67c8, for
example.
Have you thought about the MaxAllocSize restriction as it concerns
lts->freeBlocks? Will that be okay when you have many more tapes than
before?
Not a lot of changes from the last version; mostly some cleanup and
rebasing. But it's faster now with the logtape.c changes.
LogicalTapeSetExtend() seems to work in a way that assumes that the
tape is frozen. It would be good to document that assumption, and
possible enforce it by way of an assertion. The same remark applies to
any other assumptions you're making there.
--
Peter Geoghegan
On Wed, Feb 5, 2020 at 12:08 PM Peter Geoghegan <pg@bowt.ie> wrote:
Parallel CREATE INDEX is currently accidentally disabled on the master
branch. That should be fixed in the next couple of days. You can
temporarily revert 74618e77 if you want to get it back for testing
purposes today.
(Fixed -- sorry for the disruption.)
On Tue, 2020-02-04 at 18:42 +0800, Adam Lee wrote:
The minheap looks good, have you tested the performance and aggregate
validation?
Not sure exactly what you mean, but I tested the min heap with both
Sort and HashAgg and it performs well.
About the "Cap the size of the LogicalTapeSet's freelist" and "Don't
bother tracking free space for HashAgg at all" you mentioned in last
mail, I suppose these two options will lost the disk space saving
benefit since some blocks are not reusable then?
No freelist at all will, of course, leak the blocks and not reuse the
space.
A capped freelist is not bad in practice; it seems to still work as
long as the cap is reasonable. But it feels too arbitrary, and could
cause unexpected leaks when our assumptions change. I think a minheap
just makes more sense unless the freelist just becomes way too large.
Regards,
Jeff Davis
On Tue, 2020-02-04 at 15:08 -0800, Peter Geoghegan wrote:
Have you tested this against tuplesort.c, particularly parallel
CREATE
INDEX? It would be worth trying to measure any performance impact.
Note that most parallel CREATE INDEX tuplesorts will do a merge
within
each worker, and one big merge in the leader. It's much more likely
to
have multiple passes than a regular serial external sort.
I did not observe any performance regression when creating an index in
parallel over 20M ints (random ints in random order). I tried 2
parallel workers with work_mem=4MB and also 4 parallel workers with
work_mem=256kB.
Have you thought about integer overflow in your heap related
routines?
This isn't as unlikely as you might think. See commit 512f67c8, for
example.
It's dealing with blocks rather than tuples, so it's a bit less likely.
But changed it to use "unsigned long" instead.
Have you thought about the MaxAllocSize restriction as it concerns
lts->freeBlocks? Will that be okay when you have many more tapes than
before?
I added a check. If it exceeds MaxAllocSize, before trying to perform
the allocation, just leak the block rather than adding it to the
freelist. Perhaps there's a usecase for an extraordinarily-long
freelist, but it's outside the scope of this patch.
LogicalTapeSetExtend() seems to work in a way that assumes that the
tape is frozen. It would be good to document that assumption, and
possible enforce it by way of an assertion. The same remark applies
to
any other assumptions you're making there.
Can you explain? I am not freezing any tapes in Hash Aggregation, so
what about LogicalTapeSetExtend() assumes the tape is frozen?
Attached new logtape.c patches.
Regards,
Jeff Davis
Attachments:
0001-Logical-Tape-Set-lazily-allocate-read-buffer.patchtext/x-patch; charset=UTF-8; name=0001-Logical-Tape-Set-lazily-allocate-read-buffer.patchDownload
From d3593ff34c83c20c75165624faf6d84803390b36 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 31 Jan 2020 16:43:41 -0800
Subject: [PATCH 1/3] Logical Tape Set: lazily allocate read buffer.
The write buffer was already lazily-allocated, so this is more
symmetric. It also means that a freshly-rewound tape (whether for
reading or writing) is not consuming memory for the buffer.
---
src/backend/utils/sort/logtape.c | 27 ++++++++++++++++++++-------
1 file changed, 20 insertions(+), 7 deletions(-)
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 42cfb1f9f98..ba6d6e1f80a 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -773,15 +773,12 @@ LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum, size_t buffer_size)
lt->buffer_size = 0;
if (lt->firstBlockNumber != -1L)
{
- lt->buffer = palloc(buffer_size);
+ /*
+ * The buffer is lazily allocated in LogicalTapeRead(), but we set the
+ * size here.
+ */
lt->buffer_size = buffer_size;
}
-
- /* Read the first block, or reset if tape is empty */
- lt->nextBlockNumber = lt->firstBlockNumber;
- lt->pos = 0;
- lt->nbytes = 0;
- ltsReadFillBuffer(lts, lt);
}
/*
@@ -830,6 +827,22 @@ LogicalTapeRead(LogicalTapeSet *lts, int tapenum,
lt = <s->tapes[tapenum];
Assert(!lt->writing);
+ if (lt->buffer == NULL)
+ {
+ /* lazily allocate buffer */
+ if (lt->firstBlockNumber != -1L)
+ {
+ Assert(lt->buffer_size > 0);
+ lt->buffer = palloc(lt->buffer_size);
+ }
+
+ /* Read the first block, or reset if tape is empty */
+ lt->nextBlockNumber = lt->firstBlockNumber;
+ lt->pos = 0;
+ lt->nbytes = 0;
+ ltsReadFillBuffer(lts, lt);
+ }
+
while (size > 0)
{
if (lt->pos >= lt->nbytes)
--
2.17.1
0002-Logical-Tape-Set-change-freelist-to-min-heap.patchtext/x-patch; charset=UTF-8; name=0002-Logical-Tape-Set-change-freelist-to-min-heap.patchDownload
From 667335c98e3ee830b042801b29b62053325070d2 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 31 Jan 2020 16:44:40 -0800
Subject: [PATCH 2/3] Logical Tape Set: change freelist to min heap.
Previously, the freelist of blocks was tracked as an
occasionally-sorted array. A min heap is more resilient to larger
freelists or more frequent changes between reading and writing.
---
src/backend/utils/sort/logtape.c | 160 ++++++++++++++++++++-----------
1 file changed, 104 insertions(+), 56 deletions(-)
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index ba6d6e1f80a..8d934f6d44e 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -49,12 +49,8 @@
* when reading, and read multiple blocks from the same tape in one go,
* whenever the buffer becomes empty.
*
- * To support the above policy of writing to the lowest free block,
- * ltsGetFreeBlock sorts the list of free block numbers into decreasing
- * order each time it is asked for a block and the list isn't currently
- * sorted. This is an efficient way to handle it because we expect cycles
- * of releasing many blocks followed by re-using many blocks, due to
- * the larger read buffer.
+ * To support the above policy of writing to the lowest free block, the
+ * freelist is a min heap.
*
* Since all the bookkeeping and buffer memory is allocated with palloc(),
* and the underlying file(s) are made with OpenTemporaryFile, all resources
@@ -170,7 +166,7 @@ struct LogicalTapeSet
/*
* File size tracking. nBlocksWritten is the size of the underlying file,
* in BLCKSZ blocks. nBlocksAllocated is the number of blocks allocated
- * by ltsGetFreeBlock(), and it is always greater than or equal to
+ * by ltsReleaseBlock(), and it is always greater than or equal to
* nBlocksWritten. Blocks between nBlocksAllocated and nBlocksWritten are
* blocks that have been allocated for a tape, but have not been written
* to the underlying file yet. nHoleBlocks tracks the total number of
@@ -188,17 +184,11 @@ struct LogicalTapeSet
* If forgetFreeSpace is true then any freed blocks are simply forgotten
* rather than being remembered in freeBlocks[]. See notes for
* LogicalTapeSetForgetFreeSpace().
- *
- * If blocksSorted is true then the block numbers in freeBlocks are in
- * *decreasing* order, so that removing the last entry gives us the lowest
- * free block. We re-sort the blocks whenever a block is demanded; this
- * should be reasonably efficient given the expected usage pattern.
*/
bool forgetFreeSpace; /* are we remembering free blocks? */
- bool blocksSorted; /* is freeBlocks[] currently in order? */
- long *freeBlocks; /* resizable array */
- int nFreeBlocks; /* # of currently free blocks */
- int freeBlocksLen; /* current allocated length of freeBlocks[] */
+ long *freeBlocks; /* resizable array holding minheap */
+ long nFreeBlocks; /* # of currently free blocks */
+ Size freeBlocksLen; /* current allocated length of freeBlocks[] */
/* The array of logical tapes. */
int nTapes; /* # of logical tapes in set */
@@ -321,46 +311,88 @@ ltsReadFillBuffer(LogicalTapeSet *lts, LogicalTape *lt)
return (lt->nbytes > 0);
}
-/*
- * qsort comparator for sorting freeBlocks[] into decreasing order.
- */
-static int
-freeBlocks_cmp(const void *a, const void *b)
+static inline void
+swap_nodes(long *heap, unsigned long a, unsigned long b)
+{
+ unsigned long swap;
+
+ swap = heap[a];
+ heap[a] = heap[b];
+ heap[b] = swap;
+}
+
+static inline unsigned long
+left_offset(unsigned long i)
+{
+ return 2 * i + 1;
+}
+
+static inline unsigned long
+right_offset(unsigned i)
+{
+ return 2 * i + 2;
+}
+
+static inline unsigned long
+parent_offset(unsigned long i)
{
- long ablk = *((const long *) a);
- long bblk = *((const long *) b);
-
- /* can't just subtract because long might be wider than int */
- if (ablk < bblk)
- return 1;
- if (ablk > bblk)
- return -1;
- return 0;
+ return (i - 1) / 2;
}
/*
- * Select a currently unused block for writing to.
+ * Select the lowest currently unused block by taking the first element from
+ * the freelist min heap.
*/
static long
ltsGetFreeBlock(LogicalTapeSet *lts)
{
- /*
- * If there are multiple free blocks, we select the one appearing last in
- * freeBlocks[] (after sorting the array if needed). If there are none,
- * assign the next block at the end of the file.
- */
- if (lts->nFreeBlocks > 0)
+ long *heap = lts->freeBlocks;
+ long blocknum;
+ int heapsize;
+ unsigned long pos;
+
+ /* freelist empty; allocate a new block */
+ if (lts->nFreeBlocks == 0)
+ return lts->nBlocksAllocated++;
+
+ if (lts->nFreeBlocks == 1)
{
- if (!lts->blocksSorted)
- {
- qsort((void *) lts->freeBlocks, lts->nFreeBlocks,
- sizeof(long), freeBlocks_cmp);
- lts->blocksSorted = true;
- }
- return lts->freeBlocks[--lts->nFreeBlocks];
+ lts->nFreeBlocks--;
+ return lts->freeBlocks[0];
}
- else
- return lts->nBlocksAllocated++;
+
+ /* take top of minheap */
+ blocknum = heap[0];
+
+ /* replace with end of minheap array */
+ heap[0] = heap[--lts->nFreeBlocks];
+
+ /* sift down */
+ pos = 0;
+ heapsize = lts->nFreeBlocks;
+ while (true)
+ {
+ unsigned long left = left_offset(pos);
+ unsigned long right = right_offset(pos);
+ unsigned long min_child;
+
+ if (left < heapsize && right < heapsize)
+ min_child = (heap[left] < heap[right]) ? left : right;
+ else if (left < heapsize)
+ min_child = left;
+ else if (right < heapsize)
+ min_child = right;
+ else
+ break;
+
+ if (heap[min_child] >= heap[pos])
+ break;
+
+ swap_nodes(heap, min_child, pos);
+ pos = min_child;
+ }
+
+ return blocknum;
}
/*
@@ -369,7 +401,8 @@ ltsGetFreeBlock(LogicalTapeSet *lts)
static void
ltsReleaseBlock(LogicalTapeSet *lts, long blocknum)
{
- int ndx;
+ long *heap;
+ unsigned long pos;
/*
* Do nothing if we're no longer interested in remembering free space.
@@ -382,19 +415,35 @@ ltsReleaseBlock(LogicalTapeSet *lts, long blocknum)
*/
if (lts->nFreeBlocks >= lts->freeBlocksLen)
{
+ /*
+ * If the freelist becomes very large, just return and leak this free
+ * block.
+ */
+ if (lts->freeBlocksLen * 2 > MaxAllocSize)
+ return;
+
lts->freeBlocksLen *= 2;
lts->freeBlocks = (long *) repalloc(lts->freeBlocks,
lts->freeBlocksLen * sizeof(long));
}
- /*
- * Add blocknum to array, and mark the array unsorted if it's no longer in
- * decreasing order.
- */
- ndx = lts->nFreeBlocks++;
- lts->freeBlocks[ndx] = blocknum;
- if (ndx > 0 && lts->freeBlocks[ndx - 1] < blocknum)
- lts->blocksSorted = false;
+ heap = lts->freeBlocks;
+ pos = lts->nFreeBlocks;
+
+ /* place entry at end of minheap array */
+ heap[pos] = blocknum;
+ lts->nFreeBlocks++;
+
+ /* sift up */
+ while (pos != 0)
+ {
+ unsigned long parent = parent_offset(pos);
+ if (heap[parent] < heap[pos])
+ break;
+
+ swap_nodes(heap, parent, pos);
+ pos = parent;
+ }
}
/*
@@ -524,7 +573,6 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
lts->nBlocksWritten = 0L;
lts->nHoleBlocks = 0L;
lts->forgetFreeSpace = false;
- lts->blocksSorted = true; /* a zero-length array is sorted ... */
lts->freeBlocksLen = 32; /* reasonable initial guess */
lts->freeBlocks = (long *) palloc(lts->freeBlocksLen * sizeof(long));
lts->nFreeBlocks = 0;
--
2.17.1
0003-Logical-Tape-Set-add-API-to-extend-with-additional-t.patchtext/x-patch; charset=UTF-8; name=0003-Logical-Tape-Set-add-API-to-extend-with-additional-t.patchDownload
From 6e7b9f8b246150e36e7f7df513dbfcbef9f6102d Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 31 Jan 2020 16:42:38 -0800
Subject: [PATCH 3/3] Logical Tape Set: add API to extend with additional
tapes.
---
src/backend/utils/sort/logtape.c | 71 +++++++++++++++++++++-----------
src/include/utils/logtape.h | 2 +
2 files changed, 50 insertions(+), 23 deletions(-)
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 8d934f6d44e..7556abdb4aa 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -201,6 +201,7 @@ static long ltsGetFreeBlock(LogicalTapeSet *lts);
static void ltsReleaseBlock(LogicalTapeSet *lts, long blocknum);
static void ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
SharedFileSet *fileset);
+static void ltsInitTape(LogicalTape *lt);
/*
@@ -535,6 +536,30 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
lts->nHoleBlocks = lts->nBlocksAllocated - nphysicalblocks;
}
+/*
+ * Initialize per-tape struct. Note we allocate the I/O buffer and the first
+ * block for a tape only when it is first actually written to. This avoids
+ * wasting memory space when tuplesort.c overestimates the number of tapes
+ * needed.
+ */
+static void
+ltsInitTape(LogicalTape *lt)
+{
+ lt->writing = true;
+ lt->frozen = false;
+ lt->dirty = false;
+ lt->firstBlockNumber = -1L;
+ lt->curBlockNumber = -1L;
+ lt->nextBlockNumber = -1L;
+ lt->offsetBlockNumber = 0L;
+ lt->buffer = NULL;
+ lt->buffer_size = 0;
+ /* palloc() larger than MaxAllocSize would fail */
+ lt->max_size = MaxAllocSize;
+ lt->pos = 0;
+ lt->nbytes = 0;
+}
+
/*
* Create a set of logical tapes in a temporary underlying file.
*
@@ -560,7 +585,6 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
int worker)
{
LogicalTapeSet *lts;
- LogicalTape *lt;
int i;
/*
@@ -578,29 +602,8 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
lts->nFreeBlocks = 0;
lts->nTapes = ntapes;
- /*
- * Initialize per-tape structs. Note we allocate the I/O buffer and the
- * first block for a tape only when it is first actually written to. This
- * avoids wasting memory space when tuplesort.c overestimates the number
- * of tapes needed.
- */
for (i = 0; i < ntapes; i++)
- {
- lt = <s->tapes[i];
- lt->writing = true;
- lt->frozen = false;
- lt->dirty = false;
- lt->firstBlockNumber = -1L;
- lt->curBlockNumber = -1L;
- lt->nextBlockNumber = -1L;
- lt->offsetBlockNumber = 0L;
- lt->buffer = NULL;
- lt->buffer_size = 0;
- /* palloc() larger than MaxAllocSize would fail */
- lt->max_size = MaxAllocSize;
- lt->pos = 0;
- lt->nbytes = 0;
- }
+ ltsInitTape(<s->tapes[i]);
/*
* Create temp BufFile storage as required.
@@ -1004,6 +1007,28 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum, TapeShare *share)
}
}
+/*
+ * Add additional tapes to this tape set.
+ */
+LogicalTapeSet *
+LogicalTapeSetExtend(LogicalTapeSet *lts, int nAdditional)
+{
+ int i;
+ int nTapesOrig = lts->nTapes;
+ Size newSize;
+
+ lts->nTapes += nAdditional;
+ newSize = offsetof(LogicalTapeSet, tapes) +
+ lts->nTapes * sizeof(LogicalTape);
+
+ lts = (LogicalTapeSet *) repalloc(lts, newSize);
+
+ for (i = nTapesOrig; i < lts->nTapes; i++)
+ ltsInitTape(<s->tapes[i]);
+
+ return lts;
+}
+
/*
* Backspace the tape a given number of bytes. (We also support a more
* general seek interface, see below.)
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index 695d2c00ee4..3ebe52239f8 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -67,6 +67,8 @@ extern void LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum,
extern void LogicalTapeRewindForWrite(LogicalTapeSet *lts, int tapenum);
extern void LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum,
TapeShare *share);
+extern LogicalTapeSet *LogicalTapeSetExtend(LogicalTapeSet *lts,
+ int nAdditional);
extern size_t LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum,
size_t size);
extern void LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
--
2.17.1
On Wed, Feb 5, 2020 at 10:37 AM Jeff Davis <pgsql@j-davis.com> wrote:
LogicalTapeSetExtend() seems to work in a way that assumes that the
tape is frozen. It would be good to document that assumption, and
possible enforce it by way of an assertion. The same remark applies
to
any other assumptions you're making there.Can you explain? I am not freezing any tapes in Hash Aggregation, so
what about LogicalTapeSetExtend() assumes the tape is frozen?
Sorry, I was very unclear. I meant to write just the opposite: you
assume that the tapes are *not* frozen. If you're adding a new
capability to logtape.c, it makes sense to be clear on the
requirements on tapeset state or individual tape state.
--
Peter Geoghegan
On Tue, 2020-02-04 at 18:10 +0200, Heikki Linnakangas wrote:
I'd love to change the LogicalTape API so that you could allocate
and
free tapes more freely. I wrote a patch to do that, as part of
replacing
tuplesort.c's polyphase algorithm with a simpler one (see [1]), but
I
never got around to committing it. Maybe the time is ripe to do that
now?
It's interesting that you wrote a patch to pause the tapes a while ago.
Did it just fall through the cracks or was there a problem with it?
Is pause/resume functionality required, or is it good enough that
rewinding a tape frees the buffer, to be lazily allocated later?
Regarding the API, I'd like to change it, but I'm running into some
performance challenges when adding a layer of indirection. If I apply
the very simple attached patch, which simply makes a separate
allocation for the tapes array, it seems to slow down sort by ~5%.
Regards,
Jeff Davis
Attachments:
tapes-indirection.patchtext/x-patch; charset=UTF-8; name=tapes-indirection.patchDownload
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 42cfb1f9f98..5a47835024e 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -202,7 +202,7 @@ struct LogicalTapeSet
/* The array of logical tapes. */
int nTapes; /* # of logical tapes in set */
- LogicalTape tapes[FLEXIBLE_ARRAY_MEMBER]; /* has nTapes nentries */
+ LogicalTape *tapes; /* has nTapes nentries */
};
static void ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer);
@@ -518,8 +518,7 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
* Create top-level struct including per-tape LogicalTape structs.
*/
Assert(ntapes > 0);
- lts = (LogicalTapeSet *) palloc(offsetof(LogicalTapeSet, tapes) +
- ntapes * sizeof(LogicalTape));
+ lts = (LogicalTapeSet *) palloc(sizeof(LogicalTapeSet));
lts->nBlocksAllocated = 0L;
lts->nBlocksWritten = 0L;
lts->nHoleBlocks = 0L;
@@ -529,6 +528,7 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
lts->freeBlocks = (long *) palloc(lts->freeBlocksLen * sizeof(long));
lts->nFreeBlocks = 0;
lts->nTapes = ntapes;
+ lts->tapes = (LogicalTape *) palloc(ntapes * sizeof(LogicalTape));
/*
* Initialize per-tape structs. Note we allocate the I/O buffer and the
On Wed, 2020-02-05 at 11:56 -0800, Jeff Davis wrote:
Regarding the API, I'd like to change it, but I'm running into some
performance challenges when adding a layer of indirection. If I apply
the very simple attached patch, which simply makes a separate
allocation for the tapes array, it seems to slow down sort by ~5%.
I tried a few different approaches to allow a flexible number of tapes
without regressing normal Sort performance. I found some odd hacks, but
I can't explain why they perform better than the more obvious approach.
The LogicalTapeSetExtend() API is a natural evolution of what's already
there, so I think I'll stick with that to keep the scope of Hash
Aggregation under control.
If we improve the API later I'm happy to adapt the HashAgg work to use
it -- anything to take more code out of nodeAgg.c!
Regards,
Jeff Davis
On Fri, 2020-01-24 at 17:01 -0800, Jeff Davis wrote:
New patch attached.
Three minor independent refactoring patches:
1. Add new entry points for the tuple hash table:
TupleHashTableHash()
LookupTupleHashEntryHash()
which are useful for saving and reusing hash values to avoid
recomputing.
2. Refactor hash_agg_entry_size() so that the callers don't need to do
as much work.
3. Save calculated aggcosts->transitionSpace in the Agg node for later
use, rather than discarding it.
These are helpful for the upcoming Hash Aggregation work.
Regards,
Jeff Davis
Attachments:
0001-HashAgg-TupleHashTableHash-and-LookupTupleHashEntryH.patchtext/x-patch; charset=UTF-8; name=0001-HashAgg-TupleHashTableHash-and-LookupTupleHashEntryH.patchDownload
From f47cdd10f04baa3b41eaf0fb8c17f41dda4d0bd4 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 3 Feb 2020 14:45:25 -0800
Subject: [PATCH 1/2] HashAgg: TupleHashTableHash() and
LookupTupleHashEntryHash().
Expose two new entry points; one for only calculating the hash value
of a tuple, and another for looking up a hash entry when the hash
value is already known.
---
src/backend/executor/execGrouping.c | 105 ++++++++++++++++++++--------
src/include/executor/executor.h | 5 ++
2 files changed, 80 insertions(+), 30 deletions(-)
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 3603c58b63e..94439e2ab9e 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -25,8 +25,9 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
-static uint32 TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple);
static int TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
+static TupleHashEntry LookupTupleHashEntry_internal(
+ TupleHashTable hashtable, TupleTableSlot *slot, bool *isnew, uint32 hash);
/*
* Define parameters for tuple hash table code generation. The interface is
@@ -300,10 +301,9 @@ TupleHashEntry
LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
bool *isnew)
{
- TupleHashEntryData *entry;
- MemoryContext oldContext;
- bool found;
- MinimalTuple key;
+ TupleHashEntry entry;
+ MemoryContext oldContext;
+ uint32 hash;
/* Need to run the hash functions in short-lived context */
oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
@@ -313,32 +313,29 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
hashtable->cur_eq_func = hashtable->tab_eq_func;
- key = NULL; /* flag to reference inputslot */
+ hash = TupleHashTableHash(hashtable->hashtab, NULL);
+ entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
- if (isnew)
- {
- entry = tuplehash_insert(hashtable->hashtab, key, &found);
+ MemoryContextSwitchTo(oldContext);
- if (found)
- {
- /* found pre-existing entry */
- *isnew = false;
- }
- else
- {
- /* created new entry */
- *isnew = true;
- /* zero caller data */
- entry->additional = NULL;
- MemoryContextSwitchTo(hashtable->tablecxt);
- /* Copy the first tuple into the table context */
- entry->firstTuple = ExecCopySlotMinimalTuple(slot);
- }
- }
- else
- {
- entry = tuplehash_lookup(hashtable->hashtab, key);
- }
+ return entry;
+}
+
+/*
+ * A variant of LookupTupleHashEntry for callers that have already computed
+ * the hash value.
+ */
+TupleHashEntry
+LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
+ bool *isnew, uint32 hash)
+{
+ TupleHashEntry entry;
+ MemoryContext oldContext;
+
+ /* Need to run the hash functions in short-lived context */
+ oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+ entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
MemoryContextSwitchTo(oldContext);
@@ -389,7 +386,7 @@ FindTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
* Also, the caller must select an appropriate memory context for running
* the hash functions. (dynahash.c doesn't change CurrentMemoryContext.)
*/
-static uint32
+uint32
TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
{
TupleHashTable hashtable = (TupleHashTable) tb->private_data;
@@ -450,6 +447,54 @@ TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
return murmurhash32(hashkey);
}
+/*
+ * Does the work of LookupTupleHashEntry and LookupTupleHashEntryHash. Useful
+ * so that we can avoid switching the memory context multiple times for
+ * LookupTupleHashEntry.
+ */
+static TupleHashEntry
+LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
+ bool *isnew, uint32 hash)
+{
+ TupleHashEntryData *entry;
+ bool found;
+ MinimalTuple key;
+
+ /* set up data needed by hash and match functions */
+ hashtable->inputslot = slot;
+ hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ hashtable->cur_eq_func = hashtable->tab_eq_func;
+
+ key = NULL; /* flag to reference inputslot */
+
+ if (isnew)
+ {
+ entry = tuplehash_insert_hash(hashtable->hashtab, key, hash, &found);
+
+ if (found)
+ {
+ /* found pre-existing entry */
+ *isnew = false;
+ }
+ else
+ {
+ /* created new entry */
+ *isnew = true;
+ /* zero caller data */
+ entry->additional = NULL;
+ MemoryContextSwitchTo(hashtable->tablecxt);
+ /* Copy the first tuple into the table context */
+ entry->firstTuple = ExecCopySlotMinimalTuple(slot);
+ }
+ }
+ else
+ {
+ entry = tuplehash_lookup_hash(hashtable->hashtab, key, hash);
+ }
+
+ return entry;
+}
+
/*
* See whether two tuples (presumably of the same hash value) match
*/
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 6ef3e1fe069..76215992647 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -140,10 +140,15 @@ extern TupleHashTable BuildTupleHashTableExt(PlanState *parent,
extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
bool *isnew);
+extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
+ TupleTableSlot *slot,
+ bool *isnew, uint32 hash);
extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
ExprState *eqcomp,
FmgrInfo *hashfunctions);
+extern uint32 TupleHashTableHash(struct tuplehash_hash *tb,
+ const MinimalTuple tuple);
extern void ResetTupleHashTable(TupleHashTable hashtable);
/*
--
2.17.1
0002-HashAgg-make-hash_agg_entry_size-account-for-all-spa.patchtext/x-patch; charset=UTF-8; name=0002-HashAgg-make-hash_agg_entry_size-account-for-all-spa.patchDownload
From 2db9ae43db11fb28d6b8397a1858c996a8a00b19 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 3 Feb 2020 15:12:41 -0800
Subject: [PATCH 2/2] HashAgg: make hash_agg_entry_size() account for all
space.
Previously, it neglected to account for pass-by-reference transition
data values and the representative tuple, requiring the caller to do
so.
---
src/backend/executor/nodeAgg.c | 23 +++++++++--------------
src/backend/optimizer/plan/planner.c | 9 ++-------
src/backend/utils/adt/selfuncs.c | 12 ++----------
src/include/executor/nodeAgg.h | 3 ++-
4 files changed, 15 insertions(+), 32 deletions(-)
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 9073395eacf..ac3908ba142 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1423,23 +1423,18 @@ find_hash_columns(AggState *aggstate)
/*
* Estimate per-hash-table-entry overhead for the planner.
- *
- * Note that the estimate does not include space for pass-by-reference
- * transition data values, nor for the representative tuple of each group.
- * Nor does this account of the target fill-factor and growth policy of the
- * hash table.
*/
Size
-hash_agg_entry_size(int numAggs)
+hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
{
- Size entrysize;
-
- /* This must match build_hash_table */
- entrysize = sizeof(TupleHashEntryData) +
- numAggs * sizeof(AggStatePerGroupData);
- entrysize = MAXALIGN(entrysize);
-
- return entrysize;
+ return
+ /* key */
+ MAXALIGN(SizeofMinimalTupleHeader) +
+ MAXALIGN(tupleWidth) +
+ /* data */
+ MAXALIGN(sizeof(TupleHashEntryData) +
+ numAggs * sizeof(AggStatePerGroupData)) +
+ transitionSpace;
}
/*
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index d6f21535937..b44efd6314c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4867,13 +4867,8 @@ create_distinct_paths(PlannerInfo *root,
allow_hash = false; /* policy-based decision not to hash */
else
{
- Size hashentrysize;
-
- /* Estimate per-hash-entry space at tuple width... */
- hashentrysize = MAXALIGN(cheapest_input_path->pathtarget->width) +
- MAXALIGN(SizeofMinimalTupleHeader);
- /* plus the per-hash-entry overhead */
- hashentrysize += hash_agg_entry_size(0);
+ Size hashentrysize = hash_agg_entry_size(
+ 0, cheapest_input_path->pathtarget->width, 0);
/* Allow hashing only if hashtable is predicted to fit in work_mem */
allow_hash = (hashentrysize * numDistinctRows <= work_mem * 1024L);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 7c6f0574b37..0be26fe0378 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3526,16 +3526,8 @@ double
estimate_hashagg_tablesize(Path *path, const AggClauseCosts *agg_costs,
double dNumGroups)
{
- Size hashentrysize;
-
- /* Estimate per-hash-entry space at tuple width... */
- hashentrysize = MAXALIGN(path->pathtarget->width) +
- MAXALIGN(SizeofMinimalTupleHeader);
-
- /* plus space for pass-by-ref transition values... */
- hashentrysize += agg_costs->transitionSpace;
- /* plus the per-hash-entry overhead */
- hashentrysize += hash_agg_entry_size(agg_costs->numAggs);
+ Size hashentrysize = hash_agg_entry_size(
+ agg_costs->numAggs, path->pathtarget->width, agg_costs->transitionSpace);
/*
* Note that this disregards the effect of fill-factor and growth policy
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 2fe82da6ff7..264916f9a92 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -309,6 +309,7 @@ extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
extern void ExecEndAgg(AggState *node);
extern void ExecReScanAgg(AggState *node);
-extern Size hash_agg_entry_size(int numAggs);
+extern Size hash_agg_entry_size(int numAggs, Size tupleWidth,
+ Size transitionSpace);
#endif /* NODEAGG_H */
--
2.17.1
0003-HashAgg-save-calculated-transitionSpace-in-Agg-node.patchtext/x-patch; charset=UTF-8; name=0003-HashAgg-save-calculated-transitionSpace-in-Agg-node.patchDownload
From b69e6fbcb8a92d0af4d9ba2f24cfb7d07dfdff9d Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 3 Feb 2020 15:18:52 -0800
Subject: [PATCH 3/3] HashAgg: save calculated transitionSpace in Agg node.
This is useful to improve estimates of how many groups can fit in the
hash table without exceeding work_mem.
---
src/backend/optimizer/plan/createplan.c | 9 +++++++--
src/backend/optimizer/util/pathnode.c | 2 ++
src/include/nodes/pathnodes.h | 2 ++
src/include/nodes/plannodes.h | 1 +
src/include/optimizer/planmain.h | 4 ++--
5 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e048d200bb4..090919e39a0 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1644,6 +1644,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
NIL,
NIL,
best_path->path.rows,
+ 0,
subplan);
}
else
@@ -2096,6 +2097,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
NIL,
NIL,
best_path->numGroups,
+ best_path->transitionSpace,
subplan);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2257,6 +2259,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
NIL,
rollup->numGroups,
+ best_path->transitionSpace,
sort_plan);
/*
@@ -2295,6 +2298,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
chain,
rollup->numGroups,
+ best_path->transitionSpace,
subplan);
/* Copy cost data from Path to Plan */
@@ -6192,8 +6196,8 @@ Agg *
make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree)
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree)
{
Agg *node = makeNode(Agg);
Plan *plan = &node->plan;
@@ -6209,6 +6213,7 @@ make_agg(List *tlist, List *qual,
node->grpOperators = grpOperators;
node->grpCollations = grpCollations;
node->numGroups = numGroups;
+ node->transitionSpace = transitionSpace;
node->aggParams = NULL; /* SS_finalize_plan() will fill this */
node->groupingSets = groupingSets;
node->chain = chain;
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e6d08aede56..d9ce5162116 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2949,6 +2949,7 @@ create_agg_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->aggsplit = aggsplit;
pathnode->numGroups = numGroups;
+ pathnode->transitionSpace = aggcosts ? aggcosts->transitionSpace : 0;
pathnode->groupClause = groupClause;
pathnode->qual = qual;
@@ -3036,6 +3037,7 @@ create_groupingsets_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->rollups = rollups;
pathnode->qual = having_qual;
+ pathnode->transitionSpace = agg_costs ? agg_costs->transitionSpace : 0;
Assert(rollups != NIL);
Assert(aggstrategy != AGG_PLAIN || list_length(rollups) == 1);
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 3d3be197e0e..be592d0fee4 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1663,6 +1663,7 @@ typedef struct AggPath
AggStrategy aggstrategy; /* basic strategy, see nodes.h */
AggSplit aggsplit; /* agg-splitting mode, see nodes.h */
double numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
List *groupClause; /* a list of SortGroupClause's */
List *qual; /* quals (HAVING quals), if any */
} AggPath;
@@ -1700,6 +1701,7 @@ typedef struct GroupingSetsPath
AggStrategy aggstrategy; /* basic strategy */
List *rollups; /* list of RollupData */
List *qual; /* quals (HAVING quals), if any */
+ int32 transitionSpace; /* estimated transition state size */
} GroupingSetsPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 32c0d87f80e..f4183e1efa5 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -813,6 +813,7 @@ typedef struct Agg
Oid *grpOperators; /* equality operators to compare with */
Oid *grpCollations;
long numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
Bitmapset *aggParams; /* IDs of Params used in Aggref inputs */
/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
List *groupingSets; /* grouping sets to use */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index eab486a6214..c7bda2b0917 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -54,8 +54,8 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
extern Agg *make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree);
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree);
extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
/*
--
2.17.1
On 05/02/2020 21:56, Jeff Davis wrote:
On Tue, 2020-02-04 at 18:10 +0200, Heikki Linnakangas wrote:
I'd love to change the LogicalTape API so that you could allocate
and
free tapes more freely. I wrote a patch to do that, as part of
replacing
tuplesort.c's polyphase algorithm with a simpler one (see [1]), but
I
never got around to committing it. Maybe the time is ripe to do that
now?It's interesting that you wrote a patch to pause the tapes a while ago.
Did it just fall through the cracks or was there a problem with it?Is pause/resume functionality required, or is it good enough that
rewinding a tape frees the buffer, to be lazily allocated later?
It wasn't strictly required for what I was hacking on then. IIRC it
would have saved some memory during sorting, but Peter G felt that it
wasn't worth the trouble, because he made some other changes around the
same time, which made it less important
(/messages/by-id/CAM3SWZS0nwOPoJQHvxugA9kKPzky2QC2348TTWdSStZOkke5tg@mail.gmail.com).
I dropped the ball on both patches then, but I still think they would be
worthwhile.
- Heikki
On Thu, Feb 6, 2020 at 12:01 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
It wasn't strictly required for what I was hacking on then. IIRC it
would have saved some memory during sorting, but Peter G felt that it
wasn't worth the trouble, because he made some other changes around the
same time, which made it less important
FWIW, I am not opposed to the patch at all. I would be quite happy to
get rid of a bunch of code in tuplesort.c that apparently isn't really
necessary anymore (by removing polyphase merge).
All I meant back in 2016 was that "pausing" tapes was orthogonal to my
own idea of capping the number of tapes that could be used by
tuplesort.c. The 500 MAXORDER cap thing hadn't been committed yet when
I explained this in the message you linked to, and it wasn't clear if
it would ever be committed (Robert committed it about a month
afterwards, as it turned out). Capping the size of the merge heap made
marginal sorts faster overall, since a more cache efficient merge heap
more than made up for having more than one merge pass overall (thanks
to numerous optimizations added in 2016, some of which were your
work).
I also said that the absolute overhead of tapes was not that important
back in 2016. Using many tapes within tuplesort.c can never happen
anyway (with the 500 MAXORDER cap). Maybe the use of logtape.c by hash
aggregate changes the picture there now. Even if it doesn't, I still
think that your patch is a good idea.
--
Peter Geoghegan
On Mon, 2020-02-03 at 18:24 -0800, Jeff Davis wrote:
On Mon, 2020-02-03 at 10:29 -0800, Jeff Davis wrote:
I ended up converting the freelist to a min heap.
Attached is a patch which makes three changes to better support
HashAgg:And now I'm attaching another version of the main Hash Aggregation
patch to be applied on top of the logtape.c patch.Not a lot of changes from the last version; mostly some cleanup and
rebasing. But it's faster now with the logtape.c changes.
Attaching latest version (combined logtape changes along with main
HashAgg patch).
I believe I've addressed all of the comments, except for Heikki's
question about changing the logtape.c API. I think big changes to the
API (such as Heikki's proposal) are out of scope for this patch,
although I do favor the changes in general. This patch just includes
the LogicalTapeSetExtend() API by Adam Lee, which is less intrusive.
I noticed (and fixed) a small regression for some in-memory hashagg
queries due to the way I was choosing the number of buckets when
creating the hash table. I don't think that it is necessarily worse in
general, but given that there is at least one case of a regression, I
made it more closely match the old behavior, and the regression
disappared.
I improved costing by taking into account the actual number of
partitions and the memory limits, at least for the first pass (in
recursive passes the number of partitions can change).
Aside from that, just some cleanup and rebasing.
Regards,
Jeff Davis
Attachments:
hashagg-20200210.patchtext/x-patch; charset=UTF-8; name=hashagg-20200210.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f89ec7..85f559387f9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,23 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-hashagg-mem-overflow" xreflabel="hashagg_mem_overflow">
+ <term><varname>hashagg_mem_overflow</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>hashagg_mem_overflow</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ If hash aggregation exceeds <varname>work_mem</varname> at query
+ execution time, and <varname>hashagg_mem_overflow</varname> is set
+ to <literal>on</literal>, continue consuming more memory rather than
+ performing disk-based hash aggregation. The default
+ is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
@@ -4476,6 +4493,24 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-hashagg-spill" xreflabel="enable_hashagg_spill">
+ <term><varname>enable_hashagg_spill</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_hashagg_spill</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the memory usage is expected to
+ exceed <varname>work_mem</varname>. This only affects the planner
+ choice; actual behavior at execution time is dictated by
+ <xref linkend="guc-hashagg-mem-overflow"/>. The default
+ is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
<term><varname>enable_hashjoin</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50e..2923f4ba46d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -104,6 +104,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1882,6 +1883,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ if (es->analyze)
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2769,6 +2772,55 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * If EXPLAIN ANALYZE, show information on hash aggregate memory usage and
+ * batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(
+ es->str,
+ "Memory Usage: %ldkB",
+ memPeakKb);
+
+ if (aggstate->hash_batches_used > 0)
+ {
+ appendStringInfo(
+ es->str,
+ " Batches: %d Disk: %ldkB",
+ aggstate->hash_batches_used, aggstate->hash_disk_used);
+ }
+
+ appendStringInfo(
+ es->str,
+ "\n");
+ }
+ else
+ {
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ ExplainPropertyInteger("Disk Usage", "kB",
+ aggstate->hash_disk_used, es);
+ }
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 121eff97a0c..9dff7990742 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -79,7 +79,8 @@ static void ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash);
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled);
/*
@@ -2927,7 +2928,7 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
*/
ExprState *
ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
- bool doSort, bool doHash)
+ bool doSort, bool doHash, bool spilled)
{
ExprState *state = makeNode(ExprState);
PlanState *parent = &aggstate->ss.ps;
@@ -3158,7 +3159,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (int setno = 0; setno < processGroupingSets; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, false);
+ pertrans, transno, setno, setoff, false,
+ spilled);
setoff++;
}
}
@@ -3177,7 +3179,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (int setno = 0; setno < numHashes; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, true);
+ pertrans, transno, setno, setoff, true,
+ spilled);
setoff++;
}
}
@@ -3227,7 +3230,8 @@ static void
ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash)
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled)
{
int adjust_init_jumpnull = -1;
int adjust_strict_jumpnull = -1;
@@ -3249,7 +3253,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
fcinfo->flinfo->fn_strict &&
pertrans->initValueIsNull)
{
- scratch->opcode = EEOP_AGG_INIT_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_INIT_TRANS_SPILLED : EEOP_AGG_INIT_TRANS;
scratch->d.agg_init_trans.pertrans = pertrans;
scratch->d.agg_init_trans.setno = setno;
scratch->d.agg_init_trans.setoff = setoff;
@@ -3265,7 +3270,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
if (pertrans->numSortCols == 0 &&
fcinfo->flinfo->fn_strict)
{
- scratch->opcode = EEOP_AGG_STRICT_TRANS_CHECK;
+ scratch->opcode = spilled ?
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED : EEOP_AGG_STRICT_TRANS_CHECK;
scratch->d.agg_strict_trans_check.setno = setno;
scratch->d.agg_strict_trans_check.setoff = setoff;
scratch->d.agg_strict_trans_check.transno = transno;
@@ -3282,9 +3288,11 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
/* invoke appropriate transition implementation */
if (pertrans->numSortCols == 0 && pertrans->transtypeByVal)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS_BYVAL;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED : EEOP_AGG_PLAIN_TRANS_BYVAL;
else if (pertrans->numSortCols == 0)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_SPILLED : EEOP_AGG_PLAIN_TRANS;
else if (pertrans->numInputs == 1)
scratch->opcode = EEOP_AGG_ORDERED_TRANS_DATUM;
else
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 35eb8b99f69..e21e0c440ea 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -426,9 +426,13 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
&&CASE_EEOP_AGG_INIT_TRANS,
+ &&CASE_EEOP_AGG_INIT_TRANS_SPILLED,
&&CASE_EEOP_AGG_STRICT_TRANS_CHECK,
+ &&CASE_EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_SPILLED,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
&&CASE_EEOP_LAST
@@ -1619,6 +1623,35 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
+
+ /* If transValue has not yet been initialized, do so now. */
+ if (pergroup->noTransValue)
+ {
+ AggStatePerTrans pertrans = op->d.agg_init_trans.pertrans;
+
+ aggstate->curaggcontext = op->d.agg_init_trans.aggcontext;
+ aggstate->current_set = op->d.agg_init_trans.setno;
+
+ ExecAggInitGroup(aggstate, pertrans, pergroup);
+
+ /* copied trans value from input, done this round */
+ EEO_JUMP(op->d.agg_init_trans.jumpnull);
+ }
+
+ EEO_NEXT();
+ }
/* check that a strict aggregate's input isn't NULL */
EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK)
@@ -1635,6 +1668,24 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_strict_trans_check.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_strict_trans_check.transno];
+
+ if (unlikely(pergroup->transValueIsNull))
+ EEO_JUMP(op->d.agg_strict_trans_check.jumpnull);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1683,6 +1734,51 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ Assert(pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1726,6 +1822,66 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
newVal = FunctionCallInvoke(fcinfo);
+ /*
+ * For pass-by-ref datatype, must copy the new value into
+ * aggcontext and free the prior transValue. But if transfn
+ * returned a pointer to its first input, we don't need to do
+ * anything. Also, if transfn returned a pointer to a R/W
+ * expanded object that is already a child of the aggcontext,
+ * assume we can adopt that value without copying it.
+ */
+ if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggTransReparent(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ Assert(!pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
/*
* For pass-by-ref datatype, must copy the new value into
* aggcontext and free the prior transValue. But if transfn
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index b7f49ceddf8..5f78abc5aca 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -194,6 +194,29 @@
* transition values. hashcontext is the single context created to support
* all hash tables.
*
+ * Spilling To Disk
+ *
+ * When performing hash aggregation, if the hash table memory exceeds the
+ * limit (see hash_agg_check_limits()), we enter "spill mode". In spill
+ * mode, we advance the transition states only for groups already in the
+ * hash table. For tuples that would need to create a new hash table
+ * entries (and initialize new transition states), we instead spill them to
+ * disk to be processed later. The tuples are spilled in a partitioned
+ * manner, so that subsequent batches are smaller and less likely to exceed
+ * work_mem (if a batch does exceed work_mem, it must be spilled
+ * recursively).
+ *
+ * Spilled data is written to logical tapes. These provide better control
+ * over memory usage, disk space, and the number of files than if we were
+ * to use a BufFile for each spill.
+ *
+ * Note that it's possible for transition states to start small but then
+ * grow very large; for instance in the case of ARRAY_AGG. In such cases,
+ * it's still possible to significantly exceed work_mem. We try to avoid
+ * this situation by estimating what will fit in the available memory, and
+ * imposing a limit on the number of groups separately from the amount of
+ * memory consumed.
+ *
* Transition / Combine function invocation:
*
* For performance reasons transition functions, including combine
@@ -233,12 +256,99 @@
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/datum.h"
+#include "utils/dynahash.h"
#include "utils/expandeddatum.h"
+#include "utils/logtape.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+/*
+ * Control how many partitions are created when spilling HashAgg to
+ * disk.
+ *
+ * HASHAGG_PARTITION_FACTOR is multiplied by the estimated number of
+ * partitions needed such that each partition will fit in memory. The factor
+ * is set higher than one because there's not a high cost to having a few too
+ * many partitions, and it makes it less likely that a partition will need to
+ * be spilled recursively. Another benefit of having more, smaller partitions
+ * is that small hash tables may perform better than large ones due to memory
+ * caching effects.
+ *
+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (which
+ * could instead be spent on a larger hash table).
+ *
+ * For reading from tapes, the buffer size must be a multiple of
+ * BLCKSZ. Larger values help when reading from multiple tapes concurrently,
+ * but that doesn't happen in HashAgg, so we simply use BLCKSZ. Writing to a
+ * tape always uses a buffer of size BLCKSZ.
+ */
+#define HASHAGG_PARTITION_FACTOR 1.50
+#define HASHAGG_MIN_PARTITIONS 4
+#define HASHAGG_MAX_PARTITIONS 256
+#define HASHAGG_READ_BUFFER_SIZE BLCKSZ
+#define HASHAGG_WRITE_BUFFER_SIZE BLCKSZ
+
+/*
+ * Track all tapes needed for a HashAgg that spills. We don't know the maximum
+ * number of tapes needed at the start of the algorithm (because it can
+ * recurse), so one tape set is allocated and extended as needed for new
+ * tapes. When a particular tape is already read, rewind it for write mode and
+ * put it in the free list.
+ *
+ * Tapes' buffers can take up substantial memory when many tapes are open at
+ * once. We only need one tape open at a time in read mode (using a buffer
+ * that's a multiple of BLCKSZ); but we need up to HASHAGG_MAX_PARTITIONS
+ * tapes open in write mode (each requiring a buffer of size BLCKSZ).
+ */
+typedef struct HashTapeInfo
+{
+ LogicalTapeSet *tapeset;
+ int ntapes;
+ int *freetapes;
+ int nfreetapes;
+} HashTapeInfo;
+
+/*
+ * Represents partitioned spill data for a single hashtable. Contains the
+ * necessary information to route tuples to the correct partition, and to
+ * transform the spilled data into new batches.
+ *
+ * The high bits are used for partition selection (when recursing, we ignore
+ * the bits that have already been used for partition selection at an earlier
+ * level).
+ */
+typedef struct HashAggSpill
+{
+ HashTapeInfo *tapeinfo; /* borrowed reference to tape info */
+ int npartitions; /* number of partitions */
+ int *partitions; /* spill partition tape numbers */
+ int64 *ntuples; /* number of tuples in each partition */
+ uint32 mask; /* mask to find partition from hash value */
+ int shift; /* after masking, shift by this amount */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation (with only one
+ * grouping set).
+ *
+ * Also tracks the bits of the hash already used for partition selection by
+ * earlier iterations, so that this batch can use new bits. If all bits have
+ * already been used, no partitioning will be done (any spilled data will go
+ * to a single output tape).
+ */
+typedef struct HashAggBatch
+{
+ int setno; /* grouping set */
+ int used_bits; /* number of bits of hash already used */
+ HashTapeInfo *tapeinfo; /* borrowed reference to tape info */
+ int input_tapenum; /* input partition tape */
+ int64 input_tuples; /* number of tuples in this batch */
+} HashAggBatch;
+
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -263,21 +373,56 @@ static void finalize_partialaggregate(AggState *aggstate,
AggStatePerAgg peragg,
AggStatePerGroup pergroupstate,
Datum *resultVal, bool *resultIsNull);
+static void prepare_hash_slot(AggState *aggstate);
static void prepare_projection_slot(AggState *aggstate,
TupleTableSlot *slot,
int currentSet);
static void finalize_aggregates(AggState *aggstate,
AggStatePerAgg peragg,
AggStatePerGroup pergroup);
-static TupleTableSlot *project_aggregates(AggState *aggstate);
-static Bitmapset *find_unaggregated_cols(AggState *aggstate);
-static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
-static void build_hash_table(AggState *aggstate);
-static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
+static void build_hash_tables(AggState *aggstate);
+static void build_hash_table(AggState *aggstate, int setno,
+ int64 ngroups_estimate);
+static void hashagg_recompile_expressions(AggState *aggstate);
+static long hash_choose_num_buckets(AggState *aggstate,
+ long estimated_nbuckets,
+ Size memory);
+static int hash_choose_num_partitions(uint64 input_groups,
+ double hashentrysize,
+ int used_bits,
+ int *log2_npartittions);
+static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+
+/* Hash Aggregation helpers */
+static TupleTableSlot *project_aggregates(AggState *aggstate);
+static Bitmapset *find_unaggregated_cols(AggState *aggstate);
+static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
+static void hash_agg_check_limits(AggState *aggstate);
+static void hashagg_finish_initial_spills(AggState *aggstate);
+static void hashagg_reset_spill_state(AggState *aggstate);
+
+/* Structure APIs */
+static HashAggBatch *hashagg_batch_new(HashTapeInfo *tapeinfo,
+ int input_tapenum, int setno,
+ int64 input_tuples, int used_bits);
+static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
+static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
+ int used_bits, uint64 input_tuples,
+ double hashentrysize);
+static Size hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot,
+ uint32 hash);
+static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno);
+static void hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *dest,
+ int ndest);
+static void hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum);
+
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1035,6 +1180,32 @@ finalize_partialaggregate(AggState *aggstate,
MemoryContextSwitchTo(oldContext);
}
+/*
+ * Extract the attributes that make up the grouping key into the
+ * hashslot. This is necessary to compute the hash of the grouping key.
+ */
+static void
+prepare_hash_slot(AggState *aggstate)
+{
+ TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ int i;
+
+ /* transfer just the needed columns into hashslot */
+ slot_getsomeattrs(inputslot, perhash->largestGrpColIdx);
+ ExecClearTuple(hashslot);
+
+ for (i = 0; i < perhash->numhashGrpCols; i++)
+ {
+ int varNumber = perhash->hashGrpColIdxInput[i] - 1;
+
+ hashslot->tts_values[i] = inputslot->tts_values[varNumber];
+ hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
+ }
+ ExecStoreVirtualTuple(hashslot);
+}
+
/*
* Prepare to finalize and project based on the specified representative tuple
* slot and grouping set.
@@ -1233,7 +1404,7 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
}
/*
- * (Re-)initialize the hash table(s) to empty.
+ * Initialize the hash table(s).
*
* To implement hashed aggregation, we need a hashtable that stores a
* representative tuple and an array of AggStatePerGroup structs for each
@@ -1244,44 +1415,79 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
* We have a separate hashtable and associated perhash data structure for each
* grouping set for which we're doing hashing.
*
- * The contents of the hash tables always live in the hashcontext's per-tuple
- * memory context (there is only one of these for all tables together, since
- * they are all reset at the same time).
+ * The hash tables and their contents always live in the hashcontext's
+ * per-tuple memory context (there is only one of these for all tables
+ * together, since they are all reset at the same time).
*/
static void
-build_hash_table(AggState *aggstate)
+build_hash_tables(AggState *aggstate)
{
- MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
- Size additionalsize;
- int i;
-
- Assert(aggstate->aggstrategy == AGG_HASHED || aggstate->aggstrategy == AGG_MIXED);
+ int setno;
- additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
-
- for (i = 0; i < aggstate->num_hashes; ++i)
+ for (setno = 0; setno < aggstate->num_hashes; ++setno)
{
- AggStatePerHash perhash = &aggstate->perhash[i];
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ long nbuckets;
+ Size memory;
Assert(perhash->aggnode->numGroups > 0);
- if (perhash->hashtable)
- ResetTupleHashTable(perhash->hashtable);
- else
- perhash->hashtable = BuildTupleHashTableExt(&aggstate->ss.ps,
- perhash->hashslot->tts_tupleDescriptor,
- perhash->numCols,
- perhash->hashGrpColIdxHash,
- perhash->eqfuncoids,
- perhash->hashfunctions,
- perhash->aggnode->grpCollations,
- perhash->aggnode->numGroups,
- additionalsize,
- aggstate->ss.ps.state->es_query_cxt,
- aggstate->hashcontext->ecxt_per_tuple_memory,
- tmpmem,
- DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+ memory = aggstate->hash_mem_limit / aggstate->num_hashes;
+
+ /* choose reasonable number of buckets per hashtable */
+ nbuckets = hash_choose_num_buckets(
+ aggstate, perhash->aggnode->numGroups, memory);
+
+ build_hash_table(aggstate, setno, nbuckets);
}
+
+ aggstate->hash_alloc_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_ngroups_current = 0;
+}
+
+/*
+ * Build a single hashtable for this grouping set. Pass the hash memory
+ * context as both metacxt and tablecxt, so that resetting the hashcontext
+ * will free all memory including metadata. That means that we cannot reset
+ * the hash table to empty and reuse it, though (see execGrouping.c).
+ */
+static void
+build_hash_table(AggState *aggstate, int setno, long nbuckets)
+{
+ TupleHashTable table;
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ MemoryContext hashmem = aggstate->hashcontext->ecxt_per_tuple_memory;
+ MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
+ Size additionalsize;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ /*
+ * Used to make sure initial hash table allocation does not exceed
+ * work_mem. Note that the estimate does not include space for
+ * pass-by-reference transition data values, nor for the representative
+ * tuple of each group.
+ */
+ additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
+
+ table = BuildTupleHashTableExt(&aggstate->ss.ps,
+ perhash->hashslot->tts_tupleDescriptor,
+ perhash->numCols,
+ perhash->hashGrpColIdxHash,
+ perhash->eqfuncoids,
+ perhash->hashfunctions,
+ perhash->aggnode->grpCollations,
+ nbuckets,
+ additionalsize,
+ hashmem,
+ hashmem,
+ tmpmem,
+ DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+
+ perhash->hashtable = table;
}
/*
@@ -1435,6 +1641,233 @@ hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
transitionSpace;
}
+/*
+ * Recompile the expressions for advancing aggregates while hashing. This is
+ * necessary for certain kinds of state changes that affect the resulting
+ * expression. For instance, changing aggstate->hash_ever_spilled or
+ * aggstate->ss.ps.outerops requires recompilation.
+ *
+ * A compiled expression where hash_ever_spilled is true will work even when
+ * hash_spill_mode is false, because it merely introduces additional branches
+ * that are unnecessary when hash_spill_mode is false. That allows us to only
+ * recompile when hash_ever_spilled changes, rather than every time
+ * hash_spill_mode changes.
+ */
+static void
+hashagg_recompile_expressions(AggState *aggstate)
+{
+ AggStatePerPhase phase;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ if (aggstate->aggstrategy == AGG_HASHED)
+ phase = &aggstate->phases[0];
+ else /* AGG_MIXED */
+ phase = &aggstate->phases[1];
+
+ phase->evaltrans = ExecBuildAggTrans(
+ aggstate, phase,
+ aggstate->aggstrategy == AGG_MIXED ? true : false, /* dosort */
+ true, /* dohash */
+ aggstate->hash_ever_spilled);
+}
+
+/*
+ * Set limits that trigger spilling to avoid exceeding work_mem. Consider the
+ * number of partitions we expect to create (if we do spill).
+ *
+ * There are two limits: a memory limit, and also an ngroups limit. The
+ * ngroups limit becomes important when we expect transition values to grow
+ * substantially larger than the initial value.
+ */
+void
+hash_agg_set_limits(double hashentrysize, uint64 input_groups, int used_bits,
+ Size *mem_limit, long *ngroups_limit, int *num_partitions)
+{
+ int npartitions;
+ Size partition_mem;
+
+ /* no attempt to obey work_mem */
+ if (hashagg_mem_overflow)
+ {
+ *mem_limit = SIZE_MAX;
+ *ngroups_limit = LONG_MAX;
+ return;
+ }
+
+ /* if not expected to spill, use all of work_mem */
+ if (input_groups * hashentrysize < work_mem * 1024L)
+ {
+ *mem_limit = work_mem * 1024L;
+ *ngroups_limit = *mem_limit / hashentrysize;
+ return;
+ }
+
+ /*
+ * Calculate expected memory requirements for spilling, which is the size
+ * of the buffers needed for all the tapes that need to be open at
+ * once. Then, subtract that from the memory available for holding hash
+ * tables.
+ */
+ npartitions = hash_choose_num_partitions(input_groups,
+ hashentrysize,
+ used_bits,
+ NULL);
+ if (num_partitions != NULL)
+ *num_partitions = npartitions;
+
+ partition_mem =
+ HASHAGG_READ_BUFFER_SIZE +
+ HASHAGG_WRITE_BUFFER_SIZE * npartitions;
+
+ /*
+ * Don't set the limit below 3/4 of work_mem. In that case, we are at the
+ * minimum number of partitions, so we aren't going to dramatically exceed
+ * work mem anyway.
+ */
+ if (work_mem * 1024L > 4 * partition_mem)
+ *mem_limit = work_mem * 1024L - partition_mem;
+ else
+ *mem_limit = work_mem * 1024L * 0.75;
+
+ if (*mem_limit > hashentrysize)
+ *ngroups_limit = *mem_limit / hashentrysize;
+ else
+ *ngroups_limit = 1;
+}
+
+/*
+ * hash_agg_check_limits
+ *
+ * After adding a new group to the hash table, check whether we need to enter
+ * spill mode. Allocations may happen without adding new groups (for instance,
+ * if the transition state size grows), so this check is imperfect.
+ *
+ * Memory usage is tracked by how much is allocated to the underlying memory
+ * context, not individual chunks. This is more accurate because it accounts
+ * for all memory in the context, and also accounts for fragmentation and
+ * other forms of overhead and waste that can be difficult to estimate. It's
+ * also cheaper because we don't have to track each chunk.
+ *
+ * When memory is first allocated to a memory context, it is not actually
+ * used. So when the next allocation happens, we consider the
+ * previously-allocated amount to be the memory currently used.
+ */
+static void
+hash_agg_check_limits(AggState *aggstate)
+{
+ Size allocation;
+
+ /*
+ * Even if already in spill mode, it's possible for memory usage to grow,
+ * and we should still track it for the purposes of EXPLAIN ANALYZE.
+ */
+ allocation = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ /* has allocation grown since the last observation? */
+ if (allocation > aggstate->hash_alloc_current)
+ {
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_alloc_current = allocation;
+ }
+
+ if (aggstate->hash_alloc_last > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = aggstate->hash_alloc_last;
+
+ /*
+ * Don't spill unless there's at least one group in the hash table so we
+ * can be sure to make progress even in edge cases.
+ */
+ if (aggstate->hash_ngroups_current > 0 &&
+ (aggstate->hash_alloc_last > aggstate->hash_mem_limit ||
+ aggstate->hash_ngroups_current > aggstate->hash_ngroups_limit))
+ {
+ aggstate->hash_spill_mode = true;
+
+ if (!aggstate->hash_ever_spilled)
+ {
+ aggstate->hash_ever_spilled = true;
+ aggstate->hash_spills = palloc0(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+ aggstate->hash_tapeinfo = palloc0(sizeof(HashTapeInfo));
+ hashagg_recompile_expressions(aggstate);
+ }
+ }
+}
+
+/*
+ * Choose a reasonable number of buckets for the initial hash table size.
+ */
+static long
+hash_choose_num_buckets(AggState *aggstate, long ngroups, Size memory)
+{
+ long max_nbuckets;
+
+ max_nbuckets = memory / aggstate->hashentrysize;
+
+ /*
+ * Leave room for slop to avoid a case where the initial hash table size
+ * exceeds the memory limit (though that may still happen in edge cases).
+ */
+ max_nbuckets *= 0.75;
+
+ return ngroups > max_nbuckets ? max_nbuckets : ngroups;
+}
+
+/*
+ * Determine the number of partitions to create when spilling, which will
+ * always be a power of two. If log2_npartitions is non-NULL, set
+ * *log2_npartitions to the log2() of the number of partitions.
+ */
+static int
+hash_choose_num_partitions(uint64 input_groups, double hashentrysize,
+ int used_bits, int *log2_npartitions)
+{
+ Size mem_wanted;
+ int partition_limit;
+ int npartitions;
+ int partition_bits;
+
+ /*
+ * Avoid creating so many partitions that the memory requirements of the
+ * open partition files are greater than 1/4 of work_mem.
+ */
+ partition_limit =
+ (work_mem * 1024L * 0.25 - HASHAGG_READ_BUFFER_SIZE) /
+ HASHAGG_WRITE_BUFFER_SIZE;
+
+ /* pessimistically estimate that each input tuple creates a new group */
+ mem_wanted = HASHAGG_PARTITION_FACTOR * input_groups * hashentrysize;
+
+ /* make enough partitions so that each one is likely to fit in memory */
+ npartitions = 1 + (mem_wanted / (work_mem * 1024L));
+
+ if (npartitions > partition_limit)
+ npartitions = partition_limit;
+
+ if (npartitions < HASHAGG_MIN_PARTITIONS)
+ npartitions = HASHAGG_MIN_PARTITIONS;
+ if (npartitions > HASHAGG_MAX_PARTITIONS)
+ npartitions = HASHAGG_MAX_PARTITIONS;
+
+ /* ceil(log2(npartitions)) */
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits */
+ if (partition_bits + used_bits >= 32)
+ partition_bits = 32 - used_bits;
+
+ if (log2_npartitions != NULL)
+ *log2_npartitions = partition_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ return npartitions;
+}
+
/*
* Find or create a hashtable entry for the tuple group containing the current
* tuple (already set in tmpcontext's outertuple slot), in the current grouping
@@ -1442,37 +1875,39 @@ hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
* depends on this).
*
* When called, CurrentMemoryContext should be the per-query context.
+ *
+ * If the hash table is at the memory limit, then only find existing hashtable
+ * entries; don't create new ones. If a tuple's group is not already present
+ * in the hash table for the current grouping set, return NULL and the caller
+ * will spill it to disk.
*/
-static TupleHashEntryData *
-lookup_hash_entry(AggState *aggstate)
+static AggStatePerGroup
+lookup_hash_entry(AggState *aggstate, uint32 hash)
{
- TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
TupleTableSlot *hashslot = perhash->hashslot;
TupleHashEntryData *entry;
- bool isnew;
- int i;
-
- /* transfer just the needed columns into hashslot */
- slot_getsomeattrs(inputslot, perhash->largestGrpColIdx);
- ExecClearTuple(hashslot);
+ bool isnew = false;
+ bool *p_isnew;
- for (i = 0; i < perhash->numhashGrpCols; i++)
- {
- int varNumber = perhash->hashGrpColIdxInput[i] - 1;
-
- hashslot->tts_values[i] = inputslot->tts_values[varNumber];
- hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
- }
- ExecStoreVirtualTuple(hashslot);
+ /* if hash table already spilled, don't create new entries */
+ p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntry(perhash->hashtable, hashslot, &isnew);
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
+ hash);
+
+ if (entry == NULL)
+ return NULL;
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_ngroups_current++;
+ if (!hashagg_mem_overflow)
+ hash_agg_check_limits(aggstate);
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1492,7 +1927,7 @@ lookup_hash_entry(AggState *aggstate)
}
}
- return entry;
+ return entry->additional;
}
/*
@@ -1500,18 +1935,51 @@ lookup_hash_entry(AggState *aggstate)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Some entries may be left NULL if we have reached the limit and have begun
+ * to spill. The same tuple will belong to different groups for each set, so
+ * may match a group already in memory for one set and match a group not in
+ * memory for another set. If we have begun to spill and a tuple doesn't match
+ * a group in memory for a particular set, it will be spilled.
+ *
+ * NB: It's possible to spill the same tuple for several different grouping
+ * sets. This may seem wasteful, but it's actually a trade-off: if we spill
+ * the tuple multiple times for multiple grouping sets, it can be partitioned
+ * for each grouping set, making the refilling of the hash table very
+ * efficient.
*/
static void
lookup_hash_entries(AggState *aggstate)
{
- int numHashes = aggstate->num_hashes;
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
int setno;
- for (setno = 0; setno < numHashes; setno++)
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
{
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ uint32 hash;
+
select_current_set(aggstate, setno, true);
- pergroup[setno] = lookup_hash_entry(aggstate)->additional;
+ prepare_hash_slot(aggstate);
+ hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
+ pergroup[setno] = lookup_hash_entry(aggstate, hash);
+
+ /* check to see if we need to spill the tuple for this grouping set */
+ if (pergroup[setno] == NULL)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
+
+ if (spill->partitions == NULL)
+ hashagg_spill_init(spill, aggstate->hash_tapeinfo, 0,
+ perhash->aggnode->numGroups,
+ aggstate->hashentrysize);
+
+ hashagg_spill_tuple(spill, slot, hash);
+
+ aggstate->hash_disk_used = LogicalTapeSetBlocks(
+ aggstate->hash_tapeinfo->tapeset) * (BLCKSZ / 1024);
+ }
}
}
@@ -1834,6 +2302,12 @@ agg_retrieve_direct(AggState *aggstate)
if (TupIsNull(outerslot))
{
/* no more outer-plan tuples available */
+
+ /* if we built hash tables, finalize any spills */
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hashagg_finish_initial_spills(aggstate);
+
if (hasGroupingSets)
{
aggstate->input_done = true;
@@ -1936,6 +2410,9 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ /* finalize spills, if any */
+ hashagg_finish_initial_spills(aggstate);
+
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1943,11 +2420,193 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * consumed.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+ HashAggSpill spill;
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ long nbuckets;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ spill.npartitions = 0;
+ spill.partitions = NULL;
+ /*
+ * Each spill file contains spilled data for only a single grouping
+ * set. We want to ignore all others, which is done by setting the other
+ * pergroups to NULL.
+ */
+ memset(aggstate->all_pergroups, 0,
+ sizeof(AggStatePerGroup) *
+ (aggstate->maxsets + aggstate->num_hashes));
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ hash_agg_set_limits(aggstate->hashentrysize, batch->input_tuples,
+ batch->used_bits, &aggstate->hash_mem_limit,
+ &aggstate->hash_ngroups_limit, NULL);
+
+ /*
+ * Free memory and rebuild a single hash table for this batch's grouping
+ * set. Estimate the number of groups to be the number of input tuples in
+ * this batch.
+ */
+ ReScanExprContext(aggstate->hashcontext);
+
+ nbuckets = hash_choose_num_buckets(
+ aggstate, batch->input_tuples, aggstate->hash_mem_limit);
+ build_hash_table(aggstate, batch->setno, nbuckets);
+ aggstate->hash_alloc_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_ngroups_current = 0;
+
+ Assert(aggstate->current_phase == 0);
+
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ /*
+ * The first pass (agg_fill_hash_table) reads whatever kind of slot comes
+ * from the outer plan, and considers the slot fixed. But spilled tuples
+ * are always MinimalTuples, so if that's different from the outer plan we
+ * need to change it and recompile the aggregate expressions.
+ */
+ if (aggstate->ss.ps.outerops != &TTSOpsMinimalTuple)
+ {
+ aggstate->ss.ps.outerops = &TTSOpsMinimalTuple;
+ hashagg_recompile_expressions(aggstate);
+ }
+
+ LogicalTapeRewindForRead(tapeinfo->tapeset, batch->input_tapenum,
+ HASHAGG_READ_BUFFER_SIZE);
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+ uint32 hash;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hashagg_batch_read(batch, &hash);
+ if (tuple == NULL)
+ break;
+
+ ExecStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ select_current_set(aggstate, batch->setno, true);
+ prepare_hash_slot(aggstate);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+
+ /* if there's no memory for a new group, spill */
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ {
+ /*
+ * Estimate the number of groups for this batch as the total
+ * number of tuples in its input file. Although that's a worst
+ * case, it's not bad here for two reasons: (1) overestimating
+ * is better than underestimating; and (2) we've already
+ * scanned the relation once, so it's likely that we've
+ * already finalized many of the common values.
+ */
+ if (spill.partitions == NULL)
+ hashagg_spill_init(&spill, tapeinfo, batch->used_bits,
+ batch->input_tuples,
+ aggstate->hashentrysize);
+
+ hashagg_spill_tuple(&spill, slot, hash);
+
+ aggstate->hash_disk_used = LogicalTapeSetBlocks(
+ aggstate->hash_tapeinfo->tapeset) * (BLCKSZ / 1024);
+ }
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ hashagg_tapeinfo_release(tapeinfo, batch->input_tapenum);
+
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ /* update hashentrysize estimate based on contents */
+ if (aggstate->hash_ngroups_current > 0)
+ {
+ aggstate->hashentrysize = (double)aggstate->hash_alloc_last /
+ (double)aggstate->hash_ngroups_current;
+ }
+
+ hashagg_spill_finish(aggstate, &spill, batch->setno);
+ aggstate->hash_spill_mode = false;
+
+ pfree(batch);
+
+ /* Initialize to walk the first hash table */
+ select_current_set(aggstate, 0, true);
+ ResetTupleHashIterator(aggstate->perhash[0].hashtable,
+ &aggstate->perhash[0].hashiter);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
+ *
+ * After exhausting in-memory tuples, also try refilling the hash table using
+ * previously-spilled tuples. Only returns NULL after all in-memory and
+ * spilled tuples are exhausted.
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+/*
+ * Retrieve the groups from the in-memory hash tables without considering any
+ * spilled tuples.
+ */
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -1976,7 +2635,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2007,8 +2666,6 @@ agg_retrieve_hash_table(AggState *aggstate)
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2065,6 +2722,296 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * Assign unused tapes to spill partitions, extending the tape set if
+ * necessary.
+ */
+static void
+hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *partitions,
+ int npartitions)
+{
+ int partidx = 0;
+
+ /* use free tapes if available */
+ while (partidx < npartitions && tapeinfo->nfreetapes > 0)
+ partitions[partidx++] = tapeinfo->freetapes[--tapeinfo->nfreetapes];
+
+ if (tapeinfo->tapeset == NULL)
+ tapeinfo->tapeset = LogicalTapeSetCreate(npartitions, NULL, NULL, -1);
+ else if (partidx < npartitions)
+ {
+ tapeinfo->tapeset = LogicalTapeSetExtend(
+ tapeinfo->tapeset, npartitions - partidx);
+ }
+
+ while (partidx < npartitions)
+ partitions[partidx++] = tapeinfo->ntapes++;
+}
+
+/*
+ * After a tape has already been written to and then read, this function
+ * rewinds it for writing and adds it to the free list.
+ */
+static void
+hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum)
+{
+ LogicalTapeRewindForWrite(tapeinfo->tapeset, tapenum);
+ if (tapeinfo->freetapes == NULL)
+ tapeinfo->freetapes = palloc(sizeof(int));
+ else
+ tapeinfo->freetapes = repalloc(
+ tapeinfo->freetapes, sizeof(int) * (tapeinfo->nfreetapes + 1));
+ tapeinfo->freetapes[tapeinfo->nfreetapes++] = tapenum;
+}
+
+/*
+ * hashagg_spill_init
+ *
+ * Called after we determined that spilling is necessary. Chooses the number
+ * of partitions to create, and initializes them.
+ */
+static void
+hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo, int used_bits,
+ uint64 input_groups, double hashentrysize)
+{
+ int npartitions;
+ int partition_bits;
+
+ npartitions = hash_choose_num_partitions(
+ input_groups, hashentrysize, used_bits, &partition_bits);
+
+ spill->partitions = palloc0(sizeof(int) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+
+ hashagg_tapeinfo_assign(tapeinfo, spill->partitions, npartitions);
+
+ spill->tapeinfo = tapeinfo;
+ spill->shift = 32 - used_bits - partition_bits;
+ spill->mask = (npartitions - 1) << spill->shift;
+ spill->npartitions = npartitions;
+}
+
+/*
+ * hashagg_spill_tuple
+ *
+ * No room for new groups in the hash table. Save for later in the appropriate
+ * partition.
+ */
+static Size
+hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot, uint32 hash)
+{
+ LogicalTapeSet *tapeset = spill->tapeinfo->tapeset;
+ int partition;
+ MinimalTuple tuple;
+ int tapenum;
+ int total_written = 0;
+ bool shouldFree;
+
+ Assert(spill->partitions != NULL);
+
+ /* may contain unnecessary attributes, consider projecting? */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ partition = (hash & spill->mask) >> spill->shift;
+ spill->ntuples[partition]++;
+
+ tapenum = spill->partitions[partition];
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) &hash, sizeof(uint32));
+ total_written += sizeof(uint32);
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) tuple, tuple->t_len);
+ total_written += tuple->t_len;
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return total_written;
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hashagg_batch_read(HashAggBatch *batch, uint32 *hashp)
+{
+ LogicalTapeSet *tapeset = batch->tapeinfo->tapeset;
+ int tapenum = batch->input_tapenum;
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+ uint32 hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &hash, sizeof(uint32));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+ if (hashp != NULL)
+ *hashp = hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &t_len, sizeof(t_len));
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = LogicalTapeRead(tapeset, tapenum,
+ (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, t_len - sizeof(uint32), nread)));
+
+ return tuple;
+}
+
+/*
+ * new_hashagg_batch
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done. Should be called in the aggregate's memory context.
+ */
+static HashAggBatch *
+hashagg_batch_new(HashTapeInfo *tapeinfo, int tapenum, int setno,
+ int64 input_tuples, int used_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->setno = setno;
+ batch->used_bits = used_bits;
+ batch->tapeinfo = tapeinfo;
+ batch->input_tapenum = tapenum;
+ batch->input_tuples = input_tuples;
+
+ return batch;
+}
+
+/*
+ * hashagg_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hashagg_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+
+ if (aggstate->hash_spills == NULL)
+ return;
+
+ /* update hashentrysize estimate based on contents */
+ Assert(aggstate->hash_ngroups_current > 0);
+ aggstate->hashentrysize = (double)aggstate->hash_alloc_last /
+ (double)aggstate->hash_ngroups_current;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hashagg_spill_finish(aggstate, &aggstate->hash_spills[setno], setno);
+
+ aggstate->hash_spill_mode = false;
+
+ /*
+ * We're not processing tuples from outer plan any more; only processing
+ * batches of spilled tuples. The initial spill structures are no longer
+ * needed.
+ */
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+}
+
+/*
+ * hashagg_spill_finish
+ *
+ * Transform spill partitions into new batches.
+ */
+static void
+hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
+{
+ int i;
+ int used_bits = 32 - spill->shift;
+
+ if (spill->npartitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->npartitions; i++)
+ {
+ int tapenum = spill->partitions[i];
+ MemoryContext oldContext;
+ HashAggBatch *new_batch;
+
+ oldContext = MemoryContextSwitchTo(aggstate->ss.ps.state->es_query_cxt);
+ new_batch = hashagg_batch_new(aggstate->hash_tapeinfo,
+ tapenum, setno, spill->ntuples[i],
+ used_bits);
+ aggstate->hash_batches = lcons(new_batch, aggstate->hash_batches);
+ aggstate->hash_batches_used++;
+ MemoryContextSwitchTo(oldContext);
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Free resources related to a spilled HashAgg.
+ */
+static void
+hashagg_reset_spill_state(AggState *aggstate)
+{
+ ListCell *lc;
+
+ /* free spills from initial pass */
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+ }
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ /* free batches */
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+
+ /* close tape set */
+ if (aggstate->hash_tapeinfo != NULL)
+ {
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ if (tapeinfo->tapeset != NULL)
+ LogicalTapeSetClose(tapeinfo->tapeset);
+ if (tapeinfo->freetapes != NULL)
+ pfree(tapeinfo->freetapes);
+ pfree(tapeinfo);
+ aggstate->hash_tapeinfo = NULL;
+ }
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2249,6 +3196,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->ss.ps.outeropsfixed = false;
}
+ if (use_hashing)
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsMinimalTuple);
+
/*
* Initialize result type, slot and projection.
*/
@@ -2474,11 +3425,24 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
*/
if (use_hashing)
{
+ Plan *outerplan = outerPlan(node);
+ long totalGroups = 0;
+ int i;
+
/* this is an array of pointers, not structures */
aggstate->hash_pergroup = pergroups;
+ aggstate->hashentrysize = hash_agg_entry_size(
+ aggstate->numtrans, outerplan->plan_width, node->transitionSpace);
+
+ for (i = 0; i < aggstate->num_hashes; i++)
+ totalGroups = aggstate->perhash[i].aggnode->numGroups;
+
+ hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+ &aggstate->hash_mem_limit,
+ &aggstate->hash_ngroups_limit, NULL);
find_hash_columns(aggstate);
- build_hash_table(aggstate);
+ build_hash_tables(aggstate);
aggstate->table_filled = false;
}
@@ -2884,7 +3848,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
else
Assert(false);
- phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash);
+ phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
+ false);
}
@@ -3379,6 +4344,8 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hashagg_reset_spill_state(node);
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3434,12 +4401,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_ever_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3496,11 +4464,29 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ hashagg_reset_spill_state(node);
+
+ node->hash_ever_spilled = false;
+ node->hash_spill_mode = false;
+ node->hash_alloc_last = 0;
+ node->hash_alloc_current = 0;
+ node->hash_ngroups_current = 0;
+
+ /* reset stats */
+ node->hash_mem_peak = 0;
+ node->hash_disk_used = 0;
+ node->hash_batches_used = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
- build_hash_table(node);
+ build_hash_tables(node);
node->table_filled = false;
/* iterator will be reset when the table is filled */
+
+ node->ss.ps.outerops =
+ ExecGetResultSlotOps(outerPlanState(&node->ss),
+ &node->ss.ps.outeropsfixed);
+ hashagg_recompile_expressions(node);
}
if (node->aggstrategy != AGG_HASHED)
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index cea0d6fa5ce..7246fc2b33f 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2047,6 +2047,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_INIT_TRANS:
+ case EEOP_AGG_INIT_TRANS_SPILLED:
{
AggStatePerTrans pertrans;
@@ -2056,6 +2057,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_allpergroupsp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_setoff,
v_transno;
@@ -2082,11 +2084,32 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_init_trans.setoff);
v_transno = l_int32_const(op->d.agg_init_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_notransvalue = l_bb_before_v(
+ opblocks[opno + 1], "op.%d.check_notransvalue", opno);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(
+ b, v_pergroup_allaggs, TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[opno + 1],
+ b_check_notransvalue);
+
+ LLVMPositionBuilderAtEnd(b, b_check_notransvalue);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_notransvalue =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_NOTRANSVALUE,
@@ -2143,6 +2166,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_STRICT_TRANS_CHECK:
+ case EEOP_AGG_STRICT_TRANS_CHECK_SPILLED:
{
LLVMValueRef v_setoff,
v_transno;
@@ -2152,6 +2176,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_transnull;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
int jumpnull = op->d.agg_strict_trans_check.jumpnull;
@@ -2171,11 +2196,32 @@ llvm_compile_expr(ExprState *state)
l_int32_const(op->d.agg_strict_trans_check.setoff);
v_transno =
l_int32_const(op->d.agg_strict_trans_check.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_transnull = l_bb_before_v(
+ opblocks[opno + 1], "op.%d.check_transnull", opno);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[jumpnull],
+ b_check_transnull);
+
+ LLVMPositionBuilderAtEnd(b, b_check_transnull);
+ }
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_transnull =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_TRANSVALUEISNULL,
@@ -2191,7 +2237,9 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_PLAIN_TRANS_BYVAL:
+ case EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED:
case EEOP_AGG_PLAIN_TRANS:
+ case EEOP_AGG_PLAIN_TRANS_SPILLED:
{
AggState *aggstate;
AggStatePerTrans pertrans;
@@ -2217,6 +2265,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_pertransp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_retval;
@@ -2244,10 +2293,33 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_trans.setoff);
v_transno = l_int32_const(op->d.agg_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED ||
+ opcode == EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_advance_transval = l_bb_before_v(
+ opblocks[opno + 1], "op.%d.advance_transval", opno);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[opno + 1],
+ b_advance_transval);
+
+ LLVMPositionBuilderAtEnd(b, b_advance_transval);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_fcinfo = l_ptr_const(fcinfo,
l_ptr(StructFunctionCallInfoData));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721f..8d58780bf6a 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -77,6 +77,7 @@
#include "access/htup_details.h"
#include "access/tsmapi.h"
#include "executor/executor.h"
+#include "executor/nodeAgg.h"
#include "executor/nodeHash.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -128,6 +129,7 @@ bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
bool enable_hashagg = true;
+bool enable_hashagg_spill = true;
bool enable_nestloop = true;
bool enable_material = true;
bool enable_mergejoin = true;
@@ -2153,7 +2155,7 @@ cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples)
+ double input_tuples, double input_width)
{
double output_tuples;
Cost startup_cost;
@@ -2219,21 +2221,88 @@ cost_agg(Path *path, PlannerInfo *root,
total_cost += aggcosts->finalCost.per_tuple * numGroups;
total_cost += cpu_tuple_cost * numGroups;
output_tuples = numGroups;
+
+ /*
+ * We don't need to compute the disk costs of hash aggregation here,
+ * because the planner does not choose hash aggregation for grouping
+ * sets that it doesn't expect to fit in memory.
+ */
}
else
{
+ double pages_written = 0.0;
+ double pages_read = 0.0;
+ double hashentrysize;
+ double nbatches;
+ Size mem_limit;
+ long ngroups_limit;
+ int num_partitions;
+
/* must be AGG_HASHED */
startup_cost = input_total_cost;
if (!enable_hashagg)
startup_cost += disable_cost;
startup_cost += aggcosts->transCost.startup;
startup_cost += aggcosts->transCost.per_tuple * input_tuples;
+ /* cost of computing hash value */
startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;
startup_cost += aggcosts->finalCost.startup;
+
total_cost = startup_cost;
total_cost += aggcosts->finalCost.per_tuple * numGroups;
+ /* cost of retrieving from hash table */
total_cost += cpu_tuple_cost * numGroups;
output_tuples = numGroups;
+
+ /*
+ * Estimate number of batches based on the computed limits. If less
+ * than or equal to one, all groups are expected to fit in memory;
+ * otherwise we expect to spill.
+ */
+ hashentrysize = hash_agg_entry_size(
+ aggcosts->numAggs, input_width, aggcosts->transitionSpace);
+ hash_agg_set_limits(hashentrysize, numGroups, 0, &mem_limit,
+ &ngroups_limit, &num_partitions);
+
+ nbatches = Max( (numGroups * hashentrysize) / mem_limit,
+ numGroups / ngroups_limit );
+
+ /*
+ * Estimate number of pages read and written. For each level of
+ * recursion, a tuple must be written and then later read.
+ */
+ if (!hashagg_mem_overflow && nbatches > 1.0)
+ {
+ double depth;
+ double pages;
+
+ pages = relation_byte_size(input_tuples, input_width) / BLCKSZ;
+
+ /*
+ * The number of partitions can change at different levels of
+ * recursion; but for the purposes of this calculation assume it
+ * stays constant.
+ */
+ depth = ceil( log(nbatches - 1) / log(num_partitions) );
+ pages_written = pages_read = pages * depth;
+ }
+
+ /*
+ * Add the disk costs of hash aggregation that spills to disk.
+ *
+ * Groups that go into the hash table stay in memory until finalized,
+ * so spilling and reprocessing tuples doesn't incur additional
+ * invocations of transCost or finalCost. Furthermore, the computed
+ * hash value is stored with the spilled tuples, so we don't incur
+ * extra invocations of the hash function.
+ *
+ * Hash Agg begins returning tuples after the first batch is
+ * complete. Accrue writes (spilled tuples) to startup_cost and to
+ * total_cost; accrue reads only to total_cost.
+ */
+ startup_cost += pages_written * random_page_cost;
+ total_cost += pages_written * random_page_cost;
+ total_cost += pages_read * seq_page_cost;
}
/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e048d200bb4..090919e39a0 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1644,6 +1644,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
NIL,
NIL,
best_path->path.rows,
+ 0,
subplan);
}
else
@@ -2096,6 +2097,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
NIL,
NIL,
best_path->numGroups,
+ best_path->transitionSpace,
subplan);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2257,6 +2259,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
NIL,
rollup->numGroups,
+ best_path->transitionSpace,
sort_plan);
/*
@@ -2295,6 +2298,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
chain,
rollup->numGroups,
+ best_path->transitionSpace,
subplan);
/* Copy cost data from Path to Plan */
@@ -6192,8 +6196,8 @@ Agg *
make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree)
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree)
{
Agg *node = makeNode(Agg);
Plan *plan = &node->plan;
@@ -6209,6 +6213,7 @@ make_agg(List *tlist, List *qual,
node->grpOperators = grpOperators;
node->grpCollations = grpCollations;
node->numGroups = numGroups;
+ node->transitionSpace = transitionSpace;
node->aggParams = NULL; /* SS_finalize_plan() will fill this */
node->groupingSets = groupingSets;
node->chain = chain;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6314c..913ad9335e5 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6528,7 +6528,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
* were unable to sort above, then we'd better generate a Path, so
* that we at least have one.
*/
- if (hashaggtablesize < work_mem * 1024L ||
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L ||
grouped_rel->pathlist == NIL)
{
/*
@@ -6561,7 +6562,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
agg_final_costs,
dNumGroups);
- if (hashaggtablesize < work_mem * 1024L)
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L)
add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
@@ -6830,7 +6832,7 @@ create_partial_grouping_paths(PlannerInfo *root,
* Tentatively produce a partial HashAgg Path, depending on if it
* looks as if the hash table will fit in work_mem.
*/
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
@@ -6857,7 +6859,7 @@ create_partial_grouping_paths(PlannerInfo *root,
dNumPartialPartialGroups);
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 1a23e18970d..951aed80e7a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1072,7 +1072,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
numGroupCols, dNumGroups,
NIL,
input_path->startup_cost, input_path->total_cost,
- input_path->rows);
+ input_path->rows, input_path->pathtarget->width);
/*
* Now for the sorted case. Note that the input is *always* unsorted,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e6d08aede56..8ba8122ee2f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1704,7 +1704,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
NIL,
subpath->startup_cost,
subpath->total_cost,
- rel->rows);
+ rel->rows,
+ subpath->pathtarget->width);
}
if (sjinfo->semi_can_btree && sjinfo->semi_can_hash)
@@ -2949,6 +2950,7 @@ create_agg_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->aggsplit = aggsplit;
pathnode->numGroups = numGroups;
+ pathnode->transitionSpace = aggcosts ? aggcosts->transitionSpace : 0;
pathnode->groupClause = groupClause;
pathnode->qual = qual;
@@ -2957,7 +2959,7 @@ create_agg_path(PlannerInfo *root,
list_length(groupClause), numGroups,
qual,
subpath->startup_cost, subpath->total_cost,
- subpath->rows);
+ subpath->rows, subpath->pathtarget->width);
/* add tlist eval cost for each output row */
pathnode->path.startup_cost += target->cost.startup;
@@ -3036,6 +3038,7 @@ create_groupingsets_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->rollups = rollups;
pathnode->qual = having_qual;
+ pathnode->transitionSpace = agg_costs ? agg_costs->transitionSpace : 0;
Assert(rollups != NIL);
Assert(aggstrategy != AGG_PLAIN || list_length(rollups) == 1);
@@ -3067,7 +3070,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
subpath->startup_cost,
subpath->total_cost,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
is_first = false;
if (!rollup->is_hashed)
is_first_sort = false;
@@ -3090,7 +3094,8 @@ create_groupingsets_path(PlannerInfo *root,
rollup->numGroups,
having_qual,
0.0, 0.0,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
if (!rollup->is_hashed)
is_first_sort = false;
}
@@ -3115,7 +3120,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
sort_path.startup_cost,
sort_path.total_cost,
- sort_path.rows);
+ sort_path.rows,
+ subpath->pathtarget->width);
}
pathnode->path.total_cost += agg_path.total_cost;
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb196444198..1151b807418 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -120,6 +120,7 @@ bool enableFsync = true;
bool allowSystemTableMods = false;
int work_mem = 1024;
int maintenance_work_mem = 16384;
+bool hashagg_mem_overflow = false;
int max_parallel_maintenance_workers = 2;
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8228e1f3903..ed6737a8ac9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -998,6 +998,26 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_hashagg_spill", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans that are expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_hashagg_spill,
+ true,
+ NULL, NULL, NULL
+ },
+ {
+ {"hashagg_mem_overflow", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables hashed aggregation to overflow work_mem at execution time."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &hashagg_mem_overflow,
+ false,
+ NULL, NULL, NULL
+ },
{
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of materialization."),
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 33495f8b4b3..16f6762086b 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -201,6 +201,8 @@ static long ltsGetFreeBlock(LogicalTapeSet *lts);
static void ltsReleaseBlock(LogicalTapeSet *lts, long blocknum);
static void ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
SharedFileSet *fileset);
+static void ltsInitTape(LogicalTape *lt);
+static void ltsInitReadBuffer(LogicalTapeSet *lts, LogicalTape *lt);
/*
@@ -535,6 +537,51 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
lts->nHoleBlocks = lts->nBlocksAllocated - nphysicalblocks;
}
+/*
+ * Initialize per-tape struct. Note we allocate the I/O buffer and the first
+ * block for a tape only when it is first actually written to. This avoids
+ * wasting memory space when tuplesort.c overestimates the number of tapes
+ * needed.
+ */
+static void
+ltsInitTape(LogicalTape *lt)
+{
+ lt->writing = true;
+ lt->frozen = false;
+ lt->dirty = false;
+ lt->firstBlockNumber = -1L;
+ lt->curBlockNumber = -1L;
+ lt->nextBlockNumber = -1L;
+ lt->offsetBlockNumber = 0L;
+ lt->buffer = NULL;
+ lt->buffer_size = 0;
+ /* palloc() larger than MaxAllocSize would fail */
+ lt->max_size = MaxAllocSize;
+ lt->pos = 0;
+ lt->nbytes = 0;
+}
+
+/*
+ * Lazily allocate and initialize the read buffer. This avoids waste when many
+ * tapes are open at once, but not all are active between rewinding and
+ * reading.
+ */
+static void
+ltsInitReadBuffer(LogicalTapeSet *lts, LogicalTape *lt)
+{
+ if (lt->firstBlockNumber != -1L)
+ {
+ Assert(lt->buffer_size > 0);
+ lt->buffer = palloc(lt->buffer_size);
+ }
+
+ /* Read the first block, or reset if tape is empty */
+ lt->nextBlockNumber = lt->firstBlockNumber;
+ lt->pos = 0;
+ lt->nbytes = 0;
+ ltsReadFillBuffer(lts, lt);
+}
+
/*
* Create a set of logical tapes in a temporary underlying file.
*
@@ -560,7 +607,6 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
int worker)
{
LogicalTapeSet *lts;
- LogicalTape *lt;
int i;
/*
@@ -578,29 +624,8 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
lts->nFreeBlocks = 0;
lts->nTapes = ntapes;
- /*
- * Initialize per-tape structs. Note we allocate the I/O buffer and the
- * first block for a tape only when it is first actually written to. This
- * avoids wasting memory space when tuplesort.c overestimates the number
- * of tapes needed.
- */
for (i = 0; i < ntapes; i++)
- {
- lt = <s->tapes[i];
- lt->writing = true;
- lt->frozen = false;
- lt->dirty = false;
- lt->firstBlockNumber = -1L;
- lt->curBlockNumber = -1L;
- lt->nextBlockNumber = -1L;
- lt->offsetBlockNumber = 0L;
- lt->buffer = NULL;
- lt->buffer_size = 0;
- /* palloc() larger than MaxAllocSize would fail */
- lt->max_size = MaxAllocSize;
- lt->pos = 0;
- lt->nbytes = 0;
- }
+ ltsInitTape(<s->tapes[i]);
/*
* Create temp BufFile storage as required.
@@ -821,15 +846,9 @@ LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum, size_t buffer_size)
lt->buffer_size = 0;
if (lt->firstBlockNumber != -1L)
{
- lt->buffer = palloc(buffer_size);
+ /* the buffer is lazily allocated, but set the size here */
lt->buffer_size = buffer_size;
}
-
- /* Read the first block, or reset if tape is empty */
- lt->nextBlockNumber = lt->firstBlockNumber;
- lt->pos = 0;
- lt->nbytes = 0;
- ltsReadFillBuffer(lts, lt);
}
/*
@@ -878,6 +897,9 @@ LogicalTapeRead(LogicalTapeSet *lts, int tapenum,
lt = <s->tapes[tapenum];
Assert(!lt->writing);
+ if (lt->buffer == NULL)
+ ltsInitReadBuffer(lts, lt);
+
while (size > 0)
{
if (lt->pos >= lt->nbytes)
@@ -991,6 +1013,29 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum, TapeShare *share)
}
}
+/*
+ * Add additional tapes to this tape set. Not intended to be used when any
+ * tapes are frozen.
+ */
+LogicalTapeSet *
+LogicalTapeSetExtend(LogicalTapeSet *lts, int nAdditional)
+{
+ int i;
+ int nTapesOrig = lts->nTapes;
+ Size newSize;
+
+ lts->nTapes += nAdditional;
+ newSize = offsetof(LogicalTapeSet, tapes) +
+ lts->nTapes * sizeof(LogicalTape);
+
+ lts = (LogicalTapeSet *) repalloc(lts, newSize);
+
+ for (i = nTapesOrig; i < lts->nTapes; i++)
+ ltsInitTape(<s->tapes[i]);
+
+ return lts;
+}
+
/*
* Backspace the tape a given number of bytes. (We also support a more
* general seek interface, see below.)
@@ -1015,6 +1060,9 @@ LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum, size_t size)
Assert(lt->frozen);
Assert(lt->buffer_size == BLCKSZ);
+ if (lt->buffer == NULL)
+ ltsInitReadBuffer(lts, lt);
+
/*
* Easy case for seek within current block.
*/
@@ -1087,6 +1135,9 @@ LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
Assert(offset >= 0 && offset <= TapeBlockPayloadSize);
Assert(lt->buffer_size == BLCKSZ);
+ if (lt->buffer == NULL)
+ ltsInitReadBuffer(lts, lt);
+
if (blocknum != lt->curBlockNumber)
{
ltsReadBlock(lts, blocknum, (void *) lt->buffer);
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 73a2ca8c6dd..d70bc048c46 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -226,9 +226,13 @@ typedef enum ExprEvalOp
EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
EEOP_AGG_INIT_TRANS,
+ EEOP_AGG_INIT_TRANS_SPILLED,
EEOP_AGG_STRICT_TRANS_CHECK,
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
EEOP_AGG_PLAIN_TRANS_BYVAL,
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
EEOP_AGG_PLAIN_TRANS,
+ EEOP_AGG_PLAIN_TRANS_SPILLED,
EEOP_AGG_ORDERED_TRANS_DATUM,
EEOP_AGG_ORDERED_TRANS_TUPLE,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 81fdfa4add3..d6eb2abb60b 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -255,7 +255,7 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
- bool doSort, bool doHash);
+ bool doSort, bool doHash, bool spilled);
extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
int numCols,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 264916f9a92..307987a45ab 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -311,5 +311,8 @@ extern void ExecReScanAgg(AggState *node);
extern Size hash_agg_entry_size(int numAggs, Size tupleWidth,
Size transitionSpace);
+extern void hash_agg_set_limits(double hashentrysize, uint64 input_groups,
+ int used_bits, Size *mem_limit,
+ long *ngroups_limit, int *num_partitions);
#endif /* NODEAGG_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f985453ec32..707a07a2de4 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -244,6 +244,7 @@ extern bool enableFsync;
extern PGDLLIMPORT bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT bool hashagg_mem_overflow;
extern PGDLLIMPORT int max_parallel_maintenance_workers;
extern int VacuumCostPageHit;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5d5b38b8799..a04ea19b112 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2075,13 +2075,32 @@ typedef struct AggState
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
- int num_hashes;
+ int num_hashes; /* number of hash tables active at once */
+ double hashentrysize; /* estimate revised during execution */
+ struct HashTapeInfo *hash_tapeinfo; /* metadata for spill tapes */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each hash table,
+ exists only during first pass if spilled */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ bool hash_ever_spilled; /* ever spilled during this execution? */
+ bool hash_spill_mode; /* we hit a limit during the current batch
+ and we must not create new groups */
+ Size hash_alloc_last; /* previous total memory allocation */
+ Size hash_alloc_current; /* current total memory allocation */
+ Size hash_mem_limit; /* limit before spilling hash table */
+ Size hash_mem_peak; /* peak hash table memory usage */
+ long hash_ngroups_current; /* number of groups currently in
+ memory in all hash tables */
+ long hash_ngroups_limit; /* limit before spilling hash table */
+ long hash_disk_used; /* kB of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+ List *hash_batches; /* hash batches remaining to be processed */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 49
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 3d3be197e0e..be592d0fee4 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1663,6 +1663,7 @@ typedef struct AggPath
AggStrategy aggstrategy; /* basic strategy, see nodes.h */
AggSplit aggsplit; /* agg-splitting mode, see nodes.h */
double numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
List *groupClause; /* a list of SortGroupClause's */
List *qual; /* quals (HAVING quals), if any */
} AggPath;
@@ -1700,6 +1701,7 @@ typedef struct GroupingSetsPath
AggStrategy aggstrategy; /* basic strategy */
List *rollups; /* list of RollupData */
List *qual; /* quals (HAVING quals), if any */
+ int32 transitionSpace; /* estimated transition state size */
} GroupingSetsPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 32c0d87f80e..f4183e1efa5 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -813,6 +813,7 @@ typedef struct Agg
Oid *grpOperators; /* equality operators to compare with */
Oid *grpCollations;
long numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
Bitmapset *aggParams; /* IDs of Params used in Aggref inputs */
/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
List *groupingSets; /* grouping sets to use */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index cb012ba1980..6572dc24699 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_hashagg_spill;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
@@ -114,7 +115,7 @@ extern void cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples);
+ double input_tuples, double input_width);
extern void cost_windowagg(Path *path, PlannerInfo *root,
List *windowFuncs, int numPartCols, int numOrderCols,
Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index eab486a6214..c7bda2b0917 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -54,8 +54,8 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
extern Agg *make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree);
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree);
extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
/*
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index 695d2c00ee4..3ebe52239f8 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -67,6 +67,8 @@ extern void LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum,
extern void LogicalTapeRewindForWrite(LogicalTapeSet *lts, int tapenum);
extern void LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum,
TapeShare *share);
+extern LogicalTapeSet *LogicalTapeSetExtend(LogicalTapeSet *lts,
+ int nAdditional);
extern size_t LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum,
size_t size);
extern void LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index f457b5b150f..7eeeaaa5e4a 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2357,3 +2357,124 @@ explain (costs off)
-> Seq Scan on onek
(8 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+set work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------------
+ GroupAggregate
+ Group Key: ((g % 100000))
+ -> Sort
+ Sort Key: ((g % 100000))
+ -> Function Scan on generate_series g
+(5 rows)
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+-- Produce results with hash aggregation
+set enable_hashagg = true;
+set enable_sort = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 100000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+set enable_sort = true;
+set work_mem to default;
+-- Compare group aggregation results to hash aggregation results
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+ a | c1 | c2 | c3
+---+----+----+----
+(0 rows)
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a7..767f60a96c7 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1633,4 +1633,127 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
| 1 | 2
(4 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+-- Produce results with hash aggregation.
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+set enable_sort = true;
+set work_mem to default;
+-- Compare results
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+ g100 | g10 | unnest | c | m
+------+-----+--------+---+---
+(0 rows)
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
-- end
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1de..11c6f50fbfa 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -148,6 +148,68 @@ SELECT count(*) FROM
4
(1 row)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+SET enable_hashagg=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------------
+ Unique
+ -> Sort
+ Sort Key: ((g % 1000))
+ -> Function Scan on generate_series g
+(4 rows)
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_hashagg=TRUE;
+-- Produce results with hash aggregation.
+SET enable_sort=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 1000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_sort=TRUE;
+SET work_mem TO DEFAULT;
+-- Compare results
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb9057..c40bf6c16eb 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -75,6 +75,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
+ enable_hashagg_spill | on
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 3e593f2d615..a4d476c4bb3 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1032,3 +1032,119 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
explain (costs off)
select 1 from tenk1
where (hundred, thousand) in (select twothousand, twothousand from onek);
+
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+set work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+-- Produce results with hash aggregation
+
+set enable_hashagg = true;
+set enable_sort = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare group aggregation results to hash aggregation results
+
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 95ac3fb52f6..bf8bce6ed31 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -441,4 +441,103 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
from unnest(array[1,1], array['a','b']) u(i,v)
group by rollup(i, v||'a') order by 1,3;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+-- Produce results with hash aggregation.
+
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare results
+
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+
-- end
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449e..33102744ebf 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -45,6 +45,68 @@ SELECT count(*) FROM
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+SET enable_hashagg=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_hashagg=TRUE;
+
+-- Produce results with hash aggregation.
+
+SET enable_sort=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_sort=TRUE;
+
+SET work_mem TO DEFAULT;
+
+-- Compare results
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
+
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
On Mon, 2020-02-10 at 15:57 -0800, Jeff Davis wrote:
Attaching latest version (combined logtape changes along with main
HashAgg patch).
I ran a matrix of small performance tests to look for regressions.
The goal was to find out if the refactoring or additional branches
introduced by this patch caused regressions in in-memory HashAgg, Sort,
or the JIT paths. Fortunately, I didn't find any.
This is *not* supposed to represent the performance benefits of the
patch, only to see if I regressed somewhere else. The performance
benefits will be shown in the next round of tests.
I tried with JIT on/off, work_mem='4MB' and also a value high enough to
fit the entire working set, enable_hashagg on/off, and 4 different
tables.
The 4 tables are (each containing 20 million tuples):
t1k_20k_int4:
1K groups of 20K tuples each (randomly generated and ordered)
t20m_1_int4:
20M groups of 1 tuple each (randomly generated and ordered)
t1k_20k_text:
the same as t1k_20k_int4 but cast to text (collation C.UTF-8)
t20m_1_text:
the same as t20m_1_int4 but cast to text (collation C.UTF-8)
The query is:
select count(*) from (select i, count(*) from $TABLE group by i) s;
I just did 3 runs in psql and took the median result.
I ran against master (cac8ce4a, slightly older, before any of my
patches went in) and my dev branch (attached patch applied against
0973f560).
Results were pretty boring, in a good way. All results within the
noise, and about as many results were better on dev than master as
there were better on master than dev.
I also did some JIT-specific tests against only t1k_20k_int4. For that,
the hash table fits in memory anyway, so I didn't vary work_mem. The
query I ran included more aggregates to better test JIT:
select i, sum(i), avg(i), min(i)
from t1k_20k_int4
group by i
offset 1000000; -- offset so it doesn't return result
I know these tests are simplistic, but I also think they represent a
lot of areas where regressions could have potentially been introduced.
If someone else can find a regression, please let me know.
The new patch is basically just rebased -- a few other very minor
changes.
Regards,
Jeff Davis
Attachments:
hashagg-20200212-1.patchtext/x-patch; charset=UTF-8; name=hashagg-20200212-1.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f89ec7..85f559387f9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,23 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-hashagg-mem-overflow" xreflabel="hashagg_mem_overflow">
+ <term><varname>hashagg_mem_overflow</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>hashagg_mem_overflow</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ If hash aggregation exceeds <varname>work_mem</varname> at query
+ execution time, and <varname>hashagg_mem_overflow</varname> is set
+ to <literal>on</literal>, continue consuming more memory rather than
+ performing disk-based hash aggregation. The default
+ is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
@@ -4476,6 +4493,24 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-hashagg-spill" xreflabel="enable_hashagg_spill">
+ <term><varname>enable_hashagg_spill</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_hashagg_spill</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the memory usage is expected to
+ exceed <varname>work_mem</varname>. This only affects the planner
+ choice; actual behavior at execution time is dictated by
+ <xref linkend="guc-hashagg-mem-overflow"/>. The default
+ is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
<term><varname>enable_hashjoin</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50e..2923f4ba46d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -104,6 +104,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1882,6 +1883,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ if (es->analyze)
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2769,6 +2772,55 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * If EXPLAIN ANALYZE, show information on hash aggregate memory usage and
+ * batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(
+ es->str,
+ "Memory Usage: %ldkB",
+ memPeakKb);
+
+ if (aggstate->hash_batches_used > 0)
+ {
+ appendStringInfo(
+ es->str,
+ " Batches: %d Disk: %ldkB",
+ aggstate->hash_batches_used, aggstate->hash_disk_used);
+ }
+
+ appendStringInfo(
+ es->str,
+ "\n");
+ }
+ else
+ {
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ ExplainPropertyInteger("Disk Usage", "kB",
+ aggstate->hash_disk_used, es);
+ }
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 121eff97a0c..9dff7990742 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -79,7 +79,8 @@ static void ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash);
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled);
/*
@@ -2927,7 +2928,7 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
*/
ExprState *
ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
- bool doSort, bool doHash)
+ bool doSort, bool doHash, bool spilled)
{
ExprState *state = makeNode(ExprState);
PlanState *parent = &aggstate->ss.ps;
@@ -3158,7 +3159,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (int setno = 0; setno < processGroupingSets; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, false);
+ pertrans, transno, setno, setoff, false,
+ spilled);
setoff++;
}
}
@@ -3177,7 +3179,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (int setno = 0; setno < numHashes; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, true);
+ pertrans, transno, setno, setoff, true,
+ spilled);
setoff++;
}
}
@@ -3227,7 +3230,8 @@ static void
ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash)
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled)
{
int adjust_init_jumpnull = -1;
int adjust_strict_jumpnull = -1;
@@ -3249,7 +3253,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
fcinfo->flinfo->fn_strict &&
pertrans->initValueIsNull)
{
- scratch->opcode = EEOP_AGG_INIT_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_INIT_TRANS_SPILLED : EEOP_AGG_INIT_TRANS;
scratch->d.agg_init_trans.pertrans = pertrans;
scratch->d.agg_init_trans.setno = setno;
scratch->d.agg_init_trans.setoff = setoff;
@@ -3265,7 +3270,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
if (pertrans->numSortCols == 0 &&
fcinfo->flinfo->fn_strict)
{
- scratch->opcode = EEOP_AGG_STRICT_TRANS_CHECK;
+ scratch->opcode = spilled ?
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED : EEOP_AGG_STRICT_TRANS_CHECK;
scratch->d.agg_strict_trans_check.setno = setno;
scratch->d.agg_strict_trans_check.setoff = setoff;
scratch->d.agg_strict_trans_check.transno = transno;
@@ -3282,9 +3288,11 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
/* invoke appropriate transition implementation */
if (pertrans->numSortCols == 0 && pertrans->transtypeByVal)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS_BYVAL;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED : EEOP_AGG_PLAIN_TRANS_BYVAL;
else if (pertrans->numSortCols == 0)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_SPILLED : EEOP_AGG_PLAIN_TRANS;
else if (pertrans->numInputs == 1)
scratch->opcode = EEOP_AGG_ORDERED_TRANS_DATUM;
else
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 35eb8b99f69..e21e0c440ea 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -426,9 +426,13 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
&&CASE_EEOP_AGG_INIT_TRANS,
+ &&CASE_EEOP_AGG_INIT_TRANS_SPILLED,
&&CASE_EEOP_AGG_STRICT_TRANS_CHECK,
+ &&CASE_EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_SPILLED,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
&&CASE_EEOP_LAST
@@ -1619,6 +1623,35 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
+
+ /* If transValue has not yet been initialized, do so now. */
+ if (pergroup->noTransValue)
+ {
+ AggStatePerTrans pertrans = op->d.agg_init_trans.pertrans;
+
+ aggstate->curaggcontext = op->d.agg_init_trans.aggcontext;
+ aggstate->current_set = op->d.agg_init_trans.setno;
+
+ ExecAggInitGroup(aggstate, pertrans, pergroup);
+
+ /* copied trans value from input, done this round */
+ EEO_JUMP(op->d.agg_init_trans.jumpnull);
+ }
+
+ EEO_NEXT();
+ }
/* check that a strict aggregate's input isn't NULL */
EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK)
@@ -1635,6 +1668,24 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_strict_trans_check.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_strict_trans_check.transno];
+
+ if (unlikely(pergroup->transValueIsNull))
+ EEO_JUMP(op->d.agg_strict_trans_check.jumpnull);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1683,6 +1734,51 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ Assert(pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1726,6 +1822,66 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
newVal = FunctionCallInvoke(fcinfo);
+ /*
+ * For pass-by-ref datatype, must copy the new value into
+ * aggcontext and free the prior transValue. But if transfn
+ * returned a pointer to its first input, we don't need to do
+ * anything. Also, if transfn returned a pointer to a R/W
+ * expanded object that is already a child of the aggcontext,
+ * assume we can adopt that value without copying it.
+ */
+ if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggTransReparent(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ Assert(!pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
/*
* For pass-by-ref datatype, must copy the new value into
* aggcontext and free the prior transValue. But if transfn
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index b7f49ceddf8..f85ce25415e 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -194,6 +194,29 @@
* transition values. hashcontext is the single context created to support
* all hash tables.
*
+ * Spilling To Disk
+ *
+ * When performing hash aggregation, if the hash table memory exceeds the
+ * limit (see hash_agg_check_limits()), we enter "spill mode". In spill
+ * mode, we advance the transition states only for groups already in the
+ * hash table. For tuples that would need to create a new hash table
+ * entries (and initialize new transition states), we instead spill them to
+ * disk to be processed later. The tuples are spilled in a partitioned
+ * manner, so that subsequent batches are smaller and less likely to exceed
+ * work_mem (if a batch does exceed work_mem, it must be spilled
+ * recursively).
+ *
+ * Spilled data is written to logical tapes. These provide better control
+ * over memory usage, disk space, and the number of files than if we were
+ * to use a BufFile for each spill.
+ *
+ * Note that it's possible for transition states to start small but then
+ * grow very large; for instance in the case of ARRAY_AGG. In such cases,
+ * it's still possible to significantly exceed work_mem. We try to avoid
+ * this situation by estimating what will fit in the available memory, and
+ * imposing a limit on the number of groups separately from the amount of
+ * memory consumed.
+ *
* Transition / Combine function invocation:
*
* For performance reasons transition functions, including combine
@@ -233,12 +256,99 @@
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/datum.h"
+#include "utils/dynahash.h"
#include "utils/expandeddatum.h"
+#include "utils/logtape.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+/*
+ * Control how many partitions are created when spilling HashAgg to
+ * disk.
+ *
+ * HASHAGG_PARTITION_FACTOR is multiplied by the estimated number of
+ * partitions needed such that each partition will fit in memory. The factor
+ * is set higher than one because there's not a high cost to having a few too
+ * many partitions, and it makes it less likely that a partition will need to
+ * be spilled recursively. Another benefit of having more, smaller partitions
+ * is that small hash tables may perform better than large ones due to memory
+ * caching effects.
+ *
+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (which
+ * could instead be spent on a larger hash table).
+ *
+ * For reading from tapes, the buffer size must be a multiple of
+ * BLCKSZ. Larger values help when reading from multiple tapes concurrently,
+ * but that doesn't happen in HashAgg, so we simply use BLCKSZ. Writing to a
+ * tape always uses a buffer of size BLCKSZ.
+ */
+#define HASHAGG_PARTITION_FACTOR 1.50
+#define HASHAGG_MIN_PARTITIONS 4
+#define HASHAGG_MAX_PARTITIONS 256
+#define HASHAGG_READ_BUFFER_SIZE BLCKSZ
+#define HASHAGG_WRITE_BUFFER_SIZE BLCKSZ
+
+/*
+ * Track all tapes needed for a HashAgg that spills. We don't know the maximum
+ * number of tapes needed at the start of the algorithm (because it can
+ * recurse), so one tape set is allocated and extended as needed for new
+ * tapes. When a particular tape is already read, rewind it for write mode and
+ * put it in the free list.
+ *
+ * Tapes' buffers can take up substantial memory when many tapes are open at
+ * once. We only need one tape open at a time in read mode (using a buffer
+ * that's a multiple of BLCKSZ); but we need up to HASHAGG_MAX_PARTITIONS
+ * tapes open in write mode (each requiring a buffer of size BLCKSZ).
+ */
+typedef struct HashTapeInfo
+{
+ LogicalTapeSet *tapeset;
+ int ntapes;
+ int *freetapes;
+ int nfreetapes;
+} HashTapeInfo;
+
+/*
+ * Represents partitioned spill data for a single hashtable. Contains the
+ * necessary information to route tuples to the correct partition, and to
+ * transform the spilled data into new batches.
+ *
+ * The high bits are used for partition selection (when recursing, we ignore
+ * the bits that have already been used for partition selection at an earlier
+ * level).
+ */
+typedef struct HashAggSpill
+{
+ HashTapeInfo *tapeinfo; /* borrowed reference to tape info */
+ int npartitions; /* number of partitions */
+ int *partitions; /* spill partition tape numbers */
+ int64 *ntuples; /* number of tuples in each partition */
+ uint32 mask; /* mask to find partition from hash value */
+ int shift; /* after masking, shift by this amount */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation (with only one
+ * grouping set).
+ *
+ * Also tracks the bits of the hash already used for partition selection by
+ * earlier iterations, so that this batch can use new bits. If all bits have
+ * already been used, no partitioning will be done (any spilled data will go
+ * to a single output tape).
+ */
+typedef struct HashAggBatch
+{
+ int setno; /* grouping set */
+ int used_bits; /* number of bits of hash already used */
+ HashTapeInfo *tapeinfo; /* borrowed reference to tape info */
+ int input_tapenum; /* input partition tape */
+ int64 input_tuples; /* number of tuples in this batch */
+} HashAggBatch;
+
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -263,6 +373,7 @@ static void finalize_partialaggregate(AggState *aggstate,
AggStatePerAgg peragg,
AggStatePerGroup pergroupstate,
Datum *resultVal, bool *resultIsNull);
+static void prepare_hash_slot(AggState *aggstate);
static void prepare_projection_slot(AggState *aggstate,
TupleTableSlot *slot,
int currentSet);
@@ -272,12 +383,41 @@ static void finalize_aggregates(AggState *aggstate,
static TupleTableSlot *project_aggregates(AggState *aggstate);
static Bitmapset *find_unaggregated_cols(AggState *aggstate);
static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
-static void build_hash_table(AggState *aggstate);
-static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
+static void build_hash_tables(AggState *aggstate);
+static void build_hash_table(AggState *aggstate, int setno,
+ int64 ngroups_estimate);
+static void hashagg_recompile_expressions(AggState *aggstate);
+static long hash_choose_num_buckets(AggState *aggstate,
+ long estimated_nbuckets,
+ Size memory);
+static int hash_choose_num_partitions(uint64 input_groups,
+ double hashentrysize,
+ int used_bits,
+ int *log2_npartittions);
+static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+static void hash_agg_check_limits(AggState *aggstate);
+static void hashagg_finish_initial_spills(AggState *aggstate);
+static void hashagg_reset_spill_state(AggState *aggstate);
+static HashAggBatch *hashagg_batch_new(HashTapeInfo *tapeinfo,
+ int input_tapenum, int setno,
+ int64 input_tuples, int used_bits);
+static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
+static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
+ int used_bits, uint64 input_tuples,
+ double hashentrysize);
+static Size hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot,
+ uint32 hash);
+static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno);
+static void hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *dest,
+ int ndest);
+static void hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum);
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1035,6 +1175,32 @@ finalize_partialaggregate(AggState *aggstate,
MemoryContextSwitchTo(oldContext);
}
+/*
+ * Extract the attributes that make up the grouping key into the
+ * hashslot. This is necessary to compute the hash of the grouping key.
+ */
+static void
+prepare_hash_slot(AggState *aggstate)
+{
+ TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ int i;
+
+ /* transfer just the needed columns into hashslot */
+ slot_getsomeattrs(inputslot, perhash->largestGrpColIdx);
+ ExecClearTuple(hashslot);
+
+ for (i = 0; i < perhash->numhashGrpCols; i++)
+ {
+ int varNumber = perhash->hashGrpColIdxInput[i] - 1;
+
+ hashslot->tts_values[i] = inputslot->tts_values[varNumber];
+ hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
+ }
+ ExecStoreVirtualTuple(hashslot);
+}
+
/*
* Prepare to finalize and project based on the specified representative tuple
* slot and grouping set.
@@ -1233,7 +1399,7 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
}
/*
- * (Re-)initialize the hash table(s) to empty.
+ * Initialize the hash table(s).
*
* To implement hashed aggregation, we need a hashtable that stores a
* representative tuple and an array of AggStatePerGroup structs for each
@@ -1244,44 +1410,79 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
* We have a separate hashtable and associated perhash data structure for each
* grouping set for which we're doing hashing.
*
- * The contents of the hash tables always live in the hashcontext's per-tuple
- * memory context (there is only one of these for all tables together, since
- * they are all reset at the same time).
+ * The hash tables and their contents always live in the hashcontext's
+ * per-tuple memory context (there is only one of these for all tables
+ * together, since they are all reset at the same time).
*/
static void
-build_hash_table(AggState *aggstate)
+build_hash_tables(AggState *aggstate)
{
- MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
- Size additionalsize;
- int i;
-
- Assert(aggstate->aggstrategy == AGG_HASHED || aggstate->aggstrategy == AGG_MIXED);
+ int setno;
- additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
-
- for (i = 0; i < aggstate->num_hashes; ++i)
+ for (setno = 0; setno < aggstate->num_hashes; ++setno)
{
- AggStatePerHash perhash = &aggstate->perhash[i];
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ long nbuckets;
+ Size memory;
Assert(perhash->aggnode->numGroups > 0);
- if (perhash->hashtable)
- ResetTupleHashTable(perhash->hashtable);
- else
- perhash->hashtable = BuildTupleHashTableExt(&aggstate->ss.ps,
- perhash->hashslot->tts_tupleDescriptor,
- perhash->numCols,
- perhash->hashGrpColIdxHash,
- perhash->eqfuncoids,
- perhash->hashfunctions,
- perhash->aggnode->grpCollations,
- perhash->aggnode->numGroups,
- additionalsize,
- aggstate->ss.ps.state->es_query_cxt,
- aggstate->hashcontext->ecxt_per_tuple_memory,
- tmpmem,
- DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+ memory = aggstate->hash_mem_limit / aggstate->num_hashes;
+
+ /* choose reasonable number of buckets per hashtable */
+ nbuckets = hash_choose_num_buckets(
+ aggstate, perhash->aggnode->numGroups, memory);
+
+ build_hash_table(aggstate, setno, nbuckets);
}
+
+ aggstate->hash_alloc_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_ngroups_current = 0;
+}
+
+/*
+ * Build a single hashtable for this grouping set. Pass the hash memory
+ * context as both metacxt and tablecxt, so that resetting the hashcontext
+ * will free all memory including metadata. That means that we cannot reset
+ * the hash table to empty and reuse it, though (see execGrouping.c).
+ */
+static void
+build_hash_table(AggState *aggstate, int setno, long nbuckets)
+{
+ TupleHashTable table;
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ MemoryContext hashmem = aggstate->hashcontext->ecxt_per_tuple_memory;
+ MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
+ Size additionalsize;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ /*
+ * Used to make sure initial hash table allocation does not exceed
+ * work_mem. Note that the estimate does not include space for
+ * pass-by-reference transition data values, nor for the representative
+ * tuple of each group.
+ */
+ additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
+
+ table = BuildTupleHashTableExt(&aggstate->ss.ps,
+ perhash->hashslot->tts_tupleDescriptor,
+ perhash->numCols,
+ perhash->hashGrpColIdxHash,
+ perhash->eqfuncoids,
+ perhash->hashfunctions,
+ perhash->aggnode->grpCollations,
+ nbuckets,
+ additionalsize,
+ hashmem,
+ hashmem,
+ tmpmem,
+ DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+
+ perhash->hashtable = table;
}
/*
@@ -1435,6 +1636,233 @@ hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
transitionSpace;
}
+/*
+ * Recompile the expressions for advancing aggregates while hashing. This is
+ * necessary for certain kinds of state changes that affect the resulting
+ * expression. For instance, changing aggstate->hash_ever_spilled or
+ * aggstate->ss.ps.outerops requires recompilation.
+ *
+ * A compiled expression where hash_ever_spilled is true will work even when
+ * hash_spill_mode is false, because it merely introduces additional branches
+ * that are unnecessary when hash_spill_mode is false. That allows us to only
+ * recompile when hash_ever_spilled changes, rather than every time
+ * hash_spill_mode changes.
+ */
+static void
+hashagg_recompile_expressions(AggState *aggstate)
+{
+ AggStatePerPhase phase;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ if (aggstate->aggstrategy == AGG_HASHED)
+ phase = &aggstate->phases[0];
+ else /* AGG_MIXED */
+ phase = &aggstate->phases[1];
+
+ phase->evaltrans = ExecBuildAggTrans(
+ aggstate, phase,
+ aggstate->aggstrategy == AGG_MIXED ? true : false, /* dosort */
+ true, /* dohash */
+ aggstate->hash_ever_spilled);
+}
+
+/*
+ * Set limits that trigger spilling to avoid exceeding work_mem. Consider the
+ * number of partitions we expect to create (if we do spill).
+ *
+ * There are two limits: a memory limit, and also an ngroups limit. The
+ * ngroups limit becomes important when we expect transition values to grow
+ * substantially larger than the initial value.
+ */
+void
+hash_agg_set_limits(double hashentrysize, uint64 input_groups, int used_bits,
+ Size *mem_limit, long *ngroups_limit, int *num_partitions)
+{
+ int npartitions;
+ Size partition_mem;
+
+ /* no attempt to obey work_mem */
+ if (hashagg_mem_overflow)
+ {
+ *mem_limit = SIZE_MAX;
+ *ngroups_limit = LONG_MAX;
+ return;
+ }
+
+ /* if not expected to spill, use all of work_mem */
+ if (input_groups * hashentrysize < work_mem * 1024L)
+ {
+ *mem_limit = work_mem * 1024L;
+ *ngroups_limit = *mem_limit / hashentrysize;
+ return;
+ }
+
+ /*
+ * Calculate expected memory requirements for spilling, which is the size
+ * of the buffers needed for all the tapes that need to be open at
+ * once. Then, subtract that from the memory available for holding hash
+ * tables.
+ */
+ npartitions = hash_choose_num_partitions(input_groups,
+ hashentrysize,
+ used_bits,
+ NULL);
+ if (num_partitions != NULL)
+ *num_partitions = npartitions;
+
+ partition_mem =
+ HASHAGG_READ_BUFFER_SIZE +
+ HASHAGG_WRITE_BUFFER_SIZE * npartitions;
+
+ /*
+ * Don't set the limit below 3/4 of work_mem. In that case, we are at the
+ * minimum number of partitions, so we aren't going to dramatically exceed
+ * work mem anyway.
+ */
+ if (work_mem * 1024L > 4 * partition_mem)
+ *mem_limit = work_mem * 1024L - partition_mem;
+ else
+ *mem_limit = work_mem * 1024L * 0.75;
+
+ if (*mem_limit > hashentrysize)
+ *ngroups_limit = *mem_limit / hashentrysize;
+ else
+ *ngroups_limit = 1;
+}
+
+/*
+ * hash_agg_check_limits
+ *
+ * After adding a new group to the hash table, check whether we need to enter
+ * spill mode. Allocations may happen without adding new groups (for instance,
+ * if the transition state size grows), so this check is imperfect.
+ *
+ * Memory usage is tracked by how much is allocated to the underlying memory
+ * context, not individual chunks. This is more accurate because it accounts
+ * for all memory in the context, and also accounts for fragmentation and
+ * other forms of overhead and waste that can be difficult to estimate. It's
+ * also cheaper because we don't have to track each chunk.
+ *
+ * When memory is first allocated to a memory context, it is not actually
+ * used. So when the next allocation happens, we consider the
+ * previously-allocated amount to be the memory currently used.
+ */
+static void
+hash_agg_check_limits(AggState *aggstate)
+{
+ Size allocation;
+
+ /*
+ * Even if already in spill mode, it's possible for memory usage to grow,
+ * and we should still track it for the purposes of EXPLAIN ANALYZE.
+ */
+ allocation = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ /* has allocation grown since the last observation? */
+ if (allocation > aggstate->hash_alloc_current)
+ {
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_alloc_current = allocation;
+ }
+
+ if (aggstate->hash_alloc_last > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = aggstate->hash_alloc_last;
+
+ /*
+ * Don't spill unless there's at least one group in the hash table so we
+ * can be sure to make progress even in edge cases.
+ */
+ if (aggstate->hash_ngroups_current > 0 &&
+ (aggstate->hash_alloc_last > aggstate->hash_mem_limit ||
+ aggstate->hash_ngroups_current > aggstate->hash_ngroups_limit))
+ {
+ aggstate->hash_spill_mode = true;
+
+ if (!aggstate->hash_ever_spilled)
+ {
+ aggstate->hash_ever_spilled = true;
+ aggstate->hash_spills = palloc0(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+ aggstate->hash_tapeinfo = palloc0(sizeof(HashTapeInfo));
+ hashagg_recompile_expressions(aggstate);
+ }
+ }
+}
+
+/*
+ * Choose a reasonable number of buckets for the initial hash table size.
+ */
+static long
+hash_choose_num_buckets(AggState *aggstate, long ngroups, Size memory)
+{
+ long max_nbuckets;
+
+ max_nbuckets = memory / aggstate->hashentrysize;
+
+ /*
+ * Leave room for slop to avoid a case where the initial hash table size
+ * exceeds the memory limit (though that may still happen in edge cases).
+ */
+ max_nbuckets *= 0.75;
+
+ return ngroups > max_nbuckets ? max_nbuckets : ngroups;
+}
+
+/*
+ * Determine the number of partitions to create when spilling, which will
+ * always be a power of two. If log2_npartitions is non-NULL, set
+ * *log2_npartitions to the log2() of the number of partitions.
+ */
+static int
+hash_choose_num_partitions(uint64 input_groups, double hashentrysize,
+ int used_bits, int *log2_npartitions)
+{
+ Size mem_wanted;
+ int partition_limit;
+ int npartitions;
+ int partition_bits;
+
+ /*
+ * Avoid creating so many partitions that the memory requirements of the
+ * open partition files are greater than 1/4 of work_mem.
+ */
+ partition_limit =
+ (work_mem * 1024L * 0.25 - HASHAGG_READ_BUFFER_SIZE) /
+ HASHAGG_WRITE_BUFFER_SIZE;
+
+ /* pessimistically estimate that each input tuple creates a new group */
+ mem_wanted = HASHAGG_PARTITION_FACTOR * input_groups * hashentrysize;
+
+ /* make enough partitions so that each one is likely to fit in memory */
+ npartitions = 1 + (mem_wanted / (work_mem * 1024L));
+
+ if (npartitions > partition_limit)
+ npartitions = partition_limit;
+
+ if (npartitions < HASHAGG_MIN_PARTITIONS)
+ npartitions = HASHAGG_MIN_PARTITIONS;
+ if (npartitions > HASHAGG_MAX_PARTITIONS)
+ npartitions = HASHAGG_MAX_PARTITIONS;
+
+ /* ceil(log2(npartitions)) */
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits */
+ if (partition_bits + used_bits >= 32)
+ partition_bits = 32 - used_bits;
+
+ if (log2_npartitions != NULL)
+ *log2_npartitions = partition_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ return npartitions;
+}
+
/*
* Find or create a hashtable entry for the tuple group containing the current
* tuple (already set in tmpcontext's outertuple slot), in the current grouping
@@ -1442,37 +1870,39 @@ hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
* depends on this).
*
* When called, CurrentMemoryContext should be the per-query context.
+ *
+ * If the hash table is at the memory limit, then only find existing hashtable
+ * entries; don't create new ones. If a tuple's group is not already present
+ * in the hash table for the current grouping set, return NULL and the caller
+ * will spill it to disk.
*/
-static TupleHashEntryData *
-lookup_hash_entry(AggState *aggstate)
+static AggStatePerGroup
+lookup_hash_entry(AggState *aggstate, uint32 hash)
{
- TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
TupleTableSlot *hashslot = perhash->hashslot;
TupleHashEntryData *entry;
- bool isnew;
- int i;
+ bool isnew = false;
+ bool *p_isnew;
- /* transfer just the needed columns into hashslot */
- slot_getsomeattrs(inputslot, perhash->largestGrpColIdx);
- ExecClearTuple(hashslot);
-
- for (i = 0; i < perhash->numhashGrpCols; i++)
- {
- int varNumber = perhash->hashGrpColIdxInput[i] - 1;
-
- hashslot->tts_values[i] = inputslot->tts_values[varNumber];
- hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
- }
- ExecStoreVirtualTuple(hashslot);
+ /* if hash table already spilled, don't create new entries */
+ p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntry(perhash->hashtable, hashslot, &isnew);
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
+ hash);
+
+ if (entry == NULL)
+ return NULL;
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_ngroups_current++;
+ if (!hashagg_mem_overflow)
+ hash_agg_check_limits(aggstate);
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1492,7 +1922,7 @@ lookup_hash_entry(AggState *aggstate)
}
}
- return entry;
+ return entry->additional;
}
/*
@@ -1500,18 +1930,51 @@ lookup_hash_entry(AggState *aggstate)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Some entries may be left NULL if we have reached the limit and have begun
+ * to spill. The same tuple will belong to different groups for each set, so
+ * may match a group already in memory for one set and match a group not in
+ * memory for another set. If we have begun to spill and a tuple doesn't match
+ * a group in memory for a particular set, it will be spilled.
+ *
+ * NB: It's possible to spill the same tuple for several different grouping
+ * sets. This may seem wasteful, but it's actually a trade-off: if we spill
+ * the tuple multiple times for multiple grouping sets, it can be partitioned
+ * for each grouping set, making the refilling of the hash table very
+ * efficient.
*/
static void
lookup_hash_entries(AggState *aggstate)
{
- int numHashes = aggstate->num_hashes;
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
int setno;
- for (setno = 0; setno < numHashes; setno++)
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
{
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ uint32 hash;
+
select_current_set(aggstate, setno, true);
- pergroup[setno] = lookup_hash_entry(aggstate)->additional;
+ prepare_hash_slot(aggstate);
+ hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
+ pergroup[setno] = lookup_hash_entry(aggstate, hash);
+
+ /* check to see if we need to spill the tuple for this grouping set */
+ if (pergroup[setno] == NULL)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
+
+ if (spill->partitions == NULL)
+ hashagg_spill_init(spill, aggstate->hash_tapeinfo, 0,
+ perhash->aggnode->numGroups,
+ aggstate->hashentrysize);
+
+ hashagg_spill_tuple(spill, slot, hash);
+
+ aggstate->hash_disk_used = LogicalTapeSetBlocks(
+ aggstate->hash_tapeinfo->tapeset) * (BLCKSZ / 1024);
+ }
}
}
@@ -1834,6 +2297,12 @@ agg_retrieve_direct(AggState *aggstate)
if (TupIsNull(outerslot))
{
/* no more outer-plan tuples available */
+
+ /* if we built hash tables, finalize any spills */
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hashagg_finish_initial_spills(aggstate);
+
if (hasGroupingSets)
{
aggstate->input_done = true;
@@ -1936,6 +2405,9 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ /* finalize spills, if any */
+ hashagg_finish_initial_spills(aggstate);
+
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1943,11 +2415,193 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * consumed.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+ HashAggSpill spill;
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ long nbuckets;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ spill.npartitions = 0;
+ spill.partitions = NULL;
+ /*
+ * Each spill file contains spilled data for only a single grouping
+ * set. We want to ignore all others, which is done by setting the other
+ * pergroups to NULL.
+ */
+ memset(aggstate->all_pergroups, 0,
+ sizeof(AggStatePerGroup) *
+ (aggstate->maxsets + aggstate->num_hashes));
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ hash_agg_set_limits(aggstate->hashentrysize, batch->input_tuples,
+ batch->used_bits, &aggstate->hash_mem_limit,
+ &aggstate->hash_ngroups_limit, NULL);
+
+ /*
+ * Free memory and rebuild a single hash table for this batch's grouping
+ * set. Estimate the number of groups to be the number of input tuples in
+ * this batch.
+ */
+ ReScanExprContext(aggstate->hashcontext);
+
+ nbuckets = hash_choose_num_buckets(
+ aggstate, batch->input_tuples, aggstate->hash_mem_limit);
+ build_hash_table(aggstate, batch->setno, nbuckets);
+ aggstate->hash_alloc_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_ngroups_current = 0;
+
+ Assert(aggstate->current_phase == 0);
+
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ /*
+ * The first pass (agg_fill_hash_table) reads whatever kind of slot comes
+ * from the outer plan, and considers the slot fixed. But spilled tuples
+ * are always MinimalTuples, so if that's different from the outer plan we
+ * need to change it and recompile the aggregate expressions.
+ */
+ if (aggstate->ss.ps.outerops != &TTSOpsMinimalTuple)
+ {
+ aggstate->ss.ps.outerops = &TTSOpsMinimalTuple;
+ hashagg_recompile_expressions(aggstate);
+ }
+
+ LogicalTapeRewindForRead(tapeinfo->tapeset, batch->input_tapenum,
+ HASHAGG_READ_BUFFER_SIZE);
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+ uint32 hash;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hashagg_batch_read(batch, &hash);
+ if (tuple == NULL)
+ break;
+
+ ExecStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ select_current_set(aggstate, batch->setno, true);
+ prepare_hash_slot(aggstate);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+
+ /* if there's no memory for a new group, spill */
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ {
+ /*
+ * Estimate the number of groups for this batch as the total
+ * number of tuples in its input file. Although that's a worst
+ * case, it's not bad here for two reasons: (1) overestimating
+ * is better than underestimating; and (2) we've already
+ * scanned the relation once, so it's likely that we've
+ * already finalized many of the common values.
+ */
+ if (spill.partitions == NULL)
+ hashagg_spill_init(&spill, tapeinfo, batch->used_bits,
+ batch->input_tuples,
+ aggstate->hashentrysize);
+
+ hashagg_spill_tuple(&spill, slot, hash);
+
+ aggstate->hash_disk_used = LogicalTapeSetBlocks(
+ aggstate->hash_tapeinfo->tapeset) * (BLCKSZ / 1024);
+ }
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ hashagg_tapeinfo_release(tapeinfo, batch->input_tapenum);
+
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ /* update hashentrysize estimate based on contents */
+ if (aggstate->hash_ngroups_current > 0)
+ {
+ aggstate->hashentrysize = (double)aggstate->hash_alloc_last /
+ (double)aggstate->hash_ngroups_current;
+ }
+
+ hashagg_spill_finish(aggstate, &spill, batch->setno);
+ aggstate->hash_spill_mode = false;
+
+ pfree(batch);
+
+ /* Initialize to walk the first hash table */
+ select_current_set(aggstate, 0, true);
+ ResetTupleHashIterator(aggstate->perhash[0].hashtable,
+ &aggstate->perhash[0].hashiter);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
+ *
+ * After exhausting in-memory tuples, also try refilling the hash table using
+ * previously-spilled tuples. Only returns NULL after all in-memory and
+ * spilled tuples are exhausted.
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+/*
+ * Retrieve the groups from the in-memory hash tables without considering any
+ * spilled tuples.
+ */
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -1976,7 +2630,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2007,8 +2661,6 @@ agg_retrieve_hash_table(AggState *aggstate)
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2065,6 +2717,296 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * Assign unused tapes to spill partitions, extending the tape set if
+ * necessary.
+ */
+static void
+hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *partitions,
+ int npartitions)
+{
+ int partidx = 0;
+
+ /* use free tapes if available */
+ while (partidx < npartitions && tapeinfo->nfreetapes > 0)
+ partitions[partidx++] = tapeinfo->freetapes[--tapeinfo->nfreetapes];
+
+ if (tapeinfo->tapeset == NULL)
+ tapeinfo->tapeset = LogicalTapeSetCreate(npartitions, NULL, NULL, -1);
+ else if (partidx < npartitions)
+ {
+ tapeinfo->tapeset = LogicalTapeSetExtend(
+ tapeinfo->tapeset, npartitions - partidx);
+ }
+
+ while (partidx < npartitions)
+ partitions[partidx++] = tapeinfo->ntapes++;
+}
+
+/*
+ * After a tape has already been written to and then read, this function
+ * rewinds it for writing and adds it to the free list.
+ */
+static void
+hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum)
+{
+ LogicalTapeRewindForWrite(tapeinfo->tapeset, tapenum);
+ if (tapeinfo->freetapes == NULL)
+ tapeinfo->freetapes = palloc(sizeof(int));
+ else
+ tapeinfo->freetapes = repalloc(
+ tapeinfo->freetapes, sizeof(int) * (tapeinfo->nfreetapes + 1));
+ tapeinfo->freetapes[tapeinfo->nfreetapes++] = tapenum;
+}
+
+/*
+ * hashagg_spill_init
+ *
+ * Called after we determined that spilling is necessary. Chooses the number
+ * of partitions to create, and initializes them.
+ */
+static void
+hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo, int used_bits,
+ uint64 input_groups, double hashentrysize)
+{
+ int npartitions;
+ int partition_bits;
+
+ npartitions = hash_choose_num_partitions(
+ input_groups, hashentrysize, used_bits, &partition_bits);
+
+ spill->partitions = palloc0(sizeof(int) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+
+ hashagg_tapeinfo_assign(tapeinfo, spill->partitions, npartitions);
+
+ spill->tapeinfo = tapeinfo;
+ spill->shift = 32 - used_bits - partition_bits;
+ spill->mask = (npartitions - 1) << spill->shift;
+ spill->npartitions = npartitions;
+}
+
+/*
+ * hashagg_spill_tuple
+ *
+ * No room for new groups in the hash table. Save for later in the appropriate
+ * partition.
+ */
+static Size
+hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot, uint32 hash)
+{
+ LogicalTapeSet *tapeset = spill->tapeinfo->tapeset;
+ int partition;
+ MinimalTuple tuple;
+ int tapenum;
+ int total_written = 0;
+ bool shouldFree;
+
+ Assert(spill->partitions != NULL);
+
+ /* may contain unnecessary attributes, consider projecting? */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ partition = (hash & spill->mask) >> spill->shift;
+ spill->ntuples[partition]++;
+
+ tapenum = spill->partitions[partition];
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) &hash, sizeof(uint32));
+ total_written += sizeof(uint32);
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) tuple, tuple->t_len);
+ total_written += tuple->t_len;
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return total_written;
+}
+
+/*
+ * hashagg_batch_new
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done. Should be called in the aggregate's memory context.
+ */
+static HashAggBatch *
+hashagg_batch_new(HashTapeInfo *tapeinfo, int tapenum, int setno,
+ int64 input_tuples, int used_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->setno = setno;
+ batch->used_bits = used_bits;
+ batch->tapeinfo = tapeinfo;
+ batch->input_tapenum = tapenum;
+ batch->input_tuples = input_tuples;
+
+ return batch;
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hashagg_batch_read(HashAggBatch *batch, uint32 *hashp)
+{
+ LogicalTapeSet *tapeset = batch->tapeinfo->tapeset;
+ int tapenum = batch->input_tapenum;
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+ uint32 hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &hash, sizeof(uint32));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+ if (hashp != NULL)
+ *hashp = hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &t_len, sizeof(t_len));
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = LogicalTapeRead(tapeset, tapenum,
+ (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, t_len - sizeof(uint32), nread)));
+
+ return tuple;
+}
+
+/*
+ * hashagg_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hashagg_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+
+ if (aggstate->hash_spills == NULL)
+ return;
+
+ /* update hashentrysize estimate based on contents */
+ Assert(aggstate->hash_ngroups_current > 0);
+ aggstate->hashentrysize = (double)aggstate->hash_alloc_last /
+ (double)aggstate->hash_ngroups_current;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hashagg_spill_finish(aggstate, &aggstate->hash_spills[setno], setno);
+
+ aggstate->hash_spill_mode = false;
+
+ /*
+ * We're not processing tuples from outer plan any more; only processing
+ * batches of spilled tuples. The initial spill structures are no longer
+ * needed.
+ */
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+}
+
+/*
+ * hashagg_spill_finish
+ *
+ * Transform spill partitions into new batches.
+ */
+static void
+hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
+{
+ int i;
+ int used_bits = 32 - spill->shift;
+
+ if (spill->npartitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->npartitions; i++)
+ {
+ int tapenum = spill->partitions[i];
+ MemoryContext oldContext;
+ HashAggBatch *new_batch;
+
+ oldContext = MemoryContextSwitchTo(aggstate->ss.ps.state->es_query_cxt);
+ new_batch = hashagg_batch_new(aggstate->hash_tapeinfo,
+ tapenum, setno, spill->ntuples[i],
+ used_bits);
+ aggstate->hash_batches = lcons(new_batch, aggstate->hash_batches);
+ aggstate->hash_batches_used++;
+ MemoryContextSwitchTo(oldContext);
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Free resources related to a spilled HashAgg.
+ */
+static void
+hashagg_reset_spill_state(AggState *aggstate)
+{
+ ListCell *lc;
+
+ /* free spills from initial pass */
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+ }
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ /* free batches */
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+
+ /* close tape set */
+ if (aggstate->hash_tapeinfo != NULL)
+ {
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ if (tapeinfo->tapeset != NULL)
+ LogicalTapeSetClose(tapeinfo->tapeset);
+ if (tapeinfo->freetapes != NULL)
+ pfree(tapeinfo->freetapes);
+ pfree(tapeinfo);
+ aggstate->hash_tapeinfo = NULL;
+ }
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2249,6 +3191,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->ss.ps.outeropsfixed = false;
}
+ if (use_hashing)
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsMinimalTuple);
+
/*
* Initialize result type, slot and projection.
*/
@@ -2474,11 +3420,24 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
*/
if (use_hashing)
{
+ Plan *outerplan = outerPlan(node);
+ long totalGroups = 0;
+ int i;
+
/* this is an array of pointers, not structures */
aggstate->hash_pergroup = pergroups;
+ aggstate->hashentrysize = hash_agg_entry_size(
+ aggstate->numtrans, outerplan->plan_width, node->transitionSpace);
+
+ for (i = 0; i < aggstate->num_hashes; i++)
+ totalGroups = aggstate->perhash[i].aggnode->numGroups;
+
+ hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+ &aggstate->hash_mem_limit,
+ &aggstate->hash_ngroups_limit, NULL);
find_hash_columns(aggstate);
- build_hash_table(aggstate);
+ build_hash_tables(aggstate);
aggstate->table_filled = false;
}
@@ -2884,7 +3843,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
else
Assert(false);
- phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash);
+ phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
+ false);
}
@@ -3379,6 +4339,8 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hashagg_reset_spill_state(node);
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3434,12 +4396,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_ever_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3496,11 +4459,33 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ const TupleTableSlotOps *outerops = ExecGetResultSlotOps(
+ outerPlanState(&node->ss), &node->ss.ps.outeropsfixed);
+
+ hashagg_reset_spill_state(node);
+
+ node->hash_ever_spilled = false;
+ node->hash_spill_mode = false;
+ node->hash_alloc_last = 0;
+ node->hash_alloc_current = 0;
+ node->hash_ngroups_current = 0;
+
+ /* reset stats */
+ node->hash_mem_peak = 0;
+ node->hash_disk_used = 0;
+ node->hash_batches_used = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
- build_hash_table(node);
+ build_hash_tables(node);
node->table_filled = false;
/* iterator will be reset when the table is filled */
+
+ if (node->ss.ps.outerops != outerops)
+ {
+ node->ss.ps.outerops = outerops;
+ hashagg_recompile_expressions(node);
+ }
}
if (node->aggstrategy != AGG_HASHED)
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index cea0d6fa5ce..7246fc2b33f 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2047,6 +2047,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_INIT_TRANS:
+ case EEOP_AGG_INIT_TRANS_SPILLED:
{
AggStatePerTrans pertrans;
@@ -2056,6 +2057,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_allpergroupsp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_setoff,
v_transno;
@@ -2082,11 +2084,32 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_init_trans.setoff);
v_transno = l_int32_const(op->d.agg_init_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_notransvalue = l_bb_before_v(
+ opblocks[opno + 1], "op.%d.check_notransvalue", opno);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(
+ b, v_pergroup_allaggs, TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[opno + 1],
+ b_check_notransvalue);
+
+ LLVMPositionBuilderAtEnd(b, b_check_notransvalue);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_notransvalue =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_NOTRANSVALUE,
@@ -2143,6 +2166,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_STRICT_TRANS_CHECK:
+ case EEOP_AGG_STRICT_TRANS_CHECK_SPILLED:
{
LLVMValueRef v_setoff,
v_transno;
@@ -2152,6 +2176,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_transnull;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
int jumpnull = op->d.agg_strict_trans_check.jumpnull;
@@ -2171,11 +2196,32 @@ llvm_compile_expr(ExprState *state)
l_int32_const(op->d.agg_strict_trans_check.setoff);
v_transno =
l_int32_const(op->d.agg_strict_trans_check.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_transnull = l_bb_before_v(
+ opblocks[opno + 1], "op.%d.check_transnull", opno);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[jumpnull],
+ b_check_transnull);
+
+ LLVMPositionBuilderAtEnd(b, b_check_transnull);
+ }
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_transnull =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_TRANSVALUEISNULL,
@@ -2191,7 +2237,9 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_PLAIN_TRANS_BYVAL:
+ case EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED:
case EEOP_AGG_PLAIN_TRANS:
+ case EEOP_AGG_PLAIN_TRANS_SPILLED:
{
AggState *aggstate;
AggStatePerTrans pertrans;
@@ -2217,6 +2265,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_pertransp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_retval;
@@ -2244,10 +2293,33 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_trans.setoff);
v_transno = l_int32_const(op->d.agg_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED ||
+ opcode == EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_advance_transval = l_bb_before_v(
+ opblocks[opno + 1], "op.%d.advance_transval", opno);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[opno + 1],
+ b_advance_transval);
+
+ LLVMPositionBuilderAtEnd(b, b_advance_transval);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_fcinfo = l_ptr_const(fcinfo,
l_ptr(StructFunctionCallInfoData));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721f..8d58780bf6a 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -77,6 +77,7 @@
#include "access/htup_details.h"
#include "access/tsmapi.h"
#include "executor/executor.h"
+#include "executor/nodeAgg.h"
#include "executor/nodeHash.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -128,6 +129,7 @@ bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
bool enable_hashagg = true;
+bool enable_hashagg_spill = true;
bool enable_nestloop = true;
bool enable_material = true;
bool enable_mergejoin = true;
@@ -2153,7 +2155,7 @@ cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples)
+ double input_tuples, double input_width)
{
double output_tuples;
Cost startup_cost;
@@ -2219,21 +2221,88 @@ cost_agg(Path *path, PlannerInfo *root,
total_cost += aggcosts->finalCost.per_tuple * numGroups;
total_cost += cpu_tuple_cost * numGroups;
output_tuples = numGroups;
+
+ /*
+ * We don't need to compute the disk costs of hash aggregation here,
+ * because the planner does not choose hash aggregation for grouping
+ * sets that it doesn't expect to fit in memory.
+ */
}
else
{
+ double pages_written = 0.0;
+ double pages_read = 0.0;
+ double hashentrysize;
+ double nbatches;
+ Size mem_limit;
+ long ngroups_limit;
+ int num_partitions;
+
/* must be AGG_HASHED */
startup_cost = input_total_cost;
if (!enable_hashagg)
startup_cost += disable_cost;
startup_cost += aggcosts->transCost.startup;
startup_cost += aggcosts->transCost.per_tuple * input_tuples;
+ /* cost of computing hash value */
startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;
startup_cost += aggcosts->finalCost.startup;
+
total_cost = startup_cost;
total_cost += aggcosts->finalCost.per_tuple * numGroups;
+ /* cost of retrieving from hash table */
total_cost += cpu_tuple_cost * numGroups;
output_tuples = numGroups;
+
+ /*
+ * Estimate number of batches based on the computed limits. If less
+ * than or equal to one, all groups are expected to fit in memory;
+ * otherwise we expect to spill.
+ */
+ hashentrysize = hash_agg_entry_size(
+ aggcosts->numAggs, input_width, aggcosts->transitionSpace);
+ hash_agg_set_limits(hashentrysize, numGroups, 0, &mem_limit,
+ &ngroups_limit, &num_partitions);
+
+ nbatches = Max( (numGroups * hashentrysize) / mem_limit,
+ numGroups / ngroups_limit );
+
+ /*
+ * Estimate number of pages read and written. For each level of
+ * recursion, a tuple must be written and then later read.
+ */
+ if (!hashagg_mem_overflow && nbatches > 1.0)
+ {
+ double depth;
+ double pages;
+
+ pages = relation_byte_size(input_tuples, input_width) / BLCKSZ;
+
+ /*
+ * The number of partitions can change at different levels of
+ * recursion; but for the purposes of this calculation assume it
+ * stays constant.
+ */
+ depth = ceil( log(nbatches - 1) / log(num_partitions) );
+ pages_written = pages_read = pages * depth;
+ }
+
+ /*
+ * Add the disk costs of hash aggregation that spills to disk.
+ *
+ * Groups that go into the hash table stay in memory until finalized,
+ * so spilling and reprocessing tuples doesn't incur additional
+ * invocations of transCost or finalCost. Furthermore, the computed
+ * hash value is stored with the spilled tuples, so we don't incur
+ * extra invocations of the hash function.
+ *
+ * Hash Agg begins returning tuples after the first batch is
+ * complete. Accrue writes (spilled tuples) to startup_cost and to
+ * total_cost; accrue reads only to total_cost.
+ */
+ startup_cost += pages_written * random_page_cost;
+ total_cost += pages_written * random_page_cost;
+ total_cost += pages_read * seq_page_cost;
}
/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e048d200bb4..090919e39a0 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1644,6 +1644,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
NIL,
NIL,
best_path->path.rows,
+ 0,
subplan);
}
else
@@ -2096,6 +2097,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
NIL,
NIL,
best_path->numGroups,
+ best_path->transitionSpace,
subplan);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2257,6 +2259,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
NIL,
rollup->numGroups,
+ best_path->transitionSpace,
sort_plan);
/*
@@ -2295,6 +2298,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
chain,
rollup->numGroups,
+ best_path->transitionSpace,
subplan);
/* Copy cost data from Path to Plan */
@@ -6192,8 +6196,8 @@ Agg *
make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree)
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree)
{
Agg *node = makeNode(Agg);
Plan *plan = &node->plan;
@@ -6209,6 +6213,7 @@ make_agg(List *tlist, List *qual,
node->grpOperators = grpOperators;
node->grpCollations = grpCollations;
node->numGroups = numGroups;
+ node->transitionSpace = transitionSpace;
node->aggParams = NULL; /* SS_finalize_plan() will fill this */
node->groupingSets = groupingSets;
node->chain = chain;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6314c..913ad9335e5 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6528,7 +6528,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
* were unable to sort above, then we'd better generate a Path, so
* that we at least have one.
*/
- if (hashaggtablesize < work_mem * 1024L ||
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L ||
grouped_rel->pathlist == NIL)
{
/*
@@ -6561,7 +6562,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
agg_final_costs,
dNumGroups);
- if (hashaggtablesize < work_mem * 1024L)
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L)
add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
@@ -6830,7 +6832,7 @@ create_partial_grouping_paths(PlannerInfo *root,
* Tentatively produce a partial HashAgg Path, depending on if it
* looks as if the hash table will fit in work_mem.
*/
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
@@ -6857,7 +6859,7 @@ create_partial_grouping_paths(PlannerInfo *root,
dNumPartialPartialGroups);
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 1a23e18970d..951aed80e7a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1072,7 +1072,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
numGroupCols, dNumGroups,
NIL,
input_path->startup_cost, input_path->total_cost,
- input_path->rows);
+ input_path->rows, input_path->pathtarget->width);
/*
* Now for the sorted case. Note that the input is *always* unsorted,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e6d08aede56..8ba8122ee2f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1704,7 +1704,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
NIL,
subpath->startup_cost,
subpath->total_cost,
- rel->rows);
+ rel->rows,
+ subpath->pathtarget->width);
}
if (sjinfo->semi_can_btree && sjinfo->semi_can_hash)
@@ -2949,6 +2950,7 @@ create_agg_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->aggsplit = aggsplit;
pathnode->numGroups = numGroups;
+ pathnode->transitionSpace = aggcosts ? aggcosts->transitionSpace : 0;
pathnode->groupClause = groupClause;
pathnode->qual = qual;
@@ -2957,7 +2959,7 @@ create_agg_path(PlannerInfo *root,
list_length(groupClause), numGroups,
qual,
subpath->startup_cost, subpath->total_cost,
- subpath->rows);
+ subpath->rows, subpath->pathtarget->width);
/* add tlist eval cost for each output row */
pathnode->path.startup_cost += target->cost.startup;
@@ -3036,6 +3038,7 @@ create_groupingsets_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->rollups = rollups;
pathnode->qual = having_qual;
+ pathnode->transitionSpace = agg_costs ? agg_costs->transitionSpace : 0;
Assert(rollups != NIL);
Assert(aggstrategy != AGG_PLAIN || list_length(rollups) == 1);
@@ -3067,7 +3070,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
subpath->startup_cost,
subpath->total_cost,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
is_first = false;
if (!rollup->is_hashed)
is_first_sort = false;
@@ -3090,7 +3094,8 @@ create_groupingsets_path(PlannerInfo *root,
rollup->numGroups,
having_qual,
0.0, 0.0,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
if (!rollup->is_hashed)
is_first_sort = false;
}
@@ -3115,7 +3120,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
sort_path.startup_cost,
sort_path.total_cost,
- sort_path.rows);
+ sort_path.rows,
+ subpath->pathtarget->width);
}
pathnode->path.total_cost += agg_path.total_cost;
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb196444198..1151b807418 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -120,6 +120,7 @@ bool enableFsync = true;
bool allowSystemTableMods = false;
int work_mem = 1024;
int maintenance_work_mem = 16384;
+bool hashagg_mem_overflow = false;
int max_parallel_maintenance_workers = 2;
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8228e1f3903..ed6737a8ac9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -998,6 +998,26 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_hashagg_spill", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans that are expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_hashagg_spill,
+ true,
+ NULL, NULL, NULL
+ },
+ {
+ {"hashagg_mem_overflow", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables hashed aggregation to overflow work_mem at execution time."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &hashagg_mem_overflow,
+ false,
+ NULL, NULL, NULL
+ },
{
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of materialization."),
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 33495f8b4b3..16f6762086b 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -201,6 +201,8 @@ static long ltsGetFreeBlock(LogicalTapeSet *lts);
static void ltsReleaseBlock(LogicalTapeSet *lts, long blocknum);
static void ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
SharedFileSet *fileset);
+static void ltsInitTape(LogicalTape *lt);
+static void ltsInitReadBuffer(LogicalTapeSet *lts, LogicalTape *lt);
/*
@@ -535,6 +537,51 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
lts->nHoleBlocks = lts->nBlocksAllocated - nphysicalblocks;
}
+/*
+ * Initialize per-tape struct. Note we allocate the I/O buffer and the first
+ * block for a tape only when it is first actually written to. This avoids
+ * wasting memory space when tuplesort.c overestimates the number of tapes
+ * needed.
+ */
+static void
+ltsInitTape(LogicalTape *lt)
+{
+ lt->writing = true;
+ lt->frozen = false;
+ lt->dirty = false;
+ lt->firstBlockNumber = -1L;
+ lt->curBlockNumber = -1L;
+ lt->nextBlockNumber = -1L;
+ lt->offsetBlockNumber = 0L;
+ lt->buffer = NULL;
+ lt->buffer_size = 0;
+ /* palloc() larger than MaxAllocSize would fail */
+ lt->max_size = MaxAllocSize;
+ lt->pos = 0;
+ lt->nbytes = 0;
+}
+
+/*
+ * Lazily allocate and initialize the read buffer. This avoids waste when many
+ * tapes are open at once, but not all are active between rewinding and
+ * reading.
+ */
+static void
+ltsInitReadBuffer(LogicalTapeSet *lts, LogicalTape *lt)
+{
+ if (lt->firstBlockNumber != -1L)
+ {
+ Assert(lt->buffer_size > 0);
+ lt->buffer = palloc(lt->buffer_size);
+ }
+
+ /* Read the first block, or reset if tape is empty */
+ lt->nextBlockNumber = lt->firstBlockNumber;
+ lt->pos = 0;
+ lt->nbytes = 0;
+ ltsReadFillBuffer(lts, lt);
+}
+
/*
* Create a set of logical tapes in a temporary underlying file.
*
@@ -560,7 +607,6 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
int worker)
{
LogicalTapeSet *lts;
- LogicalTape *lt;
int i;
/*
@@ -578,29 +624,8 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
lts->nFreeBlocks = 0;
lts->nTapes = ntapes;
- /*
- * Initialize per-tape structs. Note we allocate the I/O buffer and the
- * first block for a tape only when it is first actually written to. This
- * avoids wasting memory space when tuplesort.c overestimates the number
- * of tapes needed.
- */
for (i = 0; i < ntapes; i++)
- {
- lt = <s->tapes[i];
- lt->writing = true;
- lt->frozen = false;
- lt->dirty = false;
- lt->firstBlockNumber = -1L;
- lt->curBlockNumber = -1L;
- lt->nextBlockNumber = -1L;
- lt->offsetBlockNumber = 0L;
- lt->buffer = NULL;
- lt->buffer_size = 0;
- /* palloc() larger than MaxAllocSize would fail */
- lt->max_size = MaxAllocSize;
- lt->pos = 0;
- lt->nbytes = 0;
- }
+ ltsInitTape(<s->tapes[i]);
/*
* Create temp BufFile storage as required.
@@ -821,15 +846,9 @@ LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum, size_t buffer_size)
lt->buffer_size = 0;
if (lt->firstBlockNumber != -1L)
{
- lt->buffer = palloc(buffer_size);
+ /* the buffer is lazily allocated, but set the size here */
lt->buffer_size = buffer_size;
}
-
- /* Read the first block, or reset if tape is empty */
- lt->nextBlockNumber = lt->firstBlockNumber;
- lt->pos = 0;
- lt->nbytes = 0;
- ltsReadFillBuffer(lts, lt);
}
/*
@@ -878,6 +897,9 @@ LogicalTapeRead(LogicalTapeSet *lts, int tapenum,
lt = <s->tapes[tapenum];
Assert(!lt->writing);
+ if (lt->buffer == NULL)
+ ltsInitReadBuffer(lts, lt);
+
while (size > 0)
{
if (lt->pos >= lt->nbytes)
@@ -991,6 +1013,29 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum, TapeShare *share)
}
}
+/*
+ * Add additional tapes to this tape set. Not intended to be used when any
+ * tapes are frozen.
+ */
+LogicalTapeSet *
+LogicalTapeSetExtend(LogicalTapeSet *lts, int nAdditional)
+{
+ int i;
+ int nTapesOrig = lts->nTapes;
+ Size newSize;
+
+ lts->nTapes += nAdditional;
+ newSize = offsetof(LogicalTapeSet, tapes) +
+ lts->nTapes * sizeof(LogicalTape);
+
+ lts = (LogicalTapeSet *) repalloc(lts, newSize);
+
+ for (i = nTapesOrig; i < lts->nTapes; i++)
+ ltsInitTape(<s->tapes[i]);
+
+ return lts;
+}
+
/*
* Backspace the tape a given number of bytes. (We also support a more
* general seek interface, see below.)
@@ -1015,6 +1060,9 @@ LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum, size_t size)
Assert(lt->frozen);
Assert(lt->buffer_size == BLCKSZ);
+ if (lt->buffer == NULL)
+ ltsInitReadBuffer(lts, lt);
+
/*
* Easy case for seek within current block.
*/
@@ -1087,6 +1135,9 @@ LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
Assert(offset >= 0 && offset <= TapeBlockPayloadSize);
Assert(lt->buffer_size == BLCKSZ);
+ if (lt->buffer == NULL)
+ ltsInitReadBuffer(lts, lt);
+
if (blocknum != lt->curBlockNumber)
{
ltsReadBlock(lts, blocknum, (void *) lt->buffer);
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 73a2ca8c6dd..d70bc048c46 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -226,9 +226,13 @@ typedef enum ExprEvalOp
EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
EEOP_AGG_INIT_TRANS,
+ EEOP_AGG_INIT_TRANS_SPILLED,
EEOP_AGG_STRICT_TRANS_CHECK,
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
EEOP_AGG_PLAIN_TRANS_BYVAL,
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
EEOP_AGG_PLAIN_TRANS,
+ EEOP_AGG_PLAIN_TRANS_SPILLED,
EEOP_AGG_ORDERED_TRANS_DATUM,
EEOP_AGG_ORDERED_TRANS_TUPLE,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 81fdfa4add3..d6eb2abb60b 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -255,7 +255,7 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
- bool doSort, bool doHash);
+ bool doSort, bool doHash, bool spilled);
extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
int numCols,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 264916f9a92..307987a45ab 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -311,5 +311,8 @@ extern void ExecReScanAgg(AggState *node);
extern Size hash_agg_entry_size(int numAggs, Size tupleWidth,
Size transitionSpace);
+extern void hash_agg_set_limits(double hashentrysize, uint64 input_groups,
+ int used_bits, Size *mem_limit,
+ long *ngroups_limit, int *num_partitions);
#endif /* NODEAGG_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f985453ec32..707a07a2de4 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -244,6 +244,7 @@ extern bool enableFsync;
extern PGDLLIMPORT bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT bool hashagg_mem_overflow;
extern PGDLLIMPORT int max_parallel_maintenance_workers;
extern int VacuumCostPageHit;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5d5b38b8799..a04ea19b112 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2075,13 +2075,32 @@ typedef struct AggState
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
- int num_hashes;
+ int num_hashes; /* number of hash tables active at once */
+ double hashentrysize; /* estimate revised during execution */
+ struct HashTapeInfo *hash_tapeinfo; /* metadata for spill tapes */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each hash table,
+ exists only during first pass if spilled */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ bool hash_ever_spilled; /* ever spilled during this execution? */
+ bool hash_spill_mode; /* we hit a limit during the current batch
+ and we must not create new groups */
+ Size hash_alloc_last; /* previous total memory allocation */
+ Size hash_alloc_current; /* current total memory allocation */
+ Size hash_mem_limit; /* limit before spilling hash table */
+ Size hash_mem_peak; /* peak hash table memory usage */
+ long hash_ngroups_current; /* number of groups currently in
+ memory in all hash tables */
+ long hash_ngroups_limit; /* limit before spilling hash table */
+ long hash_disk_used; /* kB of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+ List *hash_batches; /* hash batches remaining to be processed */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 49
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 3d3be197e0e..be592d0fee4 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1663,6 +1663,7 @@ typedef struct AggPath
AggStrategy aggstrategy; /* basic strategy, see nodes.h */
AggSplit aggsplit; /* agg-splitting mode, see nodes.h */
double numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
List *groupClause; /* a list of SortGroupClause's */
List *qual; /* quals (HAVING quals), if any */
} AggPath;
@@ -1700,6 +1701,7 @@ typedef struct GroupingSetsPath
AggStrategy aggstrategy; /* basic strategy */
List *rollups; /* list of RollupData */
List *qual; /* quals (HAVING quals), if any */
+ int32 transitionSpace; /* estimated transition state size */
} GroupingSetsPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 32c0d87f80e..f4183e1efa5 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -813,6 +813,7 @@ typedef struct Agg
Oid *grpOperators; /* equality operators to compare with */
Oid *grpCollations;
long numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
Bitmapset *aggParams; /* IDs of Params used in Aggref inputs */
/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
List *groupingSets; /* grouping sets to use */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index cb012ba1980..6572dc24699 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_hashagg_spill;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
@@ -114,7 +115,7 @@ extern void cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples);
+ double input_tuples, double input_width);
extern void cost_windowagg(Path *path, PlannerInfo *root,
List *windowFuncs, int numPartCols, int numOrderCols,
Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index eab486a6214..c7bda2b0917 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -54,8 +54,8 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
extern Agg *make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree);
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree);
extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
/*
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index 695d2c00ee4..3ebe52239f8 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -67,6 +67,8 @@ extern void LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum,
extern void LogicalTapeRewindForWrite(LogicalTapeSet *lts, int tapenum);
extern void LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum,
TapeShare *share);
+extern LogicalTapeSet *LogicalTapeSetExtend(LogicalTapeSet *lts,
+ int nAdditional);
extern size_t LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum,
size_t size);
extern void LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index f457b5b150f..7eeeaaa5e4a 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2357,3 +2357,124 @@ explain (costs off)
-> Seq Scan on onek
(8 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+set work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------------
+ GroupAggregate
+ Group Key: ((g % 100000))
+ -> Sort
+ Sort Key: ((g % 100000))
+ -> Function Scan on generate_series g
+(5 rows)
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+-- Produce results with hash aggregation
+set enable_hashagg = true;
+set enable_sort = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 100000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+set enable_sort = true;
+set work_mem to default;
+-- Compare group aggregation results to hash aggregation results
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+ a | c1 | c2 | c3
+---+----+----+----
+(0 rows)
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a7..767f60a96c7 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1633,4 +1633,127 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
| 1 | 2
(4 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+-- Produce results with hash aggregation.
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+set enable_sort = true;
+set work_mem to default;
+-- Compare results
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+ g100 | g10 | unnest | c | m
+------+-----+--------+---+---
+(0 rows)
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
-- end
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1de..11c6f50fbfa 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -148,6 +148,68 @@ SELECT count(*) FROM
4
(1 row)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+SET enable_hashagg=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------------
+ Unique
+ -> Sort
+ Sort Key: ((g % 1000))
+ -> Function Scan on generate_series g
+(4 rows)
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_hashagg=TRUE;
+-- Produce results with hash aggregation.
+SET enable_sort=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 1000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_sort=TRUE;
+SET work_mem TO DEFAULT;
+-- Compare results
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb9057..c40bf6c16eb 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -75,6 +75,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
+ enable_hashagg_spill | on
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 3e593f2d615..a4d476c4bb3 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1032,3 +1032,119 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
explain (costs off)
select 1 from tenk1
where (hundred, thousand) in (select twothousand, twothousand from onek);
+
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+set work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+-- Produce results with hash aggregation
+
+set enable_hashagg = true;
+set enable_sort = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare group aggregation results to hash aggregation results
+
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 95ac3fb52f6..bf8bce6ed31 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -441,4 +441,103 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
from unnest(array[1,1], array['a','b']) u(i,v)
group by rollup(i, v||'a') order by 1,3;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+-- Produce results with hash aggregation.
+
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare results
+
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+
-- end
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449e..33102744ebf 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -45,6 +45,68 @@ SELECT count(*) FROM
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+SET enable_hashagg=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_hashagg=TRUE;
+
+-- Produce results with hash aggregation.
+
+SET enable_sort=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_sort=TRUE;
+
+SET work_mem TO DEFAULT;
+
+-- Compare results
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
+
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
On Wed, 2020-02-12 at 21:51 -0800, Jeff Davis wrote:
The new patch is basically just rebased -- a few other very minor
changes.
I extracted out some minor refactoring of nodeAgg.c that I can commit
separately. That will make the main patch a little easier to review.
Attached.
* split build_hash_table() into two functions
* separated hash calculation from lookup
* changed lookup_hash_entry to return AggStatePerGroup directly instead
of the TupleHashEntryData (which the caller only used to get the
AggStatePerGroup, anyway)
Regards,
Jeff Davis
Attachments:
refactor.patchtext/x-patch; charset=UTF-8; name=refactor.patchDownload
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index b7f49ceddf8..e77413ff4f3 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -263,6 +263,7 @@ static void finalize_partialaggregate(AggState *aggstate,
AggStatePerAgg peragg,
AggStatePerGroup pergroupstate,
Datum *resultVal, bool *resultIsNull);
+static void prepare_hash_slot(AggState *aggstate);
static void prepare_projection_slot(AggState *aggstate,
TupleTableSlot *slot,
int currentSet);
@@ -272,8 +273,9 @@ static void finalize_aggregates(AggState *aggstate,
static TupleTableSlot *project_aggregates(AggState *aggstate);
static Bitmapset *find_unaggregated_cols(AggState *aggstate);
static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
-static void build_hash_table(AggState *aggstate);
-static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
+static void build_hash_tables(AggState *aggstate);
+static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
+static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
@@ -1035,6 +1037,32 @@ finalize_partialaggregate(AggState *aggstate,
MemoryContextSwitchTo(oldContext);
}
+/*
+ * Extract the attributes that make up the grouping key into the
+ * hashslot. This is necessary to compute the hash of the grouping key.
+ */
+static void
+prepare_hash_slot(AggState *aggstate)
+{
+ TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ int i;
+
+ /* transfer just the needed columns into hashslot */
+ slot_getsomeattrs(inputslot, perhash->largestGrpColIdx);
+ ExecClearTuple(hashslot);
+
+ for (i = 0; i < perhash->numhashGrpCols; i++)
+ {
+ int varNumber = perhash->hashGrpColIdxInput[i] - 1;
+
+ hashslot->tts_values[i] = inputslot->tts_values[varNumber];
+ hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
+ }
+ ExecStoreVirtualTuple(hashslot);
+}
+
/*
* Prepare to finalize and project based on the specified representative tuple
* slot and grouping set.
@@ -1249,41 +1277,62 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
* they are all reset at the same time).
*/
static void
-build_hash_table(AggState *aggstate)
+build_hash_tables(AggState *aggstate)
{
- MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
- Size additionalsize;
- int i;
+ int setno;
- Assert(aggstate->aggstrategy == AGG_HASHED || aggstate->aggstrategy == AGG_MIXED);
-
- additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
-
- for (i = 0; i < aggstate->num_hashes; ++i)
+ for (setno = 0; setno < aggstate->num_hashes; ++setno)
{
- AggStatePerHash perhash = &aggstate->perhash[i];
+ AggStatePerHash perhash = &aggstate->perhash[setno];
Assert(perhash->aggnode->numGroups > 0);
- if (perhash->hashtable)
- ResetTupleHashTable(perhash->hashtable);
- else
- perhash->hashtable = BuildTupleHashTableExt(&aggstate->ss.ps,
- perhash->hashslot->tts_tupleDescriptor,
- perhash->numCols,
- perhash->hashGrpColIdxHash,
- perhash->eqfuncoids,
- perhash->hashfunctions,
- perhash->aggnode->grpCollations,
- perhash->aggnode->numGroups,
- additionalsize,
- aggstate->ss.ps.state->es_query_cxt,
- aggstate->hashcontext->ecxt_per_tuple_memory,
- tmpmem,
- DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+ build_hash_table(aggstate, setno, perhash->aggnode->numGroups);
}
}
+/*
+ * Build a single hashtable for this grouping set. Pass the hash memory
+ * context as both metacxt and tablecxt, so that resetting the hashcontext
+ * will free all memory including metadata. That means that we cannot reset
+ * the hash table to empty and reuse it, though (see execGrouping.c).
+ */
+static void
+build_hash_table(AggState *aggstate, int setno, long nbuckets)
+{
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ MemoryContext metacxt = aggstate->ss.ps.state->es_query_cxt;
+ MemoryContext hashcxt = aggstate->hashcontext->ecxt_per_tuple_memory;
+ MemoryContext tmpcxt = aggstate->tmpcontext->ecxt_per_tuple_memory;
+ Size additionalsize;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ /*
+ * Used to make sure initial hash table allocation does not exceed
+ * work_mem. Note that the estimate does not include space for
+ * pass-by-reference transition data values, nor for the representative
+ * tuple of each group.
+ */
+ additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
+
+ perhash->hashtable = BuildTupleHashTableExt(
+ &aggstate->ss.ps,
+ perhash->hashslot->tts_tupleDescriptor,
+ perhash->numCols,
+ perhash->hashGrpColIdxHash,
+ perhash->eqfuncoids,
+ perhash->hashfunctions,
+ perhash->aggnode->grpCollations,
+ nbuckets,
+ additionalsize,
+ metacxt,
+ hashcxt,
+ tmpcxt,
+ DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+}
+
/*
* Compute columns that actually need to be stored in hashtable entries. The
* incoming tuples from the child plan node will contain grouping columns,
@@ -1441,33 +1490,20 @@ hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
* set (which the caller must have selected - note that initialize_aggregate
* depends on this).
*
- * When called, CurrentMemoryContext should be the per-query context.
+ * When called, CurrentMemoryContext should be the per-query context. The
+ * already-calculated hash value for the tuple must be specified.
*/
-static TupleHashEntryData *
-lookup_hash_entry(AggState *aggstate)
+static AggStatePerGroup
+lookup_hash_entry(AggState *aggstate, uint32 hash)
{
- TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
TupleTableSlot *hashslot = perhash->hashslot;
TupleHashEntryData *entry;
bool isnew;
- int i;
-
- /* transfer just the needed columns into hashslot */
- slot_getsomeattrs(inputslot, perhash->largestGrpColIdx);
- ExecClearTuple(hashslot);
-
- for (i = 0; i < perhash->numhashGrpCols; i++)
- {
- int varNumber = perhash->hashGrpColIdxInput[i] - 1;
-
- hashslot->tts_values[i] = inputslot->tts_values[varNumber];
- hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
- }
- ExecStoreVirtualTuple(hashslot);
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntry(perhash->hashtable, hashslot, &isnew);
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, &isnew,
+ hash);
if (isnew)
{
@@ -1492,7 +1528,7 @@ lookup_hash_entry(AggState *aggstate)
}
}
- return entry;
+ return entry->additional;
}
/*
@@ -1510,8 +1546,13 @@ lookup_hash_entries(AggState *aggstate)
for (setno = 0; setno < numHashes; setno++)
{
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ uint32 hash;
+
select_current_set(aggstate, setno, true);
- pergroup[setno] = lookup_hash_entry(aggstate)->additional;
+ prepare_hash_slot(aggstate);
+ hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
+ pergroup[setno] = lookup_hash_entry(aggstate, hash);
}
}
@@ -2478,7 +2519,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->hash_pergroup = pergroups;
find_hash_columns(aggstate);
- build_hash_table(aggstate);
+ build_hash_tables(aggstate);
aggstate->table_filled = false;
}
@@ -3498,7 +3539,7 @@ ExecReScanAgg(AggState *node)
{
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
- build_hash_table(node);
+ build_hash_tables(node);
node->table_filled = false;
/* iterator will be reset when the table is filled */
}
On Wed, Jan 8, 2020 at 2:38 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
This makes the assumption that all Aggrefs or GroupingFuncs are at the
top of the TargetEntry. That's not true, e.g.:select 0+sum(a) from foo group by b;
I think find_aggregated_cols() and find_unaggregated_cols() should be
merged into one function that scans the targetlist once, and returns two
Bitmapsets. They're always used together, anyway.
So, I've attached a patch that does what Heikki recommended and gets
both aggregated and unaggregated columns in two different bitmapsets.
I think it works for more cases than the other patch.
I'm not sure it is the ideal interface, but, since there aren't many
consumers, I don't know.
Also, it needs some formatting/improved naming/etc.
Per Jeff's comment in [1]/messages/by-id/e5566f7def33a9e9fdff337cca32d07155d7b635.camel@j-davis.com I started looking into using the scanCols
patch from the thread on extracting scanCols from PlannerInfo [2]/messages/by-id/CAAKRu_Yj=Q_ZxiGX+pgstNWMbUJApEJX-imvAEwryCk5SLUebg@mail.gmail.com to
get the aggregated and unaggregated columns for this patch.
Since we only make one bitmap for scanCols containing all of the
columns that need to be scanned, there is no context about where the
columns came from in the query.
That is, once the bit is set in the bitmapset, we have no way of
knowing if that column was needed for aggregation or if it is filtered
out immediately.
We could solve this by creating multiple bitmaps at the time that we
create the scanCols field -- one for aggregated columns, one for
unaggregated columns, and, potentially more if useful to other
consumers.
The initial problem with this is that we extract scanCols from the
PlannerInfo->simple_rel_array and PlannerInfo->simple_rte_array.
If we wanted more context about where those columns were from in the
query, we would have to either change how we construct the scanCols or
construct them early and add to the bitmap when adding columns to the
simple_rel_array and simple_rte_array (which, I suppose, is the same
thing as changing how we construct scanCols).
This might decentralize the code for the benefit of one consumer.
Also, looping through the simple_rel_array and simple_rte_array a
couple times per query seems like it would add negligible overhead.
I'm more hesitant to add code that, most likely, would involve a
walker to the codepath everybody uses if only agg will leverage the
two distinct bitmapsets.
Overall, I think it seems like a good idea to leverage scanCols for
determining what columns hashagg needs to spill, but I can't think of
a way of doing it that doesn't seem bad. scanCols are currently just
that -- columns that will need to be scanned.
[1]: /messages/by-id/e5566f7def33a9e9fdff337cca32d07155d7b635.camel@j-davis.com
/messages/by-id/e5566f7def33a9e9fdff337cca32d07155d7b635.camel@j-davis.com
[2]: /messages/by-id/CAAKRu_Yj=Q_ZxiGX+pgstNWMbUJApEJX-imvAEwryCk5SLUebg@mail.gmail.com
/messages/by-id/CAAKRu_Yj=Q_ZxiGX+pgstNWMbUJApEJX-imvAEwryCk5SLUebg@mail.gmail.com
--
Melanie Plageman
Attachments:
v1-0001-aggregated-unaggregated-cols-together.patchapplication/octet-stream; name=v1-0001-aggregated-unaggregated-cols-together.patchDownload
From 925f06e1c884d24b4ec2ad517c222bd782c40bdf Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 13 Feb 2020 17:12:43 -0800
Subject: [PATCH v1] Find aggregated and unaggregated columns in same function
---
src/backend/executor/nodeAgg.c | 82 +++++++++++++++++++++++-----------
1 file changed, 57 insertions(+), 25 deletions(-)
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index b7f49ceddf..2ba321f279 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -270,8 +270,11 @@ static void finalize_aggregates(AggState *aggstate,
AggStatePerAgg peragg,
AggStatePerGroup pergroup);
static TupleTableSlot *project_aggregates(AggState *aggstate);
-static Bitmapset *find_unaggregated_cols(AggState *aggstate);
-static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
+
+static bool find_aggregated_cols_walker(Node *node, void *context);
+static bool find_unaggregated_cols_walker(Node *node, void *context);
+static void find_cols(AggState *aggstate, Bitmapset **aggregated_colnos, Bitmapset **unaggregated_colnos);
+
static void build_hash_table(AggState *aggstate);
static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
static void lookup_hash_entries(AggState *aggstate);
@@ -1189,30 +1192,38 @@ project_aggregates(AggState *aggstate)
return NULL;
}
-/*
- * find_unaggregated_cols
- * Construct a bitmapset of the column numbers of un-aggregated Vars
- * appearing in our targetlist and qual (HAVING clause)
- */
-static Bitmapset *
-find_unaggregated_cols(AggState *aggstate)
+
+typedef struct FindColsContext
{
- Agg *node = (Agg *) aggstate->ss.ps.plan;
- Bitmapset *colnos;
-
- colnos = NULL;
- (void) find_unaggregated_cols_walker((Node *) node->plan.targetlist,
- &colnos);
- (void) find_unaggregated_cols_walker((Node *) node->plan.qual,
- &colnos);
- return colnos;
+ Bitmapset *aggregated_colnos;
+ Bitmapset *unaggregated_colnos;
+} FindColsContext;
+
+static bool
+find_aggregated_cols_walker(Node *node, void *context)
+{
+ if (node == NULL)
+ return false;
+
+ FindColsContext *find_cols_context = (FindColsContext *) context;
+
+ if (IsA(node, Var))
+ {
+ Var *var = (Var *) node;
+ find_cols_context->aggregated_colnos = bms_add_member(find_cols_context->aggregated_colnos, var->varattno);
+ return false;
+ }
+ return expression_tree_walker(node, find_aggregated_cols_walker, (void *) find_cols_context);
}
static bool
-find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
+find_unaggregated_cols_walker(Node *node, void *context)
{
if (node == NULL)
return false;
+
+ FindColsContext *find_cols_context = (FindColsContext *) context;
+
if (IsA(node, Var))
{
Var *var = (Var *) node;
@@ -1220,18 +1231,36 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
/* setrefs.c should have set the varno to OUTER_VAR */
Assert(var->varno == OUTER_VAR);
Assert(var->varlevelsup == 0);
- *colnos = bms_add_member(*colnos, var->varattno);
+ /*
+ * Construct a bitmapset of the column numbers of un-aggregated Vars
+ * appearing in our targetlist and qual (HAVING clause)
+ */
+ find_cols_context->unaggregated_colnos = bms_add_member(find_cols_context->unaggregated_colnos, var->varattno);
return false;
}
if (IsA(node, Aggref) ||IsA(node, GroupingFunc))
{
- /* do not descend into aggregate exprs */
- return false;
+ return find_aggregated_cols_walker(node, (void *) find_cols_context);
}
- return expression_tree_walker(node, find_unaggregated_cols_walker,
- (void *) colnos);
+ return expression_tree_walker(node, find_unaggregated_cols_walker, (void *) find_cols_context);
}
+static void
+find_cols(AggState *aggstate, Bitmapset **aggregated_colnos, Bitmapset **unaggregated_colnos)
+{
+ Agg *node = (Agg *) aggstate->ss.ps.plan;
+
+ FindColsContext findColsContext;
+ findColsContext.aggregated_colnos = NULL;
+ findColsContext.unaggregated_colnos = NULL;
+ (void) find_unaggregated_cols_walker((Node *) node->plan.targetlist, &findColsContext);
+ (void) find_unaggregated_cols_walker((Node *) node->plan.qual, &findColsContext);
+ *aggregated_colnos = findColsContext.aggregated_colnos;
+ *unaggregated_colnos = findColsContext.unaggregated_colnos;
+}
+
+
+
/*
* (Re-)initialize the hash table(s) to empty.
*
@@ -1318,8 +1347,11 @@ find_hash_columns(AggState *aggstate)
EState *estate = aggstate->ss.ps.state;
int j;
+ Bitmapset *aggregated_colnos;
+ Bitmapset *unaggregated_colnos;
+ find_cols(aggstate, &aggregated_colnos, &unaggregated_colnos);
/* Find Vars that will be needed in tlist and qual */
- base_colnos = find_unaggregated_cols(aggstate);
+ base_colnos = unaggregated_colnos;
for (j = 0; j < numHashes; ++j)
{
--
2.20.1 (Apple Git-117)
On Wed, 2020-02-12 at 21:51 -0800, Jeff Davis wrote:
On Mon, 2020-02-10 at 15:57 -0800, Jeff Davis wrote:
Attaching latest version (combined logtape changes along with main
HashAgg patch).I ran a matrix of small performance tests to look for regressions.
I ran some more tests, this time comparing Hash Aggregation to
Sort+Group.
Summary of trends:
group key complexity : favors Hash
group key size : favors Hash
group size : favors Hash
higher work_mem : favors Sort[1]I have closed the gap a bit with some post-experiment tuning. I have just begun to analyze this case so I think there is quite a bit more room for improvement.
data set size : favors Sort[1]I have closed the gap a bit with some post-experiment tuning. I have just begun to analyze this case so I think there is quite a bit more room for improvement.
number of aggregates : favors Hash[2]Could use more exploration -- I don't have an explanation.
[1]: I have closed the gap a bit with some post-experiment tuning. I have just begun to analyze this case so I think there is quite a bit more room for improvement.
I have just begun to analyze this case so I think there is
quite a bit more room for improvement.
[2]: Could use more exploration -- I don't have an explanation.
Data sets:
t20m_1_int4: ~20 million groups of size ~1 (uniform)
t1m_20_int4: ~1 million groups of size ~20 (uniform)
t1k_20k_int4: ~1k groups of size ~20k (uniform)
also, text versions of each of those with collate "C.UTF-8"
Results:
1. A general test to vary the group size, key type, and work_mem.
Query:
select i from $TABLE group by i offset 100000000;
work_mem='4MB'
+----------------+----------+-------------+--------------+
| | sort(ms) | hashagg(ms) | sort/hashagg |
+----------------+----------+-------------+--------------+
| t20m_1_int4 | 11852 | 10640 | 1.11 |
| t1m_20_int4 | 11108 | 8109 | 1.37 |
| t1k_20k_int4 | 8575 | 2732 | 3.14 |
| t20m_1_text | 80463 | 12902 | 6.24 |
| t1m_20_text
| 58586 | 9252 | 6.33 |
| t1k_20k_text | 21781 |
5739 | 3.80 |
+----------------+----------+-------------+----
----------+
work_mem='32MB'
+----------------+----------+-------------+--------------+
| | sort(ms) | hashagg(ms) | sort/hashagg |
+----------------+----------+-------------+--------------+
| t20m_1_int4 | 9656 | 11702 | 0.83 |
| t1m_20_int4 | 8870 | 9804 | 0.90 |
| t1k_20k_int4 | 6359 | 1852 | 3.43 |
| t20m_1_text | 74266 | 14434 | 5.15 |
| t1m_20_text | 56549 | 10180 | 5.55 |
| t1k_20k_text | 21407 | 3989 | 5.37 |
+----------------+----------+-------------+--------------+
2. Test group key size
data set:
20m rows, four int4 columns.
Columns a,b,c are all the constant value 1, forcing each
comparison to look at all four columns.
Query: select a,b,c,d from wide group by a,b,c,d offset 100000000;
work_mem='4MB'
Sort : 30852ms
HashAgg : 12343ms
Sort/HashAgg : 2.50
In theory, if the first grouping column is highly selective, then Sort
may have a slight advantage because it can look at only the first
column, while HashAgg needs to look at all 4. But HashAgg only needs to
perform this calculation once and it seems hard enough to show this in
practice that I consider it an edge case. In "normal" cases, it appears
that more grouping columns significantly favors Hash Agg.
3. Test number of aggregates
Data Set: same as for test #2 (group key size).
Query: select d, count(a),sum(b),avg(c),min(d)
from wide group by d offset 100000000;
work_mem='4MB'
Sort : 22373ms
HashAgg : 17338ms
Sort/HashAgg : 1.29
I don't have an explanation of why HashAgg is doing better here. Both
of them are using JIT and essentially doing the same number of
advancements. This could use more exploration, but the effect isn't
major.
4. Test data size
Data 400 million rows of four random int8s. Group size of one.
Query: select a from t400m_1_int8 group by a offset 1000000000;
work_mem='32MB'
Sort : 300675ms
HashAgg : 560740ms
Sort/HashAgg : 0.54
I tried increasing the max number of partitions and brought the HashAgg
runtime down to 481985 (using 1024 partitions), which closes the gap to
0.62. That's not too bad for HashAgg considering this is a group size
of one with a simple group key. A bit more tuning might be able to
close the gap further.
Conclusion:
HashAgg is winning in a lot of cases, and this will be an important
improvement for many workloads. Not only is it faster in a lot of
cases, but it's also less risky. When an input has unknown group size,
it's much easier for the planner to choose HashAgg -- a small downside
and a big upside.
Regards,
Jeff Davis
Hi,
I wanted to take a look at this thread and do a review, but it's not
very clear to me if the recent patches posted here are independent or
how exactly they fit together. I see
1) hashagg-20200212-1.patch (2020/02/13 by Jeff)
2) refactor.patch (2020/02/13 by Jeff)
3) v1-0001-aggregated-unaggregated-cols-together.patch (2020/02/14 by
Melanie)
I suppose this also confuses the cfbot - it's probably only testing (3)
as it's the last thing posted here, at least I think that's the case.
And it fails:
nodeAgg.c: In function ‘find_aggregated_cols_walker’:
nodeAgg.c:1208:2: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement]
FindColsContext *find_cols_context = (FindColsContext *) context;
^
nodeAgg.c: In function ‘find_unaggregated_cols_walker’:
nodeAgg.c:1225:2: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement]
FindColsContext *find_cols_context = (FindColsContext *) context;
^
cc1: all warnings being treated as errors
<builtin>: recipe for target 'nodeAgg.o' failed
make[3]: *** [nodeAgg.o] Error 1
make[3]: *** Waiting for unfinished jobs....
It's probably a good idea to either start a separate thread for patches
that are only loosely related to the main topic, or always post the
whole patch series.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, 2020-02-18 at 19:57 +0100, Tomas Vondra wrote:
Hi,
I wanted to take a look at this thread and do a review, but it's not
very clear to me if the recent patches posted here are independent or
how exactly they fit together. I see
Attached latest version rebased on master.
It's probably a good idea to either start a separate thread for
patches
that are only loosely related to the main topic, or always post the
whole patch series.
Will do, sorry for the confusion.
Regards,
Jeff Davis
Attachments:
hashagg-20200218.patchtext/x-patch; charset=UTF-8; name=hashagg-20200218.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f89ec7..85f559387f9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,23 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-hashagg-mem-overflow" xreflabel="hashagg_mem_overflow">
+ <term><varname>hashagg_mem_overflow</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>hashagg_mem_overflow</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ If hash aggregation exceeds <varname>work_mem</varname> at query
+ execution time, and <varname>hashagg_mem_overflow</varname> is set
+ to <literal>on</literal>, continue consuming more memory rather than
+ performing disk-based hash aggregation. The default
+ is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
@@ -4476,6 +4493,24 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-hashagg-spill" xreflabel="enable_hashagg_spill">
+ <term><varname>enable_hashagg_spill</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_hashagg_spill</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the memory usage is expected to
+ exceed <varname>work_mem</varname>. This only affects the planner
+ choice; actual behavior at execution time is dictated by
+ <xref linkend="guc-hashagg-mem-overflow"/>. The default
+ is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
<term><varname>enable_hashjoin</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50e..2923f4ba46d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -104,6 +104,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1882,6 +1883,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ if (es->analyze)
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2769,6 +2772,55 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * If EXPLAIN ANALYZE, show information on hash aggregate memory usage and
+ * batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(
+ es->str,
+ "Memory Usage: %ldkB",
+ memPeakKb);
+
+ if (aggstate->hash_batches_used > 0)
+ {
+ appendStringInfo(
+ es->str,
+ " Batches: %d Disk: %ldkB",
+ aggstate->hash_batches_used, aggstate->hash_disk_used);
+ }
+
+ appendStringInfo(
+ es->str,
+ "\n");
+ }
+ else
+ {
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ ExplainPropertyInteger("Disk Usage", "kB",
+ aggstate->hash_disk_used, es);
+ }
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 121eff97a0c..9dff7990742 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -79,7 +79,8 @@ static void ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash);
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled);
/*
@@ -2927,7 +2928,7 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
*/
ExprState *
ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
- bool doSort, bool doHash)
+ bool doSort, bool doHash, bool spilled)
{
ExprState *state = makeNode(ExprState);
PlanState *parent = &aggstate->ss.ps;
@@ -3158,7 +3159,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (int setno = 0; setno < processGroupingSets; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, false);
+ pertrans, transno, setno, setoff, false,
+ spilled);
setoff++;
}
}
@@ -3177,7 +3179,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (int setno = 0; setno < numHashes; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, true);
+ pertrans, transno, setno, setoff, true,
+ spilled);
setoff++;
}
}
@@ -3227,7 +3230,8 @@ static void
ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash)
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled)
{
int adjust_init_jumpnull = -1;
int adjust_strict_jumpnull = -1;
@@ -3249,7 +3253,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
fcinfo->flinfo->fn_strict &&
pertrans->initValueIsNull)
{
- scratch->opcode = EEOP_AGG_INIT_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_INIT_TRANS_SPILLED : EEOP_AGG_INIT_TRANS;
scratch->d.agg_init_trans.pertrans = pertrans;
scratch->d.agg_init_trans.setno = setno;
scratch->d.agg_init_trans.setoff = setoff;
@@ -3265,7 +3270,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
if (pertrans->numSortCols == 0 &&
fcinfo->flinfo->fn_strict)
{
- scratch->opcode = EEOP_AGG_STRICT_TRANS_CHECK;
+ scratch->opcode = spilled ?
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED : EEOP_AGG_STRICT_TRANS_CHECK;
scratch->d.agg_strict_trans_check.setno = setno;
scratch->d.agg_strict_trans_check.setoff = setoff;
scratch->d.agg_strict_trans_check.transno = transno;
@@ -3282,9 +3288,11 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
/* invoke appropriate transition implementation */
if (pertrans->numSortCols == 0 && pertrans->transtypeByVal)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS_BYVAL;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED : EEOP_AGG_PLAIN_TRANS_BYVAL;
else if (pertrans->numSortCols == 0)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_SPILLED : EEOP_AGG_PLAIN_TRANS;
else if (pertrans->numInputs == 1)
scratch->opcode = EEOP_AGG_ORDERED_TRANS_DATUM;
else
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 35eb8b99f69..e21e0c440ea 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -426,9 +426,13 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
&&CASE_EEOP_AGG_INIT_TRANS,
+ &&CASE_EEOP_AGG_INIT_TRANS_SPILLED,
&&CASE_EEOP_AGG_STRICT_TRANS_CHECK,
+ &&CASE_EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_SPILLED,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
&&CASE_EEOP_LAST
@@ -1619,6 +1623,35 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
+
+ /* If transValue has not yet been initialized, do so now. */
+ if (pergroup->noTransValue)
+ {
+ AggStatePerTrans pertrans = op->d.agg_init_trans.pertrans;
+
+ aggstate->curaggcontext = op->d.agg_init_trans.aggcontext;
+ aggstate->current_set = op->d.agg_init_trans.setno;
+
+ ExecAggInitGroup(aggstate, pertrans, pergroup);
+
+ /* copied trans value from input, done this round */
+ EEO_JUMP(op->d.agg_init_trans.jumpnull);
+ }
+
+ EEO_NEXT();
+ }
/* check that a strict aggregate's input isn't NULL */
EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK)
@@ -1635,6 +1668,24 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_strict_trans_check.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_strict_trans_check.transno];
+
+ if (unlikely(pergroup->transValueIsNull))
+ EEO_JUMP(op->d.agg_strict_trans_check.jumpnull);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1683,6 +1734,51 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ Assert(pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1726,6 +1822,66 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
newVal = FunctionCallInvoke(fcinfo);
+ /*
+ * For pass-by-ref datatype, must copy the new value into
+ * aggcontext and free the prior transValue. But if transfn
+ * returned a pointer to its first input, we don't need to do
+ * anything. Also, if transfn returned a pointer to a R/W
+ * expanded object that is already a child of the aggcontext,
+ * assume we can adopt that value without copying it.
+ */
+ if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggTransReparent(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ Assert(!pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
/*
* For pass-by-ref datatype, must copy the new value into
* aggcontext and free the prior transValue. But if transfn
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 85311f2303a..5692af3de4c 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -194,6 +194,29 @@
* transition values. hashcontext is the single context created to support
* all hash tables.
*
+ * Spilling To Disk
+ *
+ * When performing hash aggregation, if the hash table memory exceeds the
+ * limit (see hash_agg_check_limits()), we enter "spill mode". In spill
+ * mode, we advance the transition states only for groups already in the
+ * hash table. For tuples that would need to create a new hash table
+ * entries (and initialize new transition states), we instead spill them to
+ * disk to be processed later. The tuples are spilled in a partitioned
+ * manner, so that subsequent batches are smaller and less likely to exceed
+ * work_mem (if a batch does exceed work_mem, it must be spilled
+ * recursively).
+ *
+ * Spilled data is written to logical tapes. These provide better control
+ * over memory usage, disk space, and the number of files than if we were
+ * to use a BufFile for each spill.
+ *
+ * Note that it's possible for transition states to start small but then
+ * grow very large; for instance in the case of ARRAY_AGG. In such cases,
+ * it's still possible to significantly exceed work_mem. We try to avoid
+ * this situation by estimating what will fit in the available memory, and
+ * imposing a limit on the number of groups separately from the amount of
+ * memory consumed.
+ *
* Transition / Combine function invocation:
*
* For performance reasons transition functions, including combine
@@ -233,12 +256,99 @@
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/datum.h"
+#include "utils/dynahash.h"
#include "utils/expandeddatum.h"
+#include "utils/logtape.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+/*
+ * Control how many partitions are created when spilling HashAgg to
+ * disk.
+ *
+ * HASHAGG_PARTITION_FACTOR is multiplied by the estimated number of
+ * partitions needed such that each partition will fit in memory. The factor
+ * is set higher than one because there's not a high cost to having a few too
+ * many partitions, and it makes it less likely that a partition will need to
+ * be spilled recursively. Another benefit of having more, smaller partitions
+ * is that small hash tables may perform better than large ones due to memory
+ * caching effects.
+ *
+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (which
+ * could instead be spent on a larger hash table).
+ *
+ * For reading from tapes, the buffer size must be a multiple of
+ * BLCKSZ. Larger values help when reading from multiple tapes concurrently,
+ * but that doesn't happen in HashAgg, so we simply use BLCKSZ. Writing to a
+ * tape always uses a buffer of size BLCKSZ.
+ */
+#define HASHAGG_PARTITION_FACTOR 1.50
+#define HASHAGG_MIN_PARTITIONS 4
+#define HASHAGG_MAX_PARTITIONS 256
+#define HASHAGG_READ_BUFFER_SIZE BLCKSZ
+#define HASHAGG_WRITE_BUFFER_SIZE BLCKSZ
+
+/*
+ * Track all tapes needed for a HashAgg that spills. We don't know the maximum
+ * number of tapes needed at the start of the algorithm (because it can
+ * recurse), so one tape set is allocated and extended as needed for new
+ * tapes. When a particular tape is already read, rewind it for write mode and
+ * put it in the free list.
+ *
+ * Tapes' buffers can take up substantial memory when many tapes are open at
+ * once. We only need one tape open at a time in read mode (using a buffer
+ * that's a multiple of BLCKSZ); but we need up to HASHAGG_MAX_PARTITIONS
+ * tapes open in write mode (each requiring a buffer of size BLCKSZ).
+ */
+typedef struct HashTapeInfo
+{
+ LogicalTapeSet *tapeset;
+ int ntapes;
+ int *freetapes;
+ int nfreetapes;
+} HashTapeInfo;
+
+/*
+ * Represents partitioned spill data for a single hashtable. Contains the
+ * necessary information to route tuples to the correct partition, and to
+ * transform the spilled data into new batches.
+ *
+ * The high bits are used for partition selection (when recursing, we ignore
+ * the bits that have already been used for partition selection at an earlier
+ * level).
+ */
+typedef struct HashAggSpill
+{
+ HashTapeInfo *tapeinfo; /* borrowed reference to tape info */
+ int npartitions; /* number of partitions */
+ int *partitions; /* spill partition tape numbers */
+ int64 *ntuples; /* number of tuples in each partition */
+ uint32 mask; /* mask to find partition from hash value */
+ int shift; /* after masking, shift by this amount */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation (with only one
+ * grouping set).
+ *
+ * Also tracks the bits of the hash already used for partition selection by
+ * earlier iterations, so that this batch can use new bits. If all bits have
+ * already been used, no partitioning will be done (any spilled data will go
+ * to a single output tape).
+ */
+typedef struct HashAggBatch
+{
+ int setno; /* grouping set */
+ int used_bits; /* number of bits of hash already used */
+ HashTapeInfo *tapeinfo; /* borrowed reference to tape info */
+ int input_tapenum; /* input partition tape */
+ int64 input_tuples; /* number of tuples in this batch */
+} HashAggBatch;
+
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -263,6 +373,7 @@ static void finalize_partialaggregate(AggState *aggstate,
AggStatePerAgg peragg,
AggStatePerGroup pergroupstate,
Datum *resultVal, bool *resultIsNull);
+static void prepare_hash_slot(AggState *aggstate);
static void prepare_projection_slot(AggState *aggstate,
TupleTableSlot *slot,
int currentSet);
@@ -272,12 +383,40 @@ static void finalize_aggregates(AggState *aggstate,
static TupleTableSlot *project_aggregates(AggState *aggstate);
static Bitmapset *find_unaggregated_cols(AggState *aggstate);
static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
-static void build_hash_table(AggState *aggstate);
-static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
+static void build_hash_tables(AggState *aggstate);
+static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
+static void hashagg_recompile_expressions(AggState *aggstate);
+static long hash_choose_num_buckets(AggState *aggstate,
+ long estimated_nbuckets,
+ Size memory);
+static int hash_choose_num_partitions(uint64 input_groups,
+ double hashentrysize,
+ int used_bits,
+ int *log2_npartittions);
+static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+static void hash_agg_check_limits(AggState *aggstate);
+static void hashagg_finish_initial_spills(AggState *aggstate);
+static void hashagg_reset_spill_state(AggState *aggstate);
+static HashAggBatch *hashagg_batch_new(HashTapeInfo *tapeinfo,
+ int input_tapenum, int setno,
+ int64 input_tuples, int used_bits);
+static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
+static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
+ int used_bits, uint64 input_tuples,
+ double hashentrysize);
+static Size hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot,
+ uint32 hash);
+static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno);
+static void hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *dest,
+ int ndest);
+static void hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum);
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1035,6 +1174,32 @@ finalize_partialaggregate(AggState *aggstate,
MemoryContextSwitchTo(oldContext);
}
+/*
+ * Extract the attributes that make up the grouping key into the
+ * hashslot. This is necessary to compute the hash of the grouping key.
+ */
+static void
+prepare_hash_slot(AggState *aggstate)
+{
+ TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
+ AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
+ TupleTableSlot *hashslot = perhash->hashslot;
+ int i;
+
+ /* transfer just the needed columns into hashslot */
+ slot_getsomeattrs(inputslot, perhash->largestGrpColIdx);
+ ExecClearTuple(hashslot);
+
+ for (i = 0; i < perhash->numhashGrpCols; i++)
+ {
+ int varNumber = perhash->hashGrpColIdxInput[i] - 1;
+
+ hashslot->tts_values[i] = inputslot->tts_values[varNumber];
+ hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
+ }
+ ExecStoreVirtualTuple(hashslot);
+}
+
/*
* Prepare to finalize and project based on the specified representative tuple
* slot and grouping set.
@@ -1233,7 +1398,7 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
}
/*
- * (Re-)initialize the hash table(s) to empty.
+ * Initialize the hash table(s).
*
* To implement hashed aggregation, we need a hashtable that stores a
* representative tuple and an array of AggStatePerGroup structs for each
@@ -1244,44 +1409,84 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
* We have a separate hashtable and associated perhash data structure for each
* grouping set for which we're doing hashing.
*
- * The contents of the hash tables always live in the hashcontext's per-tuple
- * memory context (there is only one of these for all tables together, since
- * they are all reset at the same time).
+ * The hash tables and their contents always live in the hashcontext's
+ * per-tuple memory context (there is only one of these for all tables
+ * together, since they are all reset at the same time).
*/
static void
-build_hash_table(AggState *aggstate)
+build_hash_tables(AggState *aggstate)
{
- MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
- Size additionalsize;
- int i;
-
- Assert(aggstate->aggstrategy == AGG_HASHED || aggstate->aggstrategy == AGG_MIXED);
-
- additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
+ int setno;
- for (i = 0; i < aggstate->num_hashes; ++i)
+ for (setno = 0; setno < aggstate->num_hashes; ++setno)
{
- AggStatePerHash perhash = &aggstate->perhash[i];
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ long nbuckets;
+ Size memory;
Assert(perhash->aggnode->numGroups > 0);
- if (perhash->hashtable)
- ResetTupleHashTable(perhash->hashtable);
- else
- perhash->hashtable = BuildTupleHashTableExt(&aggstate->ss.ps,
- perhash->hashslot->tts_tupleDescriptor,
- perhash->numCols,
- perhash->hashGrpColIdxHash,
- perhash->eqfuncoids,
- perhash->hashfunctions,
- perhash->aggnode->grpCollations,
- perhash->aggnode->numGroups,
- additionalsize,
- aggstate->ss.ps.state->es_query_cxt,
- aggstate->hashcontext->ecxt_per_tuple_memory,
- tmpmem,
- DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+ memory = aggstate->hash_mem_limit / aggstate->num_hashes;
+
+ /* choose reasonable number of buckets per hashtable */
+ nbuckets = hash_choose_num_buckets(
+ aggstate, perhash->aggnode->numGroups, memory);
+
+ build_hash_table(aggstate, setno, nbuckets);
}
+
+ aggstate->hash_alloc_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_ngroups_current = 0;
+}
+
+/*
+ * Build a single hashtable for this grouping set. Pass the hash memory
+ * context as both metacxt and tablecxt, so that resetting the hashcontext
+ * will free all memory including metadata. That means that we cannot reset
+ * the hash table to empty and reuse it, though (see execGrouping.c).
+ */
+static void
+build_hash_table(AggState *aggstate, int setno, long nbuckets)
+{
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ MemoryContext metacxt;
+ MemoryContext hashcxt = aggstate->hashcontext->ecxt_per_tuple_memory;
+ MemoryContext tmpcxt = aggstate->tmpcontext->ecxt_per_tuple_memory;
+ Size additionalsize;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ /*
+ * We don't try to preserve any part of the hash table. Set the metacxt to
+ * hashcxt, which will be reset for each batch.
+ */
+ metacxt = hashcxt;
+
+ /*
+ * Used to make sure initial hash table allocation does not exceed
+ * work_mem. Note that the estimate does not include space for
+ * pass-by-reference transition data values, nor for the representative
+ * tuple of each group.
+ */
+ additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);
+
+ perhash->hashtable = BuildTupleHashTableExt(
+ &aggstate->ss.ps,
+ perhash->hashslot->tts_tupleDescriptor,
+ perhash->numCols,
+ perhash->hashGrpColIdxHash,
+ perhash->eqfuncoids,
+ perhash->hashfunctions,
+ perhash->aggnode->grpCollations,
+ nbuckets,
+ additionalsize,
+ metacxt,
+ hashcxt,
+ tmpcxt,
+ DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
}
/*
@@ -1435,6 +1640,233 @@ hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
transitionSpace;
}
+/*
+ * Recompile the expressions for advancing aggregates while hashing. This is
+ * necessary for certain kinds of state changes that affect the resulting
+ * expression. For instance, changing aggstate->hash_ever_spilled or
+ * aggstate->ss.ps.outerops requires recompilation.
+ *
+ * A compiled expression where hash_ever_spilled is true will work even when
+ * hash_spill_mode is false, because it merely introduces additional branches
+ * that are unnecessary when hash_spill_mode is false. That allows us to only
+ * recompile when hash_ever_spilled changes, rather than every time
+ * hash_spill_mode changes.
+ */
+static void
+hashagg_recompile_expressions(AggState *aggstate)
+{
+ AggStatePerPhase phase;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ if (aggstate->aggstrategy == AGG_HASHED)
+ phase = &aggstate->phases[0];
+ else /* AGG_MIXED */
+ phase = &aggstate->phases[1];
+
+ phase->evaltrans = ExecBuildAggTrans(
+ aggstate, phase,
+ aggstate->aggstrategy == AGG_MIXED ? true : false, /* dosort */
+ true, /* dohash */
+ aggstate->hash_ever_spilled);
+}
+
+/*
+ * Set limits that trigger spilling to avoid exceeding work_mem. Consider the
+ * number of partitions we expect to create (if we do spill).
+ *
+ * There are two limits: a memory limit, and also an ngroups limit. The
+ * ngroups limit becomes important when we expect transition values to grow
+ * substantially larger than the initial value.
+ */
+void
+hash_agg_set_limits(double hashentrysize, uint64 input_groups, int used_bits,
+ Size *mem_limit, long *ngroups_limit, int *num_partitions)
+{
+ int npartitions;
+ Size partition_mem;
+
+ /* no attempt to obey work_mem */
+ if (hashagg_mem_overflow)
+ {
+ *mem_limit = SIZE_MAX;
+ *ngroups_limit = LONG_MAX;
+ return;
+ }
+
+ /* if not expected to spill, use all of work_mem */
+ if (input_groups * hashentrysize < work_mem * 1024L)
+ {
+ *mem_limit = work_mem * 1024L;
+ *ngroups_limit = *mem_limit / hashentrysize;
+ return;
+ }
+
+ /*
+ * Calculate expected memory requirements for spilling, which is the size
+ * of the buffers needed for all the tapes that need to be open at
+ * once. Then, subtract that from the memory available for holding hash
+ * tables.
+ */
+ npartitions = hash_choose_num_partitions(input_groups,
+ hashentrysize,
+ used_bits,
+ NULL);
+ if (num_partitions != NULL)
+ *num_partitions = npartitions;
+
+ partition_mem =
+ HASHAGG_READ_BUFFER_SIZE +
+ HASHAGG_WRITE_BUFFER_SIZE * npartitions;
+
+ /*
+ * Don't set the limit below 3/4 of work_mem. In that case, we are at the
+ * minimum number of partitions, so we aren't going to dramatically exceed
+ * work mem anyway.
+ */
+ if (work_mem * 1024L > 4 * partition_mem)
+ *mem_limit = work_mem * 1024L - partition_mem;
+ else
+ *mem_limit = work_mem * 1024L * 0.75;
+
+ if (*mem_limit > hashentrysize)
+ *ngroups_limit = *mem_limit / hashentrysize;
+ else
+ *ngroups_limit = 1;
+}
+
+/*
+ * hash_agg_check_limits
+ *
+ * After adding a new group to the hash table, check whether we need to enter
+ * spill mode. Allocations may happen without adding new groups (for instance,
+ * if the transition state size grows), so this check is imperfect.
+ *
+ * Memory usage is tracked by how much is allocated to the underlying memory
+ * context, not individual chunks. This is more accurate because it accounts
+ * for all memory in the context, and also accounts for fragmentation and
+ * other forms of overhead and waste that can be difficult to estimate. It's
+ * also cheaper because we don't have to track each chunk.
+ *
+ * When memory is first allocated to a memory context, it is not actually
+ * used. So when the next allocation happens, we consider the
+ * previously-allocated amount to be the memory currently used.
+ */
+static void
+hash_agg_check_limits(AggState *aggstate)
+{
+ Size allocation;
+
+ /*
+ * Even if already in spill mode, it's possible for memory usage to grow,
+ * and we should still track it for the purposes of EXPLAIN ANALYZE.
+ */
+ allocation = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ /* has allocation grown since the last observation? */
+ if (allocation > aggstate->hash_alloc_current)
+ {
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_alloc_current = allocation;
+ }
+
+ if (aggstate->hash_alloc_last > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = aggstate->hash_alloc_last;
+
+ /*
+ * Don't spill unless there's at least one group in the hash table so we
+ * can be sure to make progress even in edge cases.
+ */
+ if (aggstate->hash_ngroups_current > 0 &&
+ (aggstate->hash_alloc_last > aggstate->hash_mem_limit ||
+ aggstate->hash_ngroups_current > aggstate->hash_ngroups_limit))
+ {
+ aggstate->hash_spill_mode = true;
+
+ if (!aggstate->hash_ever_spilled)
+ {
+ aggstate->hash_ever_spilled = true;
+ aggstate->hash_spills = palloc0(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+ aggstate->hash_tapeinfo = palloc0(sizeof(HashTapeInfo));
+ hashagg_recompile_expressions(aggstate);
+ }
+ }
+}
+
+/*
+ * Choose a reasonable number of buckets for the initial hash table size.
+ */
+static long
+hash_choose_num_buckets(AggState *aggstate, long ngroups, Size memory)
+{
+ long max_nbuckets;
+
+ max_nbuckets = memory / aggstate->hashentrysize;
+
+ /*
+ * Leave room for slop to avoid a case where the initial hash table size
+ * exceeds the memory limit (though that may still happen in edge cases).
+ */
+ max_nbuckets *= 0.75;
+
+ return ngroups > max_nbuckets ? max_nbuckets : ngroups;
+}
+
+/*
+ * Determine the number of partitions to create when spilling, which will
+ * always be a power of two. If log2_npartitions is non-NULL, set
+ * *log2_npartitions to the log2() of the number of partitions.
+ */
+static int
+hash_choose_num_partitions(uint64 input_groups, double hashentrysize,
+ int used_bits, int *log2_npartitions)
+{
+ Size mem_wanted;
+ int partition_limit;
+ int npartitions;
+ int partition_bits;
+
+ /*
+ * Avoid creating so many partitions that the memory requirements of the
+ * open partition files are greater than 1/4 of work_mem.
+ */
+ partition_limit =
+ (work_mem * 1024L * 0.25 - HASHAGG_READ_BUFFER_SIZE) /
+ HASHAGG_WRITE_BUFFER_SIZE;
+
+ /* pessimistically estimate that each input tuple creates a new group */
+ mem_wanted = HASHAGG_PARTITION_FACTOR * input_groups * hashentrysize;
+
+ /* make enough partitions so that each one is likely to fit in memory */
+ npartitions = 1 + (mem_wanted / (work_mem * 1024L));
+
+ if (npartitions > partition_limit)
+ npartitions = partition_limit;
+
+ if (npartitions < HASHAGG_MIN_PARTITIONS)
+ npartitions = HASHAGG_MIN_PARTITIONS;
+ if (npartitions > HASHAGG_MAX_PARTITIONS)
+ npartitions = HASHAGG_MAX_PARTITIONS;
+
+ /* ceil(log2(npartitions)) */
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits */
+ if (partition_bits + used_bits >= 32)
+ partition_bits = 32 - used_bits;
+
+ if (log2_npartitions != NULL)
+ *log2_npartitions = partition_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ return npartitions;
+}
+
/*
* Find or create a hashtable entry for the tuple group containing the current
* tuple (already set in tmpcontext's outertuple slot), in the current grouping
@@ -1442,37 +1874,39 @@ hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
* depends on this).
*
* When called, CurrentMemoryContext should be the per-query context.
+ *
+ * If the hash table is at the memory limit, then only find existing hashtable
+ * entries; don't create new ones. If a tuple's group is not already present
+ * in the hash table for the current grouping set, return NULL and the caller
+ * will spill it to disk.
*/
-static TupleHashEntryData *
-lookup_hash_entry(AggState *aggstate)
+static AggStatePerGroup
+lookup_hash_entry(AggState *aggstate, uint32 hash)
{
- TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
TupleTableSlot *hashslot = perhash->hashslot;
TupleHashEntryData *entry;
- bool isnew;
- int i;
+ bool isnew = false;
+ bool *p_isnew;
- /* transfer just the needed columns into hashslot */
- slot_getsomeattrs(inputslot, perhash->largestGrpColIdx);
- ExecClearTuple(hashslot);
-
- for (i = 0; i < perhash->numhashGrpCols; i++)
- {
- int varNumber = perhash->hashGrpColIdxInput[i] - 1;
-
- hashslot->tts_values[i] = inputslot->tts_values[varNumber];
- hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
- }
- ExecStoreVirtualTuple(hashslot);
+ /* if hash table already spilled, don't create new entries */
+ p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntry(perhash->hashtable, hashslot, &isnew);
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
+ hash);
+
+ if (entry == NULL)
+ return NULL;
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_ngroups_current++;
+ if (!hashagg_mem_overflow)
+ hash_agg_check_limits(aggstate);
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1492,7 +1926,7 @@ lookup_hash_entry(AggState *aggstate)
}
}
- return entry;
+ return entry->additional;
}
/*
@@ -1500,18 +1934,51 @@ lookup_hash_entry(AggState *aggstate)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Some entries may be left NULL if we have reached the limit and have begun
+ * to spill. The same tuple will belong to different groups for each set, so
+ * may match a group already in memory for one set and match a group not in
+ * memory for another set. If we have begun to spill and a tuple doesn't match
+ * a group in memory for a particular set, it will be spilled.
+ *
+ * NB: It's possible to spill the same tuple for several different grouping
+ * sets. This may seem wasteful, but it's actually a trade-off: if we spill
+ * the tuple multiple times for multiple grouping sets, it can be partitioned
+ * for each grouping set, making the refilling of the hash table very
+ * efficient.
*/
static void
lookup_hash_entries(AggState *aggstate)
{
- int numHashes = aggstate->num_hashes;
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
int setno;
- for (setno = 0; setno < numHashes; setno++)
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
{
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ uint32 hash;
+
select_current_set(aggstate, setno, true);
- pergroup[setno] = lookup_hash_entry(aggstate)->additional;
+ prepare_hash_slot(aggstate);
+ hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
+ pergroup[setno] = lookup_hash_entry(aggstate, hash);
+
+ /* check to see if we need to spill the tuple for this grouping set */
+ if (pergroup[setno] == NULL)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
+
+ if (spill->partitions == NULL)
+ hashagg_spill_init(spill, aggstate->hash_tapeinfo, 0,
+ perhash->aggnode->numGroups,
+ aggstate->hashentrysize);
+
+ hashagg_spill_tuple(spill, slot, hash);
+
+ aggstate->hash_disk_used = LogicalTapeSetBlocks(
+ aggstate->hash_tapeinfo->tapeset) * (BLCKSZ / 1024);
+ }
}
}
@@ -1834,6 +2301,12 @@ agg_retrieve_direct(AggState *aggstate)
if (TupIsNull(outerslot))
{
/* no more outer-plan tuples available */
+
+ /* if we built hash tables, finalize any spills */
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hashagg_finish_initial_spills(aggstate);
+
if (hasGroupingSets)
{
aggstate->input_done = true;
@@ -1936,6 +2409,9 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ /* finalize spills, if any */
+ hashagg_finish_initial_spills(aggstate);
+
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1943,11 +2419,193 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * consumed.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+ HashAggSpill spill;
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ long nbuckets;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ spill.npartitions = 0;
+ spill.partitions = NULL;
+ /*
+ * Each spill file contains spilled data for only a single grouping
+ * set. We want to ignore all others, which is done by setting the other
+ * pergroups to NULL.
+ */
+ memset(aggstate->all_pergroups, 0,
+ sizeof(AggStatePerGroup) *
+ (aggstate->maxsets + aggstate->num_hashes));
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ hash_agg_set_limits(aggstate->hashentrysize, batch->input_tuples,
+ batch->used_bits, &aggstate->hash_mem_limit,
+ &aggstate->hash_ngroups_limit, NULL);
+
+ /*
+ * Free memory and rebuild a single hash table for this batch's grouping
+ * set. Estimate the number of groups to be the number of input tuples in
+ * this batch.
+ */
+ ReScanExprContext(aggstate->hashcontext);
+
+ nbuckets = hash_choose_num_buckets(
+ aggstate, batch->input_tuples, aggstate->hash_mem_limit);
+ build_hash_table(aggstate, batch->setno, nbuckets);
+ aggstate->hash_alloc_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_ngroups_current = 0;
+
+ Assert(aggstate->current_phase == 0);
+
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ /*
+ * The first pass (agg_fill_hash_table) reads whatever kind of slot comes
+ * from the outer plan, and considers the slot fixed. But spilled tuples
+ * are always MinimalTuples, so if that's different from the outer plan we
+ * need to change it and recompile the aggregate expressions.
+ */
+ if (aggstate->ss.ps.outerops != &TTSOpsMinimalTuple)
+ {
+ aggstate->ss.ps.outerops = &TTSOpsMinimalTuple;
+ hashagg_recompile_expressions(aggstate);
+ }
+
+ LogicalTapeRewindForRead(tapeinfo->tapeset, batch->input_tapenum,
+ HASHAGG_READ_BUFFER_SIZE);
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+ uint32 hash;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hashagg_batch_read(batch, &hash);
+ if (tuple == NULL)
+ break;
+
+ ExecStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ select_current_set(aggstate, batch->setno, true);
+ prepare_hash_slot(aggstate);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+
+ /* if there's no memory for a new group, spill */
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ {
+ /*
+ * Estimate the number of groups for this batch as the total
+ * number of tuples in its input file. Although that's a worst
+ * case, it's not bad here for two reasons: (1) overestimating
+ * is better than underestimating; and (2) we've already
+ * scanned the relation once, so it's likely that we've
+ * already finalized many of the common values.
+ */
+ if (spill.partitions == NULL)
+ hashagg_spill_init(&spill, tapeinfo, batch->used_bits,
+ batch->input_tuples,
+ aggstate->hashentrysize);
+
+ hashagg_spill_tuple(&spill, slot, hash);
+
+ aggstate->hash_disk_used = LogicalTapeSetBlocks(
+ aggstate->hash_tapeinfo->tapeset) * (BLCKSZ / 1024);
+ }
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ hashagg_tapeinfo_release(tapeinfo, batch->input_tapenum);
+
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ /* update hashentrysize estimate based on contents */
+ if (aggstate->hash_ngroups_current > 0)
+ {
+ aggstate->hashentrysize = (double)aggstate->hash_alloc_last /
+ (double)aggstate->hash_ngroups_current;
+ }
+
+ hashagg_spill_finish(aggstate, &spill, batch->setno);
+ aggstate->hash_spill_mode = false;
+
+ pfree(batch);
+
+ /* Initialize to walk the first hash table */
+ select_current_set(aggstate, 0, true);
+ ResetTupleHashIterator(aggstate->perhash[0].hashtable,
+ &aggstate->perhash[0].hashiter);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
+ *
+ * After exhausting in-memory tuples, also try refilling the hash table using
+ * previously-spilled tuples. Only returns NULL after all in-memory and
+ * spilled tuples are exhausted.
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+/*
+ * Retrieve the groups from the in-memory hash tables without considering any
+ * spilled tuples.
+ */
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -1976,7 +2634,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2007,8 +2665,6 @@ agg_retrieve_hash_table(AggState *aggstate)
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2065,6 +2721,296 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * Assign unused tapes to spill partitions, extending the tape set if
+ * necessary.
+ */
+static void
+hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *partitions,
+ int npartitions)
+{
+ int partidx = 0;
+
+ /* use free tapes if available */
+ while (partidx < npartitions && tapeinfo->nfreetapes > 0)
+ partitions[partidx++] = tapeinfo->freetapes[--tapeinfo->nfreetapes];
+
+ if (tapeinfo->tapeset == NULL)
+ tapeinfo->tapeset = LogicalTapeSetCreate(npartitions, NULL, NULL, -1);
+ else if (partidx < npartitions)
+ {
+ tapeinfo->tapeset = LogicalTapeSetExtend(
+ tapeinfo->tapeset, npartitions - partidx);
+ }
+
+ while (partidx < npartitions)
+ partitions[partidx++] = tapeinfo->ntapes++;
+}
+
+/*
+ * After a tape has already been written to and then read, this function
+ * rewinds it for writing and adds it to the free list.
+ */
+static void
+hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum)
+{
+ LogicalTapeRewindForWrite(tapeinfo->tapeset, tapenum);
+ if (tapeinfo->freetapes == NULL)
+ tapeinfo->freetapes = palloc(sizeof(int));
+ else
+ tapeinfo->freetapes = repalloc(
+ tapeinfo->freetapes, sizeof(int) * (tapeinfo->nfreetapes + 1));
+ tapeinfo->freetapes[tapeinfo->nfreetapes++] = tapenum;
+}
+
+/*
+ * hashagg_spill_init
+ *
+ * Called after we determined that spilling is necessary. Chooses the number
+ * of partitions to create, and initializes them.
+ */
+static void
+hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo, int used_bits,
+ uint64 input_groups, double hashentrysize)
+{
+ int npartitions;
+ int partition_bits;
+
+ npartitions = hash_choose_num_partitions(
+ input_groups, hashentrysize, used_bits, &partition_bits);
+
+ spill->partitions = palloc0(sizeof(int) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+
+ hashagg_tapeinfo_assign(tapeinfo, spill->partitions, npartitions);
+
+ spill->tapeinfo = tapeinfo;
+ spill->shift = 32 - used_bits - partition_bits;
+ spill->mask = (npartitions - 1) << spill->shift;
+ spill->npartitions = npartitions;
+}
+
+/*
+ * hashagg_spill_tuple
+ *
+ * No room for new groups in the hash table. Save for later in the appropriate
+ * partition.
+ */
+static Size
+hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot, uint32 hash)
+{
+ LogicalTapeSet *tapeset = spill->tapeinfo->tapeset;
+ int partition;
+ MinimalTuple tuple;
+ int tapenum;
+ int total_written = 0;
+ bool shouldFree;
+
+ Assert(spill->partitions != NULL);
+
+ /* may contain unnecessary attributes, consider projecting? */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ partition = (hash & spill->mask) >> spill->shift;
+ spill->ntuples[partition]++;
+
+ tapenum = spill->partitions[partition];
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) &hash, sizeof(uint32));
+ total_written += sizeof(uint32);
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) tuple, tuple->t_len);
+ total_written += tuple->t_len;
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return total_written;
+}
+
+/*
+ * hashagg_batch_new
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done. Should be called in the aggregate's memory context.
+ */
+static HashAggBatch *
+hashagg_batch_new(HashTapeInfo *tapeinfo, int tapenum, int setno,
+ int64 input_tuples, int used_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->setno = setno;
+ batch->used_bits = used_bits;
+ batch->tapeinfo = tapeinfo;
+ batch->input_tapenum = tapenum;
+ batch->input_tuples = input_tuples;
+
+ return batch;
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hashagg_batch_read(HashAggBatch *batch, uint32 *hashp)
+{
+ LogicalTapeSet *tapeset = batch->tapeinfo->tapeset;
+ int tapenum = batch->input_tapenum;
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+ uint32 hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &hash, sizeof(uint32));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+ if (hashp != NULL)
+ *hashp = hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &t_len, sizeof(t_len));
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = LogicalTapeRead(tapeset, tapenum,
+ (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, t_len - sizeof(uint32), nread)));
+
+ return tuple;
+}
+
+/*
+ * hashagg_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hashagg_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+
+ if (aggstate->hash_spills == NULL)
+ return;
+
+ /* update hashentrysize estimate based on contents */
+ Assert(aggstate->hash_ngroups_current > 0);
+ aggstate->hashentrysize = (double)aggstate->hash_alloc_last /
+ (double)aggstate->hash_ngroups_current;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hashagg_spill_finish(aggstate, &aggstate->hash_spills[setno], setno);
+
+ aggstate->hash_spill_mode = false;
+
+ /*
+ * We're not processing tuples from outer plan any more; only processing
+ * batches of spilled tuples. The initial spill structures are no longer
+ * needed.
+ */
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+}
+
+/*
+ * hashagg_spill_finish
+ *
+ * Transform spill partitions into new batches.
+ */
+static void
+hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
+{
+ int i;
+ int used_bits = 32 - spill->shift;
+
+ if (spill->npartitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->npartitions; i++)
+ {
+ int tapenum = spill->partitions[i];
+ MemoryContext oldContext;
+ HashAggBatch *new_batch;
+
+ oldContext = MemoryContextSwitchTo(aggstate->ss.ps.state->es_query_cxt);
+ new_batch = hashagg_batch_new(aggstate->hash_tapeinfo,
+ tapenum, setno, spill->ntuples[i],
+ used_bits);
+ aggstate->hash_batches = lcons(new_batch, aggstate->hash_batches);
+ aggstate->hash_batches_used++;
+ MemoryContextSwitchTo(oldContext);
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Free resources related to a spilled HashAgg.
+ */
+static void
+hashagg_reset_spill_state(AggState *aggstate)
+{
+ ListCell *lc;
+
+ /* free spills from initial pass */
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+ }
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ /* free batches */
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+
+ /* close tape set */
+ if (aggstate->hash_tapeinfo != NULL)
+ {
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ if (tapeinfo->tapeset != NULL)
+ LogicalTapeSetClose(tapeinfo->tapeset);
+ if (tapeinfo->freetapes != NULL)
+ pfree(tapeinfo->freetapes);
+ pfree(tapeinfo);
+ aggstate->hash_tapeinfo = NULL;
+ }
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2249,6 +3195,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->ss.ps.outeropsfixed = false;
}
+ if (use_hashing)
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsMinimalTuple);
+
/*
* Initialize result type, slot and projection.
*/
@@ -2474,11 +3424,24 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
*/
if (use_hashing)
{
+ Plan *outerplan = outerPlan(node);
+ long totalGroups = 0;
+ int i;
+
/* this is an array of pointers, not structures */
aggstate->hash_pergroup = pergroups;
+ aggstate->hashentrysize = hash_agg_entry_size(
+ aggstate->numtrans, outerplan->plan_width, node->transitionSpace);
+
+ for (i = 0; i < aggstate->num_hashes; i++)
+ totalGroups = aggstate->perhash[i].aggnode->numGroups;
+
+ hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+ &aggstate->hash_mem_limit,
+ &aggstate->hash_ngroups_limit, NULL);
find_hash_columns(aggstate);
- build_hash_table(aggstate);
+ build_hash_tables(aggstate);
aggstate->table_filled = false;
}
@@ -2884,7 +3847,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
else
Assert(false);
- phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash);
+ phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
+ false);
}
@@ -3379,6 +4343,8 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hashagg_reset_spill_state(node);
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3434,12 +4400,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_ever_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3496,11 +4463,33 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ const TupleTableSlotOps *outerops = ExecGetResultSlotOps(
+ outerPlanState(&node->ss), &node->ss.ps.outeropsfixed);
+
+ hashagg_reset_spill_state(node);
+
+ node->hash_ever_spilled = false;
+ node->hash_spill_mode = false;
+ node->hash_alloc_last = 0;
+ node->hash_alloc_current = 0;
+ node->hash_ngroups_current = 0;
+
+ /* reset stats */
+ node->hash_mem_peak = 0;
+ node->hash_disk_used = 0;
+ node->hash_batches_used = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
- build_hash_table(node);
+ build_hash_tables(node);
node->table_filled = false;
/* iterator will be reset when the table is filled */
+
+ if (node->ss.ps.outerops != outerops)
+ {
+ node->ss.ps.outerops = outerops;
+ hashagg_recompile_expressions(node);
+ }
}
if (node->aggstrategy != AGG_HASHED)
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index cea0d6fa5ce..7246fc2b33f 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2047,6 +2047,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_INIT_TRANS:
+ case EEOP_AGG_INIT_TRANS_SPILLED:
{
AggStatePerTrans pertrans;
@@ -2056,6 +2057,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_allpergroupsp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_setoff,
v_transno;
@@ -2082,11 +2084,32 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_init_trans.setoff);
v_transno = l_int32_const(op->d.agg_init_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_notransvalue = l_bb_before_v(
+ opblocks[opno + 1], "op.%d.check_notransvalue", opno);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(
+ b, v_pergroup_allaggs, TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[opno + 1],
+ b_check_notransvalue);
+
+ LLVMPositionBuilderAtEnd(b, b_check_notransvalue);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_notransvalue =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_NOTRANSVALUE,
@@ -2143,6 +2166,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_STRICT_TRANS_CHECK:
+ case EEOP_AGG_STRICT_TRANS_CHECK_SPILLED:
{
LLVMValueRef v_setoff,
v_transno;
@@ -2152,6 +2176,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_transnull;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
int jumpnull = op->d.agg_strict_trans_check.jumpnull;
@@ -2171,11 +2196,32 @@ llvm_compile_expr(ExprState *state)
l_int32_const(op->d.agg_strict_trans_check.setoff);
v_transno =
l_int32_const(op->d.agg_strict_trans_check.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_transnull = l_bb_before_v(
+ opblocks[opno + 1], "op.%d.check_transnull", opno);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[jumpnull],
+ b_check_transnull);
+
+ LLVMPositionBuilderAtEnd(b, b_check_transnull);
+ }
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_transnull =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_TRANSVALUEISNULL,
@@ -2191,7 +2237,9 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_PLAIN_TRANS_BYVAL:
+ case EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED:
case EEOP_AGG_PLAIN_TRANS:
+ case EEOP_AGG_PLAIN_TRANS_SPILLED:
{
AggState *aggstate;
AggStatePerTrans pertrans;
@@ -2217,6 +2265,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_pertransp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_retval;
@@ -2244,10 +2293,33 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_trans.setoff);
v_transno = l_int32_const(op->d.agg_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED ||
+ opcode == EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_advance_transval = l_bb_before_v(
+ opblocks[opno + 1], "op.%d.advance_transval", opno);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[opno + 1],
+ b_advance_transval);
+
+ LLVMPositionBuilderAtEnd(b, b_advance_transval);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_fcinfo = l_ptr_const(fcinfo,
l_ptr(StructFunctionCallInfoData));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721f..8d58780bf6a 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -77,6 +77,7 @@
#include "access/htup_details.h"
#include "access/tsmapi.h"
#include "executor/executor.h"
+#include "executor/nodeAgg.h"
#include "executor/nodeHash.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -128,6 +129,7 @@ bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
bool enable_hashagg = true;
+bool enable_hashagg_spill = true;
bool enable_nestloop = true;
bool enable_material = true;
bool enable_mergejoin = true;
@@ -2153,7 +2155,7 @@ cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples)
+ double input_tuples, double input_width)
{
double output_tuples;
Cost startup_cost;
@@ -2219,21 +2221,88 @@ cost_agg(Path *path, PlannerInfo *root,
total_cost += aggcosts->finalCost.per_tuple * numGroups;
total_cost += cpu_tuple_cost * numGroups;
output_tuples = numGroups;
+
+ /*
+ * We don't need to compute the disk costs of hash aggregation here,
+ * because the planner does not choose hash aggregation for grouping
+ * sets that it doesn't expect to fit in memory.
+ */
}
else
{
+ double pages_written = 0.0;
+ double pages_read = 0.0;
+ double hashentrysize;
+ double nbatches;
+ Size mem_limit;
+ long ngroups_limit;
+ int num_partitions;
+
/* must be AGG_HASHED */
startup_cost = input_total_cost;
if (!enable_hashagg)
startup_cost += disable_cost;
startup_cost += aggcosts->transCost.startup;
startup_cost += aggcosts->transCost.per_tuple * input_tuples;
+ /* cost of computing hash value */
startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;
startup_cost += aggcosts->finalCost.startup;
+
total_cost = startup_cost;
total_cost += aggcosts->finalCost.per_tuple * numGroups;
+ /* cost of retrieving from hash table */
total_cost += cpu_tuple_cost * numGroups;
output_tuples = numGroups;
+
+ /*
+ * Estimate number of batches based on the computed limits. If less
+ * than or equal to one, all groups are expected to fit in memory;
+ * otherwise we expect to spill.
+ */
+ hashentrysize = hash_agg_entry_size(
+ aggcosts->numAggs, input_width, aggcosts->transitionSpace);
+ hash_agg_set_limits(hashentrysize, numGroups, 0, &mem_limit,
+ &ngroups_limit, &num_partitions);
+
+ nbatches = Max( (numGroups * hashentrysize) / mem_limit,
+ numGroups / ngroups_limit );
+
+ /*
+ * Estimate number of pages read and written. For each level of
+ * recursion, a tuple must be written and then later read.
+ */
+ if (!hashagg_mem_overflow && nbatches > 1.0)
+ {
+ double depth;
+ double pages;
+
+ pages = relation_byte_size(input_tuples, input_width) / BLCKSZ;
+
+ /*
+ * The number of partitions can change at different levels of
+ * recursion; but for the purposes of this calculation assume it
+ * stays constant.
+ */
+ depth = ceil( log(nbatches - 1) / log(num_partitions) );
+ pages_written = pages_read = pages * depth;
+ }
+
+ /*
+ * Add the disk costs of hash aggregation that spills to disk.
+ *
+ * Groups that go into the hash table stay in memory until finalized,
+ * so spilling and reprocessing tuples doesn't incur additional
+ * invocations of transCost or finalCost. Furthermore, the computed
+ * hash value is stored with the spilled tuples, so we don't incur
+ * extra invocations of the hash function.
+ *
+ * Hash Agg begins returning tuples after the first batch is
+ * complete. Accrue writes (spilled tuples) to startup_cost and to
+ * total_cost; accrue reads only to total_cost.
+ */
+ startup_cost += pages_written * random_page_cost;
+ total_cost += pages_written * random_page_cost;
+ total_cost += pages_read * seq_page_cost;
}
/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e048d200bb4..090919e39a0 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1644,6 +1644,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
NIL,
NIL,
best_path->path.rows,
+ 0,
subplan);
}
else
@@ -2096,6 +2097,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
NIL,
NIL,
best_path->numGroups,
+ best_path->transitionSpace,
subplan);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2257,6 +2259,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
NIL,
rollup->numGroups,
+ best_path->transitionSpace,
sort_plan);
/*
@@ -2295,6 +2298,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
chain,
rollup->numGroups,
+ best_path->transitionSpace,
subplan);
/* Copy cost data from Path to Plan */
@@ -6192,8 +6196,8 @@ Agg *
make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree)
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree)
{
Agg *node = makeNode(Agg);
Plan *plan = &node->plan;
@@ -6209,6 +6213,7 @@ make_agg(List *tlist, List *qual,
node->grpOperators = grpOperators;
node->grpCollations = grpCollations;
node->numGroups = numGroups;
+ node->transitionSpace = transitionSpace;
node->aggParams = NULL; /* SS_finalize_plan() will fill this */
node->groupingSets = groupingSets;
node->chain = chain;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6314c..913ad9335e5 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6528,7 +6528,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
* were unable to sort above, then we'd better generate a Path, so
* that we at least have one.
*/
- if (hashaggtablesize < work_mem * 1024L ||
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L ||
grouped_rel->pathlist == NIL)
{
/*
@@ -6561,7 +6562,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
agg_final_costs,
dNumGroups);
- if (hashaggtablesize < work_mem * 1024L)
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L)
add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
@@ -6830,7 +6832,7 @@ create_partial_grouping_paths(PlannerInfo *root,
* Tentatively produce a partial HashAgg Path, depending on if it
* looks as if the hash table will fit in work_mem.
*/
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
@@ -6857,7 +6859,7 @@ create_partial_grouping_paths(PlannerInfo *root,
dNumPartialPartialGroups);
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 1a23e18970d..951aed80e7a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1072,7 +1072,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
numGroupCols, dNumGroups,
NIL,
input_path->startup_cost, input_path->total_cost,
- input_path->rows);
+ input_path->rows, input_path->pathtarget->width);
/*
* Now for the sorted case. Note that the input is *always* unsorted,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e6d08aede56..8ba8122ee2f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1704,7 +1704,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
NIL,
subpath->startup_cost,
subpath->total_cost,
- rel->rows);
+ rel->rows,
+ subpath->pathtarget->width);
}
if (sjinfo->semi_can_btree && sjinfo->semi_can_hash)
@@ -2949,6 +2950,7 @@ create_agg_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->aggsplit = aggsplit;
pathnode->numGroups = numGroups;
+ pathnode->transitionSpace = aggcosts ? aggcosts->transitionSpace : 0;
pathnode->groupClause = groupClause;
pathnode->qual = qual;
@@ -2957,7 +2959,7 @@ create_agg_path(PlannerInfo *root,
list_length(groupClause), numGroups,
qual,
subpath->startup_cost, subpath->total_cost,
- subpath->rows);
+ subpath->rows, subpath->pathtarget->width);
/* add tlist eval cost for each output row */
pathnode->path.startup_cost += target->cost.startup;
@@ -3036,6 +3038,7 @@ create_groupingsets_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->rollups = rollups;
pathnode->qual = having_qual;
+ pathnode->transitionSpace = agg_costs ? agg_costs->transitionSpace : 0;
Assert(rollups != NIL);
Assert(aggstrategy != AGG_PLAIN || list_length(rollups) == 1);
@@ -3067,7 +3070,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
subpath->startup_cost,
subpath->total_cost,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
is_first = false;
if (!rollup->is_hashed)
is_first_sort = false;
@@ -3090,7 +3094,8 @@ create_groupingsets_path(PlannerInfo *root,
rollup->numGroups,
having_qual,
0.0, 0.0,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
if (!rollup->is_hashed)
is_first_sort = false;
}
@@ -3115,7 +3120,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
sort_path.startup_cost,
sort_path.total_cost,
- sort_path.rows);
+ sort_path.rows,
+ subpath->pathtarget->width);
}
pathnode->path.total_cost += agg_path.total_cost;
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb196444198..1151b807418 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -120,6 +120,7 @@ bool enableFsync = true;
bool allowSystemTableMods = false;
int work_mem = 1024;
int maintenance_work_mem = 16384;
+bool hashagg_mem_overflow = false;
int max_parallel_maintenance_workers = 2;
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8228e1f3903..ed6737a8ac9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -998,6 +998,26 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_hashagg_spill", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans that are expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_hashagg_spill,
+ true,
+ NULL, NULL, NULL
+ },
+ {
+ {"hashagg_mem_overflow", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables hashed aggregation to overflow work_mem at execution time."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &hashagg_mem_overflow,
+ false,
+ NULL, NULL, NULL
+ },
{
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of materialization."),
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index fd7624c2312..5f9059aabd2 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -201,6 +201,7 @@ static long ltsGetFreeBlock(LogicalTapeSet *lts);
static void ltsReleaseBlock(LogicalTapeSet *lts, long blocknum);
static void ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
SharedFileSet *fileset);
+static void ltsInitTape(LogicalTape *lt);
static void ltsInitReadBuffer(LogicalTapeSet *lts, LogicalTape *lt);
@@ -536,6 +537,30 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
lts->nHoleBlocks = lts->nBlocksAllocated - nphysicalblocks;
}
+/*
+ * Initialize per-tape struct. Note we allocate the I/O buffer and the first
+ * block for a tape only when it is first actually written to. This avoids
+ * wasting memory space when tuplesort.c overestimates the number of tapes
+ * needed.
+ */
+static void
+ltsInitTape(LogicalTape *lt)
+{
+ lt->writing = true;
+ lt->frozen = false;
+ lt->dirty = false;
+ lt->firstBlockNumber = -1L;
+ lt->curBlockNumber = -1L;
+ lt->nextBlockNumber = -1L;
+ lt->offsetBlockNumber = 0L;
+ lt->buffer = NULL;
+ lt->buffer_size = 0;
+ /* palloc() larger than MaxAllocSize would fail */
+ lt->max_size = MaxAllocSize;
+ lt->pos = 0;
+ lt->nbytes = 0;
+}
+
/*
* Lazily allocate and initialize the read buffer. This avoids waste when many
* tapes are open at once, but not all are active between rewinding and
@@ -582,7 +607,6 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
int worker)
{
LogicalTapeSet *lts;
- LogicalTape *lt;
int i;
/*
@@ -600,29 +624,8 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
lts->nFreeBlocks = 0;
lts->nTapes = ntapes;
- /*
- * Initialize per-tape structs. Note we allocate the I/O buffer and the
- * first block for a tape only when it is first actually written to. This
- * avoids wasting memory space when tuplesort.c overestimates the number
- * of tapes needed.
- */
for (i = 0; i < ntapes; i++)
- {
- lt = <s->tapes[i];
- lt->writing = true;
- lt->frozen = false;
- lt->dirty = false;
- lt->firstBlockNumber = -1L;
- lt->curBlockNumber = -1L;
- lt->nextBlockNumber = -1L;
- lt->offsetBlockNumber = 0L;
- lt->buffer = NULL;
- lt->buffer_size = 0;
- /* palloc() larger than MaxAllocSize would fail */
- lt->max_size = MaxAllocSize;
- lt->pos = 0;
- lt->nbytes = 0;
- }
+ ltsInitTape(<s->tapes[i]);
/*
* Create temp BufFile storage as required.
@@ -1010,6 +1013,29 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum, TapeShare *share)
}
}
+/*
+ * Add additional tapes to this tape set. Not intended to be used when any
+ * tapes are frozen.
+ */
+LogicalTapeSet *
+LogicalTapeSetExtend(LogicalTapeSet *lts, int nAdditional)
+{
+ int i;
+ int nTapesOrig = lts->nTapes;
+ Size newSize;
+
+ lts->nTapes += nAdditional;
+ newSize = offsetof(LogicalTapeSet, tapes) +
+ lts->nTapes * sizeof(LogicalTape);
+
+ lts = (LogicalTapeSet *) repalloc(lts, newSize);
+
+ for (i = nTapesOrig; i < lts->nTapes; i++)
+ ltsInitTape(<s->tapes[i]);
+
+ return lts;
+}
+
/*
* Backspace the tape a given number of bytes. (We also support a more
* general seek interface, see below.)
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 73a2ca8c6dd..d70bc048c46 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -226,9 +226,13 @@ typedef enum ExprEvalOp
EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
EEOP_AGG_INIT_TRANS,
+ EEOP_AGG_INIT_TRANS_SPILLED,
EEOP_AGG_STRICT_TRANS_CHECK,
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
EEOP_AGG_PLAIN_TRANS_BYVAL,
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
EEOP_AGG_PLAIN_TRANS,
+ EEOP_AGG_PLAIN_TRANS_SPILLED,
EEOP_AGG_ORDERED_TRANS_DATUM,
EEOP_AGG_ORDERED_TRANS_TUPLE,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 81fdfa4add3..d6eb2abb60b 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -255,7 +255,7 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
- bool doSort, bool doHash);
+ bool doSort, bool doHash, bool spilled);
extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
int numCols,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 264916f9a92..307987a45ab 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -311,5 +311,8 @@ extern void ExecReScanAgg(AggState *node);
extern Size hash_agg_entry_size(int numAggs, Size tupleWidth,
Size transitionSpace);
+extern void hash_agg_set_limits(double hashentrysize, uint64 input_groups,
+ int used_bits, Size *mem_limit,
+ long *ngroups_limit, int *num_partitions);
#endif /* NODEAGG_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f985453ec32..707a07a2de4 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -244,6 +244,7 @@ extern bool enableFsync;
extern PGDLLIMPORT bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT bool hashagg_mem_overflow;
extern PGDLLIMPORT int max_parallel_maintenance_workers;
extern int VacuumCostPageHit;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd3ddf781f1..19b9cef42f6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2078,13 +2078,32 @@ typedef struct AggState
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
- int num_hashes;
+ int num_hashes; /* number of hash tables active at once */
+ double hashentrysize; /* estimate revised during execution */
+ struct HashTapeInfo *hash_tapeinfo; /* metadata for spill tapes */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each hash table,
+ exists only during first pass if spilled */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ bool hash_ever_spilled; /* ever spilled during this execution? */
+ bool hash_spill_mode; /* we hit a limit during the current batch
+ and we must not create new groups */
+ Size hash_alloc_last; /* previous total memory allocation */
+ Size hash_alloc_current; /* current total memory allocation */
+ Size hash_mem_limit; /* limit before spilling hash table */
+ Size hash_mem_peak; /* peak hash table memory usage */
+ long hash_ngroups_current; /* number of groups currently in
+ memory in all hash tables */
+ long hash_ngroups_limit; /* limit before spilling hash table */
+ long hash_disk_used; /* kB of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+ List *hash_batches; /* hash batches remaining to be processed */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 49
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 3d3be197e0e..be592d0fee4 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1663,6 +1663,7 @@ typedef struct AggPath
AggStrategy aggstrategy; /* basic strategy, see nodes.h */
AggSplit aggsplit; /* agg-splitting mode, see nodes.h */
double numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
List *groupClause; /* a list of SortGroupClause's */
List *qual; /* quals (HAVING quals), if any */
} AggPath;
@@ -1700,6 +1701,7 @@ typedef struct GroupingSetsPath
AggStrategy aggstrategy; /* basic strategy */
List *rollups; /* list of RollupData */
List *qual; /* quals (HAVING quals), if any */
+ int32 transitionSpace; /* estimated transition state size */
} GroupingSetsPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 32c0d87f80e..f4183e1efa5 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -813,6 +813,7 @@ typedef struct Agg
Oid *grpOperators; /* equality operators to compare with */
Oid *grpCollations;
long numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
Bitmapset *aggParams; /* IDs of Params used in Aggref inputs */
/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
List *groupingSets; /* grouping sets to use */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index cb012ba1980..6572dc24699 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_hashagg_spill;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
@@ -114,7 +115,7 @@ extern void cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples);
+ double input_tuples, double input_width);
extern void cost_windowagg(Path *path, PlannerInfo *root,
List *windowFuncs, int numPartCols, int numOrderCols,
Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index eab486a6214..c7bda2b0917 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -54,8 +54,8 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
extern Agg *make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree);
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree);
extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
/*
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index 695d2c00ee4..3ebe52239f8 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -67,6 +67,8 @@ extern void LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum,
extern void LogicalTapeRewindForWrite(LogicalTapeSet *lts, int tapenum);
extern void LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum,
TapeShare *share);
+extern LogicalTapeSet *LogicalTapeSetExtend(LogicalTapeSet *lts,
+ int nAdditional);
extern size_t LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum,
size_t size);
extern void LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index f457b5b150f..7eeeaaa5e4a 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2357,3 +2357,124 @@ explain (costs off)
-> Seq Scan on onek
(8 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+set work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------------
+ GroupAggregate
+ Group Key: ((g % 100000))
+ -> Sort
+ Sort Key: ((g % 100000))
+ -> Function Scan on generate_series g
+(5 rows)
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+-- Produce results with hash aggregation
+set enable_hashagg = true;
+set enable_sort = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 100000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+set enable_sort = true;
+set work_mem to default;
+-- Compare group aggregation results to hash aggregation results
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+ a | c1 | c2 | c3
+---+----+----+----
+(0 rows)
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a7..767f60a96c7 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1633,4 +1633,127 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
| 1 | 2
(4 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+-- Produce results with hash aggregation.
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+set enable_sort = true;
+set work_mem to default;
+-- Compare results
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+ g100 | g10 | unnest | c | m
+------+-----+--------+---+---
+(0 rows)
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
-- end
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1de..11c6f50fbfa 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -148,6 +148,68 @@ SELECT count(*) FROM
4
(1 row)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+SET enable_hashagg=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------------
+ Unique
+ -> Sort
+ Sort Key: ((g % 1000))
+ -> Function Scan on generate_series g
+(4 rows)
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_hashagg=TRUE;
+-- Produce results with hash aggregation.
+SET enable_sort=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 1000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_sort=TRUE;
+SET work_mem TO DEFAULT;
+-- Compare results
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb9057..c40bf6c16eb 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -75,6 +75,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
+ enable_hashagg_spill | on
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 3e593f2d615..a4d476c4bb3 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1032,3 +1032,119 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
explain (costs off)
select 1 from tenk1
where (hundred, thousand) in (select twothousand, twothousand from onek);
+
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+set work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+-- Produce results with hash aggregation
+
+set enable_hashagg = true;
+set enable_sort = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare group aggregation results to hash aggregation results
+
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 95ac3fb52f6..bf8bce6ed31 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -441,4 +441,103 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
from unnest(array[1,1], array['a','b']) u(i,v)
group by rollup(i, v||'a') order by 1,3;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+-- Produce results with hash aggregation.
+
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare results
+
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+
-- end
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449e..33102744ebf 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -45,6 +45,68 @@ SELECT count(*) FROM
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+SET enable_hashagg=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_hashagg=TRUE;
+
+-- Produce results with hash aggregation.
+
+SET enable_sort=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_sort=TRUE;
+
+SET work_mem TO DEFAULT;
+
+-- Compare results
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
+
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
On Tue, Feb 18, 2020 at 10:57 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
Hi,
I wanted to take a look at this thread and do a review, but it's not
very clear to me if the recent patches posted here are independent or
how exactly they fit together. I see1) hashagg-20200212-1.patch (2020/02/13 by Jeff)
2) refactor.patch (2020/02/13 by Jeff)
3) v1-0001-aggregated-unaggregated-cols-together.patch (2020/02/14 by
Melanie)I suppose this also confuses the cfbot - it's probably only testing (3)
as it's the last thing posted here, at least I think that's the case.And it fails:
nodeAgg.c: In function ‘find_aggregated_cols_walker’:
nodeAgg.c:1208:2: error: ISO C90 forbids mixed declarations and code
[-Werror=declaration-after-statement]
FindColsContext *find_cols_context = (FindColsContext *) context;
^
nodeAgg.c: In function ‘find_unaggregated_cols_walker’:
nodeAgg.c:1225:2: error: ISO C90 forbids mixed declarations and code
[-Werror=declaration-after-statement]
FindColsContext *find_cols_context = (FindColsContext *) context;
^
cc1: all warnings being treated as errors
<builtin>: recipe for target 'nodeAgg.o' failed
make[3]: *** [nodeAgg.o] Error 1
make[3]: *** Waiting for unfinished jobs....
Oops! Sorry, I would fix the code that those compiler warnings is
complaining about, but that would confuse the cfbot more. So, I'll let
Jeff decide what he wants to do about the patch at all (e.g. include
it in his overall patch or exclude it for now). Anyway it is trivial
to move those declarations up, were he to decide to include it.
--
Melanie Plageman
Hi,
I've started reviewing the 20200218 version of the patch. In general it
seems fine, but I have a couple minor comments and two crashes.
1) explain.c currently does this:
I wonder if we could show something for plain explain (without analyze).
At least the initial estimate of partitions, etc. I know not showing
those details until after execution is what e.g. sort does, but I find
it a bit annoying.
A related comment is that maybe this should report also the initial
number of partitions, not just the total number. With just the total
it's impossible to say if there were any repartitions, etc.
2) The ExecBuildAggTrans comment should probably explain "spilled".
3) I wonder if we need to invent new opcodes? Wouldn't it be simpler to
just add a new flag to the agg_* structs instead? I haven't tried hacking
this, so maybe it's a silly idea.
4) lookup_hash_entries says
/* check to see if we need to spill the tuple for this grouping set */
But that seems bogus, because AFAIK we can't spill tuples for grouping
sets. So maybe this should say just "grouping"?
5) Assert(nbuckets > 0);
I was curious what happens in case of extreme skew, when a lot/all rows
consistently falls into a single partition. So I did this:
create table t (a int, b real);
insert into t select i, random()
from generate_series(-2000000000, 2000000000) s(i)
where mod(hashint4(i), 16384) = 0;
analyze t;
set work_mem = '64kB';
set max_parallel_workers_per_gather = 0;
set enable_sort = 0;
explain select a, sum(b) from t group by a;
QUERY PLAN
---------------------------------------------------------------
HashAggregate (cost=23864.26..31088.52 rows=244631 width=8)
Group Key: a
-> Seq Scan on t (cost=0.00..3529.31 rows=244631 width=8)
(3 rows)
This however quickly fails on this assert in BuildTupleHashTableExt (see
backtrace1.txt):
Assert(nbuckets > 0);
The value is computed in hash_choose_num_buckets, and there seem to be
no protections against returning bogus values like 0. So maybe we should
return
Min(nbuckets, 1024)
or something like that, similarly to hash join. OTOH maybe it's simply
due to agg_refill_hash_table() passing bogus values to the function?
6) Another thing that occurred to me was what happens to grouping sets,
which we can't spill to disk. So I did this:
create table t2 (a int, b int, c int);
-- run repeatedly, until there are about 20M rows in t2 (1GB)
with tx as (select array_agg(a) as a, array_agg(b) as b
from (select a, b from t order by random()) foo),
ty as (select array_agg(a) AS a
from (select a from t order by random()) foo)
insert into t2 select unnest(tx.a), unnest(ty.a), unnest(tx.b)
from tx, ty;
analyze t2;
This produces a table with two independent columns, skewed the same as
the column t.a. I don't know which of this actually matters, considering
grouping sets don't spill, so maybe the independence is sufficient and
the skew may be irrelevant?
And then do this:
set work_mem = '200MB';
set max_parallel_workers_per_gather = 0;
set enable_sort = 0;
explain select a, b, sum(c) from t2 group by cube (a,b);;
QUERY PLAN
---------------------------------------------------------------------
MixedAggregate (cost=0.00..833064.27 rows=2756495 width=16)
Hash Key: a, b
Hash Key: a
Hash Key: b
Group Key: ()
-> Seq Scan on t2 (cost=0.00..350484.44 rows=22750744 width=12)
(6 rows)
which fails with segfault at execution time:
tuplehash_start_iterate (tb=0x18, iter=iter@entry=0x2349340)
870 for (i = 0; i < tb->size; i++)
(gdb) bt
#0 tuplehash_start_iterate (tb=0x18, iter=iter@entry=0x2349340)
#1 0x0000000000654e49 in agg_retrieve_hash_table_in_memory ...
That's not surprising, because 0x18 pointer is obviously bogus. I guess
this is simply an offset 18B added to a NULL pointer?
Disabling hashagg spill (setting both GUCs to off) makes no difference,
but on master it fails like this:
ERROR: out of memory
DETAIL: Failed on request of size 3221225472 in memory context "ExecutorState".
which is annoying, but expected with an under-estimate and hashagg. And
much better than just crashing the whole cluster.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Feb 19, 2020 at 08:16:36PM +0100, Tomas Vondra wrote:
4) lookup_hash_entries says
/* check to see if we need to spill the tuple for this grouping set */
But that seems bogus, because AFAIK we can't spill tuples for grouping
sets. So maybe this should say just "grouping"?
As I see it, it does traverse all hash sets, fill the hash table and
spill if needed, for each tuple.
The segfault is probably related to this and MixedAggregate, I'm looking
into it.
--
Adam Lee
On Wed, Feb 19, 2020 at 08:16:36PM +0100, Tomas Vondra wrote:
5) Assert(nbuckets > 0);
...
This however quickly fails on this assert in BuildTupleHashTableExt (see
backtrace1.txt):Assert(nbuckets > 0);
The value is computed in hash_choose_num_buckets, and there seem to be
no protections against returning bogus values like 0. So maybe we should
returnMin(nbuckets, 1024)
or something like that, similarly to hash join. OTOH maybe it's simply
due to agg_refill_hash_table() passing bogus values to the function?6) Another thing that occurred to me was what happens to grouping sets,
which we can't spill to disk. So I did this:create table t2 (a int, b int, c int);
-- run repeatedly, until there are about 20M rows in t2 (1GB)
with tx as (select array_agg(a) as a, array_agg(b) as b
from (select a, b from t order by random()) foo),
ty as (select array_agg(a) AS a
from (select a from t order by random()) foo)
insert into t2 select unnest(tx.a), unnest(ty.a), unnest(tx.b)
from tx, ty;analyze t2;
...which fails with segfault at execution time:
tuplehash_start_iterate (tb=0x18, iter=iter@entry=0x2349340)
870 for (i = 0; i < tb->size; i++)
(gdb) bt
#0 tuplehash_start_iterate (tb=0x18, iter=iter@entry=0x2349340)
#1 0x0000000000654e49 in agg_retrieve_hash_table_in_memory ...That's not surprising, because 0x18 pointer is obviously bogus. I guess
this is simply an offset 18B added to a NULL pointer?
I did some investigation, have you disabled the assert when this panic
happens? If so, it's the same issue as "5) nbucket == 0", which passes a
zero size to allocator when creates that endup-with-0x18 hashtable.
Sorry my testing env goes weird right now, haven't reproduced it yet.
--
Adam Lee
On Wed, 2020-02-19 at 20:16 +0100, Tomas Vondra wrote:
1) explain.c currently does this:
I wonder if we could show something for plain explain (without
analyze).
At least the initial estimate of partitions, etc. I know not showing
those details until after execution is what e.g. sort does, but I
find
it a bit annoying.
Looks like you meant to include some example explain output, but I
think I understand what you mean. I'll look into it.
2) The ExecBuildAggTrans comment should probably explain "spilled".
Done.
3) I wonder if we need to invent new opcodes? Wouldn't it be simpler
to
just add a new flag to the agg_* structs instead? I haven't tried
hacking
this, so maybe it's a silly idea.
There was a reason I didn't do it this way, but I'm trying to remember
why. I'll look into this, also.
4) lookup_hash_entries says
/* check to see if we need to spill the tuple for this grouping
set */But that seems bogus, because AFAIK we can't spill tuples for
grouping
sets. So maybe this should say just "grouping"?
Yes, we can spill tuples for grouping sets. Unfortunately, I think my
tests (which covered this case previously) don't seem to be exercising
that path well now. I am going to improve my tests, too.
5) Assert(nbuckets > 0);
I did not repro this issue, but I did set a floor of 256 buckets.
which fails with segfault at execution time:
Fixed. I was resetting the hash table context without setting the
pointers to NULL.
Thanks! I attached a new, rebased version. The fixes are quick fixes
for now and I will revisit them after I improve my test cases (which
might find more issues).
Regards,
Jeff Davis
Attachments:
hashagg-20200220.patchtext/x-patch; charset=UTF-8; name=hashagg-20200220.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f89ec7..85f559387f9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,23 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-hashagg-mem-overflow" xreflabel="hashagg_mem_overflow">
+ <term><varname>hashagg_mem_overflow</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>hashagg_mem_overflow</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ If hash aggregation exceeds <varname>work_mem</varname> at query
+ execution time, and <varname>hashagg_mem_overflow</varname> is set
+ to <literal>on</literal>, continue consuming more memory rather than
+ performing disk-based hash aggregation. The default
+ is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
@@ -4476,6 +4493,24 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-hashagg-spill" xreflabel="enable_hashagg_spill">
+ <term><varname>enable_hashagg_spill</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_hashagg_spill</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the memory usage is expected to
+ exceed <varname>work_mem</varname>. This only affects the planner
+ choice; actual behavior at execution time is dictated by
+ <xref linkend="guc-hashagg-mem-overflow"/>. The default
+ is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
<term><varname>enable_hashjoin</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50e..2923f4ba46d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -104,6 +104,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1882,6 +1883,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ if (es->analyze)
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2769,6 +2772,55 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * If EXPLAIN ANALYZE, show information on hash aggregate memory usage and
+ * batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(
+ es->str,
+ "Memory Usage: %ldkB",
+ memPeakKb);
+
+ if (aggstate->hash_batches_used > 0)
+ {
+ appendStringInfo(
+ es->str,
+ " Batches: %d Disk: %ldkB",
+ aggstate->hash_batches_used, aggstate->hash_disk_used);
+ }
+
+ appendStringInfo(
+ es->str,
+ "\n");
+ }
+ else
+ {
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ ExplainPropertyInteger("Disk Usage", "kB",
+ aggstate->hash_disk_used, es);
+ }
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 121eff97a0c..5236d6f3935 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -79,7 +79,8 @@ static void ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash);
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled);
/*
@@ -2924,10 +2925,13 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
* check for filters, evaluate aggregate input, check that that input is not
* NULL for a strict transition function, and then finally invoke the
* transition for each of the concurrently computed grouping sets.
+ *
+ * If "spilled" is true, the generated code will take into account the
+ * possibility that a Hash Aggregation has spilled to disk.
*/
ExprState *
ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
- bool doSort, bool doHash)
+ bool doSort, bool doHash, bool spilled)
{
ExprState *state = makeNode(ExprState);
PlanState *parent = &aggstate->ss.ps;
@@ -3158,7 +3162,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (int setno = 0; setno < processGroupingSets; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, false);
+ pertrans, transno, setno, setoff, false,
+ spilled);
setoff++;
}
}
@@ -3177,7 +3182,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (int setno = 0; setno < numHashes; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, true);
+ pertrans, transno, setno, setoff, true,
+ spilled);
setoff++;
}
}
@@ -3227,7 +3233,8 @@ static void
ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash)
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled)
{
int adjust_init_jumpnull = -1;
int adjust_strict_jumpnull = -1;
@@ -3249,7 +3256,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
fcinfo->flinfo->fn_strict &&
pertrans->initValueIsNull)
{
- scratch->opcode = EEOP_AGG_INIT_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_INIT_TRANS_SPILLED : EEOP_AGG_INIT_TRANS;
scratch->d.agg_init_trans.pertrans = pertrans;
scratch->d.agg_init_trans.setno = setno;
scratch->d.agg_init_trans.setoff = setoff;
@@ -3265,7 +3273,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
if (pertrans->numSortCols == 0 &&
fcinfo->flinfo->fn_strict)
{
- scratch->opcode = EEOP_AGG_STRICT_TRANS_CHECK;
+ scratch->opcode = spilled ?
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED : EEOP_AGG_STRICT_TRANS_CHECK;
scratch->d.agg_strict_trans_check.setno = setno;
scratch->d.agg_strict_trans_check.setoff = setoff;
scratch->d.agg_strict_trans_check.transno = transno;
@@ -3282,9 +3291,11 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
/* invoke appropriate transition implementation */
if (pertrans->numSortCols == 0 && pertrans->transtypeByVal)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS_BYVAL;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED : EEOP_AGG_PLAIN_TRANS_BYVAL;
else if (pertrans->numSortCols == 0)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_SPILLED : EEOP_AGG_PLAIN_TRANS;
else if (pertrans->numInputs == 1)
scratch->opcode = EEOP_AGG_ORDERED_TRANS_DATUM;
else
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 35eb8b99f69..e21e0c440ea 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -426,9 +426,13 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
&&CASE_EEOP_AGG_INIT_TRANS,
+ &&CASE_EEOP_AGG_INIT_TRANS_SPILLED,
&&CASE_EEOP_AGG_STRICT_TRANS_CHECK,
+ &&CASE_EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_SPILLED,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
&&CASE_EEOP_LAST
@@ -1619,6 +1623,35 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
+
+ /* If transValue has not yet been initialized, do so now. */
+ if (pergroup->noTransValue)
+ {
+ AggStatePerTrans pertrans = op->d.agg_init_trans.pertrans;
+
+ aggstate->curaggcontext = op->d.agg_init_trans.aggcontext;
+ aggstate->current_set = op->d.agg_init_trans.setno;
+
+ ExecAggInitGroup(aggstate, pertrans, pergroup);
+
+ /* copied trans value from input, done this round */
+ EEO_JUMP(op->d.agg_init_trans.jumpnull);
+ }
+
+ EEO_NEXT();
+ }
/* check that a strict aggregate's input isn't NULL */
EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK)
@@ -1635,6 +1668,24 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_strict_trans_check.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_strict_trans_check.transno];
+
+ if (unlikely(pergroup->transValueIsNull))
+ EEO_JUMP(op->d.agg_strict_trans_check.jumpnull);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1683,6 +1734,51 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ Assert(pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1726,6 +1822,66 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
newVal = FunctionCallInvoke(fcinfo);
+ /*
+ * For pass-by-ref datatype, must copy the new value into
+ * aggcontext and free the prior transValue. But if transfn
+ * returned a pointer to its first input, we don't need to do
+ * anything. Also, if transfn returned a pointer to a R/W
+ * expanded object that is already a child of the aggcontext,
+ * assume we can adopt that value without copying it.
+ */
+ if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggTransReparent(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ Assert(!pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
/*
* For pass-by-ref datatype, must copy the new value into
* aggcontext and free the prior transValue. But if transfn
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2e9a21bf400..517a2649f7e 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -194,6 +194,29 @@
* transition values. hashcontext is the single context created to support
* all hash tables.
*
+ * Spilling To Disk
+ *
+ * When performing hash aggregation, if the hash table memory exceeds the
+ * limit (see hash_agg_check_limits()), we enter "spill mode". In spill
+ * mode, we advance the transition states only for groups already in the
+ * hash table. For tuples that would need to create a new hash table
+ * entries (and initialize new transition states), we instead spill them to
+ * disk to be processed later. The tuples are spilled in a partitioned
+ * manner, so that subsequent batches are smaller and less likely to exceed
+ * work_mem (if a batch does exceed work_mem, it must be spilled
+ * recursively).
+ *
+ * Spilled data is written to logical tapes. These provide better control
+ * over memory usage, disk space, and the number of files than if we were
+ * to use a BufFile for each spill.
+ *
+ * Note that it's possible for transition states to start small but then
+ * grow very large; for instance in the case of ARRAY_AGG. In such cases,
+ * it's still possible to significantly exceed work_mem. We try to avoid
+ * this situation by estimating what will fit in the available memory, and
+ * imposing a limit on the number of groups separately from the amount of
+ * memory consumed.
+ *
* Transition / Combine function invocation:
*
* For performance reasons transition functions, including combine
@@ -233,12 +256,100 @@
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/datum.h"
+#include "utils/dynahash.h"
#include "utils/expandeddatum.h"
+#include "utils/logtape.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+/*
+ * Control how many partitions are created when spilling HashAgg to
+ * disk.
+ *
+ * HASHAGG_PARTITION_FACTOR is multiplied by the estimated number of
+ * partitions needed such that each partition will fit in memory. The factor
+ * is set higher than one because there's not a high cost to having a few too
+ * many partitions, and it makes it less likely that a partition will need to
+ * be spilled recursively. Another benefit of having more, smaller partitions
+ * is that small hash tables may perform better than large ones due to memory
+ * caching effects.
+ *
+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (which
+ * could instead be spent on a larger hash table).
+ *
+ * For reading from tapes, the buffer size must be a multiple of
+ * BLCKSZ. Larger values help when reading from multiple tapes concurrently,
+ * but that doesn't happen in HashAgg, so we simply use BLCKSZ. Writing to a
+ * tape always uses a buffer of size BLCKSZ.
+ */
+#define HASHAGG_PARTITION_FACTOR 1.50
+#define HASHAGG_MIN_PARTITIONS 4
+#define HASHAGG_MAX_PARTITIONS 256
+#define HASHAGG_MIN_BUCKETS 256
+#define HASHAGG_READ_BUFFER_SIZE BLCKSZ
+#define HASHAGG_WRITE_BUFFER_SIZE BLCKSZ
+
+/*
+ * Track all tapes needed for a HashAgg that spills. We don't know the maximum
+ * number of tapes needed at the start of the algorithm (because it can
+ * recurse), so one tape set is allocated and extended as needed for new
+ * tapes. When a particular tape is already read, rewind it for write mode and
+ * put it in the free list.
+ *
+ * Tapes' buffers can take up substantial memory when many tapes are open at
+ * once. We only need one tape open at a time in read mode (using a buffer
+ * that's a multiple of BLCKSZ); but we need up to HASHAGG_MAX_PARTITIONS
+ * tapes open in write mode (each requiring a buffer of size BLCKSZ).
+ */
+typedef struct HashTapeInfo
+{
+ LogicalTapeSet *tapeset;
+ int ntapes;
+ int *freetapes;
+ int nfreetapes;
+} HashTapeInfo;
+
+/*
+ * Represents partitioned spill data for a single hashtable. Contains the
+ * necessary information to route tuples to the correct partition, and to
+ * transform the spilled data into new batches.
+ *
+ * The high bits are used for partition selection (when recursing, we ignore
+ * the bits that have already been used for partition selection at an earlier
+ * level).
+ */
+typedef struct HashAggSpill
+{
+ HashTapeInfo *tapeinfo; /* borrowed reference to tape info */
+ int npartitions; /* number of partitions */
+ int *partitions; /* spill partition tape numbers */
+ int64 *ntuples; /* number of tuples in each partition */
+ uint32 mask; /* mask to find partition from hash value */
+ int shift; /* after masking, shift by this amount */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation (with only one
+ * grouping set).
+ *
+ * Also tracks the bits of the hash already used for partition selection by
+ * earlier iterations, so that this batch can use new bits. If all bits have
+ * already been used, no partitioning will be done (any spilled data will go
+ * to a single output tape).
+ */
+typedef struct HashAggBatch
+{
+ int setno; /* grouping set */
+ int used_bits; /* number of bits of hash already used */
+ HashTapeInfo *tapeinfo; /* borrowed reference to tape info */
+ int input_tapenum; /* input partition tape */
+ int64 input_tuples; /* number of tuples in this batch */
+} HashAggBatch;
+
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -275,11 +386,38 @@ static Bitmapset *find_unaggregated_cols(AggState *aggstate);
static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
static void build_hash_tables(AggState *aggstate);
static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
+static void hashagg_recompile_expressions(AggState *aggstate);
+static long hash_choose_num_buckets(AggState *aggstate,
+ long estimated_nbuckets,
+ Size memory);
+static int hash_choose_num_partitions(uint64 input_groups,
+ double hashentrysize,
+ int used_bits,
+ int *log2_npartittions);
static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+static void hash_agg_check_limits(AggState *aggstate);
+static void hashagg_finish_initial_spills(AggState *aggstate);
+static void hashagg_reset_spill_state(AggState *aggstate);
+static HashAggBatch *hashagg_batch_new(HashTapeInfo *tapeinfo,
+ int input_tapenum, int setno,
+ int64 input_tuples, int used_bits);
+static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
+static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
+ int used_bits, uint64 input_tuples,
+ double hashentrysize);
+static Size hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot,
+ uint32 hash);
+static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno);
+static void hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *dest,
+ int ndest);
+static void hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum);
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1261,7 +1399,7 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
}
/*
- * (Re-)initialize the hash table(s) to empty.
+ * Initialize the hash table(s).
*
* To implement hashed aggregation, we need a hashtable that stores a
* representative tuple and an array of AggStatePerGroup structs for each
@@ -1272,9 +1410,9 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
* We have a separate hashtable and associated perhash data structure for each
* grouping set for which we're doing hashing.
*
- * The contents of the hash tables always live in the hashcontext's per-tuple
- * memory context (there is only one of these for all tables together, since
- * they are all reset at the same time).
+ * The hash tables and their contents always live in the hashcontext's
+ * per-tuple memory context (there is only one of these for all tables
+ * together, since they are all reset at the same time).
*/
static void
build_hash_tables(AggState *aggstate)
@@ -1284,11 +1422,24 @@ build_hash_tables(AggState *aggstate)
for (setno = 0; setno < aggstate->num_hashes; ++setno)
{
AggStatePerHash perhash = &aggstate->perhash[setno];
+ long nbuckets;
+ Size memory;
Assert(perhash->aggnode->numGroups > 0);
- build_hash_table(aggstate, setno, perhash->aggnode->numGroups);
+ memory = aggstate->hash_mem_limit / aggstate->num_hashes;
+
+ /* choose reasonable number of buckets per hashtable */
+ nbuckets = hash_choose_num_buckets(
+ aggstate, perhash->aggnode->numGroups, memory);
+
+ build_hash_table(aggstate, setno, nbuckets);
}
+
+ aggstate->hash_alloc_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_ngroups_current = 0;
}
/*
@@ -1298,7 +1449,7 @@ static void
build_hash_table(AggState *aggstate, int setno, long nbuckets)
{
AggStatePerHash perhash = &aggstate->perhash[setno];
- MemoryContext metacxt = aggstate->ss.ps.state->es_query_cxt;
+ MemoryContext metacxt;
MemoryContext hashcxt = aggstate->hashcontext->ecxt_per_tuple_memory;
MemoryContext tmpcxt = aggstate->tmpcontext->ecxt_per_tuple_memory;
Size additionalsize;
@@ -1306,6 +1457,12 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
Assert(aggstate->aggstrategy == AGG_HASHED ||
aggstate->aggstrategy == AGG_MIXED);
+ /*
+ * We don't try to preserve any part of the hash table. Set the metacxt to
+ * hashcxt, which will be reset for each batch.
+ */
+ metacxt = hashcxt;
+
/*
* Used to make sure initial hash table allocation does not exceed
* work_mem. Note that the estimate does not include space for
@@ -1481,14 +1638,250 @@ hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
transitionSpace;
}
+/*
+ * Recompile the expressions for advancing aggregates while hashing. This is
+ * necessary for certain kinds of state changes that affect the resulting
+ * expression. For instance, changing aggstate->hash_ever_spilled or
+ * aggstate->ss.ps.outerops requires recompilation.
+ *
+ * A compiled expression where hash_ever_spilled is true will work even when
+ * hash_spill_mode is false, because it merely introduces additional branches
+ * that are unnecessary when hash_spill_mode is false. That allows us to only
+ * recompile when hash_ever_spilled changes, rather than every time
+ * hash_spill_mode changes.
+ */
+static void
+hashagg_recompile_expressions(AggState *aggstate)
+{
+ AggStatePerPhase phase;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ if (aggstate->aggstrategy == AGG_HASHED)
+ phase = &aggstate->phases[0];
+ else /* AGG_MIXED */
+ phase = &aggstate->phases[1];
+
+ phase->evaltrans = ExecBuildAggTrans(
+ aggstate, phase,
+ aggstate->aggstrategy == AGG_MIXED ? true : false, /* dosort */
+ true, /* dohash */
+ aggstate->hash_ever_spilled);
+}
+
+/*
+ * Set limits that trigger spilling to avoid exceeding work_mem. Consider the
+ * number of partitions we expect to create (if we do spill).
+ *
+ * There are two limits: a memory limit, and also an ngroups limit. The
+ * ngroups limit becomes important when we expect transition values to grow
+ * substantially larger than the initial value.
+ */
+void
+hash_agg_set_limits(double hashentrysize, uint64 input_groups, int used_bits,
+ Size *mem_limit, long *ngroups_limit, int *num_partitions)
+{
+ int npartitions;
+ Size partition_mem;
+
+ /* no attempt to obey work_mem */
+ if (hashagg_mem_overflow)
+ {
+ *mem_limit = SIZE_MAX;
+ *ngroups_limit = LONG_MAX;
+ return;
+ }
+
+ /* if not expected to spill, use all of work_mem */
+ if (input_groups * hashentrysize < work_mem * 1024L)
+ {
+ *mem_limit = work_mem * 1024L;
+ *ngroups_limit = *mem_limit / hashentrysize;
+ return;
+ }
+
+ /*
+ * Calculate expected memory requirements for spilling, which is the size
+ * of the buffers needed for all the tapes that need to be open at
+ * once. Then, subtract that from the memory available for holding hash
+ * tables.
+ */
+ npartitions = hash_choose_num_partitions(input_groups,
+ hashentrysize,
+ used_bits,
+ NULL);
+ if (num_partitions != NULL)
+ *num_partitions = npartitions;
+
+ partition_mem =
+ HASHAGG_READ_BUFFER_SIZE +
+ HASHAGG_WRITE_BUFFER_SIZE * npartitions;
+
+ /*
+ * Don't set the limit below 3/4 of work_mem. In that case, we are at the
+ * minimum number of partitions, so we aren't going to dramatically exceed
+ * work mem anyway.
+ */
+ if (work_mem * 1024L > 4 * partition_mem)
+ *mem_limit = work_mem * 1024L - partition_mem;
+ else
+ *mem_limit = work_mem * 1024L * 0.75;
+
+ if (*mem_limit > hashentrysize)
+ *ngroups_limit = *mem_limit / hashentrysize;
+ else
+ *ngroups_limit = 1;
+}
+
+/*
+ * hash_agg_check_limits
+ *
+ * After adding a new group to the hash table, check whether we need to enter
+ * spill mode. Allocations may happen without adding new groups (for instance,
+ * if the transition state size grows), so this check is imperfect.
+ *
+ * Memory usage is tracked by how much is allocated to the underlying memory
+ * context, not individual chunks. This is more accurate because it accounts
+ * for all memory in the context, and also accounts for fragmentation and
+ * other forms of overhead and waste that can be difficult to estimate. It's
+ * also cheaper because we don't have to track each chunk.
+ *
+ * When memory is first allocated to a memory context, it is not actually
+ * used. So when the next allocation happens, we consider the
+ * previously-allocated amount to be the memory currently used.
+ */
+static void
+hash_agg_check_limits(AggState *aggstate)
+{
+ Size allocation;
+
+ /*
+ * Even if already in spill mode, it's possible for memory usage to grow,
+ * and we should still track it for the purposes of EXPLAIN ANALYZE.
+ */
+ allocation = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ /* has allocation grown since the last observation? */
+ if (allocation > aggstate->hash_alloc_current)
+ {
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_alloc_current = allocation;
+ }
+
+ if (aggstate->hash_alloc_last > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = aggstate->hash_alloc_last;
+
+ /*
+ * Don't spill unless there's at least one group in the hash table so we
+ * can be sure to make progress even in edge cases.
+ */
+ if (aggstate->hash_ngroups_current > 0 &&
+ (aggstate->hash_alloc_last > aggstate->hash_mem_limit ||
+ aggstate->hash_ngroups_current > aggstate->hash_ngroups_limit))
+ {
+ aggstate->hash_spill_mode = true;
+
+ if (!aggstate->hash_ever_spilled)
+ {
+ aggstate->hash_ever_spilled = true;
+ aggstate->hash_spills = palloc0(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+ aggstate->hash_tapeinfo = palloc0(sizeof(HashTapeInfo));
+ hashagg_recompile_expressions(aggstate);
+ }
+ }
+}
+
+/*
+ * Choose a reasonable number of buckets for the initial hash table size.
+ */
+static long
+hash_choose_num_buckets(AggState *aggstate, long ngroups, Size memory)
+{
+ long max_nbuckets;
+ long nbuckets = ngroups;
+
+ max_nbuckets = memory / aggstate->hashentrysize;
+
+ /*
+ * Leave room for slop to avoid a case where the initial hash table size
+ * exceeds the memory limit (though that may still happen in edge cases).
+ */
+ max_nbuckets *= 0.75;
+
+ if (nbuckets > max_nbuckets)
+ nbuckets = max_nbuckets;
+ if (nbuckets < HASHAGG_MIN_BUCKETS)
+ nbuckets = HASHAGG_MIN_BUCKETS;
+ return nbuckets;
+}
+
+/*
+ * Determine the number of partitions to create when spilling, which will
+ * always be a power of two. If log2_npartitions is non-NULL, set
+ * *log2_npartitions to the log2() of the number of partitions.
+ */
+static int
+hash_choose_num_partitions(uint64 input_groups, double hashentrysize,
+ int used_bits, int *log2_npartitions)
+{
+ Size mem_wanted;
+ int partition_limit;
+ int npartitions;
+ int partition_bits;
+
+ /*
+ * Avoid creating so many partitions that the memory requirements of the
+ * open partition files are greater than 1/4 of work_mem.
+ */
+ partition_limit =
+ (work_mem * 1024L * 0.25 - HASHAGG_READ_BUFFER_SIZE) /
+ HASHAGG_WRITE_BUFFER_SIZE;
+
+ /* pessimistically estimate that each input tuple creates a new group */
+ mem_wanted = HASHAGG_PARTITION_FACTOR * input_groups * hashentrysize;
+
+ /* make enough partitions so that each one is likely to fit in memory */
+ npartitions = 1 + (mem_wanted / (work_mem * 1024L));
+
+ if (npartitions > partition_limit)
+ npartitions = partition_limit;
+
+ if (npartitions < HASHAGG_MIN_PARTITIONS)
+ npartitions = HASHAGG_MIN_PARTITIONS;
+ if (npartitions > HASHAGG_MAX_PARTITIONS)
+ npartitions = HASHAGG_MAX_PARTITIONS;
+
+ /* ceil(log2(npartitions)) */
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits */
+ if (partition_bits + used_bits >= 32)
+ partition_bits = 32 - used_bits;
+
+ if (log2_npartitions != NULL)
+ *log2_npartitions = partition_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ return npartitions;
+}
+
/*
* Find or create a hashtable entry for the tuple group containing the current
* tuple (already set in tmpcontext's outertuple slot), in the current grouping
* set (which the caller must have selected - note that initialize_aggregate
* depends on this).
*
- * When called, CurrentMemoryContext should be the per-query context. The
- * already-calculated hash value for the tuple must be specified.
+ * When called, CurrentMemoryContext should be the per-query context.
+ *
+ * If the hash table is at the memory limit, then only find existing hashtable
+ * entries; don't create new ones. If a tuple's group is not already present
+ * in the hash table for the current grouping set, return NULL and the caller
+ * will spill it to disk.
*/
static AggStatePerGroup
lookup_hash_entry(AggState *aggstate, uint32 hash)
@@ -1496,16 +1889,27 @@ lookup_hash_entry(AggState *aggstate, uint32 hash)
AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
TupleTableSlot *hashslot = perhash->hashslot;
TupleHashEntryData *entry;
- bool isnew;
+ bool isnew = false;
+ bool *p_isnew;
+
+ /* if hash table already spilled, don't create new entries */
+ p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, &isnew,
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
hash);
+ if (entry == NULL)
+ return NULL;
+
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_ngroups_current++;
+ if (!hashagg_mem_overflow)
+ hash_agg_check_limits(aggstate);
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1533,23 +1937,51 @@ lookup_hash_entry(AggState *aggstate, uint32 hash)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Some entries may be left NULL if we have reached the limit and have begun
+ * to spill. The same tuple will belong to different groups for each set, so
+ * may match a group already in memory for one set and match a group not in
+ * memory for another set. If we have begun to spill and a tuple doesn't match
+ * a group in memory for a particular set, it will be spilled.
+ *
+ * NB: It's possible to spill the same tuple for several different grouping
+ * sets. This may seem wasteful, but it's actually a trade-off: if we spill
+ * the tuple multiple times for multiple grouping sets, it can be partitioned
+ * for each grouping set, making the refilling of the hash table very
+ * efficient.
*/
static void
lookup_hash_entries(AggState *aggstate)
{
- int numHashes = aggstate->num_hashes;
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
int setno;
- for (setno = 0; setno < numHashes; setno++)
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
{
- AggStatePerHash perhash = &aggstate->perhash[setno];
+ AggStatePerHash perhash = &aggstate->perhash[setno];
uint32 hash;
select_current_set(aggstate, setno, true);
prepare_hash_slot(aggstate);
hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
pergroup[setno] = lookup_hash_entry(aggstate, hash);
+
+ /* check to see if we need to spill the tuple for this grouping set */
+ if (pergroup[setno] == NULL)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
+
+ if (spill->partitions == NULL)
+ hashagg_spill_init(spill, aggstate->hash_tapeinfo, 0,
+ perhash->aggnode->numGroups,
+ aggstate->hashentrysize);
+
+ hashagg_spill_tuple(spill, slot, hash);
+
+ aggstate->hash_disk_used = LogicalTapeSetBlocks(
+ aggstate->hash_tapeinfo->tapeset) * (BLCKSZ / 1024);
+ }
}
}
@@ -1872,6 +2304,12 @@ agg_retrieve_direct(AggState *aggstate)
if (TupIsNull(outerslot))
{
/* no more outer-plan tuples available */
+
+ /* if we built hash tables, finalize any spills */
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hashagg_finish_initial_spills(aggstate);
+
if (hasGroupingSets)
{
aggstate->input_done = true;
@@ -1974,6 +2412,9 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ /* finalize spills, if any */
+ hashagg_finish_initial_spills(aggstate);
+
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1981,11 +2422,196 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * consumed.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+ HashAggSpill spill;
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ long nbuckets;
+ int setno;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ spill.npartitions = 0;
+ spill.partitions = NULL;
+ /*
+ * Each spill file contains spilled data for only a single grouping
+ * set. We want to ignore all others, which is done by setting the other
+ * pergroups to NULL.
+ */
+ memset(aggstate->all_pergroups, 0,
+ sizeof(AggStatePerGroup) *
+ (aggstate->maxsets + aggstate->num_hashes));
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ hash_agg_set_limits(aggstate->hashentrysize, batch->input_tuples,
+ batch->used_bits, &aggstate->hash_mem_limit,
+ &aggstate->hash_ngroups_limit, NULL);
+
+ /*
+ * Free memory and rebuild a single hash table for this batch's grouping
+ * set. Estimate the number of groups to be the number of input tuples in
+ * this batch.
+ */
+ ReScanExprContext(aggstate->hashcontext);
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ aggstate->perhash[setno].hashtable = NULL;
+
+ nbuckets = hash_choose_num_buckets(
+ aggstate, batch->input_tuples, aggstate->hash_mem_limit);
+ build_hash_table(aggstate, batch->setno, nbuckets);
+ aggstate->hash_alloc_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_ngroups_current = 0;
+
+ Assert(aggstate->current_phase == 0);
+
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ /*
+ * The first pass (agg_fill_hash_table) reads whatever kind of slot comes
+ * from the outer plan, and considers the slot fixed. But spilled tuples
+ * are always MinimalTuples, so if that's different from the outer plan we
+ * need to change it and recompile the aggregate expressions.
+ */
+ if (aggstate->ss.ps.outerops != &TTSOpsMinimalTuple)
+ {
+ aggstate->ss.ps.outerops = &TTSOpsMinimalTuple;
+ hashagg_recompile_expressions(aggstate);
+ }
+
+ LogicalTapeRewindForRead(tapeinfo->tapeset, batch->input_tapenum,
+ HASHAGG_READ_BUFFER_SIZE);
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+ uint32 hash;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hashagg_batch_read(batch, &hash);
+ if (tuple == NULL)
+ break;
+
+ ExecStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ select_current_set(aggstate, batch->setno, true);
+ prepare_hash_slot(aggstate);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+
+ /* if there's no memory for a new group, spill */
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ {
+ /*
+ * Estimate the number of groups for this batch as the total
+ * number of tuples in its input file. Although that's a worst
+ * case, it's not bad here for two reasons: (1) overestimating
+ * is better than underestimating; and (2) we've already
+ * scanned the relation once, so it's likely that we've
+ * already finalized many of the common values.
+ */
+ if (spill.partitions == NULL)
+ hashagg_spill_init(&spill, tapeinfo, batch->used_bits,
+ batch->input_tuples,
+ aggstate->hashentrysize);
+
+ hashagg_spill_tuple(&spill, slot, hash);
+
+ aggstate->hash_disk_used = LogicalTapeSetBlocks(
+ aggstate->hash_tapeinfo->tapeset) * (BLCKSZ / 1024);
+ }
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ hashagg_tapeinfo_release(tapeinfo, batch->input_tapenum);
+
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ /* update hashentrysize estimate based on contents */
+ if (aggstate->hash_ngroups_current > 0)
+ {
+ aggstate->hashentrysize = (double)aggstate->hash_alloc_last /
+ (double)aggstate->hash_ngroups_current;
+ }
+
+ hashagg_spill_finish(aggstate, &spill, batch->setno);
+ aggstate->hash_spill_mode = false;
+
+ pfree(batch);
+
+ /* Initialize to walk the first hash table */
+ select_current_set(aggstate, 0, true);
+ ResetTupleHashIterator(aggstate->perhash[0].hashtable,
+ &aggstate->perhash[0].hashiter);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
+ *
+ * After exhausting in-memory tuples, also try refilling the hash table using
+ * previously-spilled tuples. Only returns NULL after all in-memory and
+ * spilled tuples are exhausted.
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+/*
+ * Retrieve the groups from the in-memory hash tables without considering any
+ * spilled tuples.
+ */
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -2014,7 +2640,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2039,14 +2665,15 @@ agg_retrieve_hash_table(AggState *aggstate)
perhash = &aggstate->perhash[aggstate->current_set];
+ if (perhash->hashtable == NULL)
+ return NULL;
+
ResetTupleHashIterator(perhash->hashtable, &perhash->hashiter);
continue;
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2103,6 +2730,296 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * Assign unused tapes to spill partitions, extending the tape set if
+ * necessary.
+ */
+static void
+hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *partitions,
+ int npartitions)
+{
+ int partidx = 0;
+
+ /* use free tapes if available */
+ while (partidx < npartitions && tapeinfo->nfreetapes > 0)
+ partitions[partidx++] = tapeinfo->freetapes[--tapeinfo->nfreetapes];
+
+ if (tapeinfo->tapeset == NULL)
+ tapeinfo->tapeset = LogicalTapeSetCreate(npartitions, NULL, NULL, -1);
+ else if (partidx < npartitions)
+ {
+ tapeinfo->tapeset = LogicalTapeSetExtend(
+ tapeinfo->tapeset, npartitions - partidx);
+ }
+
+ while (partidx < npartitions)
+ partitions[partidx++] = tapeinfo->ntapes++;
+}
+
+/*
+ * After a tape has already been written to and then read, this function
+ * rewinds it for writing and adds it to the free list.
+ */
+static void
+hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum)
+{
+ LogicalTapeRewindForWrite(tapeinfo->tapeset, tapenum);
+ if (tapeinfo->freetapes == NULL)
+ tapeinfo->freetapes = palloc(sizeof(int));
+ else
+ tapeinfo->freetapes = repalloc(
+ tapeinfo->freetapes, sizeof(int) * (tapeinfo->nfreetapes + 1));
+ tapeinfo->freetapes[tapeinfo->nfreetapes++] = tapenum;
+}
+
+/*
+ * hashagg_spill_init
+ *
+ * Called after we determined that spilling is necessary. Chooses the number
+ * of partitions to create, and initializes them.
+ */
+static void
+hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo, int used_bits,
+ uint64 input_groups, double hashentrysize)
+{
+ int npartitions;
+ int partition_bits;
+
+ npartitions = hash_choose_num_partitions(
+ input_groups, hashentrysize, used_bits, &partition_bits);
+
+ spill->partitions = palloc0(sizeof(int) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+
+ hashagg_tapeinfo_assign(tapeinfo, spill->partitions, npartitions);
+
+ spill->tapeinfo = tapeinfo;
+ spill->shift = 32 - used_bits - partition_bits;
+ spill->mask = (npartitions - 1) << spill->shift;
+ spill->npartitions = npartitions;
+}
+
+/*
+ * hashagg_spill_tuple
+ *
+ * No room for new groups in the hash table. Save for later in the appropriate
+ * partition.
+ */
+static Size
+hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot, uint32 hash)
+{
+ LogicalTapeSet *tapeset = spill->tapeinfo->tapeset;
+ int partition;
+ MinimalTuple tuple;
+ int tapenum;
+ int total_written = 0;
+ bool shouldFree;
+
+ Assert(spill->partitions != NULL);
+
+ /* may contain unnecessary attributes, consider projecting? */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ partition = (hash & spill->mask) >> spill->shift;
+ spill->ntuples[partition]++;
+
+ tapenum = spill->partitions[partition];
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) &hash, sizeof(uint32));
+ total_written += sizeof(uint32);
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) tuple, tuple->t_len);
+ total_written += tuple->t_len;
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return total_written;
+}
+
+/*
+ * hashagg_batch_new
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done. Should be called in the aggregate's memory context.
+ */
+static HashAggBatch *
+hashagg_batch_new(HashTapeInfo *tapeinfo, int tapenum, int setno,
+ int64 input_tuples, int used_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->setno = setno;
+ batch->used_bits = used_bits;
+ batch->tapeinfo = tapeinfo;
+ batch->input_tapenum = tapenum;
+ batch->input_tuples = input_tuples;
+
+ return batch;
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hashagg_batch_read(HashAggBatch *batch, uint32 *hashp)
+{
+ LogicalTapeSet *tapeset = batch->tapeinfo->tapeset;
+ int tapenum = batch->input_tapenum;
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+ uint32 hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &hash, sizeof(uint32));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+ if (hashp != NULL)
+ *hashp = hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &t_len, sizeof(t_len));
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = LogicalTapeRead(tapeset, tapenum,
+ (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, t_len - sizeof(uint32), nread)));
+
+ return tuple;
+}
+
+/*
+ * hashagg_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hashagg_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+
+ if (aggstate->hash_spills == NULL)
+ return;
+
+ /* update hashentrysize estimate based on contents */
+ Assert(aggstate->hash_ngroups_current > 0);
+ aggstate->hashentrysize = (double)aggstate->hash_alloc_last /
+ (double)aggstate->hash_ngroups_current;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hashagg_spill_finish(aggstate, &aggstate->hash_spills[setno], setno);
+
+ aggstate->hash_spill_mode = false;
+
+ /*
+ * We're not processing tuples from outer plan any more; only processing
+ * batches of spilled tuples. The initial spill structures are no longer
+ * needed.
+ */
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+}
+
+/*
+ * hashagg_spill_finish
+ *
+ * Transform spill partitions into new batches.
+ */
+static void
+hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
+{
+ int i;
+ int used_bits = 32 - spill->shift;
+
+ if (spill->npartitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->npartitions; i++)
+ {
+ int tapenum = spill->partitions[i];
+ MemoryContext oldContext;
+ HashAggBatch *new_batch;
+
+ oldContext = MemoryContextSwitchTo(aggstate->ss.ps.state->es_query_cxt);
+ new_batch = hashagg_batch_new(aggstate->hash_tapeinfo,
+ tapenum, setno, spill->ntuples[i],
+ used_bits);
+ aggstate->hash_batches = lcons(new_batch, aggstate->hash_batches);
+ aggstate->hash_batches_used++;
+ MemoryContextSwitchTo(oldContext);
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Free resources related to a spilled HashAgg.
+ */
+static void
+hashagg_reset_spill_state(AggState *aggstate)
+{
+ ListCell *lc;
+
+ /* free spills from initial pass */
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+ }
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ /* free batches */
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+
+ /* close tape set */
+ if (aggstate->hash_tapeinfo != NULL)
+ {
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ if (tapeinfo->tapeset != NULL)
+ LogicalTapeSetClose(tapeinfo->tapeset);
+ if (tapeinfo->freetapes != NULL)
+ pfree(tapeinfo->freetapes);
+ pfree(tapeinfo);
+ aggstate->hash_tapeinfo = NULL;
+ }
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2287,6 +3204,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->ss.ps.outeropsfixed = false;
}
+ if (use_hashing)
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsMinimalTuple);
+
/*
* Initialize result type, slot and projection.
*/
@@ -2512,9 +3433,22 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
*/
if (use_hashing)
{
+ Plan *outerplan = outerPlan(node);
+ long totalGroups = 0;
+ int i;
+
/* this is an array of pointers, not structures */
aggstate->hash_pergroup = pergroups;
+ aggstate->hashentrysize = hash_agg_entry_size(
+ aggstate->numtrans, outerplan->plan_width, node->transitionSpace);
+
+ for (i = 0; i < aggstate->num_hashes; i++)
+ totalGroups = aggstate->perhash[i].aggnode->numGroups;
+
+ hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+ &aggstate->hash_mem_limit,
+ &aggstate->hash_ngroups_limit, NULL);
find_hash_columns(aggstate);
build_hash_tables(aggstate);
aggstate->table_filled = false;
@@ -2922,7 +3856,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
else
Assert(false);
- phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash);
+ phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
+ false);
}
@@ -3417,6 +4352,8 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hashagg_reset_spill_state(node);
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3472,12 +4409,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_ever_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3534,11 +4472,33 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ const TupleTableSlotOps *outerops = ExecGetResultSlotOps(
+ outerPlanState(&node->ss), &node->ss.ps.outeropsfixed);
+
+ hashagg_reset_spill_state(node);
+
+ node->hash_ever_spilled = false;
+ node->hash_spill_mode = false;
+ node->hash_alloc_last = 0;
+ node->hash_alloc_current = 0;
+ node->hash_ngroups_current = 0;
+
+ /* reset stats */
+ node->hash_mem_peak = 0;
+ node->hash_disk_used = 0;
+ node->hash_batches_used = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
build_hash_tables(node);
node->table_filled = false;
/* iterator will be reset when the table is filled */
+
+ if (node->ss.ps.outerops != outerops)
+ {
+ node->ss.ps.outerops = outerops;
+ hashagg_recompile_expressions(node);
+ }
}
if (node->aggstrategy != AGG_HASHED)
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index cea0d6fa5ce..7246fc2b33f 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2047,6 +2047,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_INIT_TRANS:
+ case EEOP_AGG_INIT_TRANS_SPILLED:
{
AggStatePerTrans pertrans;
@@ -2056,6 +2057,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_allpergroupsp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_setoff,
v_transno;
@@ -2082,11 +2084,32 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_init_trans.setoff);
v_transno = l_int32_const(op->d.agg_init_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_notransvalue = l_bb_before_v(
+ opblocks[opno + 1], "op.%d.check_notransvalue", opno);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(
+ b, v_pergroup_allaggs, TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[opno + 1],
+ b_check_notransvalue);
+
+ LLVMPositionBuilderAtEnd(b, b_check_notransvalue);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_notransvalue =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_NOTRANSVALUE,
@@ -2143,6 +2166,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_STRICT_TRANS_CHECK:
+ case EEOP_AGG_STRICT_TRANS_CHECK_SPILLED:
{
LLVMValueRef v_setoff,
v_transno;
@@ -2152,6 +2176,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_transnull;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
int jumpnull = op->d.agg_strict_trans_check.jumpnull;
@@ -2171,11 +2196,32 @@ llvm_compile_expr(ExprState *state)
l_int32_const(op->d.agg_strict_trans_check.setoff);
v_transno =
l_int32_const(op->d.agg_strict_trans_check.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_transnull = l_bb_before_v(
+ opblocks[opno + 1], "op.%d.check_transnull", opno);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[jumpnull],
+ b_check_transnull);
+
+ LLVMPositionBuilderAtEnd(b, b_check_transnull);
+ }
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_transnull =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_TRANSVALUEISNULL,
@@ -2191,7 +2237,9 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_PLAIN_TRANS_BYVAL:
+ case EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED:
case EEOP_AGG_PLAIN_TRANS:
+ case EEOP_AGG_PLAIN_TRANS_SPILLED:
{
AggState *aggstate;
AggStatePerTrans pertrans;
@@ -2217,6 +2265,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_pertransp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_retval;
@@ -2244,10 +2293,33 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_trans.setoff);
v_transno = l_int32_const(op->d.agg_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED ||
+ opcode == EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_advance_transval = l_bb_before_v(
+ opblocks[opno + 1], "op.%d.advance_transval", opno);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[opno + 1],
+ b_advance_transval);
+
+ LLVMPositionBuilderAtEnd(b, b_advance_transval);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_fcinfo = l_ptr_const(fcinfo,
l_ptr(StructFunctionCallInfoData));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721f..8d58780bf6a 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -77,6 +77,7 @@
#include "access/htup_details.h"
#include "access/tsmapi.h"
#include "executor/executor.h"
+#include "executor/nodeAgg.h"
#include "executor/nodeHash.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -128,6 +129,7 @@ bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
bool enable_hashagg = true;
+bool enable_hashagg_spill = true;
bool enable_nestloop = true;
bool enable_material = true;
bool enable_mergejoin = true;
@@ -2153,7 +2155,7 @@ cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples)
+ double input_tuples, double input_width)
{
double output_tuples;
Cost startup_cost;
@@ -2219,21 +2221,88 @@ cost_agg(Path *path, PlannerInfo *root,
total_cost += aggcosts->finalCost.per_tuple * numGroups;
total_cost += cpu_tuple_cost * numGroups;
output_tuples = numGroups;
+
+ /*
+ * We don't need to compute the disk costs of hash aggregation here,
+ * because the planner does not choose hash aggregation for grouping
+ * sets that it doesn't expect to fit in memory.
+ */
}
else
{
+ double pages_written = 0.0;
+ double pages_read = 0.0;
+ double hashentrysize;
+ double nbatches;
+ Size mem_limit;
+ long ngroups_limit;
+ int num_partitions;
+
/* must be AGG_HASHED */
startup_cost = input_total_cost;
if (!enable_hashagg)
startup_cost += disable_cost;
startup_cost += aggcosts->transCost.startup;
startup_cost += aggcosts->transCost.per_tuple * input_tuples;
+ /* cost of computing hash value */
startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;
startup_cost += aggcosts->finalCost.startup;
+
total_cost = startup_cost;
total_cost += aggcosts->finalCost.per_tuple * numGroups;
+ /* cost of retrieving from hash table */
total_cost += cpu_tuple_cost * numGroups;
output_tuples = numGroups;
+
+ /*
+ * Estimate number of batches based on the computed limits. If less
+ * than or equal to one, all groups are expected to fit in memory;
+ * otherwise we expect to spill.
+ */
+ hashentrysize = hash_agg_entry_size(
+ aggcosts->numAggs, input_width, aggcosts->transitionSpace);
+ hash_agg_set_limits(hashentrysize, numGroups, 0, &mem_limit,
+ &ngroups_limit, &num_partitions);
+
+ nbatches = Max( (numGroups * hashentrysize) / mem_limit,
+ numGroups / ngroups_limit );
+
+ /*
+ * Estimate number of pages read and written. For each level of
+ * recursion, a tuple must be written and then later read.
+ */
+ if (!hashagg_mem_overflow && nbatches > 1.0)
+ {
+ double depth;
+ double pages;
+
+ pages = relation_byte_size(input_tuples, input_width) / BLCKSZ;
+
+ /*
+ * The number of partitions can change at different levels of
+ * recursion; but for the purposes of this calculation assume it
+ * stays constant.
+ */
+ depth = ceil( log(nbatches - 1) / log(num_partitions) );
+ pages_written = pages_read = pages * depth;
+ }
+
+ /*
+ * Add the disk costs of hash aggregation that spills to disk.
+ *
+ * Groups that go into the hash table stay in memory until finalized,
+ * so spilling and reprocessing tuples doesn't incur additional
+ * invocations of transCost or finalCost. Furthermore, the computed
+ * hash value is stored with the spilled tuples, so we don't incur
+ * extra invocations of the hash function.
+ *
+ * Hash Agg begins returning tuples after the first batch is
+ * complete. Accrue writes (spilled tuples) to startup_cost and to
+ * total_cost; accrue reads only to total_cost.
+ */
+ startup_cost += pages_written * random_page_cost;
+ total_cost += pages_written * random_page_cost;
+ total_cost += pages_read * seq_page_cost;
}
/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e048d200bb4..090919e39a0 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1644,6 +1644,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
NIL,
NIL,
best_path->path.rows,
+ 0,
subplan);
}
else
@@ -2096,6 +2097,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
NIL,
NIL,
best_path->numGroups,
+ best_path->transitionSpace,
subplan);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2257,6 +2259,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
NIL,
rollup->numGroups,
+ best_path->transitionSpace,
sort_plan);
/*
@@ -2295,6 +2298,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
chain,
rollup->numGroups,
+ best_path->transitionSpace,
subplan);
/* Copy cost data from Path to Plan */
@@ -6192,8 +6196,8 @@ Agg *
make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree)
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree)
{
Agg *node = makeNode(Agg);
Plan *plan = &node->plan;
@@ -6209,6 +6213,7 @@ make_agg(List *tlist, List *qual,
node->grpOperators = grpOperators;
node->grpCollations = grpCollations;
node->numGroups = numGroups;
+ node->transitionSpace = transitionSpace;
node->aggParams = NULL; /* SS_finalize_plan() will fill this */
node->groupingSets = groupingSets;
node->chain = chain;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6314c..913ad9335e5 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6528,7 +6528,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
* were unable to sort above, then we'd better generate a Path, so
* that we at least have one.
*/
- if (hashaggtablesize < work_mem * 1024L ||
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L ||
grouped_rel->pathlist == NIL)
{
/*
@@ -6561,7 +6562,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
agg_final_costs,
dNumGroups);
- if (hashaggtablesize < work_mem * 1024L)
+ if (enable_hashagg_spill ||
+ hashaggtablesize < work_mem * 1024L)
add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
@@ -6830,7 +6832,7 @@ create_partial_grouping_paths(PlannerInfo *root,
* Tentatively produce a partial HashAgg Path, depending on if it
* looks as if the hash table will fit in work_mem.
*/
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
@@ -6857,7 +6859,7 @@ create_partial_grouping_paths(PlannerInfo *root,
dNumPartialPartialGroups);
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_spill || hashaggtablesize < work_mem * 1024L) &&
cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 1a23e18970d..951aed80e7a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1072,7 +1072,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
numGroupCols, dNumGroups,
NIL,
input_path->startup_cost, input_path->total_cost,
- input_path->rows);
+ input_path->rows, input_path->pathtarget->width);
/*
* Now for the sorted case. Note that the input is *always* unsorted,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e6d08aede56..8ba8122ee2f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1704,7 +1704,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
NIL,
subpath->startup_cost,
subpath->total_cost,
- rel->rows);
+ rel->rows,
+ subpath->pathtarget->width);
}
if (sjinfo->semi_can_btree && sjinfo->semi_can_hash)
@@ -2949,6 +2950,7 @@ create_agg_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->aggsplit = aggsplit;
pathnode->numGroups = numGroups;
+ pathnode->transitionSpace = aggcosts ? aggcosts->transitionSpace : 0;
pathnode->groupClause = groupClause;
pathnode->qual = qual;
@@ -2957,7 +2959,7 @@ create_agg_path(PlannerInfo *root,
list_length(groupClause), numGroups,
qual,
subpath->startup_cost, subpath->total_cost,
- subpath->rows);
+ subpath->rows, subpath->pathtarget->width);
/* add tlist eval cost for each output row */
pathnode->path.startup_cost += target->cost.startup;
@@ -3036,6 +3038,7 @@ create_groupingsets_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->rollups = rollups;
pathnode->qual = having_qual;
+ pathnode->transitionSpace = agg_costs ? agg_costs->transitionSpace : 0;
Assert(rollups != NIL);
Assert(aggstrategy != AGG_PLAIN || list_length(rollups) == 1);
@@ -3067,7 +3070,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
subpath->startup_cost,
subpath->total_cost,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
is_first = false;
if (!rollup->is_hashed)
is_first_sort = false;
@@ -3090,7 +3094,8 @@ create_groupingsets_path(PlannerInfo *root,
rollup->numGroups,
having_qual,
0.0, 0.0,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
if (!rollup->is_hashed)
is_first_sort = false;
}
@@ -3115,7 +3120,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
sort_path.startup_cost,
sort_path.total_cost,
- sort_path.rows);
+ sort_path.rows,
+ subpath->pathtarget->width);
}
pathnode->path.total_cost += agg_path.total_cost;
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb196444198..1151b807418 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -120,6 +120,7 @@ bool enableFsync = true;
bool allowSystemTableMods = false;
int work_mem = 1024;
int maintenance_work_mem = 16384;
+bool hashagg_mem_overflow = false;
int max_parallel_maintenance_workers = 2;
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8228e1f3903..ed6737a8ac9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -998,6 +998,26 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_hashagg_spill", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans that are expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_hashagg_spill,
+ true,
+ NULL, NULL, NULL
+ },
+ {
+ {"hashagg_mem_overflow", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables hashed aggregation to overflow work_mem at execution time."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &hashagg_mem_overflow,
+ false,
+ NULL, NULL, NULL
+ },
{
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of materialization."),
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 4f78b55fbaf..36104a73a75 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -201,6 +201,7 @@ static long ltsGetFreeBlock(LogicalTapeSet *lts);
static void ltsReleaseBlock(LogicalTapeSet *lts, long blocknum);
static void ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
SharedFileSet *fileset);
+static void ltsInitTape(LogicalTape *lt);
static void ltsInitReadBuffer(LogicalTapeSet *lts, LogicalTape *lt);
@@ -536,6 +537,30 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
lts->nHoleBlocks = lts->nBlocksAllocated - nphysicalblocks;
}
+/*
+ * Initialize per-tape struct. Note we allocate the I/O buffer and the first
+ * block for a tape only when it is first actually written to. This avoids
+ * wasting memory space when tuplesort.c overestimates the number of tapes
+ * needed.
+ */
+static void
+ltsInitTape(LogicalTape *lt)
+{
+ lt->writing = true;
+ lt->frozen = false;
+ lt->dirty = false;
+ lt->firstBlockNumber = -1L;
+ lt->curBlockNumber = -1L;
+ lt->nextBlockNumber = -1L;
+ lt->offsetBlockNumber = 0L;
+ lt->buffer = NULL;
+ lt->buffer_size = 0;
+ /* palloc() larger than MaxAllocSize would fail */
+ lt->max_size = MaxAllocSize;
+ lt->pos = 0;
+ lt->nbytes = 0;
+}
+
/*
* Lazily allocate and initialize the read buffer. This avoids waste when many
* tapes are open at once, but not all are active between rewinding and
@@ -579,7 +604,6 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
int worker)
{
LogicalTapeSet *lts;
- LogicalTape *lt;
int i;
/*
@@ -597,29 +621,8 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
lts->nFreeBlocks = 0;
lts->nTapes = ntapes;
- /*
- * Initialize per-tape structs. Note we allocate the I/O buffer and the
- * first block for a tape only when it is first actually written to. This
- * avoids wasting memory space when tuplesort.c overestimates the number
- * of tapes needed.
- */
for (i = 0; i < ntapes; i++)
- {
- lt = <s->tapes[i];
- lt->writing = true;
- lt->frozen = false;
- lt->dirty = false;
- lt->firstBlockNumber = -1L;
- lt->curBlockNumber = -1L;
- lt->nextBlockNumber = -1L;
- lt->offsetBlockNumber = 0L;
- lt->buffer = NULL;
- lt->buffer_size = 0;
- /* palloc() larger than MaxAllocSize would fail */
- lt->max_size = MaxAllocSize;
- lt->pos = 0;
- lt->nbytes = 0;
- }
+ ltsInitTape(<s->tapes[i]);
/*
* Create temp BufFile storage as required.
@@ -1004,6 +1007,29 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum, TapeShare *share)
}
}
+/*
+ * Add additional tapes to this tape set. Not intended to be used when any
+ * tapes are frozen.
+ */
+LogicalTapeSet *
+LogicalTapeSetExtend(LogicalTapeSet *lts, int nAdditional)
+{
+ int i;
+ int nTapesOrig = lts->nTapes;
+ Size newSize;
+
+ lts->nTapes += nAdditional;
+ newSize = offsetof(LogicalTapeSet, tapes) +
+ lts->nTapes * sizeof(LogicalTape);
+
+ lts = (LogicalTapeSet *) repalloc(lts, newSize);
+
+ for (i = nTapesOrig; i < lts->nTapes; i++)
+ ltsInitTape(<s->tapes[i]);
+
+ return lts;
+}
+
/*
* Backspace the tape a given number of bytes. (We also support a more
* general seek interface, see below.)
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 73a2ca8c6dd..d70bc048c46 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -226,9 +226,13 @@ typedef enum ExprEvalOp
EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
EEOP_AGG_INIT_TRANS,
+ EEOP_AGG_INIT_TRANS_SPILLED,
EEOP_AGG_STRICT_TRANS_CHECK,
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
EEOP_AGG_PLAIN_TRANS_BYVAL,
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
EEOP_AGG_PLAIN_TRANS,
+ EEOP_AGG_PLAIN_TRANS_SPILLED,
EEOP_AGG_ORDERED_TRANS_DATUM,
EEOP_AGG_ORDERED_TRANS_TUPLE,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 81fdfa4add3..d6eb2abb60b 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -255,7 +255,7 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
- bool doSort, bool doHash);
+ bool doSort, bool doHash, bool spilled);
extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
int numCols,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 264916f9a92..307987a45ab 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -311,5 +311,8 @@ extern void ExecReScanAgg(AggState *node);
extern Size hash_agg_entry_size(int numAggs, Size tupleWidth,
Size transitionSpace);
+extern void hash_agg_set_limits(double hashentrysize, uint64 input_groups,
+ int used_bits, Size *mem_limit,
+ long *ngroups_limit, int *num_partitions);
#endif /* NODEAGG_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f985453ec32..707a07a2de4 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -244,6 +244,7 @@ extern bool enableFsync;
extern PGDLLIMPORT bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT bool hashagg_mem_overflow;
extern PGDLLIMPORT int max_parallel_maintenance_workers;
extern int VacuumCostPageHit;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd3ddf781f1..19b9cef42f6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2078,13 +2078,32 @@ typedef struct AggState
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
- int num_hashes;
+ int num_hashes; /* number of hash tables active at once */
+ double hashentrysize; /* estimate revised during execution */
+ struct HashTapeInfo *hash_tapeinfo; /* metadata for spill tapes */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each hash table,
+ exists only during first pass if spilled */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ bool hash_ever_spilled; /* ever spilled during this execution? */
+ bool hash_spill_mode; /* we hit a limit during the current batch
+ and we must not create new groups */
+ Size hash_alloc_last; /* previous total memory allocation */
+ Size hash_alloc_current; /* current total memory allocation */
+ Size hash_mem_limit; /* limit before spilling hash table */
+ Size hash_mem_peak; /* peak hash table memory usage */
+ long hash_ngroups_current; /* number of groups currently in
+ memory in all hash tables */
+ long hash_ngroups_limit; /* limit before spilling hash table */
+ long hash_disk_used; /* kB of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+ List *hash_batches; /* hash batches remaining to be processed */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 49
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 3d3be197e0e..be592d0fee4 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1663,6 +1663,7 @@ typedef struct AggPath
AggStrategy aggstrategy; /* basic strategy, see nodes.h */
AggSplit aggsplit; /* agg-splitting mode, see nodes.h */
double numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
List *groupClause; /* a list of SortGroupClause's */
List *qual; /* quals (HAVING quals), if any */
} AggPath;
@@ -1700,6 +1701,7 @@ typedef struct GroupingSetsPath
AggStrategy aggstrategy; /* basic strategy */
List *rollups; /* list of RollupData */
List *qual; /* quals (HAVING quals), if any */
+ int32 transitionSpace; /* estimated transition state size */
} GroupingSetsPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 32c0d87f80e..f4183e1efa5 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -813,6 +813,7 @@ typedef struct Agg
Oid *grpOperators; /* equality operators to compare with */
Oid *grpCollations;
long numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
Bitmapset *aggParams; /* IDs of Params used in Aggref inputs */
/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
List *groupingSets; /* grouping sets to use */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index cb012ba1980..6572dc24699 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_hashagg_spill;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
@@ -114,7 +115,7 @@ extern void cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples);
+ double input_tuples, double input_width);
extern void cost_windowagg(Path *path, PlannerInfo *root,
List *windowFuncs, int numPartCols, int numOrderCols,
Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index eab486a6214..c7bda2b0917 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -54,8 +54,8 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
extern Agg *make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree);
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree);
extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
/*
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index 695d2c00ee4..3ebe52239f8 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -67,6 +67,8 @@ extern void LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum,
extern void LogicalTapeRewindForWrite(LogicalTapeSet *lts, int tapenum);
extern void LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum,
TapeShare *share);
+extern LogicalTapeSet *LogicalTapeSetExtend(LogicalTapeSet *lts,
+ int nAdditional);
extern size_t LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum,
size_t size);
extern void LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index f457b5b150f..7eeeaaa5e4a 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2357,3 +2357,124 @@ explain (costs off)
-> Seq Scan on onek
(8 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+set work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------------
+ GroupAggregate
+ Group Key: ((g % 100000))
+ -> Sort
+ Sort Key: ((g % 100000))
+ -> Function Scan on generate_series g
+(5 rows)
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+-- Produce results with hash aggregation
+set enable_hashagg = true;
+set enable_sort = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 100000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+set enable_sort = true;
+set work_mem to default;
+-- Compare group aggregation results to hash aggregation results
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+ a | c1 | c2 | c3
+---+----+----+----
+(0 rows)
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a7..767f60a96c7 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1633,4 +1633,127 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
| 1 | 2
(4 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+-- Produce results with hash aggregation.
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+set enable_sort = true;
+set work_mem to default;
+-- Compare results
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+ g100 | g10 | unnest | c | m
+------+-----+--------+---+---
+(0 rows)
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
-- end
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1de..11c6f50fbfa 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -148,6 +148,68 @@ SELECT count(*) FROM
4
(1 row)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+SET enable_hashagg=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------------
+ Unique
+ -> Sort
+ Sort Key: ((g % 1000))
+ -> Function Scan on generate_series g
+(4 rows)
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_hashagg=TRUE;
+-- Produce results with hash aggregation.
+SET enable_sort=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 1000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_sort=TRUE;
+SET work_mem TO DEFAULT;
+-- Compare results
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb9057..c40bf6c16eb 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -75,6 +75,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
+ enable_hashagg_spill | on
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 3e593f2d615..a4d476c4bb3 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1032,3 +1032,119 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
explain (costs off)
select 1 from tenk1
where (hundred, thousand) in (select twothousand, twothousand from onek);
+
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+set work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+-- Produce results with hash aggregation
+
+set enable_hashagg = true;
+set enable_sort = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare group aggregation results to hash aggregation results
+
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 95ac3fb52f6..bf8bce6ed31 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -441,4 +441,103 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
from unnest(array[1,1], array['a','b']) u(i,v)
group by rollup(i, v||'a') order by 1,3;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+-- Produce results with hash aggregation.
+
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare results
+
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+
-- end
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449e..33102744ebf 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -45,6 +45,68 @@ SELECT count(*) FROM
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+SET enable_hashagg=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_hashagg=TRUE;
+
+-- Produce results with hash aggregation.
+
+SET enable_sort=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_sort=TRUE;
+
+SET work_mem TO DEFAULT;
+
+-- Compare results
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
+
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
Hi,
On 2020-02-19 20:16:36 +0100, Tomas Vondra wrote:
3) I wonder if we need to invent new opcodes? Wouldn't it be simpler to
just add a new flag to the agg_* structs instead? I haven't tried hacking
this, so maybe it's a silly idea.
New opcodes don't really cost that much - it's a jump table based
dispatch already (yes, it increases the table size slightly, but not by
much). But adding branches inside opcode implementation does add cost -
and we're already bottlenecked by stalls.
I assume code duplication is your primary concern here?
If so, I think the patch 0008 in
/messages/by-id/20191023163849.sosqbfs5yenocez3@alap3.anarazel.de
would improve the situation. I'll try to rebase that onto master.
I'd also like to apply something like 0013 from that thread, I find the
whole curperagg, select_current_set, curaggcontext logic confusing as
hell. I'd so far planned to put this on the backburner until this patch
has been committed, to avoid breaking it. But perhaps that's not the
right call?
Greetings,
Andres Freund
On Fri, 2020-02-21 at 12:22 -0800, Andres Freund wrote:
I'd also like to apply something like 0013 from that thread, I find
the
whole curperagg, select_current_set, curaggcontext logic confusing as
hell. I'd so far planned to put this on the backburner until this
patch
has been committed, to avoid breaking it. But perhaps that's not the
right call?
At least for now, I appreciate you holding off on those a bit.
Regards,
Jeff Davis
Hi,
On 2020-02-22 09:55:26 -0800, Jeff Davis wrote:
On Fri, 2020-02-21 at 12:22 -0800, Andres Freund wrote:
I'd also like to apply something like 0013 from that thread, I find
the
whole curperagg, select_current_set, curaggcontext logic confusing as
hell. I'd so far planned to put this on the backburner until this
patch
has been committed, to avoid breaking it. But perhaps that's not the
right call?At least for now, I appreciate you holding off on those a bit.
Both patches, or just 0013? Seems the earlier one might make the
addition of the opcodes you add less verbose?
Greetings,
Andres Freund
On Sat, 2020-02-22 at 10:00 -0800, Andres Freund wrote:
Both patches, or just 0013? Seems the earlier one might make the
addition of the opcodes you add less verbose?
Just 0013, thank you. 0008 looks like it will simplify things.
Regards,
Jeff Davis
On Wed, 2020-02-19 at 20:16 +0100, Tomas Vondra wrote:
5) Assert(nbuckets > 0);
...
6) Another thing that occurred to me was what happens to grouping
sets,
which we can't spill to disk. So I did this:
...
which fails with segfault at execution time:
The biggest problem was that my grouping sets test was not testing
multiple hash tables spilling, so a couple bugs crept in. I fixed them,
thank you.
To fix the tests, I also had to fix the GUCs and the way the planner
uses them with my patch. In master, grouping sets are planned by
generating a path that tries to do as many grouping sets with hashing
as possible (limited by work_mem). But with my patch, the notion of
fitting hash tables in work_mem is not necessarily important. If we
ignore work_mem during path generation entirely (and only consider it
during costing and execution), it will change quite a few plans and
undermine the concept of mixed aggregates entirely. That may be a good
thing to do eventually as a simplification, but for now it seems like
too much, so I am preserving the notion of trying to fit hash tables in
work_mem to create mixed aggregates.
But that created the testing problem: I need a reliable way to get
grouping sets with several hash tables in memory that are all spilling,
but the planner is trying to avoid exactly that. So, I am introducing a
new GUC called enable_groupingsets_hash_disk (better name suggestions
welcome), defaulting it to "off" (and turned on during the test).
Additionally, I removed the other GUCs I introduced in earlier versions
of this patch. They were basically designed around the idea to revert
back to the previous hash aggregation behavior if desired (by setting
enable_hashagg_spill=false and hashagg_mem_overflow=true). That makes
some sense, but that was already covered pretty well by existing GUCs.
If you want to use HashAgg without spilling, just set work_mem higher;
and if you want to avoid the planner from choosing HashAgg at all, you
set enable_hashagg=false. So I just got rid of enable_hashagg_spill and
hashagg_mem_overflow.
I didn't forget about your explain-related suggestions. I'll address
them in the next patch.
Regards,
Jeff Davis
Attachments:
hashagg-20200222.patchtext/x-patch; charset=UTF-8; name=hashagg-20200222.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f89ec7..edfec0362e1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4476,6 +4476,24 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-groupingsets-hash-disk" xreflabel="enable_groupingsets_hash_disk">
+ <term><varname>enable_groupingsets_hash_disk</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_groupingsets_hash_disk</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation for
+ grouping sets when the size of the hash tables is expected to exceed
+ <varname>work_mem</varname>. See <xref
+ linkend="queries-grouping-sets"/>. Note that this setting only
+ affects the chosen plan; execution time may still require using
+ disk-based hash aggregation. The default is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
<term><varname>enable_hashjoin</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50e..2923f4ba46d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -104,6 +104,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1882,6 +1883,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ if (es->analyze)
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2769,6 +2772,55 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * If EXPLAIN ANALYZE, show information on hash aggregate memory usage and
+ * batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(
+ es->str,
+ "Memory Usage: %ldkB",
+ memPeakKb);
+
+ if (aggstate->hash_batches_used > 0)
+ {
+ appendStringInfo(
+ es->str,
+ " Batches: %d Disk: %ldkB",
+ aggstate->hash_batches_used, aggstate->hash_disk_used);
+ }
+
+ appendStringInfo(
+ es->str,
+ "\n");
+ }
+ else
+ {
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ ExplainPropertyInteger("Disk Usage", "kB",
+ aggstate->hash_disk_used, es);
+ }
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 121eff97a0c..5236d6f3935 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -79,7 +79,8 @@ static void ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash);
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled);
/*
@@ -2924,10 +2925,13 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
* check for filters, evaluate aggregate input, check that that input is not
* NULL for a strict transition function, and then finally invoke the
* transition for each of the concurrently computed grouping sets.
+ *
+ * If "spilled" is true, the generated code will take into account the
+ * possibility that a Hash Aggregation has spilled to disk.
*/
ExprState *
ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
- bool doSort, bool doHash)
+ bool doSort, bool doHash, bool spilled)
{
ExprState *state = makeNode(ExprState);
PlanState *parent = &aggstate->ss.ps;
@@ -3158,7 +3162,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (int setno = 0; setno < processGroupingSets; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, false);
+ pertrans, transno, setno, setoff, false,
+ spilled);
setoff++;
}
}
@@ -3177,7 +3182,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (int setno = 0; setno < numHashes; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, true);
+ pertrans, transno, setno, setoff, true,
+ spilled);
setoff++;
}
}
@@ -3227,7 +3233,8 @@ static void
ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash)
+ int transno, int setno, int setoff, bool ishash,
+ bool spilled)
{
int adjust_init_jumpnull = -1;
int adjust_strict_jumpnull = -1;
@@ -3249,7 +3256,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
fcinfo->flinfo->fn_strict &&
pertrans->initValueIsNull)
{
- scratch->opcode = EEOP_AGG_INIT_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_INIT_TRANS_SPILLED : EEOP_AGG_INIT_TRANS;
scratch->d.agg_init_trans.pertrans = pertrans;
scratch->d.agg_init_trans.setno = setno;
scratch->d.agg_init_trans.setoff = setoff;
@@ -3265,7 +3273,8 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
if (pertrans->numSortCols == 0 &&
fcinfo->flinfo->fn_strict)
{
- scratch->opcode = EEOP_AGG_STRICT_TRANS_CHECK;
+ scratch->opcode = spilled ?
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED : EEOP_AGG_STRICT_TRANS_CHECK;
scratch->d.agg_strict_trans_check.setno = setno;
scratch->d.agg_strict_trans_check.setoff = setoff;
scratch->d.agg_strict_trans_check.transno = transno;
@@ -3282,9 +3291,11 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
/* invoke appropriate transition implementation */
if (pertrans->numSortCols == 0 && pertrans->transtypeByVal)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS_BYVAL;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED : EEOP_AGG_PLAIN_TRANS_BYVAL;
else if (pertrans->numSortCols == 0)
- scratch->opcode = EEOP_AGG_PLAIN_TRANS;
+ scratch->opcode = spilled ?
+ EEOP_AGG_PLAIN_TRANS_SPILLED : EEOP_AGG_PLAIN_TRANS;
else if (pertrans->numInputs == 1)
scratch->opcode = EEOP_AGG_ORDERED_TRANS_DATUM;
else
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 35eb8b99f69..e21e0c440ea 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -426,9 +426,13 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
&&CASE_EEOP_AGG_INIT_TRANS,
+ &&CASE_EEOP_AGG_INIT_TRANS_SPILLED,
&&CASE_EEOP_AGG_STRICT_TRANS_CHECK,
+ &&CASE_EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
&&CASE_EEOP_AGG_PLAIN_TRANS,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_SPILLED,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
&&CASE_EEOP_LAST
@@ -1619,6 +1623,35 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
+
+ /* If transValue has not yet been initialized, do so now. */
+ if (pergroup->noTransValue)
+ {
+ AggStatePerTrans pertrans = op->d.agg_init_trans.pertrans;
+
+ aggstate->curaggcontext = op->d.agg_init_trans.aggcontext;
+ aggstate->current_set = op->d.agg_init_trans.setno;
+
+ ExecAggInitGroup(aggstate, pertrans, pergroup);
+
+ /* copied trans value from input, done this round */
+ EEO_JUMP(op->d.agg_init_trans.jumpnull);
+ }
+
+ EEO_NEXT();
+ }
/* check that a strict aggregate's input isn't NULL */
EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK)
@@ -1635,6 +1668,24 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_strict_trans_check.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_strict_trans_check.transno];
+
+ if (unlikely(pergroup->transValueIsNull))
+ EEO_JUMP(op->d.agg_strict_trans_check.jumpnull);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1683,6 +1734,51 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ Assert(pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
/*
* Evaluate aggregate transition / combine function that has a
@@ -1726,6 +1822,66 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
newVal = FunctionCallInvoke(fcinfo);
+ /*
+ * For pass-by-ref datatype, must copy the new value into
+ * aggcontext and free the prior transValue. But if transfn
+ * returned a pointer to its first input, we don't need to do
+ * anything. Also, if transfn returned a pointer to a R/W
+ * expanded object that is already a child of the aggcontext,
+ * assume we can adopt that value without copying it.
+ */
+ if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggTransReparent(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+
+ MemoryContextSwitchTo(oldContext);
+
+ EEO_NEXT();
+ }
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerTrans pertrans;
+ AggStatePerGroup pergroup;
+ AggStatePerGroup pergroup_allaggs;
+ FunctionCallInfo fcinfo;
+ MemoryContext oldContext;
+ Datum newVal;
+
+ pertrans = op->d.agg_trans.pertrans;
+
+ pergroup_allaggs = aggstate->all_pergroups[op->d.agg_trans.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_NEXT();
+
+ pergroup = &pergroup_allaggs[op->d.agg_trans.transno];
+
+ Assert(!pertrans->transtypeByVal);
+
+ fcinfo = pertrans->transfn_fcinfo;
+
+ /* cf. select_current_set() */
+ aggstate->curaggcontext = op->d.agg_trans.aggcontext;
+ aggstate->current_set = op->d.agg_trans.setno;
+
+ /* set up aggstate->curpertrans for AggGetAggref() */
+ aggstate->curpertrans = pertrans;
+
+ /* invoke transition function in per-tuple context */
+ oldContext = MemoryContextSwitchTo(aggstate->tmpcontext->ecxt_per_tuple_memory);
+
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+
+ newVal = FunctionCallInvoke(fcinfo);
+
/*
* For pass-by-ref datatype, must copy the new value into
* aggcontext and free the prior transValue. But if transfn
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index a99b4a60754..944ef42dcb9 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -194,6 +194,29 @@
* transition values. hashcontext is the single context created to support
* all hash tables.
*
+ * Spilling To Disk
+ *
+ * When performing hash aggregation, if the hash table memory exceeds the
+ * limit (see hash_agg_check_limits()), we enter "spill mode". In spill
+ * mode, we advance the transition states only for groups already in the
+ * hash table. For tuples that would need to create a new hash table
+ * entries (and initialize new transition states), we instead spill them to
+ * disk to be processed later. The tuples are spilled in a partitioned
+ * manner, so that subsequent batches are smaller and less likely to exceed
+ * work_mem (if a batch does exceed work_mem, it must be spilled
+ * recursively).
+ *
+ * Spilled data is written to logical tapes. These provide better control
+ * over memory usage, disk space, and the number of files than if we were
+ * to use a BufFile for each spill.
+ *
+ * Note that it's possible for transition states to start small but then
+ * grow very large; for instance in the case of ARRAY_AGG. In such cases,
+ * it's still possible to significantly exceed work_mem. We try to avoid
+ * this situation by estimating what will fit in the available memory, and
+ * imposing a limit on the number of groups separately from the amount of
+ * memory consumed.
+ *
* Transition / Combine function invocation:
*
* For performance reasons transition functions, including combine
@@ -233,12 +256,100 @@
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/datum.h"
+#include "utils/dynahash.h"
#include "utils/expandeddatum.h"
+#include "utils/logtape.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+/*
+ * Control how many partitions are created when spilling HashAgg to
+ * disk.
+ *
+ * HASHAGG_PARTITION_FACTOR is multiplied by the estimated number of
+ * partitions needed such that each partition will fit in memory. The factor
+ * is set higher than one because there's not a high cost to having a few too
+ * many partitions, and it makes it less likely that a partition will need to
+ * be spilled recursively. Another benefit of having more, smaller partitions
+ * is that small hash tables may perform better than large ones due to memory
+ * caching effects.
+ *
+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (which
+ * could instead be spent on a larger hash table).
+ *
+ * For reading from tapes, the buffer size must be a multiple of
+ * BLCKSZ. Larger values help when reading from multiple tapes concurrently,
+ * but that doesn't happen in HashAgg, so we simply use BLCKSZ. Writing to a
+ * tape always uses a buffer of size BLCKSZ.
+ */
+#define HASHAGG_PARTITION_FACTOR 1.50
+#define HASHAGG_MIN_PARTITIONS 4
+#define HASHAGG_MAX_PARTITIONS 256
+#define HASHAGG_MIN_BUCKETS 256
+#define HASHAGG_READ_BUFFER_SIZE BLCKSZ
+#define HASHAGG_WRITE_BUFFER_SIZE BLCKSZ
+
+/*
+ * Track all tapes needed for a HashAgg that spills. We don't know the maximum
+ * number of tapes needed at the start of the algorithm (because it can
+ * recurse), so one tape set is allocated and extended as needed for new
+ * tapes. When a particular tape is already read, rewind it for write mode and
+ * put it in the free list.
+ *
+ * Tapes' buffers can take up substantial memory when many tapes are open at
+ * once. We only need one tape open at a time in read mode (using a buffer
+ * that's a multiple of BLCKSZ); but we need up to HASHAGG_MAX_PARTITIONS
+ * tapes open in write mode (each requiring a buffer of size BLCKSZ).
+ */
+typedef struct HashTapeInfo
+{
+ LogicalTapeSet *tapeset;
+ int ntapes;
+ int *freetapes;
+ int nfreetapes;
+} HashTapeInfo;
+
+/*
+ * Represents partitioned spill data for a single hashtable. Contains the
+ * necessary information to route tuples to the correct partition, and to
+ * transform the spilled data into new batches.
+ *
+ * The high bits are used for partition selection (when recursing, we ignore
+ * the bits that have already been used for partition selection at an earlier
+ * level).
+ */
+typedef struct HashAggSpill
+{
+ HashTapeInfo *tapeinfo; /* borrowed reference to tape info */
+ int npartitions; /* number of partitions */
+ int *partitions; /* spill partition tape numbers */
+ int64 *ntuples; /* number of tuples in each partition */
+ uint32 mask; /* mask to find partition from hash value */
+ int shift; /* after masking, shift by this amount */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation (with only one
+ * grouping set).
+ *
+ * Also tracks the bits of the hash already used for partition selection by
+ * earlier iterations, so that this batch can use new bits. If all bits have
+ * already been used, no partitioning will be done (any spilled data will go
+ * to a single output tape).
+ */
+typedef struct HashAggBatch
+{
+ int setno; /* grouping set */
+ int used_bits; /* number of bits of hash already used */
+ HashTapeInfo *tapeinfo; /* borrowed reference to tape info */
+ int input_tapenum; /* input partition tape */
+ int64 input_tuples; /* number of tuples in this batch */
+} HashAggBatch;
+
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -275,11 +386,38 @@ static Bitmapset *find_unaggregated_cols(AggState *aggstate);
static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
static void build_hash_tables(AggState *aggstate);
static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
+static void hashagg_recompile_expressions(AggState *aggstate);
+static long hash_choose_num_buckets(AggState *aggstate,
+ long estimated_nbuckets,
+ Size memory);
+static int hash_choose_num_partitions(uint64 input_groups,
+ double hashentrysize,
+ int used_bits,
+ int *log2_npartittions);
static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+static void hash_agg_check_limits(AggState *aggstate);
+static void hashagg_finish_initial_spills(AggState *aggstate);
+static void hashagg_reset_spill_state(AggState *aggstate);
+static HashAggBatch *hashagg_batch_new(HashTapeInfo *tapeinfo,
+ int input_tapenum, int setno,
+ int64 input_tuples, int used_bits);
+static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
+static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
+ int used_bits, uint64 input_tuples,
+ double hashentrysize);
+static Size hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot,
+ uint32 hash);
+static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno);
+static void hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *dest,
+ int ndest);
+static void hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum);
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1261,7 +1399,7 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
}
/*
- * (Re-)initialize the hash table(s) to empty.
+ * Initialize the hash table(s).
*
* To implement hashed aggregation, we need a hashtable that stores a
* representative tuple and an array of AggStatePerGroup structs for each
@@ -1272,9 +1410,9 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
* We have a separate hashtable and associated perhash data structure for each
* grouping set for which we're doing hashing.
*
- * The contents of the hash tables always live in the hashcontext's per-tuple
- * memory context (there is only one of these for all tables together, since
- * they are all reset at the same time).
+ * The hash tables and their contents always live in the hashcontext's
+ * per-tuple memory context (there is only one of these for all tables
+ * together, since they are all reset at the same time).
*/
static void
build_hash_tables(AggState *aggstate)
@@ -1284,14 +1422,24 @@ build_hash_tables(AggState *aggstate)
for (setno = 0; setno < aggstate->num_hashes; ++setno)
{
AggStatePerHash perhash = &aggstate->perhash[setno];
+ long nbuckets;
+ Size memory;
Assert(perhash->aggnode->numGroups > 0);
- if (perhash->hashtable)
- ResetTupleHashTable(perhash->hashtable);
- else
- build_hash_table(aggstate, setno, perhash->aggnode->numGroups);
+ memory = aggstate->hash_mem_limit / aggstate->num_hashes;
+
+ /* choose reasonable number of buckets per hashtable */
+ nbuckets = hash_choose_num_buckets(
+ aggstate, perhash->aggnode->numGroups, memory);
+
+ build_hash_table(aggstate, setno, nbuckets);
}
+
+ aggstate->hash_alloc_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_ngroups_current = 0;
}
/*
@@ -1301,7 +1449,7 @@ static void
build_hash_table(AggState *aggstate, int setno, long nbuckets)
{
AggStatePerHash perhash = &aggstate->perhash[setno];
- MemoryContext metacxt = aggstate->ss.ps.state->es_query_cxt;
+ MemoryContext metacxt;
MemoryContext hashcxt = aggstate->hashcontext->ecxt_per_tuple_memory;
MemoryContext tmpcxt = aggstate->tmpcontext->ecxt_per_tuple_memory;
Size additionalsize;
@@ -1309,6 +1457,12 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
Assert(aggstate->aggstrategy == AGG_HASHED ||
aggstate->aggstrategy == AGG_MIXED);
+ /*
+ * We don't try to preserve any part of the hash table. Set the metacxt to
+ * hashcxt, which will be reset for each batch.
+ */
+ metacxt = hashcxt;
+
/*
* Used to make sure initial hash table allocation does not exceed
* work_mem. Note that the estimate does not include space for
@@ -1484,14 +1638,242 @@ hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
transitionSpace;
}
+/*
+ * Recompile the expressions for advancing aggregates while hashing. This is
+ * necessary for certain kinds of state changes that affect the resulting
+ * expression. For instance, changing aggstate->hash_ever_spilled or
+ * aggstate->ss.ps.outerops requires recompilation.
+ *
+ * A compiled expression where hash_ever_spilled is true will work even when
+ * hash_spill_mode is false, because it merely introduces additional branches
+ * that are unnecessary when hash_spill_mode is false. That allows us to only
+ * recompile when hash_ever_spilled changes, rather than every time
+ * hash_spill_mode changes.
+ */
+static void
+hashagg_recompile_expressions(AggState *aggstate)
+{
+ AggStatePerPhase phase;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ if (aggstate->aggstrategy == AGG_HASHED)
+ phase = &aggstate->phases[0];
+ else /* AGG_MIXED */
+ phase = &aggstate->phases[1];
+
+ phase->evaltrans = ExecBuildAggTrans(
+ aggstate, phase,
+ aggstate->aggstrategy == AGG_MIXED ? true : false, /* dosort */
+ true, /* dohash */
+ aggstate->hash_ever_spilled);
+}
+
+/*
+ * Set limits that trigger spilling to avoid exceeding work_mem. Consider the
+ * number of partitions we expect to create (if we do spill).
+ *
+ * There are two limits: a memory limit, and also an ngroups limit. The
+ * ngroups limit becomes important when we expect transition values to grow
+ * substantially larger than the initial value.
+ */
+void
+hash_agg_set_limits(double hashentrysize, uint64 input_groups, int used_bits,
+ Size *mem_limit, long *ngroups_limit, int *num_partitions)
+{
+ int npartitions;
+ Size partition_mem;
+
+ /* if not expected to spill, use all of work_mem */
+ if (input_groups * hashentrysize < work_mem * 1024L)
+ {
+ *mem_limit = work_mem * 1024L;
+ *ngroups_limit = *mem_limit / hashentrysize;
+ return;
+ }
+
+ /*
+ * Calculate expected memory requirements for spilling, which is the size
+ * of the buffers needed for all the tapes that need to be open at
+ * once. Then, subtract that from the memory available for holding hash
+ * tables.
+ */
+ npartitions = hash_choose_num_partitions(input_groups,
+ hashentrysize,
+ used_bits,
+ NULL);
+ if (num_partitions != NULL)
+ *num_partitions = npartitions;
+
+ partition_mem =
+ HASHAGG_READ_BUFFER_SIZE +
+ HASHAGG_WRITE_BUFFER_SIZE * npartitions;
+
+ /*
+ * Don't set the limit below 3/4 of work_mem. In that case, we are at the
+ * minimum number of partitions, so we aren't going to dramatically exceed
+ * work mem anyway.
+ */
+ if (work_mem * 1024L > 4 * partition_mem)
+ *mem_limit = work_mem * 1024L - partition_mem;
+ else
+ *mem_limit = work_mem * 1024L * 0.75;
+
+ if (*mem_limit > hashentrysize)
+ *ngroups_limit = *mem_limit / hashentrysize;
+ else
+ *ngroups_limit = 1;
+}
+
+/*
+ * hash_agg_check_limits
+ *
+ * After adding a new group to the hash table, check whether we need to enter
+ * spill mode. Allocations may happen without adding new groups (for instance,
+ * if the transition state size grows), so this check is imperfect.
+ *
+ * Memory usage is tracked by how much is allocated to the underlying memory
+ * context, not individual chunks. This is more accurate because it accounts
+ * for all memory in the context, and also accounts for fragmentation and
+ * other forms of overhead and waste that can be difficult to estimate. It's
+ * also cheaper because we don't have to track each chunk.
+ *
+ * When memory is first allocated to a memory context, it is not actually
+ * used. So when the next allocation happens, we consider the
+ * previously-allocated amount to be the memory currently used.
+ */
+static void
+hash_agg_check_limits(AggState *aggstate)
+{
+ Size allocation;
+
+ /*
+ * Even if already in spill mode, it's possible for memory usage to grow,
+ * and we should still track it for the purposes of EXPLAIN ANALYZE.
+ */
+ allocation = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ /* has allocation grown since the last observation? */
+ if (allocation > aggstate->hash_alloc_current)
+ {
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_alloc_current = allocation;
+ }
+
+ if (aggstate->hash_alloc_last > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = aggstate->hash_alloc_last;
+
+ /*
+ * Don't spill unless there's at least one group in the hash table so we
+ * can be sure to make progress even in edge cases.
+ */
+ if (aggstate->hash_ngroups_current > 0 &&
+ (aggstate->hash_alloc_last > aggstate->hash_mem_limit ||
+ aggstate->hash_ngroups_current > aggstate->hash_ngroups_limit))
+ {
+ aggstate->hash_spill_mode = true;
+
+ if (!aggstate->hash_ever_spilled)
+ {
+ aggstate->hash_ever_spilled = true;
+ aggstate->hash_spills = palloc0(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+ aggstate->hash_tapeinfo = palloc0(sizeof(HashTapeInfo));
+ hashagg_recompile_expressions(aggstate);
+ }
+ }
+}
+
+/*
+ * Choose a reasonable number of buckets for the initial hash table size.
+ */
+static long
+hash_choose_num_buckets(AggState *aggstate, long ngroups, Size memory)
+{
+ long max_nbuckets;
+ long nbuckets = ngroups;
+
+ max_nbuckets = memory / aggstate->hashentrysize;
+
+ /*
+ * Leave room for slop to avoid a case where the initial hash table size
+ * exceeds the memory limit (though that may still happen in edge cases).
+ */
+ max_nbuckets *= 0.75;
+
+ if (nbuckets > max_nbuckets)
+ nbuckets = max_nbuckets;
+ if (nbuckets < HASHAGG_MIN_BUCKETS)
+ nbuckets = HASHAGG_MIN_BUCKETS;
+ return nbuckets;
+}
+
+/*
+ * Determine the number of partitions to create when spilling, which will
+ * always be a power of two. If log2_npartitions is non-NULL, set
+ * *log2_npartitions to the log2() of the number of partitions.
+ */
+static int
+hash_choose_num_partitions(uint64 input_groups, double hashentrysize,
+ int used_bits, int *log2_npartitions)
+{
+ Size mem_wanted;
+ int partition_limit;
+ int npartitions;
+ int partition_bits;
+
+ /*
+ * Avoid creating so many partitions that the memory requirements of the
+ * open partition files are greater than 1/4 of work_mem.
+ */
+ partition_limit =
+ (work_mem * 1024L * 0.25 - HASHAGG_READ_BUFFER_SIZE) /
+ HASHAGG_WRITE_BUFFER_SIZE;
+
+ /* pessimistically estimate that each input tuple creates a new group */
+ mem_wanted = HASHAGG_PARTITION_FACTOR * input_groups * hashentrysize;
+
+ /* make enough partitions so that each one is likely to fit in memory */
+ npartitions = 1 + (mem_wanted / (work_mem * 1024L));
+
+ if (npartitions > partition_limit)
+ npartitions = partition_limit;
+
+ if (npartitions < HASHAGG_MIN_PARTITIONS)
+ npartitions = HASHAGG_MIN_PARTITIONS;
+ if (npartitions > HASHAGG_MAX_PARTITIONS)
+ npartitions = HASHAGG_MAX_PARTITIONS;
+
+ /* ceil(log2(npartitions)) */
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits */
+ if (partition_bits + used_bits >= 32)
+ partition_bits = 32 - used_bits;
+
+ if (log2_npartitions != NULL)
+ *log2_npartitions = partition_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ return npartitions;
+}
+
/*
* Find or create a hashtable entry for the tuple group containing the current
* tuple (already set in tmpcontext's outertuple slot), in the current grouping
* set (which the caller must have selected - note that initialize_aggregate
* depends on this).
*
- * When called, CurrentMemoryContext should be the per-query context. The
- * already-calculated hash value for the tuple must be specified.
+ * When called, CurrentMemoryContext should be the per-query context.
+ *
+ * If the hash table is at the memory limit, then only find existing hashtable
+ * entries; don't create new ones. If a tuple's group is not already present
+ * in the hash table for the current grouping set, return NULL and the caller
+ * will spill it to disk.
*/
static AggStatePerGroup
lookup_hash_entry(AggState *aggstate, uint32 hash)
@@ -1499,16 +1881,26 @@ lookup_hash_entry(AggState *aggstate, uint32 hash)
AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
TupleTableSlot *hashslot = perhash->hashslot;
TupleHashEntryData *entry;
- bool isnew;
+ bool isnew = false;
+ bool *p_isnew;
+
+ /* if hash table already spilled, don't create new entries */
+ p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, &isnew,
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
hash);
+ if (entry == NULL)
+ return NULL;
+
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_ngroups_current++;
+ hash_agg_check_limits(aggstate);
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1536,23 +1928,51 @@ lookup_hash_entry(AggState *aggstate, uint32 hash)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Some entries may be left NULL if we have reached the limit and have begun
+ * to spill. The same tuple will belong to different groups for each set, so
+ * may match a group already in memory for one set and match a group not in
+ * memory for another set. If we have begun to spill and a tuple doesn't match
+ * a group in memory for a particular set, it will be spilled.
+ *
+ * NB: It's possible to spill the same tuple for several different grouping
+ * sets. This may seem wasteful, but it's actually a trade-off: if we spill
+ * the tuple multiple times for multiple grouping sets, it can be partitioned
+ * for each grouping set, making the refilling of the hash table very
+ * efficient.
*/
static void
lookup_hash_entries(AggState *aggstate)
{
- int numHashes = aggstate->num_hashes;
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
int setno;
- for (setno = 0; setno < numHashes; setno++)
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
{
- AggStatePerHash perhash = &aggstate->perhash[setno];
+ AggStatePerHash perhash = &aggstate->perhash[setno];
uint32 hash;
select_current_set(aggstate, setno, true);
prepare_hash_slot(aggstate);
hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
pergroup[setno] = lookup_hash_entry(aggstate, hash);
+
+ /* check to see if we need to spill the tuple for this grouping set */
+ if (pergroup[setno] == NULL)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
+
+ if (spill->partitions == NULL)
+ hashagg_spill_init(spill, aggstate->hash_tapeinfo, 0,
+ perhash->aggnode->numGroups,
+ aggstate->hashentrysize);
+
+ hashagg_spill_tuple(spill, slot, hash);
+
+ aggstate->hash_disk_used = LogicalTapeSetBlocks(
+ aggstate->hash_tapeinfo->tapeset) * (BLCKSZ / 1024);
+ }
}
}
@@ -1875,6 +2295,12 @@ agg_retrieve_direct(AggState *aggstate)
if (TupIsNull(outerslot))
{
/* no more outer-plan tuples available */
+
+ /* if we built hash tables, finalize any spills */
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hashagg_finish_initial_spills(aggstate);
+
if (hasGroupingSets)
{
aggstate->input_done = true;
@@ -1977,6 +2403,9 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ /* finalize spills, if any */
+ hashagg_finish_initial_spills(aggstate);
+
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1984,11 +2413,196 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * consumed.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+ HashAggSpill spill;
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ long nbuckets;
+ int setno;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ spill.npartitions = 0;
+ spill.partitions = NULL;
+ /*
+ * Each spill file contains spilled data for only a single grouping
+ * set. We want to ignore all others, which is done by setting the other
+ * pergroups to NULL.
+ */
+ memset(aggstate->all_pergroups, 0,
+ sizeof(AggStatePerGroup) *
+ (aggstate->maxsets + aggstate->num_hashes));
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ hash_agg_set_limits(aggstate->hashentrysize, batch->input_tuples,
+ batch->used_bits, &aggstate->hash_mem_limit,
+ &aggstate->hash_ngroups_limit, NULL);
+
+ /*
+ * Free memory and rebuild a single hash table for this batch's grouping
+ * set. Estimate the number of groups to be the number of input tuples in
+ * this batch.
+ */
+ ReScanExprContext(aggstate->hashcontext);
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ aggstate->perhash[setno].hashtable = NULL;
+
+ nbuckets = hash_choose_num_buckets(
+ aggstate, batch->input_tuples, aggstate->hash_mem_limit);
+ build_hash_table(aggstate, batch->setno, nbuckets);
+ aggstate->hash_alloc_current = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ aggstate->hash_alloc_last = aggstate->hash_alloc_current;
+ aggstate->hash_ngroups_current = 0;
+
+ Assert(aggstate->current_phase == 0);
+
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ /*
+ * The first pass (agg_fill_hash_table) reads whatever kind of slot comes
+ * from the outer plan, and considers the slot fixed. But spilled tuples
+ * are always MinimalTuples, so if that's different from the outer plan we
+ * need to change it and recompile the aggregate expressions.
+ */
+ if (aggstate->ss.ps.outerops != &TTSOpsMinimalTuple)
+ {
+ aggstate->ss.ps.outerops = &TTSOpsMinimalTuple;
+ hashagg_recompile_expressions(aggstate);
+ }
+
+ LogicalTapeRewindForRead(tapeinfo->tapeset, batch->input_tapenum,
+ HASHAGG_READ_BUFFER_SIZE);
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+ uint32 hash;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hashagg_batch_read(batch, &hash);
+ if (tuple == NULL)
+ break;
+
+ ExecStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ select_current_set(aggstate, batch->setno, true);
+ prepare_hash_slot(aggstate);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+
+ /* if there's no memory for a new group, spill */
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ {
+ /*
+ * Estimate the number of groups for this batch as the total
+ * number of tuples in its input file. Although that's a worst
+ * case, it's not bad here for two reasons: (1) overestimating
+ * is better than underestimating; and (2) we've already
+ * scanned the relation once, so it's likely that we've
+ * already finalized many of the common values.
+ */
+ if (spill.partitions == NULL)
+ hashagg_spill_init(&spill, tapeinfo, batch->used_bits,
+ batch->input_tuples,
+ aggstate->hashentrysize);
+
+ hashagg_spill_tuple(&spill, slot, hash);
+
+ aggstate->hash_disk_used = LogicalTapeSetBlocks(
+ aggstate->hash_tapeinfo->tapeset) * (BLCKSZ / 1024);
+ }
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ hashagg_tapeinfo_release(tapeinfo, batch->input_tapenum);
+
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ /* update hashentrysize estimate based on contents */
+ if (aggstate->hash_ngroups_current > 0)
+ {
+ aggstate->hashentrysize = (double)aggstate->hash_alloc_last /
+ (double)aggstate->hash_ngroups_current;
+ }
+
+ hashagg_spill_finish(aggstate, &spill, batch->setno);
+ aggstate->hash_spill_mode = false;
+
+ /* Initialize to walk the first hash table */
+ select_current_set(aggstate, batch->setno, true);
+ ResetTupleHashIterator(aggstate->perhash[batch->setno].hashtable,
+ &aggstate->perhash[batch->setno].hashiter);
+
+ pfree(batch);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
+ *
+ * After exhausting in-memory tuples, also try refilling the hash table using
+ * previously-spilled tuples. Only returns NULL after all in-memory and
+ * spilled tuples are exhausted.
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+/*
+ * Retrieve the groups from the in-memory hash tables without considering any
+ * spilled tuples.
+ */
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -2017,7 +2631,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2042,14 +2656,15 @@ agg_retrieve_hash_table(AggState *aggstate)
perhash = &aggstate->perhash[aggstate->current_set];
+ if (perhash->hashtable == NULL)
+ return NULL;
+
ResetTupleHashIterator(perhash->hashtable, &perhash->hashiter);
continue;
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2106,6 +2721,296 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * Assign unused tapes to spill partitions, extending the tape set if
+ * necessary.
+ */
+static void
+hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *partitions,
+ int npartitions)
+{
+ int partidx = 0;
+
+ /* use free tapes if available */
+ while (partidx < npartitions && tapeinfo->nfreetapes > 0)
+ partitions[partidx++] = tapeinfo->freetapes[--tapeinfo->nfreetapes];
+
+ if (tapeinfo->tapeset == NULL)
+ tapeinfo->tapeset = LogicalTapeSetCreate(npartitions, NULL, NULL, -1);
+ else if (partidx < npartitions)
+ {
+ tapeinfo->tapeset = LogicalTapeSetExtend(
+ tapeinfo->tapeset, npartitions - partidx);
+ }
+
+ while (partidx < npartitions)
+ partitions[partidx++] = tapeinfo->ntapes++;
+}
+
+/*
+ * After a tape has already been written to and then read, this function
+ * rewinds it for writing and adds it to the free list.
+ */
+static void
+hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum)
+{
+ LogicalTapeRewindForWrite(tapeinfo->tapeset, tapenum);
+ if (tapeinfo->freetapes == NULL)
+ tapeinfo->freetapes = palloc(sizeof(int));
+ else
+ tapeinfo->freetapes = repalloc(
+ tapeinfo->freetapes, sizeof(int) * (tapeinfo->nfreetapes + 1));
+ tapeinfo->freetapes[tapeinfo->nfreetapes++] = tapenum;
+}
+
+/*
+ * hashagg_spill_init
+ *
+ * Called after we determined that spilling is necessary. Chooses the number
+ * of partitions to create, and initializes them.
+ */
+static void
+hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo, int used_bits,
+ uint64 input_groups, double hashentrysize)
+{
+ int npartitions;
+ int partition_bits;
+
+ npartitions = hash_choose_num_partitions(
+ input_groups, hashentrysize, used_bits, &partition_bits);
+
+ spill->partitions = palloc0(sizeof(int) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+
+ hashagg_tapeinfo_assign(tapeinfo, spill->partitions, npartitions);
+
+ spill->tapeinfo = tapeinfo;
+ spill->shift = 32 - used_bits - partition_bits;
+ spill->mask = (npartitions - 1) << spill->shift;
+ spill->npartitions = npartitions;
+}
+
+/*
+ * hashagg_spill_tuple
+ *
+ * No room for new groups in the hash table. Save for later in the appropriate
+ * partition.
+ */
+static Size
+hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot, uint32 hash)
+{
+ LogicalTapeSet *tapeset = spill->tapeinfo->tapeset;
+ int partition;
+ MinimalTuple tuple;
+ int tapenum;
+ int total_written = 0;
+ bool shouldFree;
+
+ Assert(spill->partitions != NULL);
+
+ /* may contain unnecessary attributes, consider projecting? */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ partition = (hash & spill->mask) >> spill->shift;
+ spill->ntuples[partition]++;
+
+ tapenum = spill->partitions[partition];
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) &hash, sizeof(uint32));
+ total_written += sizeof(uint32);
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) tuple, tuple->t_len);
+ total_written += tuple->t_len;
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return total_written;
+}
+
+/*
+ * hashagg_batch_new
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done. Should be called in the aggregate's memory context.
+ */
+static HashAggBatch *
+hashagg_batch_new(HashTapeInfo *tapeinfo, int tapenum, int setno,
+ int64 input_tuples, int used_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->setno = setno;
+ batch->used_bits = used_bits;
+ batch->tapeinfo = tapeinfo;
+ batch->input_tapenum = tapenum;
+ batch->input_tuples = input_tuples;
+
+ return batch;
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hashagg_batch_read(HashAggBatch *batch, uint32 *hashp)
+{
+ LogicalTapeSet *tapeset = batch->tapeinfo->tapeset;
+ int tapenum = batch->input_tapenum;
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+ uint32 hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &hash, sizeof(uint32));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+ if (hashp != NULL)
+ *hashp = hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &t_len, sizeof(t_len));
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = LogicalTapeRead(tapeset, tapenum,
+ (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, t_len - sizeof(uint32), nread)));
+
+ return tuple;
+}
+
+/*
+ * hashagg_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hashagg_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+
+ if (aggstate->hash_spills == NULL)
+ return;
+
+ /* update hashentrysize estimate based on contents */
+ Assert(aggstate->hash_ngroups_current > 0);
+ aggstate->hashentrysize = (double)aggstate->hash_alloc_last /
+ (double)aggstate->hash_ngroups_current;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ hashagg_spill_finish(aggstate, &aggstate->hash_spills[setno], setno);
+
+ aggstate->hash_spill_mode = false;
+
+ /*
+ * We're not processing tuples from outer plan any more; only processing
+ * batches of spilled tuples. The initial spill structures are no longer
+ * needed.
+ */
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+}
+
+/*
+ * hashagg_spill_finish
+ *
+ * Transform spill partitions into new batches.
+ */
+static void
+hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
+{
+ int i;
+ int used_bits = 32 - spill->shift;
+
+ if (spill->npartitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->npartitions; i++)
+ {
+ int tapenum = spill->partitions[i];
+ MemoryContext oldContext;
+ HashAggBatch *new_batch;
+
+ oldContext = MemoryContextSwitchTo(aggstate->ss.ps.state->es_query_cxt);
+ new_batch = hashagg_batch_new(aggstate->hash_tapeinfo,
+ tapenum, setno, spill->ntuples[i],
+ used_bits);
+ aggstate->hash_batches = lcons(new_batch, aggstate->hash_batches);
+ aggstate->hash_batches_used++;
+ MemoryContextSwitchTo(oldContext);
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Free resources related to a spilled HashAgg.
+ */
+static void
+hashagg_reset_spill_state(AggState *aggstate)
+{
+ ListCell *lc;
+
+ /* free spills from initial pass */
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+ }
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ /* free batches */
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+
+ /* close tape set */
+ if (aggstate->hash_tapeinfo != NULL)
+ {
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ if (tapeinfo->tapeset != NULL)
+ LogicalTapeSetClose(tapeinfo->tapeset);
+ if (tapeinfo->freetapes != NULL)
+ pfree(tapeinfo->freetapes);
+ pfree(tapeinfo);
+ aggstate->hash_tapeinfo = NULL;
+ }
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2290,6 +3195,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->ss.ps.outeropsfixed = false;
}
+ if (use_hashing)
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsMinimalTuple);
+
/*
* Initialize result type, slot and projection.
*/
@@ -2515,9 +3424,22 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
*/
if (use_hashing)
{
+ Plan *outerplan = outerPlan(node);
+ long totalGroups = 0;
+ int i;
+
/* this is an array of pointers, not structures */
aggstate->hash_pergroup = pergroups;
+ aggstate->hashentrysize = hash_agg_entry_size(
+ aggstate->numtrans, outerplan->plan_width, node->transitionSpace);
+
+ for (i = 0; i < aggstate->num_hashes; i++)
+ totalGroups = aggstate->perhash[i].aggnode->numGroups;
+
+ hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+ &aggstate->hash_mem_limit,
+ &aggstate->hash_ngroups_limit, NULL);
find_hash_columns(aggstate);
build_hash_tables(aggstate);
aggstate->table_filled = false;
@@ -2925,7 +3847,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
else
Assert(false);
- phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash);
+ phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
+ false);
}
@@ -3420,6 +4343,8 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hashagg_reset_spill_state(node);
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3475,12 +4400,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_ever_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3537,11 +4463,33 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ const TupleTableSlotOps *outerops = ExecGetResultSlotOps(
+ outerPlanState(&node->ss), &node->ss.ps.outeropsfixed);
+
+ hashagg_reset_spill_state(node);
+
+ node->hash_ever_spilled = false;
+ node->hash_spill_mode = false;
+ node->hash_alloc_last = 0;
+ node->hash_alloc_current = 0;
+ node->hash_ngroups_current = 0;
+
+ /* reset stats */
+ node->hash_mem_peak = 0;
+ node->hash_disk_used = 0;
+ node->hash_batches_used = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
build_hash_tables(node);
node->table_filled = false;
/* iterator will be reset when the table is filled */
+
+ if (node->ss.ps.outerops != outerops)
+ {
+ node->ss.ps.outerops = outerops;
+ hashagg_recompile_expressions(node);
+ }
}
if (node->aggstrategy != AGG_HASHED)
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index cea0d6fa5ce..71d4034b03b 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2047,6 +2047,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_INIT_TRANS:
+ case EEOP_AGG_INIT_TRANS_SPILLED:
{
AggStatePerTrans pertrans;
@@ -2056,6 +2057,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_allpergroupsp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_setoff,
v_transno;
@@ -2082,11 +2084,32 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_init_trans.setoff);
v_transno = l_int32_const(op->d.agg_init_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_INIT_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_notransvalue = l_bb_before_v(
+ opblocks[opno + 1], "op.%d.check_notransvalue", opno);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(
+ b, v_pergroup_allaggs, TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[opno + 1],
+ b_check_notransvalue);
+
+ LLVMPositionBuilderAtEnd(b, b_check_notransvalue);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_notransvalue =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_NOTRANSVALUE,
@@ -2143,6 +2166,7 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_STRICT_TRANS_CHECK:
+ case EEOP_AGG_STRICT_TRANS_CHECK_SPILLED:
{
LLVMValueRef v_setoff,
v_transno;
@@ -2152,6 +2176,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_transnull;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
int jumpnull = op->d.agg_strict_trans_check.jumpnull;
@@ -2171,11 +2196,32 @@ llvm_compile_expr(ExprState *state)
l_int32_const(op->d.agg_strict_trans_check.setoff);
v_transno =
l_int32_const(op->d.agg_strict_trans_check.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_STRICT_TRANS_CHECK_SPILLED)
+ {
+ LLVMBasicBlockRef b_check_transnull = l_bb_before_v(
+ opblocks[opno + 1], "op.%d.check_transnull", opno);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[jumpnull],
+ b_check_transnull);
+
+ LLVMPositionBuilderAtEnd(b, b_check_transnull);
+ }
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_transnull =
l_load_struct_gep(b, v_pergroupp,
FIELDNO_AGGSTATEPERGROUPDATA_TRANSVALUEISNULL,
@@ -2191,7 +2237,9 @@ llvm_compile_expr(ExprState *state)
}
case EEOP_AGG_PLAIN_TRANS_BYVAL:
+ case EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED:
case EEOP_AGG_PLAIN_TRANS:
+ case EEOP_AGG_PLAIN_TRANS_SPILLED:
{
AggState *aggstate;
AggStatePerTrans pertrans;
@@ -2217,6 +2265,7 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_pertransp;
LLVMValueRef v_pergroupp;
+ LLVMValueRef v_pergroup_allaggs;
LLVMValueRef v_retval;
@@ -2244,10 +2293,33 @@ llvm_compile_expr(ExprState *state)
"aggstate.all_pergroups");
v_setoff = l_int32_const(op->d.agg_trans.setoff);
v_transno = l_int32_const(op->d.agg_trans.transno);
- v_pergroupp =
- LLVMBuildGEP(b,
- l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
- &v_transno, 1, "");
+ v_pergroup_allaggs = l_load_gep1(b, v_allpergroupsp, v_setoff, "");
+
+ /*
+ * When no tuples at all have spilled, we avoid adding this
+ * extra branch. But after some tuples have spilled, this
+ * branch is necessary, so we recompile the expression
+ * using a new opcode.
+ */
+ if (opcode == EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED ||
+ opcode == EEOP_AGG_PLAIN_TRANS_SPILLED)
+ {
+ LLVMBasicBlockRef b_advance_transval = l_bb_before_v(
+ opblocks[opno + 1], "op.%d.advance_transval", opno);
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(b, v_pergroup_allaggs,
+ TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[opno + 1],
+ b_advance_transval);
+
+ LLVMPositionBuilderAtEnd(b, b_advance_transval);
+ }
+
+ v_pergroupp = LLVMBuildGEP(b, v_pergroup_allaggs, &v_transno, 1, "");
v_fcinfo = l_ptr_const(fcinfo,
l_ptr(StructFunctionCallInfoData));
@@ -2312,7 +2384,8 @@ llvm_compile_expr(ExprState *state)
* child of the aggcontext, assume we can adopt that value
* without copying it.
*/
- if (opcode == EEOP_AGG_PLAIN_TRANS)
+ if (opcode == EEOP_AGG_PLAIN_TRANS ||
+ opcode == EEOP_AGG_PLAIN_TRANS_SPILLED)
{
LLVMBasicBlockRef b_call;
LLVMBasicBlockRef b_nocall;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721f..724e4448e9a 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -77,6 +77,7 @@
#include "access/htup_details.h"
#include "access/tsmapi.h"
#include "executor/executor.h"
+#include "executor/nodeAgg.h"
#include "executor/nodeHash.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -128,6 +129,7 @@ bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
bool enable_hashagg = true;
+bool enable_groupingsets_hash_disk = true;
bool enable_nestloop = true;
bool enable_material = true;
bool enable_mergejoin = true;
@@ -2153,7 +2155,7 @@ cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples)
+ double input_tuples, double input_width)
{
double output_tuples;
Cost startup_cost;
@@ -2219,21 +2221,88 @@ cost_agg(Path *path, PlannerInfo *root,
total_cost += aggcosts->finalCost.per_tuple * numGroups;
total_cost += cpu_tuple_cost * numGroups;
output_tuples = numGroups;
+
+ /*
+ * We don't need to compute the disk costs of hash aggregation here,
+ * because the planner does not choose hash aggregation for grouping
+ * sets that it doesn't expect to fit in memory.
+ */
}
else
{
+ double pages_written = 0.0;
+ double pages_read = 0.0;
+ double hashentrysize;
+ double nbatches;
+ Size mem_limit;
+ long ngroups_limit;
+ int num_partitions;
+
/* must be AGG_HASHED */
startup_cost = input_total_cost;
if (!enable_hashagg)
startup_cost += disable_cost;
startup_cost += aggcosts->transCost.startup;
startup_cost += aggcosts->transCost.per_tuple * input_tuples;
+ /* cost of computing hash value */
startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;
startup_cost += aggcosts->finalCost.startup;
+
total_cost = startup_cost;
total_cost += aggcosts->finalCost.per_tuple * numGroups;
+ /* cost of retrieving from hash table */
total_cost += cpu_tuple_cost * numGroups;
output_tuples = numGroups;
+
+ /*
+ * Estimate number of batches based on the computed limits. If less
+ * than or equal to one, all groups are expected to fit in memory;
+ * otherwise we expect to spill.
+ */
+ hashentrysize = hash_agg_entry_size(
+ aggcosts->numAggs, input_width, aggcosts->transitionSpace);
+ hash_agg_set_limits(hashentrysize, numGroups, 0, &mem_limit,
+ &ngroups_limit, &num_partitions);
+
+ nbatches = Max( (numGroups * hashentrysize) / mem_limit,
+ numGroups / ngroups_limit );
+
+ /*
+ * Estimate number of pages read and written. For each level of
+ * recursion, a tuple must be written and then later read.
+ */
+ if (nbatches > 1.0)
+ {
+ double depth;
+ double pages;
+
+ pages = relation_byte_size(input_tuples, input_width) / BLCKSZ;
+
+ /*
+ * The number of partitions can change at different levels of
+ * recursion; but for the purposes of this calculation assume it
+ * stays constant.
+ */
+ depth = ceil( log(nbatches - 1) / log(num_partitions) );
+ pages_written = pages_read = pages * depth;
+ }
+
+ /*
+ * Add the disk costs of hash aggregation that spills to disk.
+ *
+ * Groups that go into the hash table stay in memory until finalized,
+ * so spilling and reprocessing tuples doesn't incur additional
+ * invocations of transCost or finalCost. Furthermore, the computed
+ * hash value is stored with the spilled tuples, so we don't incur
+ * extra invocations of the hash function.
+ *
+ * Hash Agg begins returning tuples after the first batch is
+ * complete. Accrue writes (spilled tuples) to startup_cost and to
+ * total_cost; accrue reads only to total_cost.
+ */
+ startup_cost += pages_written * random_page_cost;
+ total_cost += pages_written * random_page_cost;
+ total_cost += pages_read * seq_page_cost;
}
/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e048d200bb4..090919e39a0 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1644,6 +1644,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
NIL,
NIL,
best_path->path.rows,
+ 0,
subplan);
}
else
@@ -2096,6 +2097,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
NIL,
NIL,
best_path->numGroups,
+ best_path->transitionSpace,
subplan);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2257,6 +2259,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
NIL,
rollup->numGroups,
+ best_path->transitionSpace,
sort_plan);
/*
@@ -2295,6 +2298,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
chain,
rollup->numGroups,
+ best_path->transitionSpace,
subplan);
/* Copy cost data from Path to Plan */
@@ -6192,8 +6196,8 @@ Agg *
make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree)
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree)
{
Agg *node = makeNode(Agg);
Plan *plan = &node->plan;
@@ -6209,6 +6213,7 @@ make_agg(List *tlist, List *qual,
node->grpOperators = grpOperators;
node->grpCollations = grpCollations;
node->numGroups = numGroups;
+ node->transitionSpace = transitionSpace;
node->aggParams = NULL; /* SS_finalize_plan() will fill this */
node->groupingSets = groupingSets;
node->chain = chain;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6314c..8c5b2d06301 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4258,11 +4258,12 @@ consider_groupingsets_paths(PlannerInfo *root,
dNumGroups - exclude_groups);
/*
- * gd->rollups is empty if we have only unsortable columns to work
- * with. Override work_mem in that case; otherwise, we'll rely on the
- * sorted-input case to generate usable mixed paths.
+ * If we have sortable columns to work with (gd->rollups is non-empty)
+ * and enable_groupingsets_hash_disk is disabled, don't generate
+ * hash-based paths that will exceed work_mem.
*/
- if (hashsize > work_mem * 1024L && gd->rollups)
+ if (!enable_groupingsets_hash_disk &&
+ hashsize > work_mem * 1024L && gd->rollups)
return; /* nope, won't fit */
/*
@@ -6505,8 +6506,6 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
if (can_hash)
{
- double hashaggtablesize;
-
if (parse->groupingSets)
{
/*
@@ -6518,34 +6517,20 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
}
else
{
- hashaggtablesize = estimate_hashagg_tablesize(cheapest_path,
- agg_costs,
- dNumGroups);
-
/*
- * Provided that the estimated size of the hashtable does not
- * exceed work_mem, we'll generate a HashAgg Path, although if we
- * were unable to sort above, then we'd better generate a Path, so
- * that we at least have one.
+ * We just need an Agg over the cheapest-total input path,
+ * since input order won't matter.
*/
- if (hashaggtablesize < work_mem * 1024L ||
- grouped_rel->pathlist == NIL)
- {
- /*
- * We just need an Agg over the cheapest-total input path,
- * since input order won't matter.
- */
- add_path(grouped_rel, (Path *)
- create_agg_path(root, grouped_rel,
- cheapest_path,
- grouped_rel->reltarget,
- AGG_HASHED,
- AGGSPLIT_SIMPLE,
- parse->groupClause,
- havingQual,
- agg_costs,
- dNumGroups));
- }
+ add_path(grouped_rel, (Path *)
+ create_agg_path(root, grouped_rel,
+ cheapest_path,
+ grouped_rel->reltarget,
+ AGG_HASHED,
+ AGGSPLIT_SIMPLE,
+ parse->groupClause,
+ havingQual,
+ agg_costs,
+ dNumGroups));
}
/*
@@ -6557,22 +6542,17 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
{
Path *path = partially_grouped_rel->cheapest_total_path;
- hashaggtablesize = estimate_hashagg_tablesize(path,
- agg_final_costs,
- dNumGroups);
-
- if (hashaggtablesize < work_mem * 1024L)
- add_path(grouped_rel, (Path *)
- create_agg_path(root,
- grouped_rel,
- path,
- grouped_rel->reltarget,
- AGG_HASHED,
- AGGSPLIT_FINAL_DESERIAL,
- parse->groupClause,
- havingQual,
- agg_final_costs,
- dNumGroups));
+ add_path(grouped_rel, (Path *)
+ create_agg_path(root,
+ grouped_rel,
+ path,
+ grouped_rel->reltarget,
+ AGG_HASHED,
+ AGGSPLIT_FINAL_DESERIAL,
+ parse->groupClause,
+ havingQual,
+ agg_final_costs,
+ dNumGroups));
}
}
@@ -6816,22 +6796,10 @@ create_partial_grouping_paths(PlannerInfo *root,
if (can_hash && cheapest_total_path != NULL)
{
- double hashaggtablesize;
-
/* Checked above */
Assert(parse->hasAggs || parse->groupClause);
- hashaggtablesize =
- estimate_hashagg_tablesize(cheapest_total_path,
- agg_partial_costs,
- dNumPartialGroups);
-
- /*
- * Tentatively produce a partial HashAgg Path, depending on if it
- * looks as if the hash table will fit in work_mem.
- */
- if (hashaggtablesize < work_mem * 1024L &&
- cheapest_total_path != NULL)
+ if (cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
create_agg_path(root,
@@ -6849,16 +6817,8 @@ create_partial_grouping_paths(PlannerInfo *root,
if (can_hash && cheapest_partial_path != NULL)
{
- double hashaggtablesize;
-
- hashaggtablesize =
- estimate_hashagg_tablesize(cheapest_partial_path,
- agg_partial_costs,
- dNumPartialPartialGroups);
-
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
- cheapest_partial_path != NULL)
+ if (cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
create_agg_path(root,
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 1a23e18970d..951aed80e7a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1072,7 +1072,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
numGroupCols, dNumGroups,
NIL,
input_path->startup_cost, input_path->total_cost,
- input_path->rows);
+ input_path->rows, input_path->pathtarget->width);
/*
* Now for the sorted case. Note that the input is *always* unsorted,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e6d08aede56..8ba8122ee2f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1704,7 +1704,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
NIL,
subpath->startup_cost,
subpath->total_cost,
- rel->rows);
+ rel->rows,
+ subpath->pathtarget->width);
}
if (sjinfo->semi_can_btree && sjinfo->semi_can_hash)
@@ -2949,6 +2950,7 @@ create_agg_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->aggsplit = aggsplit;
pathnode->numGroups = numGroups;
+ pathnode->transitionSpace = aggcosts ? aggcosts->transitionSpace : 0;
pathnode->groupClause = groupClause;
pathnode->qual = qual;
@@ -2957,7 +2959,7 @@ create_agg_path(PlannerInfo *root,
list_length(groupClause), numGroups,
qual,
subpath->startup_cost, subpath->total_cost,
- subpath->rows);
+ subpath->rows, subpath->pathtarget->width);
/* add tlist eval cost for each output row */
pathnode->path.startup_cost += target->cost.startup;
@@ -3036,6 +3038,7 @@ create_groupingsets_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->rollups = rollups;
pathnode->qual = having_qual;
+ pathnode->transitionSpace = agg_costs ? agg_costs->transitionSpace : 0;
Assert(rollups != NIL);
Assert(aggstrategy != AGG_PLAIN || list_length(rollups) == 1);
@@ -3067,7 +3070,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
subpath->startup_cost,
subpath->total_cost,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
is_first = false;
if (!rollup->is_hashed)
is_first_sort = false;
@@ -3090,7 +3094,8 @@ create_groupingsets_path(PlannerInfo *root,
rollup->numGroups,
having_qual,
0.0, 0.0,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
if (!rollup->is_hashed)
is_first_sort = false;
}
@@ -3115,7 +3120,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
sort_path.startup_cost,
sort_path.total_cost,
- sort_path.rows);
+ sort_path.rows,
+ subpath->pathtarget->width);
}
pathnode->path.total_cost += agg_path.total_cost;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8228e1f3903..270e220fbc9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -998,6 +998,16 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_groupingsets_hash_disk", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans for groupingsets when the total size of the hash tables is expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_groupingsets_hash_disk,
+ false,
+ NULL, NULL, NULL
+ },
{
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of materialization."),
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 4f78b55fbaf..36104a73a75 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -201,6 +201,7 @@ static long ltsGetFreeBlock(LogicalTapeSet *lts);
static void ltsReleaseBlock(LogicalTapeSet *lts, long blocknum);
static void ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
SharedFileSet *fileset);
+static void ltsInitTape(LogicalTape *lt);
static void ltsInitReadBuffer(LogicalTapeSet *lts, LogicalTape *lt);
@@ -536,6 +537,30 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
lts->nHoleBlocks = lts->nBlocksAllocated - nphysicalblocks;
}
+/*
+ * Initialize per-tape struct. Note we allocate the I/O buffer and the first
+ * block for a tape only when it is first actually written to. This avoids
+ * wasting memory space when tuplesort.c overestimates the number of tapes
+ * needed.
+ */
+static void
+ltsInitTape(LogicalTape *lt)
+{
+ lt->writing = true;
+ lt->frozen = false;
+ lt->dirty = false;
+ lt->firstBlockNumber = -1L;
+ lt->curBlockNumber = -1L;
+ lt->nextBlockNumber = -1L;
+ lt->offsetBlockNumber = 0L;
+ lt->buffer = NULL;
+ lt->buffer_size = 0;
+ /* palloc() larger than MaxAllocSize would fail */
+ lt->max_size = MaxAllocSize;
+ lt->pos = 0;
+ lt->nbytes = 0;
+}
+
/*
* Lazily allocate and initialize the read buffer. This avoids waste when many
* tapes are open at once, but not all are active between rewinding and
@@ -579,7 +604,6 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
int worker)
{
LogicalTapeSet *lts;
- LogicalTape *lt;
int i;
/*
@@ -597,29 +621,8 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
lts->nFreeBlocks = 0;
lts->nTapes = ntapes;
- /*
- * Initialize per-tape structs. Note we allocate the I/O buffer and the
- * first block for a tape only when it is first actually written to. This
- * avoids wasting memory space when tuplesort.c overestimates the number
- * of tapes needed.
- */
for (i = 0; i < ntapes; i++)
- {
- lt = <s->tapes[i];
- lt->writing = true;
- lt->frozen = false;
- lt->dirty = false;
- lt->firstBlockNumber = -1L;
- lt->curBlockNumber = -1L;
- lt->nextBlockNumber = -1L;
- lt->offsetBlockNumber = 0L;
- lt->buffer = NULL;
- lt->buffer_size = 0;
- /* palloc() larger than MaxAllocSize would fail */
- lt->max_size = MaxAllocSize;
- lt->pos = 0;
- lt->nbytes = 0;
- }
+ ltsInitTape(<s->tapes[i]);
/*
* Create temp BufFile storage as required.
@@ -1004,6 +1007,29 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum, TapeShare *share)
}
}
+/*
+ * Add additional tapes to this tape set. Not intended to be used when any
+ * tapes are frozen.
+ */
+LogicalTapeSet *
+LogicalTapeSetExtend(LogicalTapeSet *lts, int nAdditional)
+{
+ int i;
+ int nTapesOrig = lts->nTapes;
+ Size newSize;
+
+ lts->nTapes += nAdditional;
+ newSize = offsetof(LogicalTapeSet, tapes) +
+ lts->nTapes * sizeof(LogicalTape);
+
+ lts = (LogicalTapeSet *) repalloc(lts, newSize);
+
+ for (i = nTapesOrig; i < lts->nTapes; i++)
+ ltsInitTape(<s->tapes[i]);
+
+ return lts;
+}
+
/*
* Backspace the tape a given number of bytes. (We also support a more
* general seek interface, see below.)
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 73a2ca8c6dd..d70bc048c46 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -226,9 +226,13 @@ typedef enum ExprEvalOp
EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
EEOP_AGG_INIT_TRANS,
+ EEOP_AGG_INIT_TRANS_SPILLED,
EEOP_AGG_STRICT_TRANS_CHECK,
+ EEOP_AGG_STRICT_TRANS_CHECK_SPILLED,
EEOP_AGG_PLAIN_TRANS_BYVAL,
+ EEOP_AGG_PLAIN_TRANS_BYVAL_SPILLED,
EEOP_AGG_PLAIN_TRANS,
+ EEOP_AGG_PLAIN_TRANS_SPILLED,
EEOP_AGG_ORDERED_TRANS_DATUM,
EEOP_AGG_ORDERED_TRANS_TUPLE,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 81fdfa4add3..d6eb2abb60b 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -255,7 +255,7 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
- bool doSort, bool doHash);
+ bool doSort, bool doHash, bool spilled);
extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
int numCols,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 264916f9a92..307987a45ab 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -311,5 +311,8 @@ extern void ExecReScanAgg(AggState *node);
extern Size hash_agg_entry_size(int numAggs, Size tupleWidth,
Size transitionSpace);
+extern void hash_agg_set_limits(double hashentrysize, uint64 input_groups,
+ int used_bits, Size *mem_limit,
+ long *ngroups_limit, int *num_partitions);
#endif /* NODEAGG_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd3ddf781f1..19b9cef42f6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2078,13 +2078,32 @@ typedef struct AggState
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
- int num_hashes;
+ int num_hashes; /* number of hash tables active at once */
+ double hashentrysize; /* estimate revised during execution */
+ struct HashTapeInfo *hash_tapeinfo; /* metadata for spill tapes */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each hash table,
+ exists only during first pass if spilled */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ bool hash_ever_spilled; /* ever spilled during this execution? */
+ bool hash_spill_mode; /* we hit a limit during the current batch
+ and we must not create new groups */
+ Size hash_alloc_last; /* previous total memory allocation */
+ Size hash_alloc_current; /* current total memory allocation */
+ Size hash_mem_limit; /* limit before spilling hash table */
+ Size hash_mem_peak; /* peak hash table memory usage */
+ long hash_ngroups_current; /* number of groups currently in
+ memory in all hash tables */
+ long hash_ngroups_limit; /* limit before spilling hash table */
+ long hash_disk_used; /* kB of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+ List *hash_batches; /* hash batches remaining to be processed */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 49
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 3d3be197e0e..be592d0fee4 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1663,6 +1663,7 @@ typedef struct AggPath
AggStrategy aggstrategy; /* basic strategy, see nodes.h */
AggSplit aggsplit; /* agg-splitting mode, see nodes.h */
double numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
List *groupClause; /* a list of SortGroupClause's */
List *qual; /* quals (HAVING quals), if any */
} AggPath;
@@ -1700,6 +1701,7 @@ typedef struct GroupingSetsPath
AggStrategy aggstrategy; /* basic strategy */
List *rollups; /* list of RollupData */
List *qual; /* quals (HAVING quals), if any */
+ int32 transitionSpace; /* estimated transition state size */
} GroupingSetsPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 32c0d87f80e..f4183e1efa5 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -813,6 +813,7 @@ typedef struct Agg
Oid *grpOperators; /* equality operators to compare with */
Oid *grpCollations;
long numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
Bitmapset *aggParams; /* IDs of Params used in Aggref inputs */
/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
List *groupingSets; /* grouping sets to use */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index cb012ba1980..5a0fbebd895 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
@@ -114,7 +115,7 @@ extern void cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples);
+ double input_tuples, double input_width);
extern void cost_windowagg(Path *path, PlannerInfo *root,
List *windowFuncs, int numPartCols, int numOrderCols,
Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index eab486a6214..c7bda2b0917 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -54,8 +54,8 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
extern Agg *make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree);
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree);
extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
/*
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index 695d2c00ee4..3ebe52239f8 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -67,6 +67,8 @@ extern void LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum,
extern void LogicalTapeRewindForWrite(LogicalTapeSet *lts, int tapenum);
extern void LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum,
TapeShare *share);
+extern LogicalTapeSet *LogicalTapeSetExtend(LogicalTapeSet *lts,
+ int nAdditional);
extern size_t LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum,
size_t size);
extern void LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index f457b5b150f..7eeeaaa5e4a 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2357,3 +2357,124 @@ explain (costs off)
-> Seq Scan on onek
(8 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+set work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------------
+ GroupAggregate
+ Group Key: ((g % 100000))
+ -> Sort
+ Sort Key: ((g % 100000))
+ -> Function Scan on generate_series g
+(5 rows)
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+-- Produce results with hash aggregation
+set enable_hashagg = true;
+set enable_sort = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 100000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+set enable_sort = true;
+set work_mem to default;
+-- Compare group aggregation results to hash aggregation results
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+ a | c1 | c2 | c3
+---+----+----+----
+(0 rows)
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a7..dbe5140b558 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1633,4 +1633,126 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
| 1 | 2
(4 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low
+-- and turning on enable_groupingsets_hash_disk.
+--
+SET enable_groupingsets_hash_disk = true;
+SET work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+-- Produce results with hash aggregation.
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------
+ MixedAggregate
+ Hash Key: (g.g % 1000), (g.g % 100), (g.g % 10)
+ Hash Key: (g.g % 1000), (g.g % 100)
+ Hash Key: (g.g % 1000)
+ Hash Key: (g.g % 100), (g.g % 10)
+ Hash Key: (g.g % 100)
+ Hash Key: (g.g % 10), (g.g % 1000)
+ Hash Key: (g.g % 10)
+ Group Key: ()
+ -> Function Scan on generate_series g
+(10 rows)
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+set enable_sort = true;
+set work_mem to default;
+-- Compare results
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+ g100 | g10 | unnest | c | m
+------+-----+--------+---+---
+(0 rows)
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+SET enable_groupingsets_hash_disk TO DEFAULT;
-- end
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1de..11c6f50fbfa 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -148,6 +148,68 @@ SELECT count(*) FROM
4
(1 row)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+SET enable_hashagg=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------------
+ Unique
+ -> Sort
+ Sort Key: ((g % 1000))
+ -> Function Scan on generate_series g
+(4 rows)
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_hashagg=TRUE;
+-- Produce results with hash aggregation.
+SET enable_sort=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 1000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_sort=TRUE;
+SET work_mem TO DEFAULT;
+-- Compare results
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb9057..147486c2fc3 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -74,6 +74,7 @@ select name, setting from pg_settings where name like 'enable%';
--------------------------------+---------
enable_bitmapscan | on
enable_gathermerge | on
+ enable_groupingsets_hash_disk | off
enable_hashagg | on
enable_hashjoin | on
enable_indexonlyscan | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 3e593f2d615..a4d476c4bb3 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1032,3 +1032,119 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
explain (costs off)
select 1 from tenk1
where (hundred, thousand) in (select twothousand, twothousand from onek);
+
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+set work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+-- Produce results with hash aggregation
+
+set enable_hashagg = true;
+set enable_sort = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare group aggregation results to hash aggregation results
+
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 95ac3fb52f6..478f49ecab5 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -441,4 +441,107 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
from unnest(array[1,1], array['a','b']) u(i,v)
group by rollup(i, v||'a') order by 1,3;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low
+-- and turning on enable_groupingsets_hash_disk.
+--
+
+SET enable_groupingsets_hash_disk = true;
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+-- Produce results with hash aggregation.
+
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare results
+
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+
+SET enable_groupingsets_hash_disk TO DEFAULT;
+
-- end
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449e..33102744ebf 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -45,6 +45,68 @@ SELECT count(*) FROM
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+SET enable_hashagg=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_hashagg=TRUE;
+
+-- Produce results with hash aggregation.
+
+SET enable_sort=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_sort=TRUE;
+
+SET work_mem TO DEFAULT;
+
+-- Compare results
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
+
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
On Thu, Feb 20, 2020 at 04:56:38PM -0800, Jeff Davis wrote:
On Wed, 2020-02-19 at 20:16 +0100, Tomas Vondra wrote:
1) explain.c currently does this:
I wonder if we could show something for plain explain (without
analyze).
At least the initial estimate of partitions, etc. I know not showing
those details until after execution is what e.g. sort does, but I
find
it a bit annoying.Looks like you meant to include some example explain output, but I
think I understand what you mean. I'll look into it.
Oh, right. What I wanted to include is this code snippet:
if (es->analyze)
show_hashagg_info((AggState *) planstate, es);
but I forgot to do the copy-paste.
2) The ExecBuildAggTrans comment should probably explain "spilled".
Done.
3) I wonder if we need to invent new opcodes? Wouldn't it be simpler
to
just add a new flag to the agg_* structs instead? I haven't tried
hacking
this, so maybe it's a silly idea.There was a reason I didn't do it this way, but I'm trying to remember
why. I'll look into this, also.4) lookup_hash_entries says
/* check to see if we need to spill the tuple for this grouping
set */But that seems bogus, because AFAIK we can't spill tuples for
grouping
sets. So maybe this should say just "grouping"?Yes, we can spill tuples for grouping sets. Unfortunately, I think my
tests (which covered this case previously) don't seem to be exercising
that path well now. I am going to improve my tests, too.5) Assert(nbuckets > 0);
I did not repro this issue, but I did set a floor of 256 buckets.
Hmmm. I can reproduce it reliably (on the patch from 2020/02/18) but it
seems to only happen when the table is large enough. For me, doing
insert into t select * from t;
until the table has ~7.8M rows does the trick. I can't reproduce it on
the current patch, so ensuring there are at least 256 buckets seems to
have helped. If I add an elog() to print nbuckets at the beginning of
hash_choose_num_buckets, I see it starts as 0 from time to time (and
then gets tweaked to 256).
I suppose this is due to how the input data is generated, i.e. all hash
values should fall to the first batch, so all other batches should be
empty. But in agg_refill_hash_table we use the number of input tuples as
a starting point for, which is how we get nbuckets = 0.
I think enforcing nbuckets to be at least 256 is OK.
which fails with segfault at execution time:
Fixed. I was resetting the hash table context without setting the
pointers to NULL.
Yep, can confirm it's no longer crashing for me.
Thanks! I attached a new, rebased version. The fixes are quick fixes
for now and I will revisit them after I improve my test cases (which
might find more issues).
OK, sounds good.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-02-22 11:02:16 -0800, Jeff Davis wrote:
On Sat, 2020-02-22 at 10:00 -0800, Andres Freund wrote:
Both patches, or just 0013? Seems the earlier one might make the
addition of the opcodes you add less verbose?Just 0013, thank you. 0008 looks like it will simplify things.
Pushed 0008.
On Mon, 2020-02-24 at 15:29 -0800, Andres Freund wrote:
On 2020-02-22 11:02:16 -0800, Jeff Davis wrote:
On Sat, 2020-02-22 at 10:00 -0800, Andres Freund wrote:
Both patches, or just 0013? Seems the earlier one might make the
addition of the opcodes you add less verbose?Just 0013, thank you. 0008 looks like it will simplify things.
Pushed 0008.
Rebased on your change. This simplified the JIT and interpretation code
quite a bit.
Also:
* caching the compiled expressions so I can switch between the variants
cheaply
* added "Planned Partitions" to explain output
* included tape buffers in the "Memory Used" output
* Simplified the way I try to track memory usage and trigger spilling.
* Reset hash tables always rather than rebuilding them from scratch.
I will do another round of performance tests and see if anything
changed from last time.
Regards,
Jeff Davis
Attachments:
hashagg-20200226.patchtext/x-patch; charset=UTF-8; name=hashagg-20200226.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f89ec7..edfec0362e1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4476,6 +4476,24 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-groupingsets-hash-disk" xreflabel="enable_groupingsets_hash_disk">
+ <term><varname>enable_groupingsets_hash_disk</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_groupingsets_hash_disk</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation for
+ grouping sets when the size of the hash tables is expected to exceed
+ <varname>work_mem</varname>. See <xref
+ linkend="queries-grouping-sets"/>. Note that this setting only
+ affects the chosen plan; execution time may still require using
+ disk-based hash aggregation. The default is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
<term><varname>enable_hashjoin</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50e..70196ea48d0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -104,6 +104,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1882,6 +1883,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2769,6 +2771,67 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * If EXPLAIN ANALYZE, show information on hash aggregate memory usage and
+ * batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->costs)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(
+ es->str,
+ "Planned Partitions: %d\n",
+ aggstate->hash_planned_partitions);
+ }
+
+ if (!es->analyze)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(
+ es->str,
+ "Memory Usage: %ldkB",
+ memPeakKb);
+
+ if (aggstate->hash_batches_used > 0)
+ {
+ appendStringInfo(
+ es->str,
+ " Batches: %d Disk: %ldkB",
+ aggstate->hash_batches_used, aggstate->hash_disk_used);
+ }
+
+ appendStringInfo(
+ es->str,
+ "\n");
+ }
+ else
+ {
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ ExplainPropertyInteger("Disk Usage", "kB",
+ aggstate->hash_disk_used, es);
+ }
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 91aa386fa61..8c5ead93d68 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -79,7 +79,8 @@ static void ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash);
+ int transno, int setno, int setoff, bool ishash,
+ bool nullcheck);
/*
@@ -2924,10 +2925,13 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
* check for filters, evaluate aggregate input, check that that input is not
* NULL for a strict transition function, and then finally invoke the
* transition for each of the concurrently computed grouping sets.
+ *
+ * If nullcheck is true, the generated code will check for a NULL pointer to
+ * the array of AggStatePerGroup, and skip evaluation if so.
*/
ExprState *
ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
- bool doSort, bool doHash)
+ bool doSort, bool doHash, bool nullcheck)
{
ExprState *state = makeNode(ExprState);
PlanState *parent = &aggstate->ss.ps;
@@ -3158,7 +3162,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (int setno = 0; setno < processGroupingSets; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, false);
+ pertrans, transno, setno, setoff, false,
+ nullcheck);
setoff++;
}
}
@@ -3177,7 +3182,8 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
for (int setno = 0; setno < numHashes; setno++)
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
- pertrans, transno, setno, setoff, true);
+ pertrans, transno, setno, setoff, true,
+ nullcheck);
setoff++;
}
}
@@ -3227,15 +3233,28 @@ static void
ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
- int transno, int setno, int setoff, bool ishash)
+ int transno, int setno, int setoff, bool ishash,
+ bool nullcheck)
{
ExprContext *aggcontext;
+ int adjust_jumpnull = -1;
if (ishash)
aggcontext = aggstate->hashcontext;
else
aggcontext = aggstate->aggcontexts[setno];
+ /* add check for NULL pointer? */
+ if (nullcheck)
+ {
+ scratch->opcode = EEOP_AGG_PLAIN_PERGROUP_NULLCHECK;
+ scratch->d.agg_plain_pergroup_nullcheck.setoff = setoff;
+ /* adjust later */
+ scratch->d.agg_plain_pergroup_nullcheck.jumpnull = -1;
+ ExprEvalPushStep(state, scratch);
+ adjust_jumpnull = state->steps_len - 1;
+ }
+
/*
* Determine appropriate transition implementation.
*
@@ -3303,6 +3322,16 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
scratch->d.agg_trans.transno = transno;
scratch->d.agg_trans.aggcontext = aggcontext;
ExprEvalPushStep(state, scratch);
+
+ /* fix up jumpnull */
+ if (adjust_jumpnull != -1)
+ {
+ ExprEvalStep *as = &state->steps[adjust_jumpnull];
+
+ Assert(as->opcode == EEOP_AGG_PLAIN_PERGROUP_NULLCHECK);
+ Assert(as->d.agg_plain_pergroup_nullcheck.jumpnull == -1);
+ as->d.agg_plain_pergroup_nullcheck.jumpnull = state->steps_len;
+ }
}
/*
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index eafd4849002..298fdfcb1f6 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -435,6 +435,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_DESERIALIZE,
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
+ &&CASE_EEOP_AGG_PLAIN_PERGROUP_NULLCHECK,
&&CASE_EEOP_AGG_PLAIN_TRANS_INIT_STRICT_BYVAL,
&&CASE_EEOP_AGG_PLAIN_TRANS_STRICT_BYVAL,
&&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL,
@@ -1603,6 +1604,24 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ /*
+ * If a hash aggregate is in spilled mode, this tuple may have a
+ * per-group state for some grouping sets and not others. If there's
+ * no per-group state, then skip this grouping set.
+ */
+
+ EEO_CASE(EEOP_AGG_PLAIN_PERGROUP_NULLCHECK)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerGroup pergroup_allaggs = aggstate->all_pergroups
+ [op->d.agg_plain_pergroup_nullcheck.setoff];
+
+ if (pergroup_allaggs == NULL)
+ EEO_JUMP(op->d.agg_plain_pergroup_nullcheck.jumpnull);
+
+ EEO_NEXT();
+ }
+
/*
* Different types of aggregate transition functions are implemented
* as different types of steps, to avoid incurring unnecessary
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 13c21ffe9a3..fec001034f5 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -194,6 +194,29 @@
* transition values. hashcontext is the single context created to support
* all hash tables.
*
+ * Spilling To Disk
+ *
+ * When performing hash aggregation, if the hash table memory exceeds the
+ * limit (see hash_agg_check_limits()), we enter "spill mode". In spill
+ * mode, we advance the transition states only for groups already in the
+ * hash table. For tuples that would need to create a new hash table
+ * entries (and initialize new transition states), we instead spill them to
+ * disk to be processed later. The tuples are spilled in a partitioned
+ * manner, so that subsequent batches are smaller and less likely to exceed
+ * work_mem (if a batch does exceed work_mem, it must be spilled
+ * recursively).
+ *
+ * Spilled data is written to logical tapes. These provide better control
+ * over memory usage, disk space, and the number of files than if we were
+ * to use a BufFile for each spill.
+ *
+ * Note that it's possible for transition states to start small but then
+ * grow very large; for instance in the case of ARRAY_AGG. In such cases,
+ * it's still possible to significantly exceed work_mem. We try to avoid
+ * this situation by estimating what will fit in the available memory, and
+ * imposing a limit on the number of groups separately from the amount of
+ * memory consumed.
+ *
* Transition / Combine function invocation:
*
* For performance reasons transition functions, including combine
@@ -233,12 +256,100 @@
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/datum.h"
+#include "utils/dynahash.h"
#include "utils/expandeddatum.h"
+#include "utils/logtape.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+/*
+ * Control how many partitions are created when spilling HashAgg to
+ * disk.
+ *
+ * HASHAGG_PARTITION_FACTOR is multiplied by the estimated number of
+ * partitions needed such that each partition will fit in memory. The factor
+ * is set higher than one because there's not a high cost to having a few too
+ * many partitions, and it makes it less likely that a partition will need to
+ * be spilled recursively. Another benefit of having more, smaller partitions
+ * is that small hash tables may perform better than large ones due to memory
+ * caching effects.
+ *
+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (which
+ * could instead be spent on a larger hash table).
+ *
+ * For reading from tapes, the buffer size must be a multiple of
+ * BLCKSZ. Larger values help when reading from multiple tapes concurrently,
+ * but that doesn't happen in HashAgg, so we simply use BLCKSZ. Writing to a
+ * tape always uses a buffer of size BLCKSZ.
+ */
+#define HASHAGG_PARTITION_FACTOR 1.50
+#define HASHAGG_MIN_PARTITIONS 4
+#define HASHAGG_MAX_PARTITIONS 256
+#define HASHAGG_MIN_BUCKETS 256
+#define HASHAGG_READ_BUFFER_SIZE BLCKSZ
+#define HASHAGG_WRITE_BUFFER_SIZE BLCKSZ
+
+/*
+ * Track all tapes needed for a HashAgg that spills. We don't know the maximum
+ * number of tapes needed at the start of the algorithm (because it can
+ * recurse), so one tape set is allocated and extended as needed for new
+ * tapes. When a particular tape is already read, rewind it for write mode and
+ * put it in the free list.
+ *
+ * Tapes' buffers can take up substantial memory when many tapes are open at
+ * once. We only need one tape open at a time in read mode (using a buffer
+ * that's a multiple of BLCKSZ); but we need up to HASHAGG_MAX_PARTITIONS
+ * tapes open in write mode (each requiring a buffer of size BLCKSZ).
+ */
+typedef struct HashTapeInfo
+{
+ LogicalTapeSet *tapeset;
+ int ntapes;
+ int *freetapes;
+ int nfreetapes;
+} HashTapeInfo;
+
+/*
+ * Represents partitioned spill data for a single hashtable. Contains the
+ * necessary information to route tuples to the correct partition, and to
+ * transform the spilled data into new batches.
+ *
+ * The high bits are used for partition selection (when recursing, we ignore
+ * the bits that have already been used for partition selection at an earlier
+ * level).
+ */
+typedef struct HashAggSpill
+{
+ HashTapeInfo *tapeinfo; /* borrowed reference to tape info */
+ int npartitions; /* number of partitions */
+ int *partitions; /* spill partition tape numbers */
+ int64 *ntuples; /* number of tuples in each partition */
+ uint32 mask; /* mask to find partition from hash value */
+ int shift; /* after masking, shift by this amount */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation (with only one
+ * grouping set).
+ *
+ * Also tracks the bits of the hash already used for partition selection by
+ * earlier iterations, so that this batch can use new bits. If all bits have
+ * already been used, no partitioning will be done (any spilled data will go
+ * to a single output tape).
+ */
+typedef struct HashAggBatch
+{
+ int setno; /* grouping set */
+ int used_bits; /* number of bits of hash already used */
+ HashTapeInfo *tapeinfo; /* borrowed reference to tape info */
+ int input_tapenum; /* input partition tape */
+ int64 input_tuples; /* number of tuples in this batch */
+} HashAggBatch;
+
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -275,11 +386,41 @@ static Bitmapset *find_unaggregated_cols(AggState *aggstate);
static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
static void build_hash_tables(AggState *aggstate);
static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
+static void hashagg_recompile_expressions(AggState *aggstate, bool nullcheck,
+ bool minimal);
+static long hash_choose_num_buckets(AggState *aggstate,
+ long estimated_nbuckets,
+ Size memory);
+static int hash_choose_num_partitions(uint64 input_groups,
+ double hashentrysize,
+ int used_bits,
+ int *log2_npartittions);
static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+static void hash_agg_check_limits(AggState *aggstate);
+static void hash_agg_update_metrics(AggState *aggstate, bool from_tape,
+ int npartitions);
+static void hashagg_finish_initial_spills(AggState *aggstate);
+static void hashagg_reset_spill_state(AggState *aggstate);
+static HashAggBatch *hashagg_batch_new(HashTapeInfo *tapeinfo,
+ int input_tapenum, int setno,
+ int64 input_tuples, int used_bits);
+static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
+static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
+ int used_bits, uint64 input_tuples,
+ double hashentrysize);
+static Size hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot,
+ uint32 hash);
+static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno);
+static void hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *dest,
+ int ndest);
+static void hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum);
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1264,7 +1405,7 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
}
/*
- * (Re-)initialize the hash table(s) to empty.
+ * (Re-)initialize the hash table(s).
*
* To implement hashed aggregation, we need a hashtable that stores a
* representative tuple and an array of AggStatePerGroup structs for each
@@ -1275,9 +1416,9 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
* We have a separate hashtable and associated perhash data structure for each
* grouping set for which we're doing hashing.
*
- * The contents of the hash tables always live in the hashcontext's per-tuple
- * memory context (there is only one of these for all tables together, since
- * they are all reset at the same time).
+ * The hash tables and their contents always live in the hashcontext's
+ * per-tuple memory context (there is only one of these for all tables
+ * together, since they are all reset at the same time).
*/
static void
build_hash_tables(AggState *aggstate)
@@ -1287,14 +1428,27 @@ build_hash_tables(AggState *aggstate)
for (setno = 0; setno < aggstate->num_hashes; ++setno)
{
AggStatePerHash perhash = &aggstate->perhash[setno];
+ long nbuckets;
+ Size memory;
+
+ if (perhash->hashtable != NULL)
+ {
+ ResetTupleHashTable(perhash->hashtable);
+ continue;
+ }
Assert(perhash->aggnode->numGroups > 0);
- if (perhash->hashtable)
- ResetTupleHashTable(perhash->hashtable);
- else
- build_hash_table(aggstate, setno, perhash->aggnode->numGroups);
+ memory = aggstate->hash_mem_limit / aggstate->num_hashes;
+
+ /* choose reasonable number of buckets per hashtable */
+ nbuckets = hash_choose_num_buckets(
+ aggstate, perhash->aggnode->numGroups, memory);
+
+ build_hash_table(aggstate, setno, nbuckets);
}
+
+ aggstate->hash_ngroups_current = 0;
}
/*
@@ -1487,14 +1641,309 @@ hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
transitionSpace;
}
+/*
+ * hashagg_recompile_expressions()
+ *
+ * Identifies the right phase, compiles the right expression given the
+ * arguments, and then sets phase->evalfunc to that expression.
+ *
+ * Different versions of the compiled expression are needed depending on
+ * whether hash aggregation has spilled or not, and whether it's reading from
+ * the outer plan or a tape. Before spilling to disk, the expression reads
+ * from the outer plan (using a fixed slot) and does not need to perform a
+ * NULL check. After HashAgg begins to spill, new groups will not be created
+ * in the hash table, and the AggStatePerGroup array may be NULL; therefore we
+ * need to add a null pointer check to the expression. Then, when reading
+ * spilled data from a tape, we need to change the outer slot type to be a
+ * minimal tuple slot if that's different from the outer plan's slot type.
+ *
+ * It would be wasteful to recompile every time, so the first time this
+ * function is called (when entering spill mode), it compiles the three
+ * remaining variations of the expression, and caches them.
+ */
+static void
+hashagg_recompile_expressions(AggState *aggstate, bool nullcheck, bool minslot)
+{
+ AggStatePerPhase phase;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ if (aggstate->aggstrategy == AGG_HASHED)
+ phase = &aggstate->phases[0];
+ else /* AGG_MIXED */
+ phase = &aggstate->phases[1];
+
+ /* should have been created in ExecInitAgg */
+ Assert(phase->evaltrans_outerslot != NULL);
+
+ /* if not already done, compile expressions and cache them */
+ if (phase->evaltrans_nullcheck_outerslot == NULL)
+ {
+ const TupleTableSlotOps *outerops;
+ const TupleTableSlotOps *minimalops = &TTSOpsMinimalTuple;
+ const TupleTableSlotOps *ops = aggstate->ss.ps.outerops;
+ bool dohash = true;
+ bool dosort;
+
+ Assert(phase->evaltrans_minslot == NULL);
+ Assert(phase->evaltrans_nullcheck_minslot == NULL);
+
+ dosort = aggstate->aggstrategy == AGG_MIXED ? true : false;
+
+ outerops = ExecGetResultSlotOps(outerPlanState(&aggstate->ss), NULL);
+
+ /* temporarily change the outerops while compiling the expression */
+ aggstate->ss.ps.outerops = outerops;
+ phase->evaltrans_nullcheck_outerslot = ExecBuildAggTrans(
+ aggstate, phase, dosort, dohash, true);
+ aggstate->ss.ps.outerops = ops;
+
+ if (outerops == minimalops)
+ {
+ phase->evaltrans_minslot =
+ phase->evaltrans_outerslot;
+ phase->evaltrans_nullcheck_minslot =
+ phase->evaltrans_nullcheck_outerslot;
+ }
+ else
+ {
+ aggstate->ss.ps.outerops = minimalops;
+ phase->evaltrans_minslot = ExecBuildAggTrans(
+ aggstate, phase, dosort, dohash, false);
+ phase->evaltrans_nullcheck_minslot = ExecBuildAggTrans(
+ aggstate, phase, dosort, dohash, true);
+ aggstate->ss.ps.outerops = ops;
+ }
+ }
+
+ Assert(phase->evaltrans_outerslot != NULL);
+ Assert(phase->evaltrans_nullcheck_outerslot != NULL);
+ Assert(phase->evaltrans_minslot != NULL);
+ Assert(phase->evaltrans_nullcheck_minslot != NULL);
+
+ if (!nullcheck && !minslot)
+ phase->evaltrans = phase->evaltrans_outerslot;
+ else if (!nullcheck && minslot)
+ phase->evaltrans = phase->evaltrans_minslot;
+ else if (nullcheck && !minslot)
+ phase->evaltrans = phase->evaltrans_nullcheck_outerslot;
+ else /* nullcheck && minslot */
+ phase->evaltrans = phase->evaltrans_nullcheck_minslot;
+}
+
+/*
+ * Set limits that trigger spilling to avoid exceeding work_mem. Consider the
+ * number of partitions we expect to create (if we do spill).
+ *
+ * There are two limits: a memory limit, and also an ngroups limit. The
+ * ngroups limit becomes important when we expect transition values to grow
+ * substantially larger than the initial value.
+ */
+void
+hash_agg_set_limits(double hashentrysize, uint64 input_groups, int used_bits,
+ Size *mem_limit, long *ngroups_limit, int *num_partitions)
+{
+ int npartitions;
+ Size partition_mem;
+
+ /* if not expected to spill, use all of work_mem */
+ if (input_groups * hashentrysize < work_mem * 1024L)
+ {
+ *mem_limit = work_mem * 1024L;
+ *ngroups_limit = *mem_limit / hashentrysize;
+ return;
+ }
+
+ /*
+ * Calculate expected memory requirements for spilling, which is the size
+ * of the buffers needed for all the tapes that need to be open at
+ * once. Then, subtract that from the memory available for holding hash
+ * tables.
+ */
+ npartitions = hash_choose_num_partitions(input_groups,
+ hashentrysize,
+ used_bits,
+ NULL);
+ if (num_partitions != NULL)
+ *num_partitions = npartitions;
+
+ partition_mem =
+ HASHAGG_READ_BUFFER_SIZE +
+ HASHAGG_WRITE_BUFFER_SIZE * npartitions;
+
+ /*
+ * Don't set the limit below 3/4 of work_mem. In that case, we are at the
+ * minimum number of partitions, so we aren't going to dramatically exceed
+ * work mem anyway.
+ */
+ if (work_mem * 1024L > 4 * partition_mem)
+ *mem_limit = work_mem * 1024L - partition_mem;
+ else
+ *mem_limit = work_mem * 1024L * 0.75;
+
+ if (*mem_limit > hashentrysize)
+ *ngroups_limit = *mem_limit / hashentrysize;
+ else
+ *ngroups_limit = 1;
+}
+
+/*
+ * hash_agg_check_limits
+ *
+ * After adding a new group to the hash table, check whether we need to enter
+ * spill mode. Allocations may happen without adding new groups (for instance,
+ * if the transition state size grows), so this check is imperfect.
+ */
+static void
+hash_agg_check_limits(AggState *aggstate)
+{
+ long ngroups = aggstate->hash_ngroups_current;
+ Size hash_mem = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ /*
+ * Don't spill unless there's at least one group in the hash table so we
+ * can be sure to make progress even in edge cases.
+ */
+ if (aggstate->hash_ngroups_current > 0 &&
+ (hash_mem > aggstate->hash_mem_limit ||
+ ngroups > aggstate->hash_ngroups_limit))
+ {
+ aggstate->hash_spill_mode = true;
+ hashagg_recompile_expressions(aggstate, true,
+ aggstate->table_filled);
+
+ if (!aggstate->hash_ever_spilled)
+ {
+ aggstate->hash_ever_spilled = true;
+ aggstate->hash_spills = palloc0(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+ aggstate->hash_tapeinfo = palloc0(sizeof(HashTapeInfo));
+ }
+ }
+}
+
+/*
+ * Update metrics after filling the hash table.
+ *
+ * If reading from the outer plan, from_tape should be false; if reading from
+ * another tape, from_tape should be true.
+ */
+static void
+hash_agg_update_metrics(AggState *aggstate, bool from_tape, int npartitions)
+{
+ Size hash_mem = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+ Size partition_mem = 0;
+
+ /* update hashentrysize estimate based on contents */
+ if (aggstate->hash_ngroups_current > 0)
+ {
+ aggstate->hashentrysize =
+ hash_mem / (double)aggstate->hash_ngroups_current;
+ }
+
+ /*
+ * Calculate peak memory usage, which includes memory for partition tapes'
+ * read/write buffers.
+ */
+ if (from_tape)
+ partition_mem += HASHAGG_READ_BUFFER_SIZE;
+ partition_mem = npartitions * HASHAGG_WRITE_BUFFER_SIZE;
+
+ if (hash_mem + partition_mem > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = hash_mem + partition_mem;
+}
+
+/*
+ * Choose a reasonable number of buckets for the initial hash table size.
+ */
+static long
+hash_choose_num_buckets(AggState *aggstate, long ngroups, Size memory)
+{
+ long max_nbuckets;
+ long nbuckets = ngroups;
+
+ max_nbuckets = memory / aggstate->hashentrysize;
+
+ /*
+ * Leave room for slop to avoid a case where the initial hash table size
+ * exceeds the memory limit (though that may still happen in edge cases).
+ */
+ max_nbuckets *= 0.75;
+
+ if (nbuckets > max_nbuckets)
+ nbuckets = max_nbuckets;
+ if (nbuckets < HASHAGG_MIN_BUCKETS)
+ nbuckets = HASHAGG_MIN_BUCKETS;
+ return nbuckets;
+}
+
+/*
+ * Determine the number of partitions to create when spilling, which will
+ * always be a power of two. If log2_npartitions is non-NULL, set
+ * *log2_npartitions to the log2() of the number of partitions.
+ */
+static int
+hash_choose_num_partitions(uint64 input_groups, double hashentrysize,
+ int used_bits, int *log2_npartitions)
+{
+ Size mem_wanted;
+ int partition_limit;
+ int npartitions;
+ int partition_bits;
+
+ /*
+ * Avoid creating so many partitions that the memory requirements of the
+ * open partition files are greater than 1/4 of work_mem.
+ */
+ partition_limit =
+ (work_mem * 1024L * 0.25 - HASHAGG_READ_BUFFER_SIZE) /
+ HASHAGG_WRITE_BUFFER_SIZE;
+
+ /* pessimistically estimate that each input tuple creates a new group */
+ mem_wanted = HASHAGG_PARTITION_FACTOR * input_groups * hashentrysize;
+
+ /* make enough partitions so that each one is likely to fit in memory */
+ npartitions = 1 + (mem_wanted / (work_mem * 1024L));
+
+ if (npartitions > partition_limit)
+ npartitions = partition_limit;
+
+ if (npartitions < HASHAGG_MIN_PARTITIONS)
+ npartitions = HASHAGG_MIN_PARTITIONS;
+ if (npartitions > HASHAGG_MAX_PARTITIONS)
+ npartitions = HASHAGG_MAX_PARTITIONS;
+
+ /* ceil(log2(npartitions)) */
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits */
+ if (partition_bits + used_bits >= 32)
+ partition_bits = 32 - used_bits;
+
+ if (log2_npartitions != NULL)
+ *log2_npartitions = partition_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ return npartitions;
+}
+
/*
* Find or create a hashtable entry for the tuple group containing the current
* tuple (already set in tmpcontext's outertuple slot), in the current grouping
* set (which the caller must have selected - note that initialize_aggregate
* depends on this).
*
- * When called, CurrentMemoryContext should be the per-query context. The
- * already-calculated hash value for the tuple must be specified.
+ * When called, CurrentMemoryContext should be the per-query context.
+ *
+ * If the hash table is at the memory limit, then only find existing hashtable
+ * entries; don't create new ones. If a tuple's group is not already present
+ * in the hash table for the current grouping set, return NULL and the caller
+ * will spill it to disk.
*/
static AggStatePerGroup
lookup_hash_entry(AggState *aggstate, uint32 hash)
@@ -1502,16 +1951,26 @@ lookup_hash_entry(AggState *aggstate, uint32 hash)
AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
TupleTableSlot *hashslot = perhash->hashslot;
TupleHashEntryData *entry;
- bool isnew;
+ bool isnew = false;
+ bool *p_isnew;
+
+ /* if hash table already spilled, don't create new entries */
+ p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, &isnew,
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
hash);
+ if (entry == NULL)
+ return NULL;
+
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_ngroups_current++;
+ hash_agg_check_limits(aggstate);
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1539,23 +1998,51 @@ lookup_hash_entry(AggState *aggstate, uint32 hash)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Some entries may be left NULL if we have reached the limit and have begun
+ * to spill. The same tuple will belong to different groups for each set, so
+ * may match a group already in memory for one set and match a group not in
+ * memory for another set. If we have begun to spill and a tuple doesn't match
+ * a group in memory for a particular set, it will be spilled.
+ *
+ * NB: It's possible to spill the same tuple for several different grouping
+ * sets. This may seem wasteful, but it's actually a trade-off: if we spill
+ * the tuple multiple times for multiple grouping sets, it can be partitioned
+ * for each grouping set, making the refilling of the hash table very
+ * efficient.
*/
static void
lookup_hash_entries(AggState *aggstate)
{
- int numHashes = aggstate->num_hashes;
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
int setno;
- for (setno = 0; setno < numHashes; setno++)
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
{
- AggStatePerHash perhash = &aggstate->perhash[setno];
+ AggStatePerHash perhash = &aggstate->perhash[setno];
uint32 hash;
select_current_set(aggstate, setno, true);
prepare_hash_slot(aggstate);
hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
pergroup[setno] = lookup_hash_entry(aggstate, hash);
+
+ /* check to see if we need to spill the tuple for this grouping set */
+ if (pergroup[setno] == NULL)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
+
+ if (spill->partitions == NULL)
+ hashagg_spill_init(spill, aggstate->hash_tapeinfo, 0,
+ perhash->aggnode->numGroups,
+ aggstate->hashentrysize);
+
+ hashagg_spill_tuple(spill, slot, hash);
+
+ aggstate->hash_disk_used = LogicalTapeSetBlocks(
+ aggstate->hash_tapeinfo->tapeset) * (BLCKSZ / 1024);
+ }
}
}
@@ -1878,6 +2365,12 @@ agg_retrieve_direct(AggState *aggstate)
if (TupIsNull(outerslot))
{
/* no more outer-plan tuples available */
+
+ /* if we built hash tables, finalize any spills */
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hashagg_finish_initial_spills(aggstate);
+
if (hasGroupingSets)
{
aggstate->input_done = true;
@@ -1980,6 +2473,10 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ /* finalize spills, if any */
+ hashagg_finish_initial_spills(aggstate);
+
+ aggstate->input_done = true;
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1987,11 +2484,183 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * consumed.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+ HashAggSpill spill;
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ long nbuckets;
+ int setno;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ spill.npartitions = 0;
+ spill.partitions = NULL;
+ /*
+ * Each spill file contains spilled data for only a single grouping
+ * set. We want to ignore all others, which is done by setting the other
+ * pergroups to NULL.
+ */
+ memset(aggstate->all_pergroups, 0,
+ sizeof(AggStatePerGroup) *
+ (aggstate->maxsets + aggstate->num_hashes));
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ hash_agg_set_limits(aggstate->hashentrysize, batch->input_tuples,
+ batch->used_bits, &aggstate->hash_mem_limit,
+ &aggstate->hash_ngroups_limit, NULL);
+
+ /* free memory and reset hash tables */
+ ReScanExprContext(aggstate->hashcontext);
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ ResetTupleHashTable(aggstate->perhash[setno].hashtable);
+
+ /* build a single new hashtable for this grouping set */
+ nbuckets = hash_choose_num_buckets(
+ aggstate, batch->input_tuples, aggstate->hash_mem_limit);
+ build_hash_table(aggstate, batch->setno, nbuckets);
+ aggstate->hash_ngroups_current = 0;
+
+ Assert(aggstate->current_phase == 0);
+
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ /*
+ * The first pass (agg_fill_hash_table()) reads whatever kind of slot comes
+ * from the outer plan, and considers the slot fixed. But spilled tuples
+ * are always MinimalTuples, so we need to recompile the aggregate
+ * expressions.
+ *
+ * We still need the NULL check, because we are only processing one
+ * grouping set at a time and the rest will be NULL.
+ */
+ hashagg_recompile_expressions(aggstate, true, true);
+
+ LogicalTapeRewindForRead(tapeinfo->tapeset, batch->input_tapenum,
+ HASHAGG_READ_BUFFER_SIZE);
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+ uint32 hash;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hashagg_batch_read(batch, &hash);
+ if (tuple == NULL)
+ break;
+
+ ExecStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ select_current_set(aggstate, batch->setno, true);
+ prepare_hash_slot(aggstate);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+
+ /* if there's no memory for a new group, spill */
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ {
+ /*
+ * Estimate the number of groups for this batch as the total
+ * number of tuples in its input file. Although that's a worst
+ * case, it's not bad here for two reasons: (1) overestimating
+ * is better than underestimating; and (2) we've already
+ * scanned the relation once, so it's likely that we've
+ * already finalized many of the common values.
+ */
+ if (spill.partitions == NULL)
+ hashagg_spill_init(&spill, tapeinfo, batch->used_bits,
+ batch->input_tuples,
+ aggstate->hashentrysize);
+
+ hashagg_spill_tuple(&spill, slot, hash);
+
+ aggstate->hash_disk_used = LogicalTapeSetBlocks(
+ aggstate->hash_tapeinfo->tapeset) * (BLCKSZ / 1024);
+ }
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ hashagg_tapeinfo_release(tapeinfo, batch->input_tapenum);
+
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ hash_agg_update_metrics(aggstate, true, spill.npartitions);
+ hashagg_spill_finish(aggstate, &spill, batch->setno);
+ aggstate->hash_spill_mode = false;
+
+ /* Initialize to walk the first hash table */
+ select_current_set(aggstate, batch->setno, true);
+ ResetTupleHashIterator(aggstate->perhash[batch->setno].hashtable,
+ &aggstate->perhash[batch->setno].hashiter);
+
+ pfree(batch);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
+ *
+ * After exhausting in-memory tuples, also try refilling the hash table using
+ * previously-spilled tuples. Only returns NULL after all in-memory and
+ * spilled tuples are exhausted.
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+/*
+ * Retrieve the groups from the in-memory hash tables without considering any
+ * spilled tuples.
+ */
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -2020,7 +2689,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2051,8 +2720,6 @@ agg_retrieve_hash_table(AggState *aggstate)
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2109,6 +2776,297 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * Assign unused tapes to spill partitions, extending the tape set if
+ * necessary.
+ */
+static void
+hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *partitions,
+ int npartitions)
+{
+ int partidx = 0;
+
+ /* use free tapes if available */
+ while (partidx < npartitions && tapeinfo->nfreetapes > 0)
+ partitions[partidx++] = tapeinfo->freetapes[--tapeinfo->nfreetapes];
+
+ if (tapeinfo->tapeset == NULL)
+ tapeinfo->tapeset = LogicalTapeSetCreate(npartitions, NULL, NULL, -1);
+ else if (partidx < npartitions)
+ {
+ tapeinfo->tapeset = LogicalTapeSetExtend(
+ tapeinfo->tapeset, npartitions - partidx);
+ }
+
+ while (partidx < npartitions)
+ partitions[partidx++] = tapeinfo->ntapes++;
+}
+
+/*
+ * After a tape has already been written to and then read, this function
+ * rewinds it for writing and adds it to the free list.
+ */
+static void
+hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum)
+{
+ LogicalTapeRewindForWrite(tapeinfo->tapeset, tapenum);
+ if (tapeinfo->freetapes == NULL)
+ tapeinfo->freetapes = palloc(sizeof(int));
+ else
+ tapeinfo->freetapes = repalloc(
+ tapeinfo->freetapes, sizeof(int) * (tapeinfo->nfreetapes + 1));
+ tapeinfo->freetapes[tapeinfo->nfreetapes++] = tapenum;
+}
+
+/*
+ * hashagg_spill_init
+ *
+ * Called after we determined that spilling is necessary. Chooses the number
+ * of partitions to create, and initializes them.
+ */
+static void
+hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo, int used_bits,
+ uint64 input_groups, double hashentrysize)
+{
+ int npartitions;
+ int partition_bits;
+
+ npartitions = hash_choose_num_partitions(
+ input_groups, hashentrysize, used_bits, &partition_bits);
+
+ spill->partitions = palloc0(sizeof(int) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+
+ hashagg_tapeinfo_assign(tapeinfo, spill->partitions, npartitions);
+
+ spill->tapeinfo = tapeinfo;
+ spill->shift = 32 - used_bits - partition_bits;
+ spill->mask = (npartitions - 1) << spill->shift;
+ spill->npartitions = npartitions;
+}
+
+/*
+ * hashagg_spill_tuple
+ *
+ * No room for new groups in the hash table. Save for later in the appropriate
+ * partition.
+ */
+static Size
+hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot, uint32 hash)
+{
+ LogicalTapeSet *tapeset = spill->tapeinfo->tapeset;
+ int partition;
+ MinimalTuple tuple;
+ int tapenum;
+ int total_written = 0;
+ bool shouldFree;
+
+ Assert(spill->partitions != NULL);
+
+ /* may contain unnecessary attributes, consider projecting? */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ partition = (hash & spill->mask) >> spill->shift;
+ spill->ntuples[partition]++;
+
+ tapenum = spill->partitions[partition];
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) &hash, sizeof(uint32));
+ total_written += sizeof(uint32);
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) tuple, tuple->t_len);
+ total_written += tuple->t_len;
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return total_written;
+}
+
+/*
+ * hashagg_batch_new
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done. Should be called in the aggregate's memory context.
+ */
+static HashAggBatch *
+hashagg_batch_new(HashTapeInfo *tapeinfo, int tapenum, int setno,
+ int64 input_tuples, int used_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->setno = setno;
+ batch->used_bits = used_bits;
+ batch->tapeinfo = tapeinfo;
+ batch->input_tapenum = tapenum;
+ batch->input_tuples = input_tuples;
+
+ return batch;
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hashagg_batch_read(HashAggBatch *batch, uint32 *hashp)
+{
+ LogicalTapeSet *tapeset = batch->tapeinfo->tapeset;
+ int tapenum = batch->input_tapenum;
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+ uint32 hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &hash, sizeof(uint32));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+ if (hashp != NULL)
+ *hashp = hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &t_len, sizeof(t_len));
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = LogicalTapeRead(tapeset, tapenum,
+ (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, t_len - sizeof(uint32), nread)));
+
+ return tuple;
+}
+
+/*
+ * hashagg_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hashagg_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+ int total_npartitions = 0;
+
+ if (aggstate->hash_spills == NULL)
+ return;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ total_npartitions += spill->npartitions;
+ hashagg_spill_finish(aggstate, spill, setno);
+ }
+
+ hash_agg_update_metrics(aggstate, false, total_npartitions);
+ aggstate->hash_spill_mode = false;
+
+ /*
+ * We're not processing tuples from outer plan any more; only processing
+ * batches of spilled tuples. The initial spill structures are no longer
+ * needed.
+ */
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+}
+
+/*
+ * hashagg_spill_finish
+ *
+ * Transform spill partitions into new batches.
+ */
+static void
+hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
+{
+ int i;
+ int used_bits = 32 - spill->shift;
+
+ if (spill->npartitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->npartitions; i++)
+ {
+ int tapenum = spill->partitions[i];
+ MemoryContext oldContext;
+ HashAggBatch *new_batch;
+
+ oldContext = MemoryContextSwitchTo(aggstate->ss.ps.state->es_query_cxt);
+ new_batch = hashagg_batch_new(aggstate->hash_tapeinfo,
+ tapenum, setno, spill->ntuples[i],
+ used_bits);
+ aggstate->hash_batches = lcons(new_batch, aggstate->hash_batches);
+ aggstate->hash_batches_used++;
+ MemoryContextSwitchTo(oldContext);
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Free resources related to a spilled HashAgg.
+ */
+static void
+hashagg_reset_spill_state(AggState *aggstate)
+{
+ ListCell *lc;
+
+ /* free spills from initial pass */
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+ }
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ /* free batches */
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+
+ /* close tape set */
+ if (aggstate->hash_tapeinfo != NULL)
+ {
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ if (tapeinfo->tapeset != NULL)
+ LogicalTapeSetClose(tapeinfo->tapeset);
+ if (tapeinfo->freetapes != NULL)
+ pfree(tapeinfo->freetapes);
+ pfree(tapeinfo);
+ aggstate->hash_tapeinfo = NULL;
+ }
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2293,6 +3251,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->ss.ps.outeropsfixed = false;
}
+ if (use_hashing)
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsMinimalTuple);
+
/*
* Initialize result type, slot and projection.
*/
@@ -2518,9 +3480,23 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
*/
if (use_hashing)
{
+ Plan *outerplan = outerPlan(node);
+ long totalGroups = 0;
+ int i;
+
/* this is an array of pointers, not structures */
aggstate->hash_pergroup = pergroups;
+ aggstate->hashentrysize = hash_agg_entry_size(
+ aggstate->numtrans, outerplan->plan_width, node->transitionSpace);
+
+ for (i = 0; i < aggstate->num_hashes; i++)
+ totalGroups = aggstate->perhash[i].aggnode->numGroups;
+
+ hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+ &aggstate->hash_mem_limit,
+ &aggstate->hash_ngroups_limit,
+ &aggstate->hash_planned_partitions);
find_hash_columns(aggstate);
build_hash_tables(aggstate);
aggstate->table_filled = false;
@@ -2928,8 +3904,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
else
Assert(false);
- phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash);
-
+ phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
+ false);
+ if (dohash)
+ phase->evaltrans_outerslot = phase->evaltrans;
}
return aggstate;
@@ -3423,6 +4401,8 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hashagg_reset_spill_state(node);
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3478,12 +4458,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_ever_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3540,11 +4521,24 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ hashagg_reset_spill_state(node);
+
+ node->hash_ever_spilled = false;
+ node->hash_spill_mode = false;
+ node->hash_ngroups_current = 0;
+
+ /* reset stats */
+ node->hash_mem_peak = 0;
+ node->hash_disk_used = 0;
+ node->hash_batches_used = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
build_hash_tables(node);
node->table_filled = false;
/* iterator will be reset when the table is filled */
+
+ hashagg_recompile_expressions(node, false, false);
}
if (node->aggstrategy != AGG_HASHED)
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index dc16b399327..b855e739571 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2046,6 +2046,45 @@ llvm_compile_expr(ExprState *state)
break;
}
+ case EEOP_AGG_PLAIN_PERGROUP_NULLCHECK:
+ {
+ int jumpnull;
+ LLVMValueRef v_aggstatep;
+ LLVMValueRef v_allpergroupsp;
+ LLVMValueRef v_pergroup_allaggs;
+ LLVMValueRef v_setoff;
+
+ jumpnull = op->d.agg_plain_pergroup_nullcheck.jumpnull;
+
+ /*
+ * pergroup_allaggs = aggstate->all_pergroups
+ * [op->d.agg_plain_pergroup_nullcheck.setoff];
+ */
+ v_aggstatep = LLVMBuildBitCast(
+ b, v_parent, l_ptr(StructAggState), "");
+
+ v_allpergroupsp = l_load_struct_gep(
+ b, v_aggstatep,
+ FIELDNO_AGGSTATE_ALL_PERGROUPS,
+ "aggstate.all_pergroups");
+
+ v_setoff = l_int32_const(
+ op->d.agg_plain_pergroup_nullcheck.setoff);
+
+ v_pergroup_allaggs = l_load_gep1(
+ b, v_allpergroupsp, v_setoff, "");
+
+ LLVMBuildCondBr(
+ b,
+ LLVMBuildICmp(b, LLVMIntEQ,
+ LLVMBuildPtrToInt(
+ b, v_pergroup_allaggs, TypeSizeT, ""),
+ l_sizet_const(0), ""),
+ opblocks[jumpnull],
+ opblocks[opno + 1]);
+ break;
+ }
+
case EEOP_AGG_PLAIN_TRANS_INIT_STRICT_BYVAL:
case EEOP_AGG_PLAIN_TRANS_STRICT_BYVAL:
case EEOP_AGG_PLAIN_TRANS_BYVAL:
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721f..724e4448e9a 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -77,6 +77,7 @@
#include "access/htup_details.h"
#include "access/tsmapi.h"
#include "executor/executor.h"
+#include "executor/nodeAgg.h"
#include "executor/nodeHash.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -128,6 +129,7 @@ bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
bool enable_hashagg = true;
+bool enable_groupingsets_hash_disk = true;
bool enable_nestloop = true;
bool enable_material = true;
bool enable_mergejoin = true;
@@ -2153,7 +2155,7 @@ cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples)
+ double input_tuples, double input_width)
{
double output_tuples;
Cost startup_cost;
@@ -2219,21 +2221,88 @@ cost_agg(Path *path, PlannerInfo *root,
total_cost += aggcosts->finalCost.per_tuple * numGroups;
total_cost += cpu_tuple_cost * numGroups;
output_tuples = numGroups;
+
+ /*
+ * We don't need to compute the disk costs of hash aggregation here,
+ * because the planner does not choose hash aggregation for grouping
+ * sets that it doesn't expect to fit in memory.
+ */
}
else
{
+ double pages_written = 0.0;
+ double pages_read = 0.0;
+ double hashentrysize;
+ double nbatches;
+ Size mem_limit;
+ long ngroups_limit;
+ int num_partitions;
+
/* must be AGG_HASHED */
startup_cost = input_total_cost;
if (!enable_hashagg)
startup_cost += disable_cost;
startup_cost += aggcosts->transCost.startup;
startup_cost += aggcosts->transCost.per_tuple * input_tuples;
+ /* cost of computing hash value */
startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;
startup_cost += aggcosts->finalCost.startup;
+
total_cost = startup_cost;
total_cost += aggcosts->finalCost.per_tuple * numGroups;
+ /* cost of retrieving from hash table */
total_cost += cpu_tuple_cost * numGroups;
output_tuples = numGroups;
+
+ /*
+ * Estimate number of batches based on the computed limits. If less
+ * than or equal to one, all groups are expected to fit in memory;
+ * otherwise we expect to spill.
+ */
+ hashentrysize = hash_agg_entry_size(
+ aggcosts->numAggs, input_width, aggcosts->transitionSpace);
+ hash_agg_set_limits(hashentrysize, numGroups, 0, &mem_limit,
+ &ngroups_limit, &num_partitions);
+
+ nbatches = Max( (numGroups * hashentrysize) / mem_limit,
+ numGroups / ngroups_limit );
+
+ /*
+ * Estimate number of pages read and written. For each level of
+ * recursion, a tuple must be written and then later read.
+ */
+ if (nbatches > 1.0)
+ {
+ double depth;
+ double pages;
+
+ pages = relation_byte_size(input_tuples, input_width) / BLCKSZ;
+
+ /*
+ * The number of partitions can change at different levels of
+ * recursion; but for the purposes of this calculation assume it
+ * stays constant.
+ */
+ depth = ceil( log(nbatches - 1) / log(num_partitions) );
+ pages_written = pages_read = pages * depth;
+ }
+
+ /*
+ * Add the disk costs of hash aggregation that spills to disk.
+ *
+ * Groups that go into the hash table stay in memory until finalized,
+ * so spilling and reprocessing tuples doesn't incur additional
+ * invocations of transCost or finalCost. Furthermore, the computed
+ * hash value is stored with the spilled tuples, so we don't incur
+ * extra invocations of the hash function.
+ *
+ * Hash Agg begins returning tuples after the first batch is
+ * complete. Accrue writes (spilled tuples) to startup_cost and to
+ * total_cost; accrue reads only to total_cost.
+ */
+ startup_cost += pages_written * random_page_cost;
+ total_cost += pages_written * random_page_cost;
+ total_cost += pages_read * seq_page_cost;
}
/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e048d200bb4..090919e39a0 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1644,6 +1644,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
NIL,
NIL,
best_path->path.rows,
+ 0,
subplan);
}
else
@@ -2096,6 +2097,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
NIL,
NIL,
best_path->numGroups,
+ best_path->transitionSpace,
subplan);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2257,6 +2259,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
NIL,
rollup->numGroups,
+ best_path->transitionSpace,
sort_plan);
/*
@@ -2295,6 +2298,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
rollup->gsets,
chain,
rollup->numGroups,
+ best_path->transitionSpace,
subplan);
/* Copy cost data from Path to Plan */
@@ -6192,8 +6196,8 @@ Agg *
make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree)
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree)
{
Agg *node = makeNode(Agg);
Plan *plan = &node->plan;
@@ -6209,6 +6213,7 @@ make_agg(List *tlist, List *qual,
node->grpOperators = grpOperators;
node->grpCollations = grpCollations;
node->numGroups = numGroups;
+ node->transitionSpace = transitionSpace;
node->aggParams = NULL; /* SS_finalize_plan() will fill this */
node->groupingSets = groupingSets;
node->chain = chain;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6314c..8c5b2d06301 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4258,11 +4258,12 @@ consider_groupingsets_paths(PlannerInfo *root,
dNumGroups - exclude_groups);
/*
- * gd->rollups is empty if we have only unsortable columns to work
- * with. Override work_mem in that case; otherwise, we'll rely on the
- * sorted-input case to generate usable mixed paths.
+ * If we have sortable columns to work with (gd->rollups is non-empty)
+ * and enable_groupingsets_hash_disk is disabled, don't generate
+ * hash-based paths that will exceed work_mem.
*/
- if (hashsize > work_mem * 1024L && gd->rollups)
+ if (!enable_groupingsets_hash_disk &&
+ hashsize > work_mem * 1024L && gd->rollups)
return; /* nope, won't fit */
/*
@@ -6505,8 +6506,6 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
if (can_hash)
{
- double hashaggtablesize;
-
if (parse->groupingSets)
{
/*
@@ -6518,34 +6517,20 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
}
else
{
- hashaggtablesize = estimate_hashagg_tablesize(cheapest_path,
- agg_costs,
- dNumGroups);
-
/*
- * Provided that the estimated size of the hashtable does not
- * exceed work_mem, we'll generate a HashAgg Path, although if we
- * were unable to sort above, then we'd better generate a Path, so
- * that we at least have one.
+ * We just need an Agg over the cheapest-total input path,
+ * since input order won't matter.
*/
- if (hashaggtablesize < work_mem * 1024L ||
- grouped_rel->pathlist == NIL)
- {
- /*
- * We just need an Agg over the cheapest-total input path,
- * since input order won't matter.
- */
- add_path(grouped_rel, (Path *)
- create_agg_path(root, grouped_rel,
- cheapest_path,
- grouped_rel->reltarget,
- AGG_HASHED,
- AGGSPLIT_SIMPLE,
- parse->groupClause,
- havingQual,
- agg_costs,
- dNumGroups));
- }
+ add_path(grouped_rel, (Path *)
+ create_agg_path(root, grouped_rel,
+ cheapest_path,
+ grouped_rel->reltarget,
+ AGG_HASHED,
+ AGGSPLIT_SIMPLE,
+ parse->groupClause,
+ havingQual,
+ agg_costs,
+ dNumGroups));
}
/*
@@ -6557,22 +6542,17 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
{
Path *path = partially_grouped_rel->cheapest_total_path;
- hashaggtablesize = estimate_hashagg_tablesize(path,
- agg_final_costs,
- dNumGroups);
-
- if (hashaggtablesize < work_mem * 1024L)
- add_path(grouped_rel, (Path *)
- create_agg_path(root,
- grouped_rel,
- path,
- grouped_rel->reltarget,
- AGG_HASHED,
- AGGSPLIT_FINAL_DESERIAL,
- parse->groupClause,
- havingQual,
- agg_final_costs,
- dNumGroups));
+ add_path(grouped_rel, (Path *)
+ create_agg_path(root,
+ grouped_rel,
+ path,
+ grouped_rel->reltarget,
+ AGG_HASHED,
+ AGGSPLIT_FINAL_DESERIAL,
+ parse->groupClause,
+ havingQual,
+ agg_final_costs,
+ dNumGroups));
}
}
@@ -6816,22 +6796,10 @@ create_partial_grouping_paths(PlannerInfo *root,
if (can_hash && cheapest_total_path != NULL)
{
- double hashaggtablesize;
-
/* Checked above */
Assert(parse->hasAggs || parse->groupClause);
- hashaggtablesize =
- estimate_hashagg_tablesize(cheapest_total_path,
- agg_partial_costs,
- dNumPartialGroups);
-
- /*
- * Tentatively produce a partial HashAgg Path, depending on if it
- * looks as if the hash table will fit in work_mem.
- */
- if (hashaggtablesize < work_mem * 1024L &&
- cheapest_total_path != NULL)
+ if (cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
create_agg_path(root,
@@ -6849,16 +6817,8 @@ create_partial_grouping_paths(PlannerInfo *root,
if (can_hash && cheapest_partial_path != NULL)
{
- double hashaggtablesize;
-
- hashaggtablesize =
- estimate_hashagg_tablesize(cheapest_partial_path,
- agg_partial_costs,
- dNumPartialPartialGroups);
-
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
- cheapest_partial_path != NULL)
+ if (cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
create_agg_path(root,
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 1a23e18970d..951aed80e7a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1072,7 +1072,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
numGroupCols, dNumGroups,
NIL,
input_path->startup_cost, input_path->total_cost,
- input_path->rows);
+ input_path->rows, input_path->pathtarget->width);
/*
* Now for the sorted case. Note that the input is *always* unsorted,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e6d08aede56..8ba8122ee2f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1704,7 +1704,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
NIL,
subpath->startup_cost,
subpath->total_cost,
- rel->rows);
+ rel->rows,
+ subpath->pathtarget->width);
}
if (sjinfo->semi_can_btree && sjinfo->semi_can_hash)
@@ -2949,6 +2950,7 @@ create_agg_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->aggsplit = aggsplit;
pathnode->numGroups = numGroups;
+ pathnode->transitionSpace = aggcosts ? aggcosts->transitionSpace : 0;
pathnode->groupClause = groupClause;
pathnode->qual = qual;
@@ -2957,7 +2959,7 @@ create_agg_path(PlannerInfo *root,
list_length(groupClause), numGroups,
qual,
subpath->startup_cost, subpath->total_cost,
- subpath->rows);
+ subpath->rows, subpath->pathtarget->width);
/* add tlist eval cost for each output row */
pathnode->path.startup_cost += target->cost.startup;
@@ -3036,6 +3038,7 @@ create_groupingsets_path(PlannerInfo *root,
pathnode->aggstrategy = aggstrategy;
pathnode->rollups = rollups;
pathnode->qual = having_qual;
+ pathnode->transitionSpace = agg_costs ? agg_costs->transitionSpace : 0;
Assert(rollups != NIL);
Assert(aggstrategy != AGG_PLAIN || list_length(rollups) == 1);
@@ -3067,7 +3070,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
subpath->startup_cost,
subpath->total_cost,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
is_first = false;
if (!rollup->is_hashed)
is_first_sort = false;
@@ -3090,7 +3094,8 @@ create_groupingsets_path(PlannerInfo *root,
rollup->numGroups,
having_qual,
0.0, 0.0,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
if (!rollup->is_hashed)
is_first_sort = false;
}
@@ -3115,7 +3120,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
sort_path.startup_cost,
sort_path.total_cost,
- sort_path.rows);
+ sort_path.rows,
+ subpath->pathtarget->width);
}
pathnode->path.total_cost += agg_path.total_cost;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 464f264d9a2..d88a3bbaa1c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -998,6 +998,16 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_groupingsets_hash_disk", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans for groupingsets when the total size of the hash tables is expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_groupingsets_hash_disk,
+ false,
+ NULL, NULL, NULL
+ },
{
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of materialization."),
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 4f78b55fbaf..36104a73a75 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -201,6 +201,7 @@ static long ltsGetFreeBlock(LogicalTapeSet *lts);
static void ltsReleaseBlock(LogicalTapeSet *lts, long blocknum);
static void ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
SharedFileSet *fileset);
+static void ltsInitTape(LogicalTape *lt);
static void ltsInitReadBuffer(LogicalTapeSet *lts, LogicalTape *lt);
@@ -536,6 +537,30 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
lts->nHoleBlocks = lts->nBlocksAllocated - nphysicalblocks;
}
+/*
+ * Initialize per-tape struct. Note we allocate the I/O buffer and the first
+ * block for a tape only when it is first actually written to. This avoids
+ * wasting memory space when tuplesort.c overestimates the number of tapes
+ * needed.
+ */
+static void
+ltsInitTape(LogicalTape *lt)
+{
+ lt->writing = true;
+ lt->frozen = false;
+ lt->dirty = false;
+ lt->firstBlockNumber = -1L;
+ lt->curBlockNumber = -1L;
+ lt->nextBlockNumber = -1L;
+ lt->offsetBlockNumber = 0L;
+ lt->buffer = NULL;
+ lt->buffer_size = 0;
+ /* palloc() larger than MaxAllocSize would fail */
+ lt->max_size = MaxAllocSize;
+ lt->pos = 0;
+ lt->nbytes = 0;
+}
+
/*
* Lazily allocate and initialize the read buffer. This avoids waste when many
* tapes are open at once, but not all are active between rewinding and
@@ -579,7 +604,6 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
int worker)
{
LogicalTapeSet *lts;
- LogicalTape *lt;
int i;
/*
@@ -597,29 +621,8 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
lts->nFreeBlocks = 0;
lts->nTapes = ntapes;
- /*
- * Initialize per-tape structs. Note we allocate the I/O buffer and the
- * first block for a tape only when it is first actually written to. This
- * avoids wasting memory space when tuplesort.c overestimates the number
- * of tapes needed.
- */
for (i = 0; i < ntapes; i++)
- {
- lt = <s->tapes[i];
- lt->writing = true;
- lt->frozen = false;
- lt->dirty = false;
- lt->firstBlockNumber = -1L;
- lt->curBlockNumber = -1L;
- lt->nextBlockNumber = -1L;
- lt->offsetBlockNumber = 0L;
- lt->buffer = NULL;
- lt->buffer_size = 0;
- /* palloc() larger than MaxAllocSize would fail */
- lt->max_size = MaxAllocSize;
- lt->pos = 0;
- lt->nbytes = 0;
- }
+ ltsInitTape(<s->tapes[i]);
/*
* Create temp BufFile storage as required.
@@ -1004,6 +1007,29 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum, TapeShare *share)
}
}
+/*
+ * Add additional tapes to this tape set. Not intended to be used when any
+ * tapes are frozen.
+ */
+LogicalTapeSet *
+LogicalTapeSetExtend(LogicalTapeSet *lts, int nAdditional)
+{
+ int i;
+ int nTapesOrig = lts->nTapes;
+ Size newSize;
+
+ lts->nTapes += nAdditional;
+ newSize = offsetof(LogicalTapeSet, tapes) +
+ lts->nTapes * sizeof(LogicalTape);
+
+ lts = (LogicalTapeSet *) repalloc(lts, newSize);
+
+ for (i = nTapesOrig; i < lts->nTapes; i++)
+ ltsInitTape(<s->tapes[i]);
+
+ return lts;
+}
+
/*
* Backspace the tape a given number of bytes. (We also support a more
* general seek interface, see below.)
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 8bbf6621da0..dbe8649a576 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -225,6 +225,7 @@ typedef enum ExprEvalOp
EEOP_AGG_DESERIALIZE,
EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
+ EEOP_AGG_PLAIN_PERGROUP_NULLCHECK,
EEOP_AGG_PLAIN_TRANS_INIT_STRICT_BYVAL,
EEOP_AGG_PLAIN_TRANS_STRICT_BYVAL,
EEOP_AGG_PLAIN_TRANS_BYVAL,
@@ -622,6 +623,13 @@ typedef struct ExprEvalStep
int jumpnull;
} agg_strict_input_check;
+ /* for EEOP_AGG_PLAIN_PERGROUP_NULLCHECK */
+ struct
+ {
+ int setoff;
+ int jumpnull;
+ } agg_plain_pergroup_nullcheck;
+
/* for EEOP_AGG_PLAIN_TRANS_[INIT_][STRICT_]{BYVAL,BYREF} */
/* for EEOP_AGG_ORDERED_TRANS_{DATUM,TUPLE} */
struct
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 81fdfa4add3..94890512dc8 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -255,7 +255,7 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
- bool doSort, bool doHash);
+ bool doSort, bool doHash, bool nullcheck);
extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
int numCols,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 264916f9a92..014f9fb26e2 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -280,6 +280,12 @@ typedef struct AggStatePerPhaseData
Sort *sortnode; /* Sort node for input ordering for phase */
ExprState *evaltrans; /* evaluation of transition functions */
+
+ /* cached variants of the compiled expression */
+ ExprState *evaltrans_outerslot;
+ ExprState *evaltrans_minslot;
+ ExprState *evaltrans_nullcheck_outerslot;
+ ExprState *evaltrans_nullcheck_minslot;
} AggStatePerPhaseData;
/*
@@ -311,5 +317,8 @@ extern void ExecReScanAgg(AggState *node);
extern Size hash_agg_entry_size(int numAggs, Size tupleWidth,
Size transitionSpace);
+extern void hash_agg_set_limits(double hashentrysize, uint64 input_groups,
+ int used_bits, Size *mem_limit,
+ long *ngroups_limit, int *num_partitions);
#endif /* NODEAGG_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd3ddf781f1..39b9a6df41b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2078,13 +2078,31 @@ typedef struct AggState
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
- int num_hashes;
+ int num_hashes; /* number of hash tables active at once */
+ double hashentrysize; /* estimate revised during execution */
+ struct HashTapeInfo *hash_tapeinfo; /* metadata for spill tapes */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each hash table,
+ exists only during first pass if spilled */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ bool hash_ever_spilled; /* ever spilled during this execution? */
+ bool hash_spill_mode; /* we hit a limit during the current batch
+ and we must not create new groups */
+ int hash_planned_partitions; /* number of partitions planned */
+ Size hash_mem_limit; /* limit before spilling hash table */
+ Size hash_mem_peak; /* peak hash table memory usage */
+ long hash_ngroups_current; /* number of groups currently in
+ memory in all hash tables */
+ long hash_ngroups_limit; /* limit before spilling hash table */
+ long hash_disk_used; /* kB of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+ List *hash_batches; /* hash batches remaining to be processed */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 48
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 3d3be197e0e..be592d0fee4 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1663,6 +1663,7 @@ typedef struct AggPath
AggStrategy aggstrategy; /* basic strategy, see nodes.h */
AggSplit aggsplit; /* agg-splitting mode, see nodes.h */
double numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
List *groupClause; /* a list of SortGroupClause's */
List *qual; /* quals (HAVING quals), if any */
} AggPath;
@@ -1700,6 +1701,7 @@ typedef struct GroupingSetsPath
AggStrategy aggstrategy; /* basic strategy */
List *rollups; /* list of RollupData */
List *qual; /* quals (HAVING quals), if any */
+ int32 transitionSpace; /* estimated transition state size */
} GroupingSetsPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 32c0d87f80e..f4183e1efa5 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -813,6 +813,7 @@ typedef struct Agg
Oid *grpOperators; /* equality operators to compare with */
Oid *grpCollations;
long numGroups; /* estimated number of groups in input */
+ int32 transitionSpace; /* estimated transition state size */
Bitmapset *aggParams; /* IDs of Params used in Aggref inputs */
/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
List *groupingSets; /* grouping sets to use */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index cb012ba1980..5a0fbebd895 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
@@ -114,7 +115,7 @@ extern void cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples);
+ double input_tuples, double input_width);
extern void cost_windowagg(Path *path, PlannerInfo *root,
List *windowFuncs, int numPartCols, int numOrderCols,
Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index eab486a6214..c7bda2b0917 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -54,8 +54,8 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
extern Agg *make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain,
- double dNumGroups, Plan *lefttree);
+ List *groupingSets, List *chain, double dNumGroups,
+ int32 transitionSpace, Plan *lefttree);
extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
/*
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index 695d2c00ee4..3ebe52239f8 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -67,6 +67,8 @@ extern void LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum,
extern void LogicalTapeRewindForWrite(LogicalTapeSet *lts, int tapenum);
extern void LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum,
TapeShare *share);
+extern LogicalTapeSet *LogicalTapeSetExtend(LogicalTapeSet *lts,
+ int nAdditional);
extern size_t LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum,
size_t size);
extern void LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index f457b5b150f..0073072a368 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2357,3 +2357,187 @@ explain (costs off)
-> Seq Scan on onek
(8 rows)
+--
+-- Hash Aggregation Spill tests
+--
+set enable_sort=false;
+set work_mem='64kB';
+select unique1, count(*), sum(twothousand) from tenk1
+group by unique1
+having sum(fivethous) > 4975
+order by sum(twothousand);
+ unique1 | count | sum
+---------+-------+------
+ 4976 | 1 | 976
+ 4977 | 1 | 977
+ 4978 | 1 | 978
+ 4979 | 1 | 979
+ 4980 | 1 | 980
+ 4981 | 1 | 981
+ 4982 | 1 | 982
+ 4983 | 1 | 983
+ 4984 | 1 | 984
+ 4985 | 1 | 985
+ 4986 | 1 | 986
+ 4987 | 1 | 987
+ 4988 | 1 | 988
+ 4989 | 1 | 989
+ 4990 | 1 | 990
+ 4991 | 1 | 991
+ 4992 | 1 | 992
+ 4993 | 1 | 993
+ 4994 | 1 | 994
+ 4995 | 1 | 995
+ 4996 | 1 | 996
+ 4997 | 1 | 997
+ 4998 | 1 | 998
+ 4999 | 1 | 999
+ 9976 | 1 | 1976
+ 9977 | 1 | 1977
+ 9978 | 1 | 1978
+ 9979 | 1 | 1979
+ 9980 | 1 | 1980
+ 9981 | 1 | 1981
+ 9982 | 1 | 1982
+ 9983 | 1 | 1983
+ 9984 | 1 | 1984
+ 9985 | 1 | 1985
+ 9986 | 1 | 1986
+ 9987 | 1 | 1987
+ 9988 | 1 | 1988
+ 9989 | 1 | 1989
+ 9990 | 1 | 1990
+ 9991 | 1 | 1991
+ 9992 | 1 | 1992
+ 9993 | 1 | 1993
+ 9994 | 1 | 1994
+ 9995 | 1 | 1995
+ 9996 | 1 | 1996
+ 9997 | 1 | 1997
+ 9998 | 1 | 1998
+ 9999 | 1 | 1999
+(48 rows)
+
+set work_mem to default;
+set enable_sort to default;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+set work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------------
+ GroupAggregate
+ Group Key: ((g % 100000))
+ -> Sort
+ Sort Key: ((g % 100000))
+ -> Function Scan on generate_series g
+(5 rows)
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+-- Produce results with hash aggregation
+set enable_hashagg = true;
+set enable_sort = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 100000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+set enable_sort = true;
+set work_mem to default;
+-- Compare group aggregation results to hash aggregation results
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+ a | c1 | c2 | c3
+---+----+----+----
+(0 rows)
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a7..dbe5140b558 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1633,4 +1633,126 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
| 1 | 2
(4 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low
+-- and turning on enable_groupingsets_hash_disk.
+--
+SET enable_groupingsets_hash_disk = true;
+SET work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+-- Produce results with hash aggregation.
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------
+ MixedAggregate
+ Hash Key: (g.g % 1000), (g.g % 100), (g.g % 10)
+ Hash Key: (g.g % 1000), (g.g % 100)
+ Hash Key: (g.g % 1000)
+ Hash Key: (g.g % 100), (g.g % 10)
+ Hash Key: (g.g % 100)
+ Hash Key: (g.g % 10), (g.g % 1000)
+ Hash Key: (g.g % 10)
+ Group Key: ()
+ -> Function Scan on generate_series g
+(10 rows)
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+set enable_sort = true;
+set work_mem to default;
+-- Compare results
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+ g100 | g10 | unnest | c | m
+------+-----+--------+---+---
+(0 rows)
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+SET enable_groupingsets_hash_disk TO DEFAULT;
-- end
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1de..11c6f50fbfa 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -148,6 +148,68 @@ SELECT count(*) FROM
4
(1 row)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+SET enable_hashagg=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------------
+ Unique
+ -> Sort
+ Sort Key: ((g % 1000))
+ -> Function Scan on generate_series g
+(4 rows)
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_hashagg=TRUE;
+-- Produce results with hash aggregation.
+SET enable_sort=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 1000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_sort=TRUE;
+SET work_mem TO DEFAULT;
+-- Compare results
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb9057..147486c2fc3 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -74,6 +74,7 @@ select name, setting from pg_settings where name like 'enable%';
--------------------------------+---------
enable_bitmapscan | on
enable_gathermerge | on
+ enable_groupingsets_hash_disk | off
enable_hashagg | on
enable_hashjoin | on
enable_indexonlyscan | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 3e593f2d615..02578330a6f 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1032,3 +1032,134 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
explain (costs off)
select 1 from tenk1
where (hundred, thousand) in (select twothousand, twothousand from onek);
+
+--
+-- Hash Aggregation Spill tests
+--
+
+set enable_sort=false;
+set work_mem='64kB';
+
+select unique1, count(*), sum(twothousand) from tenk1
+group by unique1
+having sum(fivethous) > 4975
+order by sum(twothousand);
+
+set work_mem to default;
+set enable_sort to default;
+
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+set work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+-- Produce results with hash aggregation
+
+set enable_hashagg = true;
+set enable_sort = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare group aggregation results to hash aggregation results
+
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 95ac3fb52f6..478f49ecab5 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -441,4 +441,107 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
from unnest(array[1,1], array['a','b']) u(i,v)
group by rollup(i, v||'a') order by 1,3;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low
+-- and turning on enable_groupingsets_hash_disk.
+--
+
+SET enable_groupingsets_hash_disk = true;
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+-- Produce results with hash aggregation.
+
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare results
+
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+
+SET enable_groupingsets_hash_disk TO DEFAULT;
+
-- end
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449e..33102744ebf 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -45,6 +45,68 @@ SELECT count(*) FROM
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+SET enable_hashagg=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_hashagg=TRUE;
+
+-- Produce results with hash aggregation.
+
+SET enable_sort=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_sort=TRUE;
+
+SET work_mem TO DEFAULT;
+
+-- Compare results
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
+
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
On Wed, 2020-02-26 at 19:14 -0800, Jeff Davis wrote:
Rebased on your change. This simplified the JIT and interpretation
code
quite a bit.
Attached another version.
* tweaked EXPLAIN output some more
* rebased and cleaned up
* Added back the enable_hashagg_disk flag (defaulting to on). I've
gone back and forth on this, but it seems like a good idea to have it
there. So now there are a total of two GUCs: enable_hashagg_disk and
enable_groupingsets_hash_disk
Unless I (or someone else) finds something significant, this is close
to commit.
Regards,
Jeff Davis
Attachments:
hashagg-20200311.patchtext/x-patch; charset=UTF-8; name=hashagg-20200311.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 371d7838fb6..5e223c42208 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4462,6 +4462,24 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-groupingsets-hash-disk" xreflabel="enable_groupingsets_hash_disk">
+ <term><varname>enable_groupingsets_hash_disk</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_groupingsets_hash_disk</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation for
+ grouping sets when the size of the hash tables is expected to exceed
+ <varname>work_mem</varname>. See <xref
+ linkend="queries-grouping-sets"/>. Note that this setting only
+ affects the chosen plan; execution time may still require using
+ disk-based hash aggregation. The default is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashagg" xreflabel="enable_hashagg">
<term><varname>enable_hashagg</varname> (<type>boolean</type>)
<indexterm>
@@ -4476,6 +4494,23 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-hashagg-disk" xreflabel="enable_hashagg_disk">
+ <term><varname>enable_hashagg_disk</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_hashagg_disk</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the memory usage is expected to exceed
+ <varname>work_mem</varname>. This only affects the planner choice;
+ execution time may still require using disk-based hash
+ aggregation. The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
<term><varname>enable_hashjoin</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50e..58141d8393c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -104,6 +104,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1882,6 +1883,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2769,6 +2771,41 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * Show information on hash aggregate memory usage and batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->costs && aggstate->hash_planned_partitions > 0)
+ {
+ ExplainPropertyInteger("Planned Partitions", NULL,
+ aggstate->hash_planned_partitions, es);
+ }
+
+ if (!es->analyze)
+ return;
+
+ /* EXPLAIN ANALYZE */
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("Disk Usage", "kB",
+ aggstate->hash_disk_used, es);
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 7aebb247d88..d5ab1769127 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -194,6 +194,29 @@
* transition values. hashcontext is the single context created to support
* all hash tables.
*
+ * Spilling To Disk
+ *
+ * When performing hash aggregation, if the hash table memory exceeds the
+ * limit (see hash_agg_check_limits()), we enter "spill mode". In spill
+ * mode, we advance the transition states only for groups already in the
+ * hash table. For tuples that would need to create a new hash table
+ * entries (and initialize new transition states), we instead spill them to
+ * disk to be processed later. The tuples are spilled in a partitioned
+ * manner, so that subsequent batches are smaller and less likely to exceed
+ * work_mem (if a batch does exceed work_mem, it must be spilled
+ * recursively).
+ *
+ * Spilled data is written to logical tapes. These provide better control
+ * over memory usage, disk space, and the number of files than if we were
+ * to use a BufFile for each spill.
+ *
+ * Note that it's possible for transition states to start small but then
+ * grow very large; for instance in the case of ARRAY_AGG. In such cases,
+ * it's still possible to significantly exceed work_mem. We try to avoid
+ * this situation by estimating what will fit in the available memory, and
+ * imposing a limit on the number of groups separately from the amount of
+ * memory consumed.
+ *
* Transition / Combine function invocation:
*
* For performance reasons transition functions, including combine
@@ -233,12 +256,100 @@
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/datum.h"
+#include "utils/dynahash.h"
#include "utils/expandeddatum.h"
+#include "utils/logtape.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+/*
+ * Control how many partitions are created when spilling HashAgg to
+ * disk.
+ *
+ * HASHAGG_PARTITION_FACTOR is multiplied by the estimated number of
+ * partitions needed such that each partition will fit in memory. The factor
+ * is set higher than one because there's not a high cost to having a few too
+ * many partitions, and it makes it less likely that a partition will need to
+ * be spilled recursively. Another benefit of having more, smaller partitions
+ * is that small hash tables may perform better than large ones due to memory
+ * caching effects.
+ *
+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (which
+ * could instead be spent on a larger hash table).
+ *
+ * For reading from tapes, the buffer size must be a multiple of
+ * BLCKSZ. Larger values help when reading from multiple tapes concurrently,
+ * but that doesn't happen in HashAgg, so we simply use BLCKSZ. Writing to a
+ * tape always uses a buffer of size BLCKSZ.
+ */
+#define HASHAGG_PARTITION_FACTOR 1.50
+#define HASHAGG_MIN_PARTITIONS 4
+#define HASHAGG_MAX_PARTITIONS 256
+#define HASHAGG_MIN_BUCKETS 256
+#define HASHAGG_READ_BUFFER_SIZE BLCKSZ
+#define HASHAGG_WRITE_BUFFER_SIZE BLCKSZ
+
+/*
+ * Track all tapes needed for a HashAgg that spills. We don't know the maximum
+ * number of tapes needed at the start of the algorithm (because it can
+ * recurse), so one tape set is allocated and extended as needed for new
+ * tapes. When a particular tape is already read, rewind it for write mode and
+ * put it in the free list.
+ *
+ * Tapes' buffers can take up substantial memory when many tapes are open at
+ * once. We only need one tape open at a time in read mode (using a buffer
+ * that's a multiple of BLCKSZ); but we need one tape open in write mode (each
+ * requiring a buffer of size BLCKSZ) for each partition.
+ */
+typedef struct HashTapeInfo
+{
+ LogicalTapeSet *tapeset;
+ int ntapes;
+ int *freetapes;
+ int nfreetapes;
+} HashTapeInfo;
+
+/*
+ * Represents partitioned spill data for a single hashtable. Contains the
+ * necessary information to route tuples to the correct partition, and to
+ * transform the spilled data into new batches.
+ *
+ * The high bits are used for partition selection (when recursing, we ignore
+ * the bits that have already been used for partition selection at an earlier
+ * level).
+ */
+typedef struct HashAggSpill
+{
+ HashTapeInfo *tapeinfo; /* borrowed reference to tape info */
+ int npartitions; /* number of partitions */
+ int *partitions; /* spill partition tape numbers */
+ int64 *ntuples; /* number of tuples in each partition */
+ uint32 mask; /* mask to find partition from hash value */
+ int shift; /* after masking, shift by this amount */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation (with only one
+ * grouping set).
+ *
+ * Also tracks the bits of the hash already used for partition selection by
+ * earlier iterations, so that this batch can use new bits. If all bits have
+ * already been used, no partitioning will be done (any spilled data will go
+ * to a single output tape).
+ */
+typedef struct HashAggBatch
+{
+ int setno; /* grouping set */
+ int used_bits; /* number of bits of hash already used */
+ HashTapeInfo *tapeinfo; /* borrowed reference to tape info */
+ int input_tapenum; /* input partition tape */
+ int64 input_tuples; /* number of tuples in this batch */
+} HashAggBatch;
+
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -275,11 +386,41 @@ static Bitmapset *find_unaggregated_cols(AggState *aggstate);
static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
static void build_hash_tables(AggState *aggstate);
static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
+static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
+ bool nullcheck);
+static long hash_choose_num_buckets(double hashentrysize,
+ long estimated_nbuckets,
+ Size memory);
+static int hash_choose_num_partitions(uint64 input_groups,
+ double hashentrysize,
+ int used_bits,
+ int *log2_npartittions);
static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+static void hash_agg_check_limits(AggState *aggstate);
+static void hash_agg_update_metrics(AggState *aggstate, bool from_tape,
+ int npartitions);
+static void hashagg_finish_initial_spills(AggState *aggstate);
+static void hashagg_reset_spill_state(AggState *aggstate);
+static HashAggBatch *hashagg_batch_new(HashTapeInfo *tapeinfo,
+ int input_tapenum, int setno,
+ int64 input_tuples, int used_bits);
+static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
+static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
+ int used_bits, uint64 input_tuples,
+ double hashentrysize);
+static Size hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot,
+ uint32 hash);
+static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno);
+static void hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *dest,
+ int ndest);
+static void hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum);
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1264,7 +1405,7 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
}
/*
- * (Re-)initialize the hash table(s) to empty.
+ * (Re-)initialize the hash table(s).
*
* To implement hashed aggregation, we need a hashtable that stores a
* representative tuple and an array of AggStatePerGroup structs for each
@@ -1275,9 +1416,9 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
* We have a separate hashtable and associated perhash data structure for each
* grouping set for which we're doing hashing.
*
- * The contents of the hash tables always live in the hashcontext's per-tuple
- * memory context (there is only one of these for all tables together, since
- * they are all reset at the same time).
+ * The hash tables and their contents always live in the hashcontext's
+ * per-tuple memory context (there is only one of these for all tables
+ * together, since they are all reset at the same time).
*/
static void
build_hash_tables(AggState *aggstate)
@@ -1287,14 +1428,27 @@ build_hash_tables(AggState *aggstate)
for (setno = 0; setno < aggstate->num_hashes; ++setno)
{
AggStatePerHash perhash = &aggstate->perhash[setno];
+ long nbuckets;
+ Size memory;
+
+ if (perhash->hashtable != NULL)
+ {
+ ResetTupleHashTable(perhash->hashtable);
+ continue;
+ }
Assert(perhash->aggnode->numGroups > 0);
- if (perhash->hashtable)
- ResetTupleHashTable(perhash->hashtable);
- else
- build_hash_table(aggstate, setno, perhash->aggnode->numGroups);
+ memory = aggstate->hash_mem_limit / aggstate->num_hashes;
+
+ /* choose reasonable number of buckets per hashtable */
+ nbuckets = hash_choose_num_buckets(
+ aggstate->hashentrysize, perhash->aggnode->numGroups, memory);
+
+ build_hash_table(aggstate, setno, nbuckets);
}
+
+ aggstate->hash_ngroups_current = 0;
}
/*
@@ -1487,14 +1641,293 @@ hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
transitionSpace;
}
+/*
+ * hashagg_recompile_expressions()
+ *
+ * Identifies the right phase, compiles the right expression given the
+ * arguments, and then sets phase->evalfunc to that expression.
+ *
+ * Different versions of the compiled expression are needed depending on
+ * whether hash aggregation has spilled or not, and whether it's reading from
+ * the outer plan or a tape. Before spilling to disk, the expression reads
+ * from the outer plan and does not need to perform a NULL check. After
+ * HashAgg begins to spill, new groups will not be created in the hash table,
+ * and the AggStatePerGroup array may be NULL; therefore we need to add a null
+ * pointer check to the expression. Then, when reading spilled data from a
+ * tape, we change the outer slot type to be a fixed minimal tuple slot.
+ *
+ * It would be wasteful to recompile every time, so cache the compiled
+ * expressions in the AggStatePerPhase, and reuse when appropriate.
+ */
+static void
+hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
+{
+ AggStatePerPhase phase;
+ int i = minslot ? 1 : 0;
+ int j = nullcheck ? 1 : 0;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ if (aggstate->aggstrategy == AGG_HASHED)
+ phase = &aggstate->phases[0];
+ else /* AGG_MIXED */
+ phase = &aggstate->phases[1];
+
+ if (phase->evaltrans_cache[i][j] == NULL)
+ {
+ const TupleTableSlotOps *outerops = aggstate->ss.ps.outerops;
+ bool outerfixed = aggstate->ss.ps.outeropsfixed;
+ bool dohash = true;
+ bool dosort;
+
+ dosort = aggstate->aggstrategy == AGG_MIXED ? true : false;
+
+ /* temporarily change the outerops while compiling the expression */
+ if (minslot)
+ {
+ aggstate->ss.ps.outerops = &TTSOpsMinimalTuple;
+ aggstate->ss.ps.outeropsfixed = true;
+ }
+
+ phase->evaltrans_cache[i][j] = ExecBuildAggTrans(
+ aggstate, phase, dosort, dohash, nullcheck);
+
+ /* change back */
+ aggstate->ss.ps.outerops = outerops;
+ aggstate->ss.ps.outeropsfixed = outerfixed;
+ }
+
+ phase->evaltrans = phase->evaltrans_cache[i][j];
+}
+
+/*
+ * Set limits that trigger spilling to avoid exceeding work_mem. Consider the
+ * number of partitions we expect to create (if we do spill).
+ *
+ * There are two limits: a memory limit, and also an ngroups limit. The
+ * ngroups limit becomes important when we expect transition values to grow
+ * substantially larger than the initial value.
+ */
+void
+hash_agg_set_limits(double hashentrysize, uint64 input_groups, int used_bits,
+ Size *mem_limit, long *ngroups_limit, int *num_partitions)
+{
+ int npartitions;
+ Size partition_mem;
+
+ /* if not expected to spill, use all of work_mem */
+ if (input_groups * hashentrysize < work_mem * 1024L)
+ {
+ *mem_limit = work_mem * 1024L;
+ *ngroups_limit = *mem_limit / hashentrysize;
+ return;
+ }
+
+ /*
+ * Calculate expected memory requirements for spilling, which is the size
+ * of the buffers needed for all the tapes that need to be open at
+ * once. Then, subtract that from the memory available for holding hash
+ * tables.
+ */
+ npartitions = hash_choose_num_partitions(input_groups,
+ hashentrysize,
+ used_bits,
+ NULL);
+ if (num_partitions != NULL)
+ *num_partitions = npartitions;
+
+ partition_mem =
+ HASHAGG_READ_BUFFER_SIZE +
+ HASHAGG_WRITE_BUFFER_SIZE * npartitions;
+
+ /*
+ * Don't set the limit below 3/4 of work_mem. In that case, we are at the
+ * minimum number of partitions, so we aren't going to dramatically exceed
+ * work mem anyway.
+ */
+ if (work_mem * 1024L > 4 * partition_mem)
+ *mem_limit = work_mem * 1024L - partition_mem;
+ else
+ *mem_limit = work_mem * 1024L * 0.75;
+
+ if (*mem_limit > hashentrysize)
+ *ngroups_limit = *mem_limit / hashentrysize;
+ else
+ *ngroups_limit = 1;
+}
+
+/*
+ * hash_agg_check_limits
+ *
+ * After adding a new group to the hash table, check whether we need to enter
+ * spill mode. Allocations may happen without adding new groups (for instance,
+ * if the transition state size grows), so this check is imperfect.
+ */
+static void
+hash_agg_check_limits(AggState *aggstate)
+{
+ long ngroups = aggstate->hash_ngroups_current;
+ Size hash_mem = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ /*
+ * Don't spill unless there's at least one group in the hash table so we
+ * can be sure to make progress even in edge cases.
+ */
+ if (aggstate->hash_ngroups_current > 0 &&
+ (hash_mem > aggstate->hash_mem_limit ||
+ ngroups > aggstate->hash_ngroups_limit))
+ {
+ aggstate->hash_spill_mode = true;
+ hashagg_recompile_expressions(aggstate,
+ aggstate->table_filled,
+ true);
+
+ if (!aggstate->hash_ever_spilled)
+ {
+ aggstate->hash_ever_spilled = true;
+ aggstate->hash_spills = palloc0(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+ aggstate->hash_tapeinfo = palloc0(sizeof(HashTapeInfo));
+ }
+ }
+}
+
+/*
+ * Update metrics after filling the hash table.
+ *
+ * If reading from the outer plan, from_tape should be false; if reading from
+ * another tape, from_tape should be true.
+ */
+static void
+hash_agg_update_metrics(AggState *aggstate, bool from_tape, int npartitions)
+{
+ Size partition_mem = 0;
+ Size hash_mem = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ if (aggstate->aggstrategy != AGG_MIXED &&
+ aggstate->aggstrategy != AGG_HASHED)
+ return;
+
+ /* update hashentrysize estimate based on contents */
+ if (aggstate->hash_ngroups_current > 0)
+ {
+ aggstate->hashentrysize =
+ hash_mem / (double)aggstate->hash_ngroups_current;
+ }
+
+ /*
+ * Calculate peak memory usage, which includes memory for partition tapes'
+ * read/write buffers.
+ */
+ if (from_tape)
+ partition_mem += HASHAGG_READ_BUFFER_SIZE;
+ partition_mem = npartitions * HASHAGG_WRITE_BUFFER_SIZE;
+
+ if (hash_mem + partition_mem > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = hash_mem + partition_mem;
+
+ /* update disk usage */
+ if (aggstate->hash_tapeinfo != NULL &&
+ aggstate->hash_tapeinfo->tapeset != NULL)
+ {
+ uint64 disk_used = LogicalTapeSetBlocks(
+ aggstate->hash_tapeinfo->tapeset) * (BLCKSZ / 1024);
+
+ if (aggstate->hash_disk_used < disk_used)
+ aggstate->hash_disk_used = disk_used;
+ }
+}
+
+/*
+ * Choose a reasonable number of buckets for the initial hash table size.
+ */
+static long
+hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory)
+{
+ long max_nbuckets;
+ long nbuckets = ngroups;
+
+ max_nbuckets = memory / hashentrysize;
+
+ /*
+ * Leave room for slop to avoid a case where the initial hash table size
+ * exceeds the memory limit (though that may still happen in edge cases).
+ */
+ max_nbuckets *= 0.75;
+
+ if (nbuckets > max_nbuckets)
+ nbuckets = max_nbuckets;
+ if (nbuckets < HASHAGG_MIN_BUCKETS)
+ nbuckets = HASHAGG_MIN_BUCKETS;
+ return nbuckets;
+}
+
+/*
+ * Determine the number of partitions to create when spilling, which will
+ * always be a power of two. If log2_npartitions is non-NULL, set
+ * *log2_npartitions to the log2() of the number of partitions.
+ */
+static int
+hash_choose_num_partitions(uint64 input_groups, double hashentrysize,
+ int used_bits, int *log2_npartitions)
+{
+ Size mem_wanted;
+ int partition_limit;
+ int npartitions;
+ int partition_bits;
+
+ /*
+ * Avoid creating so many partitions that the memory requirements of the
+ * open partition files are greater than 1/4 of work_mem.
+ */
+ partition_limit =
+ (work_mem * 1024L * 0.25 - HASHAGG_READ_BUFFER_SIZE) /
+ HASHAGG_WRITE_BUFFER_SIZE;
+
+ mem_wanted = HASHAGG_PARTITION_FACTOR * input_groups * hashentrysize;
+
+ /* make enough partitions so that each one is likely to fit in memory */
+ npartitions = 1 + (mem_wanted / (work_mem * 1024L));
+
+ if (npartitions > partition_limit)
+ npartitions = partition_limit;
+
+ if (npartitions < HASHAGG_MIN_PARTITIONS)
+ npartitions = HASHAGG_MIN_PARTITIONS;
+ if (npartitions > HASHAGG_MAX_PARTITIONS)
+ npartitions = HASHAGG_MAX_PARTITIONS;
+
+ /* ceil(log2(npartitions)) */
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits */
+ if (partition_bits + used_bits >= 32)
+ partition_bits = 32 - used_bits;
+
+ if (log2_npartitions != NULL)
+ *log2_npartitions = partition_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ return npartitions;
+}
+
/*
* Find or create a hashtable entry for the tuple group containing the current
* tuple (already set in tmpcontext's outertuple slot), in the current grouping
* set (which the caller must have selected - note that initialize_aggregate
* depends on this).
*
- * When called, CurrentMemoryContext should be the per-query context. The
- * already-calculated hash value for the tuple must be specified.
+ * When called, CurrentMemoryContext should be the per-query context.
+ *
+ * If the hash table is at the memory limit, then only find existing hashtable
+ * entries; don't create new ones. If a tuple's group is not already present
+ * in the hash table for the current grouping set, return NULL and the caller
+ * will spill it to disk.
*/
static AggStatePerGroup
lookup_hash_entry(AggState *aggstate, uint32 hash)
@@ -1502,16 +1935,26 @@ lookup_hash_entry(AggState *aggstate, uint32 hash)
AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
TupleTableSlot *hashslot = perhash->hashslot;
TupleHashEntryData *entry;
- bool isnew;
+ bool isnew = false;
+ bool *p_isnew;
+
+ /* if hash table already spilled, don't create new entries */
+ p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, &isnew,
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
hash);
+ if (entry == NULL)
+ return NULL;
+
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_ngroups_current++;
+ hash_agg_check_limits(aggstate);
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1539,23 +1982,48 @@ lookup_hash_entry(AggState *aggstate, uint32 hash)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Some entries may be left NULL if we have reached the limit and have begun
+ * to spill. The same tuple will belong to different groups for each set, so
+ * may match a group already in memory for one set and match a group not in
+ * memory for another set. If we have begun to spill and a tuple doesn't match
+ * a group in memory for a particular set, it will be spilled.
+ *
+ * NB: It's possible to spill the same tuple for several different grouping
+ * sets. This may seem wasteful, but it's actually a trade-off: if we spill
+ * the tuple multiple times for multiple grouping sets, it can be partitioned
+ * for each grouping set, making the refilling of the hash table very
+ * efficient.
*/
static void
lookup_hash_entries(AggState *aggstate)
{
- int numHashes = aggstate->num_hashes;
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
int setno;
- for (setno = 0; setno < numHashes; setno++)
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
{
- AggStatePerHash perhash = &aggstate->perhash[setno];
+ AggStatePerHash perhash = &aggstate->perhash[setno];
uint32 hash;
select_current_set(aggstate, setno, true);
prepare_hash_slot(aggstate);
hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
pergroup[setno] = lookup_hash_entry(aggstate, hash);
+
+ /* check to see if we need to spill the tuple for this grouping set */
+ if (pergroup[setno] == NULL)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
+
+ if (spill->partitions == NULL)
+ hashagg_spill_init(spill, aggstate->hash_tapeinfo, 0,
+ perhash->aggnode->numGroups,
+ aggstate->hashentrysize);
+
+ hashagg_spill_tuple(spill, slot, hash);
+ }
}
}
@@ -1878,6 +2346,12 @@ agg_retrieve_direct(AggState *aggstate)
if (TupIsNull(outerslot))
{
/* no more outer-plan tuples available */
+
+ /* if we built hash tables, finalize any spills */
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hashagg_finish_initial_spills(aggstate);
+
if (hasGroupingSets)
{
aggstate->input_done = true;
@@ -1980,6 +2454,10 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ /* finalize spills, if any */
+ hashagg_finish_initial_spills(aggstate);
+
+ aggstate->input_done = true;
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1987,11 +2465,182 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * consumed.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+ HashAggSpill spill;
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ long nbuckets;
+ int setno;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ spill.npartitions = 0;
+ spill.partitions = NULL;
+ /*
+ * Each spill file contains spilled data for only a single grouping
+ * set. We want to ignore all others, which is done by setting the other
+ * pergroups to NULL.
+ */
+ memset(aggstate->all_pergroups, 0,
+ sizeof(AggStatePerGroup) *
+ (aggstate->maxsets + aggstate->num_hashes));
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ /* pessimistically estimate that input tuples are equal to input groups */
+ hash_agg_set_limits(aggstate->hashentrysize, batch->input_tuples,
+ batch->used_bits, &aggstate->hash_mem_limit,
+ &aggstate->hash_ngroups_limit, NULL);
+
+ /* free memory and reset hash tables */
+ ReScanExprContext(aggstate->hashcontext);
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ ResetTupleHashTable(aggstate->perhash[setno].hashtable);
+
+ /* build a single new hashtable for this grouping set */
+ nbuckets = hash_choose_num_buckets(
+ aggstate->hashentrysize, batch->input_tuples,
+ aggstate->hash_mem_limit);
+ build_hash_table(aggstate, batch->setno, nbuckets);
+ aggstate->hash_ngroups_current = 0;
+
+ Assert(aggstate->current_phase == 0);
+
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ /*
+ * The first pass (agg_fill_hash_table()) reads whatever kind of slot comes
+ * from the outer plan, and considers the slot fixed. But spilled tuples
+ * are always MinimalTuples, so we need to recompile the aggregate
+ * expressions.
+ *
+ * We still need the NULL check, because we are only processing one
+ * grouping set at a time and the rest will be NULL.
+ */
+ hashagg_recompile_expressions(aggstate, true, true);
+
+ LogicalTapeRewindForRead(tapeinfo->tapeset, batch->input_tapenum,
+ HASHAGG_READ_BUFFER_SIZE);
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+ uint32 hash;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hashagg_batch_read(batch, &hash);
+ if (tuple == NULL)
+ break;
+
+ ExecStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ select_current_set(aggstate, batch->setno, true);
+ prepare_hash_slot(aggstate);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+
+ /* if there's no memory for a new group, spill */
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ {
+ /*
+ * Estimate the number of groups for this batch as the total
+ * number of tuples in its input file. Although that's a worst
+ * case, it's not bad here for two reasons: (1) overestimating
+ * is better than underestimating; and (2) we've already
+ * scanned the relation once, so it's likely that we've
+ * already finalized many of the common values.
+ */
+ if (spill.partitions == NULL)
+ hashagg_spill_init(&spill, tapeinfo, batch->used_bits,
+ batch->input_tuples,
+ aggstate->hashentrysize);
+
+ hashagg_spill_tuple(&spill, slot, hash);
+ }
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ hashagg_tapeinfo_release(tapeinfo, batch->input_tapenum);
+
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ hash_agg_update_metrics(aggstate, true, spill.npartitions);
+ hashagg_spill_finish(aggstate, &spill, batch->setno);
+ aggstate->hash_spill_mode = false;
+
+ /* Initialize to walk the first hash table */
+ select_current_set(aggstate, batch->setno, true);
+ ResetTupleHashIterator(aggstate->perhash[batch->setno].hashtable,
+ &aggstate->perhash[batch->setno].hashiter);
+
+ pfree(batch);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
+ *
+ * After exhausting in-memory tuples, also try refilling the hash table using
+ * previously-spilled tuples. Only returns NULL after all in-memory and
+ * spilled tuples are exhausted.
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+/*
+ * Retrieve the groups from the in-memory hash tables without considering any
+ * spilled tuples.
+ */
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -2020,7 +2669,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2051,8 +2700,6 @@ agg_retrieve_hash_table(AggState *aggstate)
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2109,6 +2756,292 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * Assign unused tapes to spill partitions, extending the tape set if
+ * necessary.
+ */
+static void
+hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *partitions,
+ int npartitions)
+{
+ int partidx = 0;
+
+ /* use free tapes if available */
+ while (partidx < npartitions && tapeinfo->nfreetapes > 0)
+ partitions[partidx++] = tapeinfo->freetapes[--tapeinfo->nfreetapes];
+
+ if (tapeinfo->tapeset == NULL)
+ tapeinfo->tapeset = LogicalTapeSetCreate(npartitions, NULL, NULL, -1);
+ else if (partidx < npartitions)
+ LogicalTapeSetExtend(tapeinfo->tapeset, npartitions - partidx);
+
+ while (partidx < npartitions)
+ partitions[partidx++] = tapeinfo->ntapes++;
+}
+
+/*
+ * After a tape has already been written to and then read, this function
+ * rewinds it for writing and adds it to the free list.
+ */
+static void
+hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum)
+{
+ LogicalTapeRewindForWrite(tapeinfo->tapeset, tapenum);
+ if (tapeinfo->freetapes == NULL)
+ tapeinfo->freetapes = palloc(sizeof(int));
+ else
+ tapeinfo->freetapes = repalloc(
+ tapeinfo->freetapes, sizeof(int) * (tapeinfo->nfreetapes + 1));
+ tapeinfo->freetapes[tapeinfo->nfreetapes++] = tapenum;
+}
+
+/*
+ * hashagg_spill_init
+ *
+ * Called after we determined that spilling is necessary. Chooses the number
+ * of partitions to create, and initializes them.
+ */
+static void
+hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo, int used_bits,
+ uint64 input_groups, double hashentrysize)
+{
+ int npartitions;
+ int partition_bits;
+
+ npartitions = hash_choose_num_partitions(
+ input_groups, hashentrysize, used_bits, &partition_bits);
+
+ spill->partitions = palloc0(sizeof(int) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+
+ hashagg_tapeinfo_assign(tapeinfo, spill->partitions, npartitions);
+
+ spill->tapeinfo = tapeinfo;
+ spill->shift = 32 - used_bits - partition_bits;
+ spill->mask = (npartitions - 1) << spill->shift;
+ spill->npartitions = npartitions;
+}
+
+/*
+ * hashagg_spill_tuple
+ *
+ * No room for new groups in the hash table. Save for later in the appropriate
+ * partition.
+ */
+static Size
+hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot, uint32 hash)
+{
+ LogicalTapeSet *tapeset = spill->tapeinfo->tapeset;
+ int partition;
+ MinimalTuple tuple;
+ int tapenum;
+ int total_written = 0;
+ bool shouldFree;
+
+ Assert(spill->partitions != NULL);
+
+ /* XXX: may contain unnecessary attributes, should project */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ partition = (hash & spill->mask) >> spill->shift;
+ spill->ntuples[partition]++;
+
+ tapenum = spill->partitions[partition];
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) &hash, sizeof(uint32));
+ total_written += sizeof(uint32);
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) tuple, tuple->t_len);
+ total_written += tuple->t_len;
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return total_written;
+}
+
+/*
+ * hashagg_batch_new
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done.
+ */
+static HashAggBatch *
+hashagg_batch_new(HashTapeInfo *tapeinfo, int tapenum, int setno,
+ int64 input_tuples, int used_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->setno = setno;
+ batch->used_bits = used_bits;
+ batch->tapeinfo = tapeinfo;
+ batch->input_tapenum = tapenum;
+ batch->input_tuples = input_tuples;
+
+ return batch;
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hashagg_batch_read(HashAggBatch *batch, uint32 *hashp)
+{
+ LogicalTapeSet *tapeset = batch->tapeinfo->tapeset;
+ int tapenum = batch->input_tapenum;
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+ uint32 hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &hash, sizeof(uint32));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+ if (hashp != NULL)
+ *hashp = hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &t_len, sizeof(t_len));
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = LogicalTapeRead(tapeset, tapenum,
+ (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, t_len - sizeof(uint32), nread)));
+
+ return tuple;
+}
+
+/*
+ * hashagg_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hashagg_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+ int total_npartitions = 0;
+
+ if (aggstate->hash_spills != NULL)
+ {
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ total_npartitions += spill->npartitions;
+ hashagg_spill_finish(aggstate, spill, setno);
+ }
+
+ /*
+ * We're not processing tuples from outer plan any more; only
+ * processing batches of spilled tuples. The initial spill structures
+ * are no longer needed.
+ */
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ hash_agg_update_metrics(aggstate, false, total_npartitions);
+ aggstate->hash_spill_mode = false;
+}
+
+/*
+ * hashagg_spill_finish
+ *
+ * Transform spill partitions into new batches.
+ */
+static void
+hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
+{
+ int i;
+ int used_bits = 32 - spill->shift;
+
+ if (spill->npartitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->npartitions; i++)
+ {
+ int tapenum = spill->partitions[i];
+ HashAggBatch *new_batch;
+
+ new_batch = hashagg_batch_new(aggstate->hash_tapeinfo,
+ tapenum, setno, spill->ntuples[i],
+ used_bits);
+ aggstate->hash_batches = lcons(new_batch, aggstate->hash_batches);
+ aggstate->hash_batches_used++;
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Free resources related to a spilled HashAgg.
+ */
+static void
+hashagg_reset_spill_state(AggState *aggstate)
+{
+ ListCell *lc;
+
+ /* free spills from initial pass */
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+ }
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ /* free batches */
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+
+ /* close tape set */
+ if (aggstate->hash_tapeinfo != NULL)
+ {
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+
+ if (tapeinfo->tapeset != NULL)
+ LogicalTapeSetClose(tapeinfo->tapeset);
+ if (tapeinfo->freetapes != NULL)
+ pfree(tapeinfo->freetapes);
+ pfree(tapeinfo);
+ aggstate->hash_tapeinfo = NULL;
+ }
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2518,9 +3451,26 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
*/
if (use_hashing)
{
+ Plan *outerplan = outerPlan(node);
+ long totalGroups = 0;
+ int i;
+
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(
+ estate, scanDesc, &TTSOpsMinimalTuple);
+
/* this is an array of pointers, not structures */
aggstate->hash_pergroup = pergroups;
+ aggstate->hashentrysize = hash_agg_entry_size(
+ aggstate->numtrans, outerplan->plan_width, node->transitionSpace);
+
+ for (i = 0; i < aggstate->num_hashes; i++)
+ totalGroups = aggstate->perhash[i].aggnode->numGroups;
+
+ hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+ &aggstate->hash_mem_limit,
+ &aggstate->hash_ngroups_limit,
+ &aggstate->hash_planned_partitions);
find_hash_columns(aggstate);
build_hash_tables(aggstate);
aggstate->table_filled = false;
@@ -2931,6 +3881,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
false);
+ /* cache compiled expression for outer slot without NULL check */
+ phase->evaltrans_cache[0][0] = phase->evaltrans;
}
return aggstate;
@@ -3424,6 +4376,8 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hashagg_reset_spill_state(node);
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3479,12 +4433,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_ever_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3541,11 +4496,19 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ hashagg_reset_spill_state(node);
+
+ node->hash_ever_spilled = false;
+ node->hash_spill_mode = false;
+ node->hash_ngroups_current = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
build_hash_tables(node);
node->table_filled = false;
/* iterator will be reset when the table is filled */
+
+ hashagg_recompile_expressions(node, false, false);
}
if (node->aggstrategy != AGG_HASHED)
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721f..1cb5d0d6751 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -77,6 +77,7 @@
#include "access/htup_details.h"
#include "access/tsmapi.h"
#include "executor/executor.h"
+#include "executor/nodeAgg.h"
#include "executor/nodeHash.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -128,6 +129,8 @@ bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
bool enable_hashagg = true;
+bool enable_hashagg_disk = true;
+bool enable_groupingsets_hash_disk = false;
bool enable_nestloop = true;
bool enable_material = true;
bool enable_mergejoin = true;
@@ -2153,7 +2156,7 @@ cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples)
+ double input_tuples, double input_width)
{
double output_tuples;
Cost startup_cost;
@@ -2228,14 +2231,79 @@ cost_agg(Path *path, PlannerInfo *root,
startup_cost += disable_cost;
startup_cost += aggcosts->transCost.startup;
startup_cost += aggcosts->transCost.per_tuple * input_tuples;
+ /* cost of computing hash value */
startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;
startup_cost += aggcosts->finalCost.startup;
+
total_cost = startup_cost;
total_cost += aggcosts->finalCost.per_tuple * numGroups;
+ /* cost of retrieving from hash table */
total_cost += cpu_tuple_cost * numGroups;
output_tuples = numGroups;
}
+ /*
+ * Add the disk costs of hash aggregation that spills to disk.
+ *
+ * Groups that go into the hash table stay in memory until finalized,
+ * so spilling and reprocessing tuples doesn't incur additional
+ * invocations of transCost or finalCost. Furthermore, the computed
+ * hash value is stored with the spilled tuples, so we don't incur
+ * extra invocations of the hash function.
+ *
+ * Hash Agg begins returning tuples after the first batch is
+ * complete. Accrue writes (spilled tuples) to startup_cost and to
+ * total_cost; accrue reads only to total_cost.
+ */
+ if (aggstrategy == AGG_HASHED || aggstrategy == AGG_MIXED)
+ {
+ double pages_written = 0.0;
+ double pages_read = 0.0;
+ double hashentrysize;
+ double nbatches;
+ Size mem_limit;
+ long ngroups_limit;
+ int num_partitions;
+
+
+ /*
+ * Estimate number of batches based on the computed limits. If less
+ * than or equal to one, all groups are expected to fit in memory;
+ * otherwise we expect to spill.
+ */
+ hashentrysize = hash_agg_entry_size(
+ aggcosts->numAggs, input_width, aggcosts->transitionSpace);
+ hash_agg_set_limits(hashentrysize, numGroups, 0, &mem_limit,
+ &ngroups_limit, &num_partitions);
+
+ nbatches = Max( (numGroups * hashentrysize) / mem_limit,
+ numGroups / ngroups_limit );
+
+ /*
+ * Estimate number of pages read and written. For each level of
+ * recursion, a tuple must be written and then later read.
+ */
+ if (nbatches > 1.0)
+ {
+ double depth;
+ double pages;
+
+ pages = relation_byte_size(input_tuples, input_width) / BLCKSZ;
+
+ /*
+ * The number of partitions can change at different levels of
+ * recursion; but for the purposes of this calculation assume it
+ * stays constant.
+ */
+ depth = ceil( log(nbatches - 1) / log(num_partitions) );
+ pages_written = pages_read = pages * depth;
+ }
+
+ startup_cost += pages_written * random_page_cost;
+ total_cost += pages_written * random_page_cost;
+ total_cost += pages_read * seq_page_cost;
+ }
+
/*
* If there are quals (HAVING quals), account for their cost and
* selectivity.
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6314c..eb25c2f4707 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4258,11 +4258,12 @@ consider_groupingsets_paths(PlannerInfo *root,
dNumGroups - exclude_groups);
/*
- * gd->rollups is empty if we have only unsortable columns to work
- * with. Override work_mem in that case; otherwise, we'll rely on the
- * sorted-input case to generate usable mixed paths.
+ * If we have sortable columns to work with (gd->rollups is non-empty)
+ * and enable_groupingsets_hash_disk is disabled, don't generate
+ * hash-based paths that will exceed work_mem.
*/
- if (hashsize > work_mem * 1024L && gd->rollups)
+ if (!enable_groupingsets_hash_disk &&
+ hashsize > work_mem * 1024L && gd->rollups)
return; /* nope, won't fit */
/*
@@ -6528,7 +6529,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
* were unable to sort above, then we'd better generate a Path, so
* that we at least have one.
*/
- if (hashaggtablesize < work_mem * 1024L ||
+ if (enable_hashagg_disk ||
+ hashaggtablesize < work_mem * 1024L ||
grouped_rel->pathlist == NIL)
{
/*
@@ -6561,7 +6563,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
agg_final_costs,
dNumGroups);
- if (hashaggtablesize < work_mem * 1024L)
+ if (enable_hashagg_disk ||
+ hashaggtablesize < work_mem * 1024L)
add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
@@ -6830,7 +6833,7 @@ create_partial_grouping_paths(PlannerInfo *root,
* Tentatively produce a partial HashAgg Path, depending on if it
* looks as if the hash table will fit in work_mem.
*/
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_disk || hashaggtablesize < work_mem * 1024L) &&
cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
@@ -6857,7 +6860,7 @@ create_partial_grouping_paths(PlannerInfo *root,
dNumPartialPartialGroups);
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_disk || hashaggtablesize < work_mem * 1024L) &&
cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 1a23e18970d..951aed80e7a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1072,7 +1072,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
numGroupCols, dNumGroups,
NIL,
input_path->startup_cost, input_path->total_cost,
- input_path->rows);
+ input_path->rows, input_path->pathtarget->width);
/*
* Now for the sorted case. Note that the input is *always* unsorted,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d9ce5162116..8ba8122ee2f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1704,7 +1704,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
NIL,
subpath->startup_cost,
subpath->total_cost,
- rel->rows);
+ rel->rows,
+ subpath->pathtarget->width);
}
if (sjinfo->semi_can_btree && sjinfo->semi_can_hash)
@@ -2958,7 +2959,7 @@ create_agg_path(PlannerInfo *root,
list_length(groupClause), numGroups,
qual,
subpath->startup_cost, subpath->total_cost,
- subpath->rows);
+ subpath->rows, subpath->pathtarget->width);
/* add tlist eval cost for each output row */
pathnode->path.startup_cost += target->cost.startup;
@@ -3069,7 +3070,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
subpath->startup_cost,
subpath->total_cost,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
is_first = false;
if (!rollup->is_hashed)
is_first_sort = false;
@@ -3092,7 +3094,8 @@ create_groupingsets_path(PlannerInfo *root,
rollup->numGroups,
having_qual,
0.0, 0.0,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
if (!rollup->is_hashed)
is_first_sort = false;
}
@@ -3117,7 +3120,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
sort_path.startup_cost,
sort_path.total_cost,
- sort_path.rows);
+ sort_path.rows,
+ subpath->pathtarget->width);
}
pathnode->path.total_cost += agg_path.total_cost;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4c6d6486623..64da8882082 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -999,6 +999,26 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_hashagg_disk", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans that are expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_hashagg_disk,
+ true,
+ NULL, NULL, NULL
+ },
+ {
+ {"enable_groupingsets_hash_disk", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans for groupingsets when the total size of the hash tables is expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_groupingsets_hash_disk,
+ false,
+ NULL, NULL, NULL
+ },
{
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of materialization."),
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 264916f9a92..2341061bdf4 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -280,6 +280,11 @@ typedef struct AggStatePerPhaseData
Sort *sortnode; /* Sort node for input ordering for phase */
ExprState *evaltrans; /* evaluation of transition functions */
+
+ /* cached variants of the compiled expression */
+ ExprState *evaltrans_cache
+ [2] /* 0: outerops; 1: TTSOpsMinimalTuple */
+ [2]; /* 0: no NULL check; 1: with NULL check */
} AggStatePerPhaseData;
/*
@@ -311,5 +316,8 @@ extern void ExecReScanAgg(AggState *node);
extern Size hash_agg_entry_size(int numAggs, Size tupleWidth,
Size transitionSpace);
+extern void hash_agg_set_limits(double hashentrysize, uint64 input_groups,
+ int used_bits, Size *mem_limit,
+ long *ngroups_limit, int *num_partitions);
#endif /* NODEAGG_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd3ddf781f1..952fa627a60 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2079,12 +2079,31 @@ typedef struct AggState
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
int num_hashes;
+ struct HashTapeInfo *hash_tapeinfo; /* metadata for spill tapes */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each grouping set,
+ exists only during first pass */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ List *hash_batches; /* hash batches remaining to be processed */
+ bool hash_ever_spilled; /* ever spilled during this execution? */
+ bool hash_spill_mode; /* we hit a limit during the current batch
+ and we must not create new groups */
+ Size hash_mem_limit; /* limit before spilling hash table */
+ long hash_ngroups_limit; /* limit before spilling hash table */
+ int hash_planned_partitions; /* number of partitions planned
+ for first pass */
+ double hashentrysize; /* estimate revised during execution */
+ Size hash_mem_peak; /* peak hash table memory usage */
+ long hash_ngroups_current; /* number of groups currently in
+ memory in all hash tables */
+ uint64 hash_disk_used; /* kB of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 48
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index cb012ba1980..735ba096503 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -54,6 +54,8 @@ extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_hashagg_disk;
+extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
@@ -114,7 +116,7 @@ extern void cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples);
+ double input_tuples, double input_width);
extern void cost_windowagg(Path *path, PlannerInfo *root,
List *windowFuncs, int numPartCols, int numOrderCols,
Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index f457b5b150f..0073072a368 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2357,3 +2357,187 @@ explain (costs off)
-> Seq Scan on onek
(8 rows)
+--
+-- Hash Aggregation Spill tests
+--
+set enable_sort=false;
+set work_mem='64kB';
+select unique1, count(*), sum(twothousand) from tenk1
+group by unique1
+having sum(fivethous) > 4975
+order by sum(twothousand);
+ unique1 | count | sum
+---------+-------+------
+ 4976 | 1 | 976
+ 4977 | 1 | 977
+ 4978 | 1 | 978
+ 4979 | 1 | 979
+ 4980 | 1 | 980
+ 4981 | 1 | 981
+ 4982 | 1 | 982
+ 4983 | 1 | 983
+ 4984 | 1 | 984
+ 4985 | 1 | 985
+ 4986 | 1 | 986
+ 4987 | 1 | 987
+ 4988 | 1 | 988
+ 4989 | 1 | 989
+ 4990 | 1 | 990
+ 4991 | 1 | 991
+ 4992 | 1 | 992
+ 4993 | 1 | 993
+ 4994 | 1 | 994
+ 4995 | 1 | 995
+ 4996 | 1 | 996
+ 4997 | 1 | 997
+ 4998 | 1 | 998
+ 4999 | 1 | 999
+ 9976 | 1 | 1976
+ 9977 | 1 | 1977
+ 9978 | 1 | 1978
+ 9979 | 1 | 1979
+ 9980 | 1 | 1980
+ 9981 | 1 | 1981
+ 9982 | 1 | 1982
+ 9983 | 1 | 1983
+ 9984 | 1 | 1984
+ 9985 | 1 | 1985
+ 9986 | 1 | 1986
+ 9987 | 1 | 1987
+ 9988 | 1 | 1988
+ 9989 | 1 | 1989
+ 9990 | 1 | 1990
+ 9991 | 1 | 1991
+ 9992 | 1 | 1992
+ 9993 | 1 | 1993
+ 9994 | 1 | 1994
+ 9995 | 1 | 1995
+ 9996 | 1 | 1996
+ 9997 | 1 | 1997
+ 9998 | 1 | 1998
+ 9999 | 1 | 1999
+(48 rows)
+
+set work_mem to default;
+set enable_sort to default;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+set work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------------
+ GroupAggregate
+ Group Key: ((g % 100000))
+ -> Sort
+ Sort Key: ((g % 100000))
+ -> Function Scan on generate_series g
+(5 rows)
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+-- Produce results with hash aggregation
+set enable_hashagg = true;
+set enable_sort = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 100000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+set enable_sort = true;
+set work_mem to default;
+-- Compare group aggregation results to hash aggregation results
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+ a | c1 | c2 | c3
+---+----+----+----
+(0 rows)
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a7..dbe5140b558 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1633,4 +1633,126 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
| 1 | 2
(4 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low
+-- and turning on enable_groupingsets_hash_disk.
+--
+SET enable_groupingsets_hash_disk = true;
+SET work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+-- Produce results with hash aggregation.
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------
+ MixedAggregate
+ Hash Key: (g.g % 1000), (g.g % 100), (g.g % 10)
+ Hash Key: (g.g % 1000), (g.g % 100)
+ Hash Key: (g.g % 1000)
+ Hash Key: (g.g % 100), (g.g % 10)
+ Hash Key: (g.g % 100)
+ Hash Key: (g.g % 10), (g.g % 1000)
+ Hash Key: (g.g % 10)
+ Group Key: ()
+ -> Function Scan on generate_series g
+(10 rows)
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+set enable_sort = true;
+set work_mem to default;
+-- Compare results
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+ g100 | g10 | unnest | c | m
+------+-----+--------+---+---
+(0 rows)
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+SET enable_groupingsets_hash_disk TO DEFAULT;
-- end
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1de..11c6f50fbfa 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -148,6 +148,68 @@ SELECT count(*) FROM
4
(1 row)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+SET enable_hashagg=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------------
+ Unique
+ -> Sort
+ Sort Key: ((g % 1000))
+ -> Function Scan on generate_series g
+(4 rows)
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_hashagg=TRUE;
+-- Produce results with hash aggregation.
+SET enable_sort=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 1000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_sort=TRUE;
+SET work_mem TO DEFAULT;
+-- Compare results
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb9057..715842b87af 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -74,7 +74,9 @@ select name, setting from pg_settings where name like 'enable%';
--------------------------------+---------
enable_bitmapscan | on
enable_gathermerge | on
+ enable_groupingsets_hash_disk | off
enable_hashagg | on
+ enable_hashagg_disk | on
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
@@ -89,7 +91,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(19 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 3e593f2d615..02578330a6f 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1032,3 +1032,134 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
explain (costs off)
select 1 from tenk1
where (hundred, thousand) in (select twothousand, twothousand from onek);
+
+--
+-- Hash Aggregation Spill tests
+--
+
+set enable_sort=false;
+set work_mem='64kB';
+
+select unique1, count(*), sum(twothousand) from tenk1
+group by unique1
+having sum(fivethous) > 4975
+order by sum(twothousand);
+
+set work_mem to default;
+set enable_sort to default;
+
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+set work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+-- Produce results with hash aggregation
+
+set enable_hashagg = true;
+set enable_sort = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare group aggregation results to hash aggregation results
+
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 95ac3fb52f6..478f49ecab5 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -441,4 +441,107 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
from unnest(array[1,1], array['a','b']) u(i,v)
group by rollup(i, v||'a') order by 1,3;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low
+-- and turning on enable_groupingsets_hash_disk.
+--
+
+SET enable_groupingsets_hash_disk = true;
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+-- Produce results with hash aggregation.
+
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare results
+
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+
+SET enable_groupingsets_hash_disk TO DEFAULT;
+
-- end
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449e..33102744ebf 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -45,6 +45,68 @@ SELECT count(*) FROM
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+SET enable_hashagg=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_hashagg=TRUE;
+
+-- Produce results with hash aggregation.
+
+SET enable_sort=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_sort=TRUE;
+
+SET work_mem TO DEFAULT;
+
+-- Compare results
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
+
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
On Wed, Mar 11, 2020 at 11:55:35PM -0700, Jeff Davis wrote:
* tweaked EXPLAIN output some more
Unless I (or someone else) finds something significant, this is close
to commit.
Thanks for working on this ; I finally made a pass over the patch.
+++ b/doc/src/sgml/config.sgml
+ <term><varname>enable_groupingsets_hash_disk</varname> (<type>boolean</type>)
+ Enables or disables the query planner's use of hashed aggregation for
+ grouping sets when the size of the hash tables is expected to exceed
+ <varname>work_mem</varname>. See <xref
+ linkend="queries-grouping-sets"/>. Note that this setting only
+ affects the chosen plan; execution time may still require using
+ disk-based hash aggregation. ...
...
+ <term><varname>enable_hashagg_disk</varname> (<type>boolean</type>)
+ ... This only affects the planner choice;
+ execution time may still require using disk-based hash
+ aggregation. The default is <literal>on</literal>.
I don't understand what's meant by "the chosen plan".
Should it say, "at execution ..." instead of "execution time" ?
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the memory usage is expected to exceed
Either remove "plan types" for consistency with enable_groupingsets_hash_disk,
Or add it there. Maybe it should say "when the memory usage would OTHERWISE BE
expected to exceed.."
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
I see this partially duplicates my patch [0]/messages/by-id/20200306213310.GM684@telsasoft.com to show memory stats for (at
Andres' suggestion) all of execGrouping.c. Perhaps you'd consider naming the
function something more generic in case my patch progresses ? I'm using:
|show_tuplehash_info(HashTableInstrumentation *inst, ExplainState *es);
Mine also shows:
|ExplainPropertyInteger("Original Hash Buckets", NULL,
|ExplainPropertyInteger("Peak Memory Usage (hashtable)", "kB",
|ExplainPropertyInteger("Peak Memory Usage (tuples)", "kB",
[0]: /messages/by-id/20200306213310.GM684@telsasoft.com
You added hash_mem_peak and hash_batches_used to struct AggState.
In my 0001 patch, I added instrumentation to struct TupleHashTable, and in my
0005 patch I move it into AggStatePerHashData and other State nodes.
+ if (from_tape)
+ partition_mem += HASHAGG_READ_BUFFER_SIZE;
+ partition_mem = npartitions * HASHAGG_WRITE_BUFFER_SIZE;
=> That looks wrong ; should say += ?
+ gettext_noop("Enables the planner's use of hashed aggregation plans that are expected to exceed work_mem."),
should say:
"when the memory usage is otherwise be expected to exceed.."
--
Justin
On Thu, 2020-03-12 at 16:01 -0500, Justin Pryzby wrote:
I don't understand what's meant by "the chosen plan".
Should it say, "at execution ..." instead of "execution time" ?
I removed that wording; hopefully it's more clear without it?
Either remove "plan types" for consistency with
enable_groupingsets_hash_disk,
Or add it there. Maybe it should say "when the memory usage would
OTHERWISE BE
expected to exceed.."
I added "plan types".
I don't think "otherwise be..." would quite work there. "Otherwise"
sounds to me like it's referring to another plan type (e.g.
Sort+GroupAgg), and that doesn't fit.
It's probably best to leave that level of detail out of the docs. I
think the main use case for enable_hashagg_disk is for users who
experience some plan changes and want the old behavior which favors
Sort when there are a lot of groups.
+show_hashagg_info(AggState *aggstate, ExplainState *es) +{ + Agg *agg = (Agg *)aggstate->ss.ps.plan; + long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;I see this partially duplicates my patch [0] to show memory stats for
...
You added hash_mem_peak and hash_batches_used to struct AggState.
In my 0001 patch, I added instrumentation to struct TupleHashTable
I replied in that thread and I'm not sure that tracking the memory in
the TupleHashTable is the right approach. The group keys and the
transition state data can't be estimated easily that way. Perhaps we
can do that if the THT owns the memory contexts (and can call
MemoryContextMemAllocated()), rather than using passed-in ones, but
that might require more discussion. (I'm open to that idea, by the
way.)
Also, my patch also considers the buffer space, so would that be a
third memory number?
For now, I think I'll leave the way I report it in a simpler form and
we can change it later as we sort out these details. That leaves mine
specific to HashAgg, but we can always refactor it later.
I did change my code to put the metacontext in a child context of its
own so that I could call MemoryContextMemAllocated() on it to include
it in the memory total, and that will make reporting it separately
easier when we want to do so.
+ if (from_tape) + partition_mem += HASHAGG_READ_BUFFER_SIZE; + partition_mem = npartitions * HASHAGG_WRITE_BUFFER_SIZE;=> That looks wrong ; should say += ?
Good catch! Fixed.
Regards,
Jeff Davis
Attachments:
hashagg-20200315.patchtext/x-patch; charset=UTF-8; name=hashagg-20200315.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3cac340f323..9f7f7736665 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4462,6 +4462,23 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-groupingsets-hash-disk" xreflabel="enable_groupingsets_hash_disk">
+ <term><varname>enable_groupingsets_hash_disk</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_groupingsets_hash_disk</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation plan
+ types for grouping sets when the total size of the hash tables is
+ expected to exceed <varname>work_mem</varname>. See <xref
+ linkend="queries-grouping-sets"/>. The default is
+ <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashagg" xreflabel="enable_hashagg">
<term><varname>enable_hashagg</varname> (<type>boolean</type>)
<indexterm>
@@ -4476,6 +4493,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-hashagg-disk" xreflabel="enable_hashagg_disk">
+ <term><varname>enable_hashagg_disk</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_hashagg_disk</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the memory usage is expected to exceed
+ <varname>work_mem</varname>. The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
<term><varname>enable_hashjoin</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50e..58141d8393c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -104,6 +104,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1882,6 +1883,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Agg:
show_agg_keys(castNode(AggState, planstate), ancestors, es);
show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ show_hashagg_info((AggState *) planstate, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2769,6 +2771,41 @@ show_hash_info(HashState *hashstate, ExplainState *es)
}
}
+/*
+ * Show information on hash aggregate memory usage and batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+ Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+
+ Assert(IsA(aggstate, AggState));
+
+ if (agg->aggstrategy != AGG_HASHED &&
+ agg->aggstrategy != AGG_MIXED)
+ return;
+
+ if (es->costs && aggstate->hash_planned_partitions > 0)
+ {
+ ExplainPropertyInteger("Planned Partitions", NULL,
+ aggstate->hash_planned_partitions, es);
+ }
+
+ if (!es->analyze)
+ return;
+
+ /* EXPLAIN ANALYZE */
+ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+ if (aggstate->hash_batches_used > 0)
+ {
+ ExplainPropertyInteger("Disk Usage", "kB",
+ aggstate->hash_disk_used, es);
+ ExplainPropertyInteger("HashAgg Batches", NULL,
+ aggstate->hash_batches_used, es);
+ }
+}
+
/*
* If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 7aebb247d88..2d8efb7731a 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -194,6 +194,29 @@
* transition values. hashcontext is the single context created to support
* all hash tables.
*
+ * Spilling To Disk
+ *
+ * When performing hash aggregation, if the hash table memory exceeds the
+ * limit (see hash_agg_check_limits()), we enter "spill mode". In spill
+ * mode, we advance the transition states only for groups already in the
+ * hash table. For tuples that would need to create a new hash table
+ * entries (and initialize new transition states), we instead spill them to
+ * disk to be processed later. The tuples are spilled in a partitioned
+ * manner, so that subsequent batches are smaller and less likely to exceed
+ * work_mem (if a batch does exceed work_mem, it must be spilled
+ * recursively).
+ *
+ * Spilled data is written to logical tapes. These provide better control
+ * over memory usage, disk space, and the number of files than if we were
+ * to use a BufFile for each spill.
+ *
+ * Note that it's possible for transition states to start small but then
+ * grow very large; for instance in the case of ARRAY_AGG. In such cases,
+ * it's still possible to significantly exceed work_mem. We try to avoid
+ * this situation by estimating what will fit in the available memory, and
+ * imposing a limit on the number of groups separately from the amount of
+ * memory consumed.
+ *
* Transition / Combine function invocation:
*
* For performance reasons transition functions, including combine
@@ -233,12 +256,101 @@
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/datum.h"
+#include "utils/dynahash.h"
#include "utils/expandeddatum.h"
+#include "utils/logtape.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+/*
+ * Control how many partitions are created when spilling HashAgg to
+ * disk.
+ *
+ * HASHAGG_PARTITION_FACTOR is multiplied by the estimated number of
+ * partitions needed such that each partition will fit in memory. The factor
+ * is set higher than one because there's not a high cost to having a few too
+ * many partitions, and it makes it less likely that a partition will need to
+ * be spilled recursively. Another benefit of having more, smaller partitions
+ * is that small hash tables may perform better than large ones due to memory
+ * caching effects.
+ *
+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (which
+ * could instead be spent on a larger hash table).
+ *
+ * For reading from tapes, the buffer size must be a multiple of
+ * BLCKSZ. Larger values help when reading from multiple tapes concurrently,
+ * but that doesn't happen in HashAgg, so we simply use BLCKSZ. Writing to a
+ * tape always uses a buffer of size BLCKSZ.
+ */
+#define HASHAGG_PARTITION_FACTOR 1.50
+#define HASHAGG_MIN_PARTITIONS 4
+#define HASHAGG_MAX_PARTITIONS 256
+#define HASHAGG_MIN_BUCKETS 256
+#define HASHAGG_READ_BUFFER_SIZE BLCKSZ
+#define HASHAGG_WRITE_BUFFER_SIZE BLCKSZ
+
+/*
+ * Track all tapes needed for a HashAgg that spills. We don't know the maximum
+ * number of tapes needed at the start of the algorithm (because it can
+ * recurse), so one tape set is allocated and extended as needed for new
+ * tapes. When a particular tape is already read, rewind it for write mode and
+ * put it in the free list.
+ *
+ * Tapes' buffers can take up substantial memory when many tapes are open at
+ * once. We only need one tape open at a time in read mode (using a buffer
+ * that's a multiple of BLCKSZ); but we need one tape open in write mode (each
+ * requiring a buffer of size BLCKSZ) for each partition.
+ */
+typedef struct HashTapeInfo
+{
+ LogicalTapeSet *tapeset;
+ int ntapes;
+ int *freetapes;
+ int nfreetapes;
+ int freetapes_alloc;
+} HashTapeInfo;
+
+/*
+ * Represents partitioned spill data for a single hashtable. Contains the
+ * necessary information to route tuples to the correct partition, and to
+ * transform the spilled data into new batches.
+ *
+ * The high bits are used for partition selection (when recursing, we ignore
+ * the bits that have already been used for partition selection at an earlier
+ * level).
+ */
+typedef struct HashAggSpill
+{
+ LogicalTapeSet *tapeset; /* borrowed reference to tape set */
+ int npartitions; /* number of partitions */
+ int *partitions; /* spill partition tape numbers */
+ int64 *ntuples; /* number of tuples in each partition */
+ uint32 mask; /* mask to find partition from hash value */
+ int shift; /* after masking, shift by this amount */
+} HashAggSpill;
+
+/*
+ * Represents work to be done for one pass of hash aggregation (with only one
+ * grouping set).
+ *
+ * Also tracks the bits of the hash already used for partition selection by
+ * earlier iterations, so that this batch can use new bits. If all bits have
+ * already been used, no partitioning will be done (any spilled data will go
+ * to a single output tape).
+ */
+typedef struct HashAggBatch
+{
+ int setno; /* grouping set */
+ int used_bits; /* number of bits of hash already used */
+ LogicalTapeSet *tapeset; /* borrowed reference to tape set */
+ int input_tapenum; /* input partition tape */
+ int64 input_tuples; /* number of tuples in this batch */
+} HashAggBatch;
+
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -275,11 +387,43 @@ static Bitmapset *find_unaggregated_cols(AggState *aggstate);
static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
static void build_hash_tables(AggState *aggstate);
static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
+static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
+ bool nullcheck);
+static long hash_choose_num_buckets(double hashentrysize,
+ long estimated_nbuckets,
+ Size memory);
+static int hash_choose_num_partitions(uint64 input_groups,
+ double hashentrysize,
+ int used_bits,
+ int *log2_npartittions);
static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_refill_hash_table(AggState *aggstate);
static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
+static void hash_agg_check_limits(AggState *aggstate);
+static void hash_agg_enter_spill_mode(AggState *aggstate);
+static void hash_agg_update_metrics(AggState *aggstate, bool from_tape,
+ int npartitions);
+static void hashagg_finish_initial_spills(AggState *aggstate);
+static void hashagg_reset_spill_state(AggState *aggstate);
+static HashAggBatch *hashagg_batch_new(LogicalTapeSet *tapeset,
+ int input_tapenum, int setno,
+ int64 input_tuples, int used_bits);
+static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
+static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
+ int used_bits, uint64 input_tuples,
+ double hashentrysize);
+static Size hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot,
+ uint32 hash);
+static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
+ int setno);
+static void hashagg_tapeinfo_init(AggState *aggstate);
+static void hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *dest,
+ int ndest);
+static void hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum);
static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
AggState *aggstate, EState *estate,
@@ -1275,9 +1419,9 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
* We have a separate hashtable and associated perhash data structure for each
* grouping set for which we're doing hashing.
*
- * The contents of the hash tables always live in the hashcontext's per-tuple
- * memory context (there is only one of these for all tables together, since
- * they are all reset at the same time).
+ * The hash tables and their contents always live in the hashcontext's
+ * per-tuple memory context (there is only one of these for all tables
+ * together, since they are all reset at the same time).
*/
static void
build_hash_tables(AggState *aggstate)
@@ -1287,14 +1431,27 @@ build_hash_tables(AggState *aggstate)
for (setno = 0; setno < aggstate->num_hashes; ++setno)
{
AggStatePerHash perhash = &aggstate->perhash[setno];
+ long nbuckets;
+ Size memory;
+
+ if (perhash->hashtable != NULL)
+ {
+ ResetTupleHashTable(perhash->hashtable);
+ continue;
+ }
Assert(perhash->aggnode->numGroups > 0);
- if (perhash->hashtable)
- ResetTupleHashTable(perhash->hashtable);
- else
- build_hash_table(aggstate, setno, perhash->aggnode->numGroups);
+ memory = aggstate->hash_mem_limit / aggstate->num_hashes;
+
+ /* choose reasonable number of buckets per hashtable */
+ nbuckets = hash_choose_num_buckets(
+ aggstate->hashentrysize, perhash->aggnode->numGroups, memory);
+
+ build_hash_table(aggstate, setno, nbuckets);
}
+
+ aggstate->hash_ngroups_current = 0;
}
/*
@@ -1304,7 +1461,7 @@ static void
build_hash_table(AggState *aggstate, int setno, long nbuckets)
{
AggStatePerHash perhash = &aggstate->perhash[setno];
- MemoryContext metacxt = aggstate->ss.ps.state->es_query_cxt;
+ MemoryContext metacxt = aggstate->hash_metacxt;
MemoryContext hashcxt = aggstate->hashcontext->ecxt_per_tuple_memory;
MemoryContext tmpcxt = aggstate->tmpcontext->ecxt_per_tuple_memory;
Size additionalsize;
@@ -1487,14 +1644,326 @@ hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
transitionSpace;
}
+/*
+ * hashagg_recompile_expressions()
+ *
+ * Identifies the right phase, compiles the right expression given the
+ * arguments, and then sets phase->evalfunc to that expression.
+ *
+ * Different versions of the compiled expression are needed depending on
+ * whether hash aggregation has spilled or not, and whether it's reading from
+ * the outer plan or a tape. Before spilling to disk, the expression reads
+ * from the outer plan and does not need to perform a NULL check. After
+ * HashAgg begins to spill, new groups will not be created in the hash table,
+ * and the AggStatePerGroup array may be NULL; therefore we need to add a null
+ * pointer check to the expression. Then, when reading spilled data from a
+ * tape, we change the outer slot type to be a fixed minimal tuple slot.
+ *
+ * It would be wasteful to recompile every time, so cache the compiled
+ * expressions in the AggStatePerPhase, and reuse when appropriate.
+ */
+static void
+hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
+{
+ AggStatePerPhase phase;
+ int i = minslot ? 1 : 0;
+ int j = nullcheck ? 1 : 0;
+
+ Assert(aggstate->aggstrategy == AGG_HASHED ||
+ aggstate->aggstrategy == AGG_MIXED);
+
+ if (aggstate->aggstrategy == AGG_HASHED)
+ phase = &aggstate->phases[0];
+ else /* AGG_MIXED */
+ phase = &aggstate->phases[1];
+
+ if (phase->evaltrans_cache[i][j] == NULL)
+ {
+ const TupleTableSlotOps *outerops = aggstate->ss.ps.outerops;
+ bool outerfixed = aggstate->ss.ps.outeropsfixed;
+ bool dohash = true;
+ bool dosort;
+
+ dosort = aggstate->aggstrategy == AGG_MIXED ? true : false;
+
+ /* temporarily change the outerops while compiling the expression */
+ if (minslot)
+ {
+ aggstate->ss.ps.outerops = &TTSOpsMinimalTuple;
+ aggstate->ss.ps.outeropsfixed = true;
+ }
+
+ phase->evaltrans_cache[i][j] = ExecBuildAggTrans(
+ aggstate, phase, dosort, dohash, nullcheck);
+
+ /* change back */
+ aggstate->ss.ps.outerops = outerops;
+ aggstate->ss.ps.outeropsfixed = outerfixed;
+ }
+
+ phase->evaltrans = phase->evaltrans_cache[i][j];
+}
+
+/*
+ * Set limits that trigger spilling to avoid exceeding work_mem. Consider the
+ * number of partitions we expect to create (if we do spill).
+ *
+ * There are two limits: a memory limit, and also an ngroups limit. The
+ * ngroups limit becomes important when we expect transition values to grow
+ * substantially larger than the initial value.
+ */
+void
+hash_agg_set_limits(double hashentrysize, uint64 input_groups, int used_bits,
+ Size *mem_limit, uint64 *ngroups_limit,
+ int *num_partitions)
+{
+ int npartitions;
+ Size partition_mem;
+
+ /* if not expected to spill, use all of work_mem */
+ if (input_groups * hashentrysize < work_mem * 1024L)
+ {
+ *mem_limit = work_mem * 1024L;
+ *ngroups_limit = *mem_limit / hashentrysize;
+ return;
+ }
+
+ /*
+ * Calculate expected memory requirements for spilling, which is the size
+ * of the buffers needed for all the tapes that need to be open at
+ * once. Then, subtract that from the memory available for holding hash
+ * tables.
+ */
+ npartitions = hash_choose_num_partitions(input_groups,
+ hashentrysize,
+ used_bits,
+ NULL);
+ if (num_partitions != NULL)
+ *num_partitions = npartitions;
+
+ partition_mem =
+ HASHAGG_READ_BUFFER_SIZE +
+ HASHAGG_WRITE_BUFFER_SIZE * npartitions;
+
+ /*
+ * Don't set the limit below 3/4 of work_mem. In that case, we are at the
+ * minimum number of partitions, so we aren't going to dramatically exceed
+ * work mem anyway.
+ */
+ if (work_mem * 1024L > 4 * partition_mem)
+ *mem_limit = work_mem * 1024L - partition_mem;
+ else
+ *mem_limit = work_mem * 1024L * 0.75;
+
+ if (*mem_limit > hashentrysize)
+ *ngroups_limit = *mem_limit / hashentrysize;
+ else
+ *ngroups_limit = 1;
+}
+
+/*
+ * hash_agg_check_limits
+ *
+ * After adding a new group to the hash table, check whether we need to enter
+ * spill mode. Allocations may happen without adding new groups (for instance,
+ * if the transition state size grows), so this check is imperfect.
+ */
+static void
+hash_agg_check_limits(AggState *aggstate)
+{
+ uint64 ngroups = aggstate->hash_ngroups_current;
+ Size meta_mem = MemoryContextMemAllocated(
+ aggstate->hash_metacxt, true);
+ Size hash_mem = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ /*
+ * Don't spill unless there's at least one group in the hash table so we
+ * can be sure to make progress even in edge cases.
+ */
+ if (aggstate->hash_ngroups_current > 0 &&
+ (meta_mem + hash_mem > aggstate->hash_mem_limit ||
+ ngroups > aggstate->hash_ngroups_limit))
+ {
+ hash_agg_enter_spill_mode(aggstate);
+ }
+}
+
+/*
+ * Enter "spill mode", meaning that no new groups are added to any of the hash
+ * tables. Tuples that would create a new group are instead spilled, and
+ * processed later.
+ */
+static void
+hash_agg_enter_spill_mode(AggState *aggstate)
+{
+ aggstate->hash_spill_mode = true;
+ hashagg_recompile_expressions(aggstate, aggstate->table_filled, true);
+
+ if (!aggstate->hash_ever_spilled)
+ {
+ Assert(aggstate->hash_tapeinfo == NULL);
+ Assert(aggstate->hash_spills == NULL);
+
+ aggstate->hash_ever_spilled = true;
+
+ hashagg_tapeinfo_init(aggstate);
+
+ aggstate->hash_spills = palloc(
+ sizeof(HashAggSpill) * aggstate->num_hashes);
+
+ for (int setno = 0; setno < aggstate->num_hashes; setno++)
+ {
+ AggStatePerHash perhash = &aggstate->perhash[setno];
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+
+ hashagg_spill_init(spill, aggstate->hash_tapeinfo, 0,
+ perhash->aggnode->numGroups,
+ aggstate->hashentrysize);
+ }
+ }
+}
+
+/*
+ * Update metrics after filling the hash table.
+ *
+ * If reading from the outer plan, from_tape should be false; if reading from
+ * another tape, from_tape should be true.
+ */
+static void
+hash_agg_update_metrics(AggState *aggstate, bool from_tape, int npartitions)
+{
+ Size meta_mem;
+ Size hash_mem;
+ Size buffer_mem;
+ Size total_mem;
+
+ if (aggstate->aggstrategy != AGG_MIXED &&
+ aggstate->aggstrategy != AGG_HASHED)
+ return;
+
+ /* memory for the hash table itself */
+ meta_mem = MemoryContextMemAllocated(aggstate->hash_metacxt, true);
+
+ /* memory for the group keys and transition states */
+ hash_mem = MemoryContextMemAllocated(
+ aggstate->hashcontext->ecxt_per_tuple_memory, true);
+
+ /* memory for read/write tape buffers, if spilled */
+ buffer_mem = npartitions * HASHAGG_WRITE_BUFFER_SIZE;
+ if (from_tape)
+ buffer_mem += HASHAGG_READ_BUFFER_SIZE;
+
+ /* update peak mem */
+ total_mem = meta_mem + hash_mem + buffer_mem;
+ if (total_mem > aggstate->hash_mem_peak)
+ aggstate->hash_mem_peak = total_mem;
+
+ /* update disk usage */
+ if (aggstate->hash_tapeinfo != NULL)
+ {
+ uint64 disk_used = LogicalTapeSetBlocks(
+ aggstate->hash_tapeinfo->tapeset) * (BLCKSZ / 1024);
+
+ if (aggstate->hash_disk_used < disk_used)
+ aggstate->hash_disk_used = disk_used;
+ }
+
+ /* update hashentrysize estimate based on contents */
+ if (aggstate->hash_ngroups_current > 0)
+ {
+ aggstate->hashentrysize =
+ hash_mem / (double)aggstate->hash_ngroups_current;
+ }
+}
+
+/*
+ * Choose a reasonable number of buckets for the initial hash table size.
+ */
+static long
+hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory)
+{
+ long max_nbuckets;
+ long nbuckets = ngroups;
+
+ max_nbuckets = memory / hashentrysize;
+
+ /*
+ * Leave room for slop to avoid a case where the initial hash table size
+ * exceeds the memory limit (though that may still happen in edge cases).
+ */
+ max_nbuckets *= 0.75;
+
+ if (nbuckets > max_nbuckets)
+ nbuckets = max_nbuckets;
+ if (nbuckets < HASHAGG_MIN_BUCKETS)
+ nbuckets = HASHAGG_MIN_BUCKETS;
+ return nbuckets;
+}
+
+/*
+ * Determine the number of partitions to create when spilling, which will
+ * always be a power of two. If log2_npartitions is non-NULL, set
+ * *log2_npartitions to the log2() of the number of partitions.
+ */
+static int
+hash_choose_num_partitions(uint64 input_groups, double hashentrysize,
+ int used_bits, int *log2_npartitions)
+{
+ Size mem_wanted;
+ int partition_limit;
+ int npartitions;
+ int partition_bits;
+
+ /*
+ * Avoid creating so many partitions that the memory requirements of the
+ * open partition files are greater than 1/4 of work_mem.
+ */
+ partition_limit =
+ (work_mem * 1024L * 0.25 - HASHAGG_READ_BUFFER_SIZE) /
+ HASHAGG_WRITE_BUFFER_SIZE;
+
+ mem_wanted = HASHAGG_PARTITION_FACTOR * input_groups * hashentrysize;
+
+ /* make enough partitions so that each one is likely to fit in memory */
+ npartitions = 1 + (mem_wanted / (work_mem * 1024L));
+
+ if (npartitions > partition_limit)
+ npartitions = partition_limit;
+
+ if (npartitions < HASHAGG_MIN_PARTITIONS)
+ npartitions = HASHAGG_MIN_PARTITIONS;
+ if (npartitions > HASHAGG_MAX_PARTITIONS)
+ npartitions = HASHAGG_MAX_PARTITIONS;
+
+ /* ceil(log2(npartitions)) */
+ partition_bits = my_log2(npartitions);
+
+ /* make sure that we don't exhaust the hash bits */
+ if (partition_bits + used_bits >= 32)
+ partition_bits = 32 - used_bits;
+
+ if (log2_npartitions != NULL)
+ *log2_npartitions = partition_bits;
+
+ /* number of partitions will be a power of two */
+ npartitions = 1L << partition_bits;
+
+ return npartitions;
+}
+
/*
* Find or create a hashtable entry for the tuple group containing the current
* tuple (already set in tmpcontext's outertuple slot), in the current grouping
* set (which the caller must have selected - note that initialize_aggregate
* depends on this).
*
- * When called, CurrentMemoryContext should be the per-query context. The
- * already-calculated hash value for the tuple must be specified.
+ * When called, CurrentMemoryContext should be the per-query context.
+ *
+ * If the hash table is at the memory limit, then only find existing hashtable
+ * entries; don't create new ones. If a tuple's group is not already present
+ * in the hash table for the current grouping set, return NULL and the caller
+ * will spill it to disk.
*/
static AggStatePerGroup
lookup_hash_entry(AggState *aggstate, uint32 hash)
@@ -1502,16 +1971,26 @@ lookup_hash_entry(AggState *aggstate, uint32 hash)
AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
TupleTableSlot *hashslot = perhash->hashslot;
TupleHashEntryData *entry;
- bool isnew;
+ bool isnew = false;
+ bool *p_isnew;
+
+ /* if hash table already spilled, don't create new entries */
+ p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
/* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, &isnew,
+ entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
hash);
+ if (entry == NULL)
+ return NULL;
+
if (isnew)
{
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
+
+ aggstate->hash_ngroups_current++;
+ hash_agg_check_limits(aggstate);
pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
@@ -1539,23 +2018,48 @@ lookup_hash_entry(AggState *aggstate, uint32 hash)
* returning an array of pergroup pointers suitable for advance_aggregates.
*
* Be aware that lookup_hash_entry can reset the tmpcontext.
+ *
+ * Some entries may be left NULL if we have reached the limit and have begun
+ * to spill. The same tuple will belong to different groups for each set, so
+ * may match a group already in memory for one set and match a group not in
+ * memory for another set. If we have begun to spill and a tuple doesn't match
+ * a group in memory for a particular set, it will be spilled.
+ *
+ * NB: It's possible to spill the same tuple for several different grouping
+ * sets. This may seem wasteful, but it's actually a trade-off: if we spill
+ * the tuple multiple times for multiple grouping sets, it can be partitioned
+ * for each grouping set, making the refilling of the hash table very
+ * efficient.
*/
static void
lookup_hash_entries(AggState *aggstate)
{
- int numHashes = aggstate->num_hashes;
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
int setno;
- for (setno = 0; setno < numHashes; setno++)
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
{
- AggStatePerHash perhash = &aggstate->perhash[setno];
+ AggStatePerHash perhash = &aggstate->perhash[setno];
uint32 hash;
select_current_set(aggstate, setno, true);
prepare_hash_slot(aggstate);
hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
pergroup[setno] = lookup_hash_entry(aggstate, hash);
+
+ /* check to see if we need to spill the tuple for this grouping set */
+ if (pergroup[setno] == NULL)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
+
+ if (spill->partitions == NULL)
+ hashagg_spill_init(spill, aggstate->hash_tapeinfo, 0,
+ perhash->aggnode->numGroups,
+ aggstate->hashentrysize);
+
+ hashagg_spill_tuple(spill, slot, hash);
+ }
}
}
@@ -1878,6 +2382,12 @@ agg_retrieve_direct(AggState *aggstate)
if (TupIsNull(outerslot))
{
/* no more outer-plan tuples available */
+
+ /* if we built hash tables, finalize any spills */
+ if (aggstate->aggstrategy == AGG_MIXED &&
+ aggstate->current_phase == 1)
+ hashagg_finish_initial_spills(aggstate);
+
if (hasGroupingSets)
{
aggstate->input_done = true;
@@ -1980,6 +2490,10 @@ agg_fill_hash_table(AggState *aggstate)
ResetExprContext(aggstate->tmpcontext);
}
+ /* finalize spills, if any */
+ hashagg_finish_initial_spills(aggstate);
+
+ aggstate->input_done = true;
aggstate->table_filled = true;
/* Initialize to walk the first hash table */
select_current_set(aggstate, 0, true);
@@ -1987,11 +2501,171 @@ agg_fill_hash_table(AggState *aggstate)
&aggstate->perhash[0].hashiter);
}
+/*
+ * If any data was spilled during hash aggregation, reset the hash table and
+ * reprocess one batch of spilled data. After reprocessing a batch, the hash
+ * table will again contain data, ready to be consumed by
+ * agg_retrieve_hash_table_in_memory().
+ *
+ * Should only be called after all in memory hash table entries have been
+ * finalized and emitted.
+ *
+ * Return false when input is exhausted and there's no more work to be done;
+ * otherwise return true.
+ */
+static bool
+agg_refill_hash_table(AggState *aggstate)
+{
+ HashAggBatch *batch;
+ HashAggSpill spill;
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+ uint64 ngroups_estimate;
+
+ if (aggstate->hash_batches == NIL)
+ return false;
+
+ batch = linitial(aggstate->hash_batches);
+ aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+
+ /*
+ * Estimate the number of groups for this batch as the total number of
+ * tuples in its input file. Although that's a worst case, it's not bad
+ * here for two reasons: (1) overestimating is better than
+ * underestimating; and (2) we've already scanned the relation once, so
+ * it's likely that we've already finalized many of the common values.
+ */
+ ngroups_estimate = batch->input_tuples;
+
+ hashagg_spill_init(&spill, tapeinfo, batch->used_bits,
+ ngroups_estimate, aggstate->hashentrysize);
+
+ hash_agg_set_limits(aggstate->hashentrysize, ngroups_estimate,
+ batch->used_bits, &aggstate->hash_mem_limit,
+ &aggstate->hash_ngroups_limit, NULL);
+
+ /* there could be residual pergroup pointers; clear them */
+ for (int setoff = 0;
+ setoff < aggstate->maxsets + aggstate->num_hashes;
+ setoff++)
+ aggstate->all_pergroups[setoff] = NULL;
+
+ /* free memory and reset hash tables */
+ ReScanExprContext(aggstate->hashcontext);
+ for (int setno = 0; setno < aggstate->num_hashes; setno++)
+ ResetTupleHashTable(aggstate->perhash[setno].hashtable);
+
+ aggstate->hash_ngroups_current = 0;
+
+ /*
+ * In AGG_MIXED mode, hash aggregation happens in phase 1 and the output
+ * happens in phase 0. So, we switch to phase 1 when processing a batch,
+ * and back to phase 0 after the batch is done.
+ */
+ Assert(aggstate->current_phase == 0);
+ if (aggstate->phase->aggstrategy == AGG_MIXED)
+ {
+ aggstate->current_phase = 1;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+ }
+
+ select_current_set(aggstate, batch->setno, true);
+
+ /*
+ * Spilled tuples are always read back as MinimalTuples, which may be
+ * different from the outer plan, so recompile the aggregate expressions.
+ *
+ * We still need the NULL check, because we are only processing one
+ * grouping set at a time and the rest will be NULL.
+ */
+ hashagg_recompile_expressions(aggstate, true, true);
+
+ LogicalTapeRewindForRead(tapeinfo->tapeset, batch->input_tapenum,
+ HASHAGG_READ_BUFFER_SIZE);
+ for (;;) {
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ MinimalTuple tuple;
+ uint32 hash;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tuple = hashagg_batch_read(batch, &hash);
+ if (tuple == NULL)
+ break;
+
+ ExecStoreMinimalTuple(tuple, slot, true);
+ aggstate->tmpcontext->ecxt_outertuple = slot;
+
+ prepare_hash_slot(aggstate);
+ aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+
+ /* if there's no memory for a new group, spill */
+ if (aggstate->hash_pergroup[batch->setno] == NULL)
+ hashagg_spill_tuple(&spill, slot, hash);
+
+ /* Advance the aggregates (or combine functions) */
+ advance_aggregates(aggstate);
+
+ /*
+ * Reset per-input-tuple context after each tuple, but note that the
+ * hash lookups do this too
+ */
+ ResetExprContext(aggstate->tmpcontext);
+ }
+
+ hashagg_tapeinfo_release(tapeinfo, batch->input_tapenum);
+
+ /* change back to phase 0 */
+ aggstate->current_phase = 0;
+ aggstate->phase = &aggstate->phases[aggstate->current_phase];
+
+ hash_agg_update_metrics(aggstate, true, spill.npartitions);
+ hashagg_spill_finish(aggstate, &spill, batch->setno);
+ aggstate->hash_spill_mode = false;
+
+ /* prepare to walk the first hash table */
+ select_current_set(aggstate, batch->setno, true);
+ ResetTupleHashIterator(aggstate->perhash[batch->setno].hashtable,
+ &aggstate->perhash[batch->setno].hashiter);
+
+ pfree(batch);
+
+ return true;
+}
+
/*
* ExecAgg for hashed case: retrieving groups from hash table
+ *
+ * After exhausting in-memory tuples, also try refilling the hash table using
+ * previously-spilled tuples. Only returns NULL after all in-memory and
+ * spilled tuples are exhausted.
*/
static TupleTableSlot *
agg_retrieve_hash_table(AggState *aggstate)
+{
+ TupleTableSlot *result = NULL;
+
+ while (result == NULL)
+ {
+ result = agg_retrieve_hash_table_in_memory(aggstate);
+ if (result == NULL)
+ {
+ if (!agg_refill_hash_table(aggstate))
+ {
+ aggstate->agg_done = true;
+ break;
+ }
+ }
+ }
+
+ return result;
+}
+
+/*
+ * Retrieve the groups from the in-memory hash tables without considering any
+ * spilled tuples.
+ */
+static TupleTableSlot *
+agg_retrieve_hash_table_in_memory(AggState *aggstate)
{
ExprContext *econtext;
AggStatePerAgg peragg;
@@ -2020,7 +2694,7 @@ agg_retrieve_hash_table(AggState *aggstate)
* We loop retrieving groups until we find one satisfying
* aggstate->ss.ps.qual
*/
- while (!aggstate->agg_done)
+ for (;;)
{
TupleTableSlot *hashslot = perhash->hashslot;
int i;
@@ -2051,8 +2725,6 @@ agg_retrieve_hash_table(AggState *aggstate)
}
else
{
- /* No more hashtables, so done */
- aggstate->agg_done = true;
return NULL;
}
}
@@ -2109,6 +2781,315 @@ agg_retrieve_hash_table(AggState *aggstate)
return NULL;
}
+/*
+ * Initialize HashTapeInfo
+ */
+static void
+hashagg_tapeinfo_init(AggState *aggstate)
+{
+ HashTapeInfo *tapeinfo = palloc(sizeof(HashTapeInfo));
+ int init_tapes = 16; /* expanded dynamically */
+
+ tapeinfo->tapeset = LogicalTapeSetCreate(init_tapes, NULL, NULL, -1);
+ tapeinfo->ntapes = init_tapes;
+ tapeinfo->nfreetapes = init_tapes;
+ tapeinfo->freetapes_alloc = init_tapes;
+ tapeinfo->freetapes = palloc(init_tapes * sizeof(int));
+ for (int i = 0; i < init_tapes; i++)
+ tapeinfo->freetapes[i] = i;
+
+ aggstate->hash_tapeinfo = tapeinfo;
+}
+
+/*
+ * Assign unused tapes to spill partitions, extending the tape set if
+ * necessary.
+ */
+static void
+hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *partitions,
+ int npartitions)
+{
+ int partidx = 0;
+
+ /* use free tapes if available */
+ while (partidx < npartitions && tapeinfo->nfreetapes > 0)
+ partitions[partidx++] = tapeinfo->freetapes[--tapeinfo->nfreetapes];
+
+ if (partidx < npartitions)
+ {
+ LogicalTapeSetExtend(tapeinfo->tapeset, npartitions - partidx);
+
+ while (partidx < npartitions)
+ partitions[partidx++] = tapeinfo->ntapes++;
+ }
+}
+
+/*
+ * After a tape has already been written to and then read, this function
+ * rewinds it for writing and adds it to the free list.
+ */
+static void
+hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum)
+{
+ LogicalTapeRewindForWrite(tapeinfo->tapeset, tapenum);
+ if (tapeinfo->freetapes_alloc == tapeinfo->nfreetapes)
+ {
+ tapeinfo->freetapes_alloc <<= 1;
+ tapeinfo->freetapes = repalloc(
+ tapeinfo->freetapes, tapeinfo->freetapes_alloc * sizeof(int));
+ }
+ tapeinfo->freetapes[tapeinfo->nfreetapes++] = tapenum;
+}
+
+/*
+ * hashagg_spill_init
+ *
+ * Called after we determined that spilling is necessary. Chooses the number
+ * of partitions to create, and initializes them.
+ */
+static void
+hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo, int used_bits,
+ uint64 input_groups, double hashentrysize)
+{
+ int npartitions;
+ int partition_bits;
+
+ npartitions = hash_choose_num_partitions(
+ input_groups, hashentrysize, used_bits, &partition_bits);
+
+ spill->partitions = palloc0(sizeof(int) * npartitions);
+ spill->ntuples = palloc0(sizeof(int64) * npartitions);
+
+ hashagg_tapeinfo_assign(tapeinfo, spill->partitions, npartitions);
+
+ spill->tapeset = tapeinfo->tapeset;
+ spill->shift = 32 - used_bits - partition_bits;
+ spill->mask = (npartitions - 1) << spill->shift;
+ spill->npartitions = npartitions;
+}
+
+/*
+ * hashagg_spill_tuple
+ *
+ * No room for new groups in the hash table. Save for later in the appropriate
+ * partition.
+ */
+static Size
+hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot, uint32 hash)
+{
+ LogicalTapeSet *tapeset = spill->tapeset;
+ int partition;
+ MinimalTuple tuple;
+ int tapenum;
+ int total_written = 0;
+ bool shouldFree;
+
+ Assert(spill->partitions != NULL);
+
+ /* XXX: may contain unnecessary attributes, should project */
+ tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+ partition = (hash & spill->mask) >> spill->shift;
+ spill->ntuples[partition]++;
+
+ tapenum = spill->partitions[partition];
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) &hash, sizeof(uint32));
+ total_written += sizeof(uint32);
+
+ LogicalTapeWrite(tapeset, tapenum, (void *) tuple, tuple->t_len);
+ total_written += tuple->t_len;
+
+ if (shouldFree)
+ pfree(tuple);
+
+ return total_written;
+}
+
+/*
+ * hashagg_batch_new
+ *
+ * Construct a HashAggBatch item, which represents one iteration of HashAgg to
+ * be done.
+ */
+static HashAggBatch *
+hashagg_batch_new(LogicalTapeSet *tapeset, int tapenum, int setno,
+ int64 input_tuples, int used_bits)
+{
+ HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
+
+ batch->setno = setno;
+ batch->used_bits = used_bits;
+ batch->tapeset = tapeset;
+ batch->input_tapenum = tapenum;
+ batch->input_tuples = input_tuples;
+
+ return batch;
+}
+
+/*
+ * read_spilled_tuple
+ * read the next tuple from a batch file. Return NULL if no more.
+ */
+static MinimalTuple
+hashagg_batch_read(HashAggBatch *batch, uint32 *hashp)
+{
+ LogicalTapeSet *tapeset = batch->tapeset;
+ int tapenum = batch->input_tapenum;
+ MinimalTuple tuple;
+ uint32 t_len;
+ size_t nread;
+ uint32 hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &hash, sizeof(uint32));
+ if (nread == 0)
+ return NULL;
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+ if (hashp != NULL)
+ *hashp = hash;
+
+ nread = LogicalTapeRead(tapeset, tapenum, &t_len, sizeof(t_len));
+ if (nread != sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, sizeof(uint32), nread)));
+
+ tuple = (MinimalTuple) palloc(t_len);
+ tuple->t_len = t_len;
+
+ nread = LogicalTapeRead(tapeset, tapenum,
+ (void *)((char *)tuple + sizeof(uint32)),
+ t_len - sizeof(uint32));
+ if (nread != t_len - sizeof(uint32))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("unexpected EOF for tape %d: requested %zu bytes, read %zu bytes",
+ tapenum, t_len - sizeof(uint32), nread)));
+
+ return tuple;
+}
+
+/*
+ * hashagg_finish_initial_spills
+ *
+ * After a HashAggBatch has been processed, it may have spilled tuples to
+ * disk. If so, turn the spilled partitions into new batches that must later
+ * be executed.
+ */
+static void
+hashagg_finish_initial_spills(AggState *aggstate)
+{
+ int setno;
+ int total_npartitions = 0;
+
+ if (aggstate->hash_spills != NULL)
+ {
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ total_npartitions += spill->npartitions;
+ hashagg_spill_finish(aggstate, spill, setno);
+ }
+
+ /*
+ * We're not processing tuples from outer plan any more; only
+ * processing batches of spilled tuples. The initial spill structures
+ * are no longer needed.
+ */
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ hash_agg_update_metrics(aggstate, false, total_npartitions);
+ aggstate->hash_spill_mode = false;
+}
+
+/*
+ * hashagg_spill_finish
+ *
+ * Transform spill partitions into new batches.
+ */
+static void
+hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
+{
+ int i;
+ int used_bits = 32 - spill->shift;
+
+ if (spill->npartitions == 0)
+ return; /* didn't spill */
+
+ for (i = 0; i < spill->npartitions; i++)
+ {
+ int tapenum = spill->partitions[i];
+ HashAggBatch *new_batch;
+
+ /* if the partition is empty, don't create a new batch of work */
+ if (spill->ntuples[i] == 0)
+ continue;
+
+ new_batch = hashagg_batch_new(aggstate->hash_tapeinfo->tapeset,
+ tapenum, setno, spill->ntuples[i],
+ used_bits);
+ aggstate->hash_batches = lcons(new_batch, aggstate->hash_batches);
+ aggstate->hash_batches_used++;
+ }
+
+ pfree(spill->ntuples);
+ pfree(spill->partitions);
+}
+
+/*
+ * Free resources related to a spilled HashAgg.
+ */
+static void
+hashagg_reset_spill_state(AggState *aggstate)
+{
+ ListCell *lc;
+
+ /* free spills from initial pass */
+ if (aggstate->hash_spills != NULL)
+ {
+ int setno;
+
+ for (setno = 0; setno < aggstate->num_hashes; setno++)
+ {
+ HashAggSpill *spill = &aggstate->hash_spills[setno];
+ if (spill->ntuples != NULL)
+ pfree(spill->ntuples);
+ if (spill->partitions != NULL)
+ pfree(spill->partitions);
+ }
+ pfree(aggstate->hash_spills);
+ aggstate->hash_spills = NULL;
+ }
+
+ /* free batches */
+ foreach(lc, aggstate->hash_batches)
+ {
+ HashAggBatch *batch = (HashAggBatch*) lfirst(lc);
+ pfree(batch);
+ }
+ list_free(aggstate->hash_batches);
+ aggstate->hash_batches = NIL;
+
+ /* close tape set */
+ if (aggstate->hash_tapeinfo != NULL)
+ {
+ HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
+
+ LogicalTapeSetClose(tapeinfo->tapeset);
+ pfree(tapeinfo->freetapes);
+ pfree(tapeinfo);
+ aggstate->hash_tapeinfo = NULL;
+ }
+}
+
+
/* -----------------
* ExecInitAgg
*
@@ -2518,9 +3499,36 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
*/
if (use_hashing)
{
+ Plan *outerplan = outerPlan(node);
+ uint64 totalGroups = 0;
+ int i;
+
+ aggstate->hash_metacxt = AllocSetContextCreate(
+ aggstate->ss.ps.state->es_query_cxt,
+ "HashAgg meta context",
+ ALLOCSET_DEFAULT_SIZES);
+ aggstate->hash_spill_slot = ExecInitExtraTupleSlot(
+ estate, scanDesc, &TTSOpsMinimalTuple);
+
/* this is an array of pointers, not structures */
aggstate->hash_pergroup = pergroups;
+ aggstate->hashentrysize = hash_agg_entry_size(
+ aggstate->numtrans, outerplan->plan_width, node->transitionSpace);
+
+ /*
+ * Consider all of the grouping sets together when setting the limits
+ * and estimating the number of partitions. This can be inaccurate
+ * when there is more than one grouping set, but should still be
+ * reasonable.
+ */
+ for (i = 0; i < aggstate->num_hashes; i++)
+ totalGroups = aggstate->perhash[i].aggnode->numGroups;
+
+ hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+ &aggstate->hash_mem_limit,
+ &aggstate->hash_ngroups_limit,
+ &aggstate->hash_planned_partitions);
find_hash_columns(aggstate);
build_hash_tables(aggstate);
aggstate->table_filled = false;
@@ -2931,6 +3939,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
false);
+ /* cache compiled expression for outer slot without NULL check */
+ phase->evaltrans_cache[0][0] = phase->evaltrans;
}
return aggstate;
@@ -3424,6 +4434,14 @@ ExecEndAgg(AggState *node)
if (node->sort_out)
tuplesort_end(node->sort_out);
+ hashagg_reset_spill_state(node);
+
+ if (node->hash_metacxt != NULL)
+ {
+ MemoryContextDelete(node->hash_metacxt);
+ node->hash_metacxt = NULL;
+ }
+
for (transno = 0; transno < node->numtrans; transno++)
{
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -3479,12 +4497,13 @@ ExecReScanAgg(AggState *node)
return;
/*
- * If we do have the hash table, and the subplan does not have any
- * parameter changes, and none of our own parameter changes affect
- * input expressions of the aggregated functions, then we can just
- * rescan the existing hash table; no need to build it again.
+ * If we do have the hash table, and it never spilled, and the subplan
+ * does not have any parameter changes, and none of our own parameter
+ * changes affect input expressions of the aggregated functions, then
+ * we can just rescan the existing hash table; no need to build it
+ * again.
*/
- if (outerPlan->chgParam == NULL &&
+ if (outerPlan->chgParam == NULL && !node->hash_ever_spilled &&
!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
{
ResetTupleHashIterator(node->perhash[0].hashtable,
@@ -3541,11 +4560,19 @@ ExecReScanAgg(AggState *node)
*/
if (node->aggstrategy == AGG_HASHED || node->aggstrategy == AGG_MIXED)
{
+ hashagg_reset_spill_state(node);
+
+ node->hash_ever_spilled = false;
+ node->hash_spill_mode = false;
+ node->hash_ngroups_current = 0;
+
ReScanExprContext(node->hashcontext);
/* Rebuild an empty hash table */
build_hash_tables(node);
node->table_filled = false;
/* iterator will be reset when the table is filled */
+
+ hashagg_recompile_expressions(node, false, false);
}
if (node->aggstrategy != AGG_HASHED)
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721f..8cf694b61dc 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -77,6 +77,7 @@
#include "access/htup_details.h"
#include "access/tsmapi.h"
#include "executor/executor.h"
+#include "executor/nodeAgg.h"
#include "executor/nodeHash.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -128,6 +129,8 @@ bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
bool enable_hashagg = true;
+bool enable_hashagg_disk = true;
+bool enable_groupingsets_hash_disk = false;
bool enable_nestloop = true;
bool enable_material = true;
bool enable_mergejoin = true;
@@ -2153,7 +2156,7 @@ cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples)
+ double input_tuples, double input_width)
{
double output_tuples;
Cost startup_cost;
@@ -2228,14 +2231,79 @@ cost_agg(Path *path, PlannerInfo *root,
startup_cost += disable_cost;
startup_cost += aggcosts->transCost.startup;
startup_cost += aggcosts->transCost.per_tuple * input_tuples;
+ /* cost of computing hash value */
startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;
startup_cost += aggcosts->finalCost.startup;
+
total_cost = startup_cost;
total_cost += aggcosts->finalCost.per_tuple * numGroups;
+ /* cost of retrieving from hash table */
total_cost += cpu_tuple_cost * numGroups;
output_tuples = numGroups;
}
+ /*
+ * Add the disk costs of hash aggregation that spills to disk.
+ *
+ * Groups that go into the hash table stay in memory until finalized,
+ * so spilling and reprocessing tuples doesn't incur additional
+ * invocations of transCost or finalCost. Furthermore, the computed
+ * hash value is stored with the spilled tuples, so we don't incur
+ * extra invocations of the hash function.
+ *
+ * Hash Agg begins returning tuples after the first batch is
+ * complete. Accrue writes (spilled tuples) to startup_cost and to
+ * total_cost; accrue reads only to total_cost.
+ */
+ if (aggstrategy == AGG_HASHED || aggstrategy == AGG_MIXED)
+ {
+ double pages_written = 0.0;
+ double pages_read = 0.0;
+ double hashentrysize;
+ double nbatches;
+ Size mem_limit;
+ uint64 ngroups_limit;
+ int num_partitions;
+
+
+ /*
+ * Estimate number of batches based on the computed limits. If less
+ * than or equal to one, all groups are expected to fit in memory;
+ * otherwise we expect to spill.
+ */
+ hashentrysize = hash_agg_entry_size(
+ aggcosts->numAggs, input_width, aggcosts->transitionSpace);
+ hash_agg_set_limits(hashentrysize, numGroups, 0, &mem_limit,
+ &ngroups_limit, &num_partitions);
+
+ nbatches = Max( (numGroups * hashentrysize) / mem_limit,
+ numGroups / ngroups_limit );
+
+ /*
+ * Estimate number of pages read and written. For each level of
+ * recursion, a tuple must be written and then later read.
+ */
+ if (nbatches > 1.0)
+ {
+ double depth;
+ double pages;
+
+ pages = relation_byte_size(input_tuples, input_width) / BLCKSZ;
+
+ /*
+ * The number of partitions can change at different levels of
+ * recursion; but for the purposes of this calculation assume it
+ * stays constant.
+ */
+ depth = ceil( log(nbatches - 1) / log(num_partitions) );
+ pages_written = pages_read = pages * depth;
+ }
+
+ startup_cost += pages_written * random_page_cost;
+ total_cost += pages_written * random_page_cost;
+ total_cost += pages_read * seq_page_cost;
+ }
+
/*
* If there are quals (HAVING quals), account for their cost and
* selectivity.
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6314c..eb25c2f4707 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4258,11 +4258,12 @@ consider_groupingsets_paths(PlannerInfo *root,
dNumGroups - exclude_groups);
/*
- * gd->rollups is empty if we have only unsortable columns to work
- * with. Override work_mem in that case; otherwise, we'll rely on the
- * sorted-input case to generate usable mixed paths.
+ * If we have sortable columns to work with (gd->rollups is non-empty)
+ * and enable_groupingsets_hash_disk is disabled, don't generate
+ * hash-based paths that will exceed work_mem.
*/
- if (hashsize > work_mem * 1024L && gd->rollups)
+ if (!enable_groupingsets_hash_disk &&
+ hashsize > work_mem * 1024L && gd->rollups)
return; /* nope, won't fit */
/*
@@ -6528,7 +6529,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
* were unable to sort above, then we'd better generate a Path, so
* that we at least have one.
*/
- if (hashaggtablesize < work_mem * 1024L ||
+ if (enable_hashagg_disk ||
+ hashaggtablesize < work_mem * 1024L ||
grouped_rel->pathlist == NIL)
{
/*
@@ -6561,7 +6563,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
agg_final_costs,
dNumGroups);
- if (hashaggtablesize < work_mem * 1024L)
+ if (enable_hashagg_disk ||
+ hashaggtablesize < work_mem * 1024L)
add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
@@ -6830,7 +6833,7 @@ create_partial_grouping_paths(PlannerInfo *root,
* Tentatively produce a partial HashAgg Path, depending on if it
* looks as if the hash table will fit in work_mem.
*/
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_disk || hashaggtablesize < work_mem * 1024L) &&
cheapest_total_path != NULL)
{
add_path(partially_grouped_rel, (Path *)
@@ -6857,7 +6860,7 @@ create_partial_grouping_paths(PlannerInfo *root,
dNumPartialPartialGroups);
/* Do the same for partial paths. */
- if (hashaggtablesize < work_mem * 1024L &&
+ if ((enable_hashagg_disk || hashaggtablesize < work_mem * 1024L) &&
cheapest_partial_path != NULL)
{
add_partial_path(partially_grouped_rel, (Path *)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 1a23e18970d..951aed80e7a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1072,7 +1072,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
numGroupCols, dNumGroups,
NIL,
input_path->startup_cost, input_path->total_cost,
- input_path->rows);
+ input_path->rows, input_path->pathtarget->width);
/*
* Now for the sorted case. Note that the input is *always* unsorted,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d9ce5162116..8ba8122ee2f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1704,7 +1704,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
NIL,
subpath->startup_cost,
subpath->total_cost,
- rel->rows);
+ rel->rows,
+ subpath->pathtarget->width);
}
if (sjinfo->semi_can_btree && sjinfo->semi_can_hash)
@@ -2958,7 +2959,7 @@ create_agg_path(PlannerInfo *root,
list_length(groupClause), numGroups,
qual,
subpath->startup_cost, subpath->total_cost,
- subpath->rows);
+ subpath->rows, subpath->pathtarget->width);
/* add tlist eval cost for each output row */
pathnode->path.startup_cost += target->cost.startup;
@@ -3069,7 +3070,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
subpath->startup_cost,
subpath->total_cost,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
is_first = false;
if (!rollup->is_hashed)
is_first_sort = false;
@@ -3092,7 +3094,8 @@ create_groupingsets_path(PlannerInfo *root,
rollup->numGroups,
having_qual,
0.0, 0.0,
- subpath->rows);
+ subpath->rows,
+ subpath->pathtarget->width);
if (!rollup->is_hashed)
is_first_sort = false;
}
@@ -3117,7 +3120,8 @@ create_groupingsets_path(PlannerInfo *root,
having_qual,
sort_path.startup_cost,
sort_path.total_cost,
- sort_path.rows);
+ sort_path.rows,
+ subpath->pathtarget->width);
}
pathnode->path.total_cost += agg_path.total_cost;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4c6d6486623..64da8882082 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -999,6 +999,26 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_hashagg_disk", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans that are expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_hashagg_disk,
+ true,
+ NULL, NULL, NULL
+ },
+ {
+ {"enable_groupingsets_hash_disk", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of hashed aggregation plans for groupingsets when the total size of the hash tables is expected to exceed work_mem."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_groupingsets_hash_disk,
+ false,
+ NULL, NULL, NULL
+ },
{
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of materialization."),
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 264916f9a92..a5b8a004d1e 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -280,6 +280,11 @@ typedef struct AggStatePerPhaseData
Sort *sortnode; /* Sort node for input ordering for phase */
ExprState *evaltrans; /* evaluation of transition functions */
+
+ /* cached variants of the compiled expression */
+ ExprState *evaltrans_cache
+ [2] /* 0: outerops; 1: TTSOpsMinimalTuple */
+ [2]; /* 0: no NULL check; 1: with NULL check */
} AggStatePerPhaseData;
/*
@@ -311,5 +316,8 @@ extern void ExecReScanAgg(AggState *node);
extern Size hash_agg_entry_size(int numAggs, Size tupleWidth,
Size transitionSpace);
+extern void hash_agg_set_limits(double hashentrysize, uint64 input_groups,
+ int used_bits, Size *mem_limit,
+ uint64 *ngroups_limit, int *num_partitions);
#endif /* NODEAGG_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd3ddf781f1..3d27d50f090 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2079,12 +2079,32 @@ typedef struct AggState
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
bool table_filled; /* hash table filled yet? */
int num_hashes;
+ MemoryContext hash_metacxt; /* memory for hash table itself */
+ struct HashTapeInfo *hash_tapeinfo; /* metadata for spill tapes */
+ struct HashAggSpill *hash_spills; /* HashAggSpill for each grouping set,
+ exists only during first pass */
+ TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ List *hash_batches; /* hash batches remaining to be processed */
+ bool hash_ever_spilled; /* ever spilled during this execution? */
+ bool hash_spill_mode; /* we hit a limit during the current batch
+ and we must not create new groups */
+ Size hash_mem_limit; /* limit before spilling hash table */
+ uint64 hash_ngroups_limit; /* limit before spilling hash table */
+ int hash_planned_partitions; /* number of partitions planned
+ for first pass */
+ double hashentrysize; /* estimate revised during execution */
+ Size hash_mem_peak; /* peak hash table memory usage */
+ uint64 hash_ngroups_current; /* number of groups currently in
+ memory in all hash tables */
+ uint64 hash_disk_used; /* kB of disk space used */
+ int hash_batches_used; /* batches used during entire execution */
+
AggStatePerHash perhash; /* array of per-hashtable data */
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 49
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index cb012ba1980..735ba096503 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -54,6 +54,8 @@ extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_hashagg_disk;
+extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
extern PGDLLIMPORT bool enable_mergejoin;
@@ -114,7 +116,7 @@ extern void cost_agg(Path *path, PlannerInfo *root,
int numGroupCols, double numGroups,
List *quals,
Cost input_startup_cost, Cost input_total_cost,
- double input_tuples);
+ double input_tuples, double input_width);
extern void cost_windowagg(Path *path, PlannerInfo *root,
List *windowFuncs, int numPartCols, int numOrderCols,
Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index f457b5b150f..0073072a368 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2357,3 +2357,187 @@ explain (costs off)
-> Seq Scan on onek
(8 rows)
+--
+-- Hash Aggregation Spill tests
+--
+set enable_sort=false;
+set work_mem='64kB';
+select unique1, count(*), sum(twothousand) from tenk1
+group by unique1
+having sum(fivethous) > 4975
+order by sum(twothousand);
+ unique1 | count | sum
+---------+-------+------
+ 4976 | 1 | 976
+ 4977 | 1 | 977
+ 4978 | 1 | 978
+ 4979 | 1 | 979
+ 4980 | 1 | 980
+ 4981 | 1 | 981
+ 4982 | 1 | 982
+ 4983 | 1 | 983
+ 4984 | 1 | 984
+ 4985 | 1 | 985
+ 4986 | 1 | 986
+ 4987 | 1 | 987
+ 4988 | 1 | 988
+ 4989 | 1 | 989
+ 4990 | 1 | 990
+ 4991 | 1 | 991
+ 4992 | 1 | 992
+ 4993 | 1 | 993
+ 4994 | 1 | 994
+ 4995 | 1 | 995
+ 4996 | 1 | 996
+ 4997 | 1 | 997
+ 4998 | 1 | 998
+ 4999 | 1 | 999
+ 9976 | 1 | 1976
+ 9977 | 1 | 1977
+ 9978 | 1 | 1978
+ 9979 | 1 | 1979
+ 9980 | 1 | 1980
+ 9981 | 1 | 1981
+ 9982 | 1 | 1982
+ 9983 | 1 | 1983
+ 9984 | 1 | 1984
+ 9985 | 1 | 1985
+ 9986 | 1 | 1986
+ 9987 | 1 | 1987
+ 9988 | 1 | 1988
+ 9989 | 1 | 1989
+ 9990 | 1 | 1990
+ 9991 | 1 | 1991
+ 9992 | 1 | 1992
+ 9993 | 1 | 1993
+ 9994 | 1 | 1994
+ 9995 | 1 | 1995
+ 9996 | 1 | 1996
+ 9997 | 1 | 1997
+ 9998 | 1 | 1998
+ 9999 | 1 | 1999
+(48 rows)
+
+set work_mem to default;
+set enable_sort to default;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+set work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------------
+ GroupAggregate
+ Group Key: ((g % 100000))
+ -> Sort
+ Sort Key: ((g % 100000))
+ -> Function Scan on generate_series g
+(5 rows)
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+-- Produce results with hash aggregation
+set enable_hashagg = true;
+set enable_sort = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 100000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+set jit_above_cost to default;
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+set enable_sort = true;
+set work_mem to default;
+-- Compare group aggregation results to hash aggregation results
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+ a | c1 | c2 | c3
+---+----+----+----
+(0 rows)
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a7..dbe5140b558 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1633,4 +1633,126 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
| 1 | 2
(4 rows)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low
+-- and turning on enable_groupingsets_hash_disk.
+--
+SET enable_groupingsets_hash_disk = true;
+SET work_mem='64kB';
+-- Produce results with sorting.
+set enable_hashagg = false;
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------------------
+ GroupAggregate
+ Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 1000)), ((g.g % 100))
+ Group Key: ((g.g % 1000))
+ Group Key: ()
+ Sort Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100)), ((g.g % 10))
+ Group Key: ((g.g % 100))
+ Sort Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10)), ((g.g % 1000))
+ Group Key: ((g.g % 10))
+ -> Sort
+ Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
+ -> Function Scan on generate_series g
+(14 rows)
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+-- Produce results with hash aggregation.
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+set jit_above_cost = 0;
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+ QUERY PLAN
+---------------------------------------------------
+ MixedAggregate
+ Hash Key: (g.g % 1000), (g.g % 100), (g.g % 10)
+ Hash Key: (g.g % 1000), (g.g % 100)
+ Hash Key: (g.g % 1000)
+ Hash Key: (g.g % 100), (g.g % 10)
+ Hash Key: (g.g % 100)
+ Hash Key: (g.g % 10), (g.g % 1000)
+ Hash Key: (g.g % 10)
+ Group Key: ()
+ -> Function Scan on generate_series g
+(10 rows)
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+set jit_above_cost to default;
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+set enable_sort = true;
+set work_mem to default;
+-- Compare results
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+ g1000 | g100 | g10 | sum | count | max
+-------+------+-----+-----+-------+-----
+(0 rows)
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+ g100 | g10 | unnest | c | m
+------+-----+--------+---+---
+(0 rows)
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+SET enable_groupingsets_hash_disk TO DEFAULT;
-- end
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1de..11c6f50fbfa 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -148,6 +148,68 @@ SELECT count(*) FROM
4
(1 row)
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+SET work_mem='64kB';
+-- Produce results with sorting.
+SET enable_hashagg=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------------
+ Unique
+ -> Sort
+ Sort Key: ((g % 1000))
+ -> Function Scan on generate_series g
+(4 rows)
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_hashagg=TRUE;
+-- Produce results with hash aggregation.
+SET enable_sort=FALSE;
+SET jit_above_cost=0;
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+ QUERY PLAN
+------------------------------------------
+ HashAggregate
+ Group Key: (g % 1000)
+ -> Function Scan on generate_series g
+(3 rows)
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+SET jit_above_cost TO DEFAULT;
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+SET enable_sort=TRUE;
+SET work_mem TO DEFAULT;
+-- Compare results
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+ ?column?
+----------
+(0 rows)
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb9057..715842b87af 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -74,7 +74,9 @@ select name, setting from pg_settings where name like 'enable%';
--------------------------------+---------
enable_bitmapscan | on
enable_gathermerge | on
+ enable_groupingsets_hash_disk | off
enable_hashagg | on
+ enable_hashagg_disk | on
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
@@ -89,7 +91,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(19 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 3e593f2d615..02578330a6f 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1032,3 +1032,134 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
explain (costs off)
select 1 from tenk1
where (hundred, thousand) in (select twothousand, twothousand from onek);
+
+--
+-- Hash Aggregation Spill tests
+--
+
+set enable_sort=false;
+set work_mem='64kB';
+
+select unique1, count(*), sum(twothousand) from tenk1
+group by unique1
+having sum(fivethous) > 4975
+order by sum(twothousand);
+
+set work_mem to default;
+set enable_sort to default;
+
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+set work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_group_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_group_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_group_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+-- Produce results with hash aggregation
+
+set enable_hashagg = true;
+set enable_sort = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_1 as
+select g%100000 as c1, sum(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 199999) g
+ group by g%100000;
+
+create table agg_hash_2 as
+select * from
+ (values (100), (300), (500)) as r(a),
+ lateral (
+ select (g/2)::numeric as c1,
+ array_agg(g::numeric) as c2,
+ count(*) as c3
+ from generate_series(0, 1999) g
+ where g < r.a
+ group by g/2) as s;
+
+set jit_above_cost to default;
+
+create table agg_hash_3 as
+select (g/2)::numeric as c1, sum(7::int4) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+create table agg_hash_4 as
+select (g/2)::numeric as c1, array_agg(g::numeric) as c2, count(*) as c3
+ from generate_series(0, 1999) g
+ group by g/2;
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare group aggregation results to hash aggregation results
+
+(select * from agg_hash_1 except select * from agg_group_1)
+ union all
+(select * from agg_group_1 except select * from agg_hash_1);
+
+(select * from agg_hash_2 except select * from agg_group_2)
+ union all
+(select * from agg_group_2 except select * from agg_hash_2);
+
+(select * from agg_hash_3 except select * from agg_group_3)
+ union all
+(select * from agg_group_3 except select * from agg_hash_3);
+
+(select * from agg_hash_4 except select * from agg_group_4)
+ union all
+(select * from agg_group_4 except select * from agg_hash_4);
+
+drop table agg_group_1;
+drop table agg_group_2;
+drop table agg_group_3;
+drop table agg_group_4;
+drop table agg_hash_1;
+drop table agg_hash_2;
+drop table agg_hash_3;
+drop table agg_hash_4;
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 95ac3fb52f6..478f49ecab5 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -441,4 +441,107 @@ select v||'a', case when grouping(v||'a') = 1 then 1 else 0 end, count(*)
from unnest(array[1,1], array['a','b']) u(i,v)
group by rollup(i, v||'a') order by 1,3;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low
+-- and turning on enable_groupingsets_hash_disk.
+--
+
+SET enable_groupingsets_hash_disk = true;
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+set enable_hashagg = false;
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_group_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_group_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+-- Produce results with hash aggregation.
+
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+set jit_above_cost = 0;
+
+explain (costs off)
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_1 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
+ from generate_series(0,199999) g) s
+group by cube (g1000,g100,g10);
+
+set jit_above_cost to default;
+
+create table gs_hash_2 as
+select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
+ (select g/20 as g1000, g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by cube (g1000,g100,g10);
+
+create table gs_hash_3 as
+select g100, g10, array_agg(g) as a, count(*) as c, max(g::text) as m from
+ (select g/200 as g100, g/2000 as g10, g
+ from generate_series(0,19999) g) s
+group by grouping sets (g100,g10);
+
+set enable_sort = true;
+set work_mem to default;
+
+-- Compare results
+
+(select * from gs_hash_1 except select * from gs_group_1)
+ union all
+(select * from gs_group_1 except select * from gs_hash_1);
+
+(select * from gs_hash_2 except select * from gs_group_2)
+ union all
+(select * from gs_group_2 except select * from gs_hash_2);
+
+(select g100,g10,unnest(a),c,m from gs_hash_3 except
+ select g100,g10,unnest(a),c,m from gs_group_3)
+ union all
+(select g100,g10,unnest(a),c,m from gs_group_3 except
+ select g100,g10,unnest(a),c,m from gs_hash_3);
+
+drop table gs_group_1;
+drop table gs_group_2;
+drop table gs_group_3;
+drop table gs_hash_1;
+drop table gs_hash_2;
+drop table gs_hash_3;
+
+SET enable_groupingsets_hash_disk TO DEFAULT;
+
-- end
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449e..33102744ebf 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -45,6 +45,68 @@ SELECT count(*) FROM
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;
+--
+-- Compare results between plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low.
+--
+
+SET work_mem='64kB';
+
+-- Produce results with sorting.
+
+SET enable_hashagg=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_group_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_group_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_hashagg=TRUE;
+
+-- Produce results with hash aggregation.
+
+SET enable_sort=FALSE;
+
+SET jit_above_cost=0;
+
+EXPLAIN (costs off)
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+CREATE TABLE distinct_hash_1 AS
+SELECT DISTINCT g%1000 FROM generate_series(0,9999) g;
+
+SET jit_above_cost TO DEFAULT;
+
+CREATE TABLE distinct_hash_2 AS
+SELECT DISTINCT (g%1000)::text FROM generate_series(0,9999) g;
+
+SET enable_sort=TRUE;
+
+SET work_mem TO DEFAULT;
+
+-- Compare results
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+(SELECT * FROM distinct_hash_1 EXCEPT SELECT * FROM distinct_group_1)
+ UNION ALL
+(SELECT * FROM distinct_group_1 EXCEPT SELECT * FROM distinct_hash_1);
+
+DROP TABLE distinct_hash_1;
+DROP TABLE distinct_hash_2;
+DROP TABLE distinct_group_1;
+DROP TABLE distinct_group_2;
+
--
-- Also, some tests of IS DISTINCT FROM, which doesn't quite deserve its
-- very own regression file.
Committed.
There's some future work that would be nice (some of these are just
ideas and may not be worth it):
* Refactor MemoryContextMemAllocated() to be a part of
MemoryContextStats(), but allow it to avoid walking through the blocks
and freelists.
* Improve the choice of the initial number of buckets in the hash
table. For this patch, I tried to preserve the existing behavior of
estimating the number of groups and trying to initialize with that many
buckets. But my performance tests seem to indicate this is not the best
approach. More work is needed to find what we should really do here.
* For workloads that are not in work_mem *or* system memory, and need
to actually go to storage, I see poor CPU utilization because it's not
effectively overlapping CPU and IO work. Perhaps buffering or readahead
changes can improve this, or async IO (even better).
* Project unnecessary attributes away before spilling tuples to disk.
* Improve logtape.c API so that the caller doesn't need to manage a
bunch of tape numbers.
* Improve estimate of the hash entry size. This patch doesn't change
the way the planner estimates it, but I observe that actual size as
seen at runtime is significantly different. This is connected to the
initial number of buckets for the hash table.
* In recursive steps, I don't have a good estimate for the number of
groups, so I just estimate it as the number of tuples in that spill
tape (which is pessimistic). That could be improved by doing a real
cardinality estimate as the tuples are spilling (perhaps with HLL?).
* Many aggregates with pass-by-ref transition states don't provide a
great aggtransspace. We should consider doing something smarter, like
having negative numbers represent a number that should be multiplied by
the size of the group (e.g. ARRAY_AGG would have a size dependent on
the group size, not a constant).
* If we want to handle ARRAY_AGG (and the like) well, we can consider
spilling the partial states in the hash table whem the memory is full.
That would add a fair amount of complexity because there would be two
types of spilled data (tuples and partial states), but it could be
useful in some cases.
Regards,
Jeff Davis
On Wed, Mar 18, 2020 at 04:35:57PM -0700, Jeff Davis wrote:
Committed.
\o/
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Mar 15, 2020 at 04:05:37PM -0700, Jeff Davis wrote:
+ if (from_tape) + partition_mem += HASHAGG_READ_BUFFER_SIZE; + partition_mem = npartitions * HASHAGG_WRITE_BUFFER_SIZE;=> That looks wrong ; should say += ?
Good catch! Fixed.
+++ b/src/backend/executor/nodeAgg.c @@ -2518,9 +3499,36 @@ ExecInitAgg(Agg *node, EState *estate, int eflags) */ if (use_hashing) { + Plan *outerplan = outerPlan(node); + uint64 totalGroups = 0; + for (i = 0; i < aggstate->num_hashes; i++) + totalGroups = aggstate->perhash[i].aggnode->numGroups; + + hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
I realize that I missed the train but .. that looks like another += issue?
Also, Andres was educating me about the range of behavior of "long" type, and I
see now while rebasing that you did the same thing.
/messages/by-id/20200306175859.d56ohskarwldyrrw@alap3.anarazel.de
--
Justin
Hi,
I happen to notice that "set enable_sort to false" cannot guarantee the
planner to use hashagg in test groupingsets.sql,
the following comparing results of sortagg and hashagg seems to have no
meaning.
Thanks,
Pengzhou
On Thu, Mar 19, 2020 at 7:36 AM Jeff Davis <pgsql@j-davis.com> wrote:
Show quoted text
Committed.
There's some future work that would be nice (some of these are just
ideas and may not be worth it):* Refactor MemoryContextMemAllocated() to be a part of
MemoryContextStats(), but allow it to avoid walking through the blocks
and freelists.* Improve the choice of the initial number of buckets in the hash
table. For this patch, I tried to preserve the existing behavior of
estimating the number of groups and trying to initialize with that many
buckets. But my performance tests seem to indicate this is not the best
approach. More work is needed to find what we should really do here.* For workloads that are not in work_mem *or* system memory, and need
to actually go to storage, I see poor CPU utilization because it's not
effectively overlapping CPU and IO work. Perhaps buffering or readahead
changes can improve this, or async IO (even better).* Project unnecessary attributes away before spilling tuples to disk.
* Improve logtape.c API so that the caller doesn't need to manage a
bunch of tape numbers.* Improve estimate of the hash entry size. This patch doesn't change
the way the planner estimates it, but I observe that actual size as
seen at runtime is significantly different. This is connected to the
initial number of buckets for the hash table.* In recursive steps, I don't have a good estimate for the number of
groups, so I just estimate it as the number of tuples in that spill
tape (which is pessimistic). That could be improved by doing a real
cardinality estimate as the tuples are spilling (perhaps with HLL?).* Many aggregates with pass-by-ref transition states don't provide a
great aggtransspace. We should consider doing something smarter, like
having negative numbers represent a number that should be multiplied by
the size of the group (e.g. ARRAY_AGG would have a size dependent on
the group size, not a constant).* If we want to handle ARRAY_AGG (and the like) well, we can consider
spilling the partial states in the hash table whem the memory is full.
That would add a fair amount of complexity because there would be two
types of spilled data (tuples and partial states), but it could be
useful in some cases.Regards,
Jeff Davis
On Fri, Mar 20, 2020 at 1:20 PM Pengzhou Tang <ptang@pivotal.io> wrote:
Hi,
I happen to notice that "set enable_sort to false" cannot guarantee the
planner to use hashagg in test groupingsets.sql,
the following comparing results of sortagg and hashagg seems to have no
meaning.
Please forget my comment, I should set enable_groupingsets_hash_disk too.
Hello,
When calculating the disk costs of hash aggregation that spills to disk,
there is something wrong with how we determine depth:
depth = ceil( log(nbatches - 1) / log(num_partitions) );
If nbatches is some number between 1.0 and 2.0, we would have a negative
depth. As a result, we may have a negative cost for hash aggregation
plan node, as described in [1]/messages/by-id/CAMbWs4_maqdBnRR4x01pDpoV-CiQ+RvMQaPm4JoTPbA=mZmhMw@mail.gmail.com.
I don't think 'log(nbatches - 1)' is what we want here. Should it be
just '(nbatches - 1)'?
[1]: /messages/by-id/CAMbWs4_maqdBnRR4x01pDpoV-CiQ+RvMQaPm4JoTPbA=mZmhMw@mail.gmail.com
/messages/by-id/CAMbWs4_maqdBnRR4x01pDpoV-CiQ+RvMQaPm4JoTPbA=mZmhMw@mail.gmail.com
Thanks
Richard
On Thu, Mar 19, 2020 at 7:36 AM Jeff Davis <pgsql@j-davis.com> wrote:
Show quoted text
Committed.
There's some future work that would be nice (some of these are just
ideas and may not be worth it):* Refactor MemoryContextMemAllocated() to be a part of
MemoryContextStats(), but allow it to avoid walking through the blocks
and freelists.* Improve the choice of the initial number of buckets in the hash
table. For this patch, I tried to preserve the existing behavior of
estimating the number of groups and trying to initialize with that many
buckets. But my performance tests seem to indicate this is not the best
approach. More work is needed to find what we should really do here.* For workloads that are not in work_mem *or* system memory, and need
to actually go to storage, I see poor CPU utilization because it's not
effectively overlapping CPU and IO work. Perhaps buffering or readahead
changes can improve this, or async IO (even better).* Project unnecessary attributes away before spilling tuples to disk.
* Improve logtape.c API so that the caller doesn't need to manage a
bunch of tape numbers.* Improve estimate of the hash entry size. This patch doesn't change
the way the planner estimates it, but I observe that actual size as
seen at runtime is significantly different. This is connected to the
initial number of buckets for the hash table.* In recursive steps, I don't have a good estimate for the number of
groups, so I just estimate it as the number of tuples in that spill
tape (which is pessimistic). That could be improved by doing a real
cardinality estimate as the tuples are spilling (perhaps with HLL?).* Many aggregates with pass-by-ref transition states don't provide a
great aggtransspace. We should consider doing something smarter, like
having negative numbers represent a number that should be multiplied by
the size of the group (e.g. ARRAY_AGG would have a size dependent on
the group size, not a constant).* If we want to handle ARRAY_AGG (and the like) well, we can consider
spilling the partial states in the hash table whem the memory is full.
That would add a fair amount of complexity because there would be two
types of spilled data (tuples and partial states), but it could be
useful in some cases.Regards,
Jeff Davis
On Thu, Mar 26, 2020 at 05:56:56PM +0800, Richard Guo wrote:
Hello,
When calculating the disk costs of hash aggregation that spills to disk,
there is something wrong with how we determine depth:depth = ceil( log(nbatches - 1) / log(num_partitions) );
If nbatches is some number between 1.0 and 2.0, we would have a negative
depth. As a result, we may have a negative cost for hash aggregation
plan node, as described in [1].I don't think 'log(nbatches - 1)' is what we want here. Should it be
just '(nbatches - 1)'?
I think using log() is correct, but why should we allow fractional
nbatches values between 1.0 and 2.0? You either have 1 batch or 2
batches, you can't have 1.5 batches. So I think the issue is here
nbatches = Max((numGroups * hashentrysize) / mem_limit,
numGroups / ngroups_limit );
and we should probably do
nbatches = ceil(nbatches);
right after it.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, 2020-03-27 at 02:31 +0100, Tomas Vondra wrote:
On Thu, Mar 26, 2020 at 05:56:56PM +0800, Richard Guo wrote:
If nbatches is some number between 1.0 and 2.0, we would have a
negative
depth. As a result, we may have a negative cost for hash
aggregation
plan node, as described in [1].
numGroups / ngroups_limit );and we should probably do
nbatches = ceil(nbatches);
Thank you both. I also protected against nbatches == 0 (shouldn't
happen), and against num_partitions <= 1. That allowed me to remove the
conditional and simplify a bit.
Regards,
Jeff Davis