hashagg slowdown due to spill changes
Hi,
While preparing my pgcon talk I noticed that our hash-agg performance
degraded noticeably. Looks to me like it's due to the spilling-hashagg
changes.
Sample benchmark:
config:
-c huge_pages=on -c shared_buffers=32GB -c jit=0 -c max_parallel_workers_per_gather=0
(largely just to reduce variance)
data prep:
CREATE TABLE fewgroups_many_rows AS SELECT (random() * 4)::int cat, (random()*10000)::int val FROM generate_series(1, 100000000);
VACUUM (FREEZE, ANALYZE) fewgroups_many_rows;
test prep:
CREATE EXTENSION IF NOT EXISTS pg_prewarm;SELECT pg_prewarm('fewgroups_many_rows', 'buffer');
test:
SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1;
Using best-of-three timing:
12 12221.031 ms
master 13855.129 ms
While not the end of the world, that's a definitely noticable and
reproducible slowdown (~12%).
I don't think this is actually an inherent cost, but a question of how
the code ended up being organized. Here's a perf diff of profiles for
both versions:
# Baseline Delta Abs Shared Object Symbol
# ........ ......... ................ .........................................
#
+6.70% postgres [.] LookupTupleHashEntryHash
+6.37% postgres [.] prepare_hash_slot
+4.74% postgres [.] TupleHashTableHash_internal.isra.0
20.36% -2.89% postgres [.] ExecInterpExpr
6.31% -2.73% postgres [.] lookup_hash_entries
+2.36% postgres [.] lookup_hash_entry
+2.14% postgres [.] ExecJustAssignScanVar
2.28% +1.97% postgres [.] ExecScan
2.54% +1.93% postgres [.] MemoryContextReset
3.84% -1.86% postgres [.] SeqNext
10.19% -1.50% postgres [.] tts_buffer_heap_getsomeattrs
+1.42% postgres [.] hash_bytes_uint32
+1.39% postgres [.] TupleHashTableHash
+1.10% postgres [.] tts_virtual_clear
3.36% -0.74% postgres [.] ExecAgg
+0.45% postgres [.] CheckForSerializableConflictOutNeeded
0.25% +0.44% postgres [.] hashint4
5.80% -0.35% postgres [.] tts_minimal_getsomeattrs
1.91% -0.33% postgres [.] heap_getnextslot
4.86% -0.32% postgres [.] heapgettup_pagemode
1.46% -0.32% postgres [.] tts_minimal_clear
While some of this is likely is just noise, it's pretty clear that we
spend a substantial amount of additional time below
lookup_hash_entries().
And looking at the code, I'm not too surprised:
Before there was basically one call from nodeAgg.c to execGrouping.c for
each tuple and hash table. Now it's a lot more complicated:
1) nodeAgg.c: prepare_hash_slot()
2) execGrouping.c: TupleHashTableHash()
3) nodeAgg.c: lookup_hash_entry()
4) execGrouping.c: LookupTupleHashEntryHash()
For each of these data needs to be peeled out of one or more of AggState
/ AggStatePerHashData / TupleHashTable. There's no way the compiler can
know that nothing inside those changes, therefore it has to reload the
contents repeatedly. By my look at the profiles, that's where most of
the time is going.
There's also the issue that the signalling whether to insert / not to
insert got unnecessarily complicated. There's several checks:
1) lookup_hash_entry() (p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;)
2) LookupTupleHashEntry_internal() (if (isnew))
3) lookup_hash_entry() (if (entry == NULL) and if (isnew))
4) lookup_hash_entries() if (!in_hash_table)
Not performance related: I am a bit confused why the new per-hash stuff
in lookup_hash_entries() isn't in lookup_hash_entry()? I assume that's
because of agg_refill_hash_table()?
Why isn't the flow more like this:
1) prepare_hash_slot()
2) if (aggstate->hash_spill_mode) goto 3; else goto 4
3) entry = LookupTupleHashEntry(&hash); if (!entry) hashagg_spill_tuple();
4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry)
That way there's again exactly one call to execGrouping.c, there's no
need for nodeAgg to separately compute the hash, there's far fewer
branches...
Doing things this way might perhaps make agg_refill_hash_table() a tiny
bit more complicated, but it'll also avoid the slowdown for the vast
majority of cases where we're not spilling.
Greetings,
Andres Freund
On 2020-Jun-05, Andres Freund wrote:
While preparing my pgcon talk I noticed that our hash-agg performance
degraded noticeably. Looks to me like it's due to the spilling-hashagg
changes.
Jeff, what are your thoughts on this?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:
Before there was basically one call from nodeAgg.c to execGrouping.c
for
each tuple and hash table. Now it's a lot more complicated:
1) nodeAgg.c: prepare_hash_slot()
2) execGrouping.c: TupleHashTableHash()
3) nodeAgg.c: lookup_hash_entry()
4) execGrouping.c: LookupTupleHashEntryHash()
The reason that I did it that way was to be able to store the hash
along with the saved tuple (similar to what HashJoin does), which
avoids recalculation.
That could be a nice savings for some cases, like when work_mem is
small but the data still fits in system memory, which I expect to be
fairly common. But based on your numbers, it might be a bad trade-off
overall.
Why isn't the flow more like this:
1) prepare_hash_slot()
2) if (aggstate->hash_spill_mode) goto 3; else goto 4
3) entry = LookupTupleHashEntry(&hash); if (!entry)
hashagg_spill_tuple();
4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry)
I'll work up a patch to refactor this. I'd still like to see if we can
preserve the calculate-hash-once behavior somehow.
Regards,
Jeff Davis
On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:
Why isn't the flow more like this:
1) prepare_hash_slot()
2) if (aggstate->hash_spill_mode) goto 3; else goto 4
3) entry = LookupTupleHashEntry(&hash); if (!entry)
hashagg_spill_tuple();
4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry)
I see, you are suggesting that I change around the execGrouping.c
signatures to return the hash, which will avoid the extra call. That
makes more sense.
Regards,
Jeff Davis
Hi,
On 2020-06-08 13:41:29 -0700, Jeff Davis wrote:
On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:
Before there was basically one call from nodeAgg.c to execGrouping.c
for
each tuple and hash table. Now it's a lot more complicated:
1) nodeAgg.c: prepare_hash_slot()
2) execGrouping.c: TupleHashTableHash()
3) nodeAgg.c: lookup_hash_entry()
4) execGrouping.c: LookupTupleHashEntryHash()The reason that I did it that way was to be able to store the hash
along with the saved tuple (similar to what HashJoin does), which
avoids recalculation.
That makes sense. But then you can just use a separate call into
execGrouping for that purpose.
Why isn't the flow more like this:
1) prepare_hash_slot()
2) if (aggstate->hash_spill_mode) goto 3; else goto 4
3) entry = LookupTupleHashEntry(&hash); if (!entry)
hashagg_spill_tuple();
4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry)I'll work up a patch to refactor this. I'd still like to see if we can
preserve the calculate-hash-once behavior somehow.
Cool!
Greetings,
Andres Freund
On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:
Hi,
While preparing my pgcon talk I noticed that our hash-agg performance
degraded noticeably. Looks to me like it's due to the spilling-
hashagg
changes.
Attached a proposed fix. (Might require some minor cleanup).
The only awkward part is that LookupTupleHashEntry() needs a new out
parameter to pass the hash value back to the caller. Ordinarily, the
caller can get that from the returned entry, but if isnew==NULL, then
the function might return NULL (and the caller wouldn't have an entry
from which to read the hash value).
Regards,
Jeff Davis
Attachments:
refactor-hashlookup.patchtext/x-patch; charset=UTF-8; name=refactor-hashlookup.patchDownload
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 8be36ca7634..27dbf264766 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -24,9 +24,10 @@
static int TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
static uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb,
const MinimalTuple tuple);
-static TupleHashEntry LookupTupleHashEntry_internal(TupleHashTable hashtable,
- TupleTableSlot *slot,
- bool *isnew, uint32 hash);
+static inline TupleHashEntry LookupTupleHashEntry_internal(
+ TupleHashTable hashtable,
+ TupleTableSlot *slot,
+ bool *isnew, uint32 hash);
/*
* Define parameters for tuple hash table code generation. The interface is
@@ -291,6 +292,9 @@ ResetTupleHashTable(TupleHashTable hashtable)
* If isnew is NULL, we do not create new entries; we return NULL if no
* match is found.
*
+ * If hash is not NULL, we set it to the calculated hash value. This allows
+ * callers access to the hash value even if no entry is returned.
+ *
* If isnew isn't NULL, then a new entry is created if no existing entry
* matches. On return, *isnew is true if the entry is newly created,
* false if it existed already. ->additional_data in the new entry has
@@ -298,11 +302,11 @@ ResetTupleHashTable(TupleHashTable hashtable)
*/
TupleHashEntry
LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
- bool *isnew)
+ bool *isnew, uint32 *hash)
{
TupleHashEntry entry;
MemoryContext oldContext;
- uint32 hash;
+ uint32 local_hash;
/* Need to run the hash functions in short-lived context */
oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
@@ -312,8 +316,12 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
hashtable->cur_eq_func = hashtable->tab_eq_func;
- hash = TupleHashTableHash_internal(hashtable->hashtab, NULL);
- entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
+ local_hash = TupleHashTableHash_internal(hashtable->hashtab, NULL);
+ if (hash != NULL)
+ *hash = local_hash;
+
+ entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, local_hash);
+ Assert(entry == NULL || entry->hash == local_hash);
MemoryContextSwitchTo(oldContext);
@@ -362,6 +370,7 @@ LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
hashtable->cur_eq_func = hashtable->tab_eq_func;
entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
+ Assert(entry == NULL || entry->hash == hash);
MemoryContextSwitchTo(oldContext);
@@ -480,7 +489,7 @@ TupleHashTableHash_internal(struct tuplehash_hash *tb,
* NB: This function may or may not change the memory context. Caller is
* expected to change it back.
*/
-static TupleHashEntry
+static inline TupleHashEntry
LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
bool *isnew, uint32 hash)
{
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 331acee2814..0447d473c82 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -403,8 +403,9 @@ static int hash_choose_num_partitions(uint64 input_groups,
double hashentrysize,
int used_bits,
int *log2_npartittions);
-static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash,
- bool *in_hash_table);
+static void initialize_hash_entry(AggState *aggstate,
+ TupleHashTable hashtable,
+ TupleHashEntry entry);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
@@ -1979,75 +1980,39 @@ hash_choose_num_partitions(uint64 input_groups, double hashentrysize,
}
/*
- * Find or create a hashtable entry for the tuple group containing the current
- * tuple (already set in tmpcontext's outertuple slot), in the current grouping
- * set (which the caller must have selected - note that initialize_aggregate
- * depends on this).
- *
- * When called, CurrentMemoryContext should be the per-query context. The
- * already-calculated hash value for the tuple must be specified.
- *
- * If in "spill mode", then only find existing hashtable entries; don't create
- * new ones. If a tuple's group is not already present in the hash table for
- * the current grouping set, assign *in_hash_table=false and the caller will
- * spill it to disk.
+ * Initialize a freshly-created TupleHashEntry.
*/
-static AggStatePerGroup
-lookup_hash_entry(AggState *aggstate, uint32 hash, bool *in_hash_table)
+static void
+initialize_hash_entry(AggState *aggstate, TupleHashTable hashtable,
+ TupleHashEntry entry)
{
- AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
- TupleTableSlot *hashslot = perhash->hashslot;
- TupleHashEntryData *entry;
- bool isnew = false;
- bool *p_isnew;
-
- /* if hash table already spilled, don't create new entries */
- p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
-
- /* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
- hash);
-
- if (entry == NULL)
- {
- *in_hash_table = false;
- return NULL;
- }
- else
- *in_hash_table = true;
-
- if (isnew)
- {
- AggStatePerGroup pergroup;
- int transno;
+ AggStatePerGroup pergroup;
+ int transno;
- aggstate->hash_ngroups_current++;
- hash_agg_check_limits(aggstate);
+ aggstate->hash_ngroups_current++;
+ hash_agg_check_limits(aggstate);
- /* no need to allocate or initialize per-group state */
- if (aggstate->numtrans == 0)
- return NULL;
+ /* no need to allocate or initialize per-group state */
+ if (aggstate->numtrans == 0)
+ return;
- pergroup = (AggStatePerGroup)
- MemoryContextAlloc(perhash->hashtable->tablecxt,
- sizeof(AggStatePerGroupData) * aggstate->numtrans);
+ pergroup = (AggStatePerGroup)
+ MemoryContextAlloc(hashtable->tablecxt,
+ sizeof(AggStatePerGroupData) * aggstate->numtrans);
- entry->additional = pergroup;
+ entry->additional = pergroup;
- /*
- * Initialize aggregates for new tuple group, lookup_hash_entries()
- * already has selected the relevant grouping set.
- */
- for (transno = 0; transno < aggstate->numtrans; transno++)
- {
- AggStatePerTrans pertrans = &aggstate->pertrans[transno];
- AggStatePerGroup pergroupstate = &pergroup[transno];
+ /*
+ * Initialize aggregates for new tuple group, lookup_hash_entries()
+ * already has selected the relevant grouping set.
+ */
+ for (transno = 0; transno < aggstate->numtrans; transno++)
+ {
+ AggStatePerTrans pertrans = &aggstate->pertrans[transno];
+ AggStatePerGroup pergroupstate = &pergroup[transno];
- initialize_aggregate(aggstate, pertrans, pergroupstate);
- }
+ initialize_aggregate(aggstate, pertrans, pergroupstate);
}
-
- return entry->additional;
}
/*
@@ -2077,16 +2042,27 @@ lookup_hash_entries(AggState *aggstate)
for (setno = 0; setno < aggstate->num_hashes; setno++)
{
AggStatePerHash perhash = &aggstate->perhash[setno];
+ TupleHashEntry entry;
uint32 hash;
- bool in_hash_table;
+ bool isnew = false;
+ bool *p_isnew;
+
+ /* if hash table already spilled, don't create new entries */
+ p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
select_current_set(aggstate, setno, true);
prepare_hash_slot(aggstate);
- hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
- pergroup[setno] = lookup_hash_entry(aggstate, hash, &in_hash_table);
- /* check to see if we need to spill the tuple for this grouping set */
- if (!in_hash_table)
+ entry = LookupTupleHashEntry(perhash->hashtable, perhash->hashslot,
+ p_isnew, &hash);
+
+ if (entry != NULL)
+ {
+ if (isnew)
+ initialize_hash_entry(aggstate, perhash->hashtable, entry);
+ pergroup[setno] = entry->additional;
+ }
+ else
{
HashAggSpill *spill = &aggstate->hash_spills[setno];
TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
@@ -2097,6 +2073,7 @@ lookup_hash_entries(AggState *aggstate)
aggstate->hashentrysize);
hashagg_spill_tuple(spill, slot, hash);
+ pergroup[setno] = NULL;
}
}
}
@@ -2554,6 +2531,7 @@ static bool
agg_refill_hash_table(AggState *aggstate)
{
HashAggBatch *batch;
+ AggStatePerHash perhash;
HashAggSpill spill;
HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
uint64 ngroups_estimate;
@@ -2605,6 +2583,8 @@ agg_refill_hash_table(AggState *aggstate)
select_current_set(aggstate, batch->setno, true);
+ perhash = &aggstate->perhash[aggstate->current_set];
+
/*
* Spilled tuples are always read back as MinimalTuples, which may be
* different from the outer plan, so recompile the aggregate expressions.
@@ -2618,10 +2598,12 @@ agg_refill_hash_table(AggState *aggstate)
HASHAGG_READ_BUFFER_SIZE);
for (;;)
{
- TupleTableSlot *slot = aggstate->hash_spill_slot;
- MinimalTuple tuple;
- uint32 hash;
- bool in_hash_table;
+ TupleTableSlot *slot = aggstate->hash_spill_slot;
+ TupleHashEntry entry;
+ MinimalTuple tuple;
+ uint32 hash;
+ bool isnew = false;
+ bool *p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
CHECK_FOR_INTERRUPTS();
@@ -2633,12 +2615,14 @@ agg_refill_hash_table(AggState *aggstate)
aggstate->tmpcontext->ecxt_outertuple = slot;
prepare_hash_slot(aggstate);
- aggstate->hash_pergroup[batch->setno] =
- lookup_hash_entry(aggstate, hash, &in_hash_table);
+ entry = LookupTupleHashEntryHash(
+ perhash->hashtable, perhash->hashslot, p_isnew, hash);
- if (in_hash_table)
+ if (entry != NULL)
{
- /* Advance the aggregates (or combine functions) */
+ if (isnew)
+ initialize_hash_entry(aggstate, perhash->hashtable, entry);
+ aggstate->hash_pergroup[batch->setno] = entry->additional;
advance_aggregates(aggstate);
}
else
@@ -2655,6 +2639,8 @@ agg_refill_hash_table(AggState *aggstate)
}
/* no memory for a new group, spill */
hashagg_spill_tuple(&spill, slot, hash);
+
+ aggstate->hash_pergroup[batch->setno] = NULL;
}
/*
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 620414a1edc..046242682f0 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -94,7 +94,7 @@ ExecRecursiveUnion(PlanState *pstate)
if (plan->numCols > 0)
{
/* Find or build hashtable entry for this tuple's group */
- LookupTupleHashEntry(node->hashtable, slot, &isnew);
+ LookupTupleHashEntry(node->hashtable, slot, &isnew, NULL);
/* Must reset temp context after each hashtable lookup */
MemoryContextReset(node->tempContext);
/* Ignore tuple if already seen */
@@ -141,7 +141,7 @@ ExecRecursiveUnion(PlanState *pstate)
if (plan->numCols > 0)
{
/* Find or build hashtable entry for this tuple's group */
- LookupTupleHashEntry(node->hashtable, slot, &isnew);
+ LookupTupleHashEntry(node->hashtable, slot, &isnew, NULL);
/* Must reset temp context after each hashtable lookup */
MemoryContextReset(node->tempContext);
/* Ignore tuple if already seen */
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index bfd148a41a2..8d4ccff19cc 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -381,7 +381,7 @@ setop_fill_hash_table(SetOpState *setopstate)
/* Find or build hashtable entry for this tuple's group */
entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
- &isnew);
+ &isnew, NULL);
/* If new tuple group, initialize counts */
if (isnew)
@@ -402,7 +402,7 @@ setop_fill_hash_table(SetOpState *setopstate)
/* For tuples not seen previously, do not make hashtable entry */
entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
- NULL);
+ NULL, NULL);
/* Advance the counts if entry is already present */
if (entry)
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 298b7757f57..38c2fc0b50b 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -595,12 +595,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
*/
if (slotNoNulls(slot))
{
- (void) LookupTupleHashEntry(node->hashtable, slot, &isnew);
+ (void) LookupTupleHashEntry(node->hashtable, slot, &isnew, NULL);
node->havehashrows = true;
}
else if (node->hashnulls)
{
- (void) LookupTupleHashEntry(node->hashnulls, slot, &isnew);
+ (void) LookupTupleHashEntry(node->hashnulls, slot, &isnew, NULL);
node->havenullrows = true;
}
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c7deeac662f..415e117407c 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -139,7 +139,7 @@ extern TupleHashTable BuildTupleHashTableExt(PlanState *parent,
MemoryContext tempcxt, bool use_variable_hash_iv);
extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
- bool *isnew);
+ bool *isnew, uint32 *hash);
extern uint32 TupleHashTableHash(TupleHashTable hashtable,
TupleTableSlot *slot);
extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
Hi,
On June 10, 2020 6:15:39 PM PDT, Jeff Davis <pgsql@j-davis.com> wrote:
On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:
Hi,
While preparing my pgcon talk I noticed that our hash-agg performance
degraded noticeably. Looks to me like it's due to the spilling-
hashagg
changes.Attached a proposed fix. (Might require some minor cleanup).
The only awkward part is that LookupTupleHashEntry() needs a new out
parameter to pass the hash value back to the caller. Ordinarily, the
caller can get that from the returned entry, but if isnew==NULL, then
the function might return NULL (and the caller wouldn't have an entry
from which to read the hash value).
Great!
Did you run any performance tests?
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote:
Did you run any performance tests?
Yes, I reproduced your ~12% regression from V12, and this patch nearly
eliminated it for me.
Regards,
Jeff Davis
Hi,
On 2020-06-11 11:14:02 -0700, Jeff Davis wrote:
On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote:
Did you run any performance tests?
Yes, I reproduced your ~12% regression from V12, and this patch nearly
eliminated it for me.
I spent a fair bit of time looking at the difference. Jeff had let me
know on chat that he was still seeing some difference, but couldn't
quite figure out where that was.
Trying it out myself, I observed that the patch helped, but not that
much. After a bit I found one major reason for why:
LookupTupleHashEntryHash() assigned the hash to pointer provided by the
caller's before doing the insertion. That ended up causing a pipeline
stall (I assume it's store forwarding, but not sure). Moving the
assignment to the caller variable to after the insertion got rid of
that.
It got within 3-4% after that change. I did a number of small
microoptimizations that each helped, but didn't get quite get to the
level of 12.
Finally I figured out that that's due to an issue outside of nodeAgg.c
itself:
commit 4cad2534da6d17067d98cf04be2dfc1bda8f2cd0
Author: Tomas Vondra <tomas.vondra@postgresql.org>
Date: 2020-05-31 14:43:13 +0200
Use CP_SMALL_TLIST for hash aggregate
Due to this change we end up with an additional projection in queries
like this:
postgres[212666][1]=# \d fewgroups_many_rows
Table "public.fewgroups_many_rows"
┌────────┬─────────┬───────────┬──────────┬─────────┐
│ Column │ Type │ Collation │ Nullable │ Default │
├────────┼─────────┼───────────┼──────────┼─────────┤
│ cat │ integer │ │ not null │ │
│ val │ integer │ │ not null │ │
└────────┴─────────┴───────────┴──────────┴─────────┘
postgres[212666][1]=# explain SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1;
┌───────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├───────────────────────────────────────────────────────────────────────────────────────┤
│ HashAggregate (cost=1942478.48..1942478.53 rows=5 width=12) │
│ Group Key: cat │
│ -> Seq Scan on fewgroups_many_rows (cost=0.00..1442478.32 rows=100000032 width=4) │
└───────────────────────────────────────────────────────────────────────────────────────┘
(3 rows)
as 'val' is "projected away"..
After neutering the tlist change, Jeff's patch and my changes to it
yield performance *above* v12.
I don't see why it's ok to force an additional projection in the very
common case of hashaggs over a few rows. So I think we need to rethink
4cad2534da6.
Greetings,
Andres Freund
Attachments:
aggspeed.difftext/x-diff; charset=us-asciiDownload
diff --git i/src/include/executor/executor.h w/src/include/executor/executor.h
index c7deeac662f..415e117407c 100644
--- i/src/include/executor/executor.h
+++ w/src/include/executor/executor.h
@@ -139,7 +139,7 @@ extern TupleHashTable BuildTupleHashTableExt(PlanState *parent,
MemoryContext tempcxt, bool use_variable_hash_iv);
extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
- bool *isnew);
+ bool *isnew, uint32 *hash);
extern uint32 TupleHashTableHash(TupleHashTable hashtable,
TupleTableSlot *slot);
extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
diff --git i/src/backend/executor/execGrouping.c w/src/backend/executor/execGrouping.c
index 8be36ca7634..1e582832ea0 100644
--- i/src/backend/executor/execGrouping.c
+++ w/src/backend/executor/execGrouping.c
@@ -22,11 +22,12 @@
#include "utils/memutils.h"
static int TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
-static uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb,
+static inline uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb,
const MinimalTuple tuple);
-static TupleHashEntry LookupTupleHashEntry_internal(TupleHashTable hashtable,
- TupleTableSlot *slot,
- bool *isnew, uint32 hash);
+static inline TupleHashEntry LookupTupleHashEntry_internal(
+ TupleHashTable hashtable,
+ TupleTableSlot *slot,
+ bool *isnew, uint32 hash);
/*
* Define parameters for tuple hash table code generation. The interface is
@@ -291,6 +292,9 @@ ResetTupleHashTable(TupleHashTable hashtable)
* If isnew is NULL, we do not create new entries; we return NULL if no
* match is found.
*
+ * If hash is not NULL, we set it to the calculated hash value. This allows
+ * callers access to the hash value even if no entry is returned.
+ *
* If isnew isn't NULL, then a new entry is created if no existing entry
* matches. On return, *isnew is true if the entry is newly created,
* false if it existed already. ->additional_data in the new entry has
@@ -298,11 +302,11 @@ ResetTupleHashTable(TupleHashTable hashtable)
*/
TupleHashEntry
LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
- bool *isnew)
+ bool *isnew, uint32 *hash)
{
TupleHashEntry entry;
MemoryContext oldContext;
- uint32 hash;
+ uint32 local_hash;
/* Need to run the hash functions in short-lived context */
oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
@@ -312,8 +316,14 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
hashtable->cur_eq_func = hashtable->tab_eq_func;
- hash = TupleHashTableHash_internal(hashtable->hashtab, NULL);
- entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
+ local_hash = TupleHashTableHash_internal(hashtable->hashtab, NULL);
+
+ entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, local_hash);
+
+ if (hash != NULL)
+ *hash = local_hash;
+
+ Assert(entry == NULL || entry->hash == local_hash);
MemoryContextSwitchTo(oldContext);
@@ -362,6 +372,7 @@ LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
hashtable->cur_eq_func = hashtable->tab_eq_func;
entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
+ Assert(entry == NULL || entry->hash == hash);
MemoryContextSwitchTo(oldContext);
@@ -480,7 +491,7 @@ TupleHashTableHash_internal(struct tuplehash_hash *tb,
* NB: This function may or may not change the memory context. Caller is
* expected to change it back.
*/
-static TupleHashEntry
+static inline TupleHashEntry
LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
bool *isnew, uint32 hash)
{
diff --git i/src/backend/executor/nodeAgg.c w/src/backend/executor/nodeAgg.c
index 331acee2814..1b44b9f00de 100644
--- i/src/backend/executor/nodeAgg.c
+++ w/src/backend/executor/nodeAgg.c
@@ -382,7 +382,9 @@ static void finalize_partialaggregate(AggState *aggstate,
AggStatePerAgg peragg,
AggStatePerGroup pergroupstate,
Datum *resultVal, bool *resultIsNull);
-static void prepare_hash_slot(AggState *aggstate);
+static inline void prepare_hash_slot(AggStatePerHash perhash,
+ TupleTableSlot *inputslot,
+ TupleTableSlot *hashslot);
static void prepare_projection_slot(AggState *aggstate,
TupleTableSlot *slot,
int currentSet);
@@ -403,8 +405,9 @@ static int hash_choose_num_partitions(uint64 input_groups,
double hashentrysize,
int used_bits,
int *log2_npartittions);
-static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash,
- bool *in_hash_table);
+static void initialize_hash_entry(AggState *aggstate,
+ TupleHashTable hashtable,
+ TupleHashEntry entry);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
@@ -1197,12 +1200,11 @@ finalize_partialaggregate(AggState *aggstate,
* Extract the attributes that make up the grouping key into the
* hashslot. This is necessary to compute the hash or perform a lookup.
*/
-static void
-prepare_hash_slot(AggState *aggstate)
+static inline void
+prepare_hash_slot(AggStatePerHash perhash,
+ TupleTableSlot *inputslot,
+ TupleTableSlot *hashslot)
{
- TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
- AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
- TupleTableSlot *hashslot = perhash->hashslot;
int i;
/* transfer just the needed columns into hashslot */
@@ -1979,75 +1981,39 @@ hash_choose_num_partitions(uint64 input_groups, double hashentrysize,
}
/*
- * Find or create a hashtable entry for the tuple group containing the current
- * tuple (already set in tmpcontext's outertuple slot), in the current grouping
- * set (which the caller must have selected - note that initialize_aggregate
- * depends on this).
- *
- * When called, CurrentMemoryContext should be the per-query context. The
- * already-calculated hash value for the tuple must be specified.
- *
- * If in "spill mode", then only find existing hashtable entries; don't create
- * new ones. If a tuple's group is not already present in the hash table for
- * the current grouping set, assign *in_hash_table=false and the caller will
- * spill it to disk.
+ * Initialize a freshly-created TupleHashEntry.
*/
-static AggStatePerGroup
-lookup_hash_entry(AggState *aggstate, uint32 hash, bool *in_hash_table)
+static void
+initialize_hash_entry(AggState *aggstate, TupleHashTable hashtable,
+ TupleHashEntry entry)
{
- AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
- TupleTableSlot *hashslot = perhash->hashslot;
- TupleHashEntryData *entry;
- bool isnew = false;
- bool *p_isnew;
+ AggStatePerGroup pergroup;
+ int transno;
- /* if hash table already spilled, don't create new entries */
- p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
+ aggstate->hash_ngroups_current++;
+ hash_agg_check_limits(aggstate);
- /* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
- hash);
+ /* no need to allocate or initialize per-group state */
+ if (aggstate->numtrans == 0)
+ return;
- if (entry == NULL)
+ pergroup = (AggStatePerGroup)
+ MemoryContextAlloc(hashtable->tablecxt,
+ sizeof(AggStatePerGroupData) * aggstate->numtrans);
+
+ entry->additional = pergroup;
+
+ /*
+ * Initialize aggregates for new tuple group, lookup_hash_entries()
+ * already has selected the relevant grouping set.
+ */
+ for (transno = 0; transno < aggstate->numtrans; transno++)
{
- *in_hash_table = false;
- return NULL;
+ AggStatePerTrans pertrans = &aggstate->pertrans[transno];
+ AggStatePerGroup pergroupstate = &pergroup[transno];
+
+ initialize_aggregate(aggstate, pertrans, pergroupstate);
}
- else
- *in_hash_table = true;
-
- if (isnew)
- {
- AggStatePerGroup pergroup;
- int transno;
-
- aggstate->hash_ngroups_current++;
- hash_agg_check_limits(aggstate);
-
- /* no need to allocate or initialize per-group state */
- if (aggstate->numtrans == 0)
- return NULL;
-
- pergroup = (AggStatePerGroup)
- MemoryContextAlloc(perhash->hashtable->tablecxt,
- sizeof(AggStatePerGroupData) * aggstate->numtrans);
-
- entry->additional = pergroup;
-
- /*
- * Initialize aggregates for new tuple group, lookup_hash_entries()
- * already has selected the relevant grouping set.
- */
- for (transno = 0; transno < aggstate->numtrans; transno++)
- {
- AggStatePerTrans pertrans = &aggstate->pertrans[transno];
- AggStatePerGroup pergroupstate = &pergroup[transno];
-
- initialize_aggregate(aggstate, pertrans, pergroupstate);
- }
- }
-
- return entry->additional;
}
/*
@@ -2072,21 +2038,37 @@ static void
lookup_hash_entries(AggState *aggstate)
{
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
+ TupleTableSlot *outerslot = aggstate->tmpcontext->ecxt_outertuple;
int setno;
for (setno = 0; setno < aggstate->num_hashes; setno++)
{
AggStatePerHash perhash = &aggstate->perhash[setno];
+ TupleHashTable hashtable = perhash->hashtable;
+ TupleTableSlot *hashslot = perhash->hashslot;
+ TupleHashEntry entry;
uint32 hash;
- bool in_hash_table;
+ bool isnew = false;
+ bool *p_isnew;
+
+ /* if hash table already spilled, don't create new entries */
+ p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
select_current_set(aggstate, setno, true);
- prepare_hash_slot(aggstate);
- hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
- pergroup[setno] = lookup_hash_entry(aggstate, hash, &in_hash_table);
+ prepare_hash_slot(perhash,
+ outerslot,
+ hashslot);
- /* check to see if we need to spill the tuple for this grouping set */
- if (!in_hash_table)
+ entry = LookupTupleHashEntry(hashtable, hashslot,
+ p_isnew, &hash);
+
+ if (entry != NULL)
+ {
+ if (isnew)
+ initialize_hash_entry(aggstate, hashtable, entry);
+ pergroup[setno] = entry->additional;
+ }
+ else
{
HashAggSpill *spill = &aggstate->hash_spills[setno];
TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
@@ -2097,6 +2079,7 @@ lookup_hash_entries(AggState *aggstate)
aggstate->hashentrysize);
hashagg_spill_tuple(spill, slot, hash);
+ pergroup[setno] = NULL;
}
}
}
@@ -2554,6 +2537,7 @@ static bool
agg_refill_hash_table(AggState *aggstate)
{
HashAggBatch *batch;
+ AggStatePerHash perhash;
HashAggSpill spill;
HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
uint64 ngroups_estimate;
@@ -2605,6 +2589,8 @@ agg_refill_hash_table(AggState *aggstate)
select_current_set(aggstate, batch->setno, true);
+ perhash = &aggstate->perhash[aggstate->current_set];
+
/*
* Spilled tuples are always read back as MinimalTuples, which may be
* different from the outer plan, so recompile the aggregate expressions.
@@ -2618,10 +2604,13 @@ agg_refill_hash_table(AggState *aggstate)
HASHAGG_READ_BUFFER_SIZE);
for (;;)
{
- TupleTableSlot *slot = aggstate->hash_spill_slot;
- MinimalTuple tuple;
- uint32 hash;
- bool in_hash_table;
+ TupleTableSlot *spillslot = aggstate->hash_spill_slot;
+ TupleTableSlot *hashslot = perhash->hashslot;
+ TupleHashEntry entry;
+ MinimalTuple tuple;
+ uint32 hash;
+ bool isnew = false;
+ bool *p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
CHECK_FOR_INTERRUPTS();
@@ -2629,16 +2618,20 @@ agg_refill_hash_table(AggState *aggstate)
if (tuple == NULL)
break;
- ExecStoreMinimalTuple(tuple, slot, true);
- aggstate->tmpcontext->ecxt_outertuple = slot;
+ ExecStoreMinimalTuple(tuple, spillslot, true);
+ aggstate->tmpcontext->ecxt_outertuple = spillslot;
- prepare_hash_slot(aggstate);
- aggstate->hash_pergroup[batch->setno] =
- lookup_hash_entry(aggstate, hash, &in_hash_table);
+ prepare_hash_slot(perhash,
+ aggstate->tmpcontext->ecxt_outertuple,
+ hashslot);
+ entry = LookupTupleHashEntryHash(
+ perhash->hashtable, hashslot, p_isnew, hash);
- if (in_hash_table)
+ if (entry != NULL)
{
- /* Advance the aggregates (or combine functions) */
+ if (isnew)
+ initialize_hash_entry(aggstate, perhash->hashtable, entry);
+ aggstate->hash_pergroup[batch->setno] = entry->additional;
advance_aggregates(aggstate);
}
else
@@ -2654,7 +2647,9 @@ agg_refill_hash_table(AggState *aggstate)
ngroups_estimate, aggstate->hashentrysize);
}
/* no memory for a new group, spill */
- hashagg_spill_tuple(&spill, slot, hash);
+ hashagg_spill_tuple(&spill, spillslot, hash);
+
+ aggstate->hash_pergroup[batch->setno] = NULL;
}
/*
diff --git i/src/backend/executor/nodeRecursiveunion.c w/src/backend/executor/nodeRecursiveunion.c
index 620414a1edc..046242682f0 100644
--- i/src/backend/executor/nodeRecursiveunion.c
+++ w/src/backend/executor/nodeRecursiveunion.c
@@ -94,7 +94,7 @@ ExecRecursiveUnion(PlanState *pstate)
if (plan->numCols > 0)
{
/* Find or build hashtable entry for this tuple's group */
- LookupTupleHashEntry(node->hashtable, slot, &isnew);
+ LookupTupleHashEntry(node->hashtable, slot, &isnew, NULL);
/* Must reset temp context after each hashtable lookup */
MemoryContextReset(node->tempContext);
/* Ignore tuple if already seen */
@@ -141,7 +141,7 @@ ExecRecursiveUnion(PlanState *pstate)
if (plan->numCols > 0)
{
/* Find or build hashtable entry for this tuple's group */
- LookupTupleHashEntry(node->hashtable, slot, &isnew);
+ LookupTupleHashEntry(node->hashtable, slot, &isnew, NULL);
/* Must reset temp context after each hashtable lookup */
MemoryContextReset(node->tempContext);
/* Ignore tuple if already seen */
diff --git i/src/backend/executor/nodeSetOp.c w/src/backend/executor/nodeSetOp.c
index bfd148a41a2..8d4ccff19cc 100644
--- i/src/backend/executor/nodeSetOp.c
+++ w/src/backend/executor/nodeSetOp.c
@@ -381,7 +381,7 @@ setop_fill_hash_table(SetOpState *setopstate)
/* Find or build hashtable entry for this tuple's group */
entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
- &isnew);
+ &isnew, NULL);
/* If new tuple group, initialize counts */
if (isnew)
@@ -402,7 +402,7 @@ setop_fill_hash_table(SetOpState *setopstate)
/* For tuples not seen previously, do not make hashtable entry */
entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
- NULL);
+ NULL, NULL);
/* Advance the counts if entry is already present */
if (entry)
diff --git i/src/backend/executor/nodeSubplan.c w/src/backend/executor/nodeSubplan.c
index 298b7757f57..38c2fc0b50b 100644
--- i/src/backend/executor/nodeSubplan.c
+++ w/src/backend/executor/nodeSubplan.c
@@ -595,12 +595,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
*/
if (slotNoNulls(slot))
{
- (void) LookupTupleHashEntry(node->hashtable, slot, &isnew);
+ (void) LookupTupleHashEntry(node->hashtable, slot, &isnew, NULL);
node->havehashrows = true;
}
else if (node->hashnulls)
{
- (void) LookupTupleHashEntry(node->hashnulls, slot, &isnew);
+ (void) LookupTupleHashEntry(node->hashnulls, slot, &isnew, NULL);
node->havenullrows = true;
}
diff --git i/src/backend/optimizer/plan/createplan.c w/src/backend/optimizer/plan/createplan.c
index eb9543f6add..0778ff29ab4 100644
--- i/src/backend/optimizer/plan/createplan.c
+++ w/src/backend/optimizer/plan/createplan.c
@@ -2124,9 +2124,11 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
*/
flags = CP_LABEL_TLIST;
+#if 0
/* ensure small tlist for hash aggregate */
if (best_path->aggstrategy == AGG_HASHED)
flags |= CP_SMALL_TLIST;
+#endif
subplan = create_plan_recurse(root, best_path->subpath, flags);
On Fri, 2020-06-12 at 14:37 -0700, Andres Freund wrote:
I don't see why it's ok to force an additional projection in the very
common case of hashaggs over a few rows. So I think we need to
rethink
4cad2534da6.
One possibility is to project only spilled tuples, more similar to
Melanie's patch from a while ago:
/messages/by-id/CAAKRu_aefEsv+UkQWqu+ioEnoiL2LJu9Diuh9BR8MbyXuZ0j4A@mail.gmail.com
Which makes sense, but it's also more code.
Regards,
Jeff Davis
On Fri, Jun 12, 2020 at 03:29:08PM -0700, Jeff Davis wrote:
On Fri, 2020-06-12 at 14:37 -0700, Andres Freund wrote:
I don't see why it's ok to force an additional projection in the very
common case of hashaggs over a few rows. So I think we need to
rethink
4cad2534da6.One possibility is to project only spilled tuples, more similar to
Melanie's patch from a while ago:/messages/by-id/CAAKRu_aefEsv+UkQWqu+ioEnoiL2LJu9Diuh9BR8MbyXuZ0j4A@mail.gmail.com
Which makes sense, but it's also more code.
I agree, we should revert 4cad2534da and only project tuples when we
actually need to spill them. Did any of the WIP patches actually
implement that, or do we need to write that patch from scratch?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2020-06-13 01:06:25 +0200, Tomas Vondra wrote:
I agree, we should revert 4cad2534da and only project tuples when we
actually need to spill them.
There are cases where projecting helps for non-spilling aggregates too,
but only for the representative tuple. It doesn't help in the case at
hand, because there's just 5 hashtable entries but millions of rows. So
we're unnecessarily projecting all-5 rows. But when there are many
different groups, it'd be different, because then the size of the
representative tuple can matter substantially.
Do you think we should tackle this for 13? To me 4cad2534da seems like a
somewhat independent improvement to spillable hashaggs.
Greetings,
Andres Freund
On Fri, 2020-06-12 at 17:12 -0700, Andres Freund wrote:
Do you think we should tackle this for 13? To me 4cad2534da seems
like a
somewhat independent improvement to spillable hashaggs.
We've gone back and forth on this issue a few times, so let's try to
get some agreement before we revert 4cad2534da. I added Robert because
he also seemed to think it was a reasonable idea.
Regards,
Jeff Davis
On Sat, Jun 13, 2020 at 11:48:09AM -0700, Jeff Davis wrote:
On Fri, 2020-06-12 at 17:12 -0700, Andres Freund wrote:
Do you think we should tackle this for 13? To me 4cad2534da seems
like a
somewhat independent improvement to spillable hashaggs.We've gone back and forth on this issue a few times, so let's try to
get some agreement before we revert 4cad2534da. I added Robert because
he also seemed to think it was a reasonable idea.
I can't speak for Robert, but I haven't expected the extra projection
would be this high. And I agree with Andres it's not very nice we have
to do this even for aggregates with just a handful of groups that don't
need to spill.
In any case, I think we need to address this somehow for v13 - either we
keep the 4cad2534da patch in, or we tweak the cost model to reflect the
extra I/O costs, or we project only when spilling.
I'm not in a position to whip up a patch soon, though :-(
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2020-06-12 15:29:08 -0700, Jeff Davis wrote:
On Fri, 2020-06-12 at 14:37 -0700, Andres Freund wrote:
I don't see why it's ok to force an additional projection in the very
common case of hashaggs over a few rows. So I think we need to
rethink
4cad2534da6.One possibility is to project only spilled tuples, more similar to
Melanie's patch from a while ago:/messages/by-id/CAAKRu_aefEsv+UkQWqu+ioEnoiL2LJu9Diuh9BR8MbyXuZ0j4A@mail.gmail.com
Which makes sense, but it's also more code.
I'm somewhat inclined to think that we should revert 4cad2534da6 and
then look at how precisely to tackle this in 14.
It'd probably make sense to request small tlists when the number of
estimated groups is large, and not otherwise.
Greetings,
Andres Freund
On Sun, 2020-06-14 at 11:14 -0700, Andres Freund wrote:
I'm somewhat inclined to think that we should revert 4cad2534da6 and
then look at how precisely to tackle this in 14.
I'm fine with that.
It'd probably make sense to request small tlists when the number of
estimated groups is large, and not otherwise.
That seems like a nice compromise that would be non-invasive, at least
for create_agg_plan().
Regards,
Jeff Davis
On Sun, Jun 14, 2020 at 11:09:55PM -0700, Jeff Davis wrote:
On Sun, 2020-06-14 at 11:14 -0700, Andres Freund wrote:
I'm somewhat inclined to think that we should revert 4cad2534da6 and
then look at how precisely to tackle this in 14.I'm fine with that.
I don't see how we could just revert 4cad2534d and leave this for v14.
The hashagg spilling is IMHO almost guaranteed to be a pain point for
some users, as it will force some queries to serialize large amounts of
data. Yes, some of this is a cost for hashagg enforcing work_mem at
runtime, I'm fine with that. We'd get reports about that too, but we can
justify that cost ...
But just reverting 4cad2534d will make this much worse, I think, as
illustrated by the benchmarks I did in [1]/messages/by-id/20200519151202.u2p2gpiawoaznsv2@development. And no, this is not really
fixable by tweaking the cost parameters - even with the current code
(i.e. 4cad2534d in place) I had to increase random_page_cost to 60 on
the temp tablespace (on SATA RAID) to get good plans with parallelism
enabled. I haven't tried, but I presume without 4cad2534d I'd have to
push r_p_c even further ...
[1]: /messages/by-id/20200519151202.u2p2gpiawoaznsv2@development
It'd probably make sense to request small tlists when the number of
estimated groups is large, and not otherwise.That seems like a nice compromise that would be non-invasive, at least
for create_agg_plan().
Maybe. It'd certainly better than nothing. It's not clear to me what
would a good threshold be, though. And it's not going to handle cases of
under-estimates.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
But just reverting 4cad2534d will make this much worse, I think, as
illustrated by the benchmarks I did in [1].
I share this concern, although I do not know what we should do about it.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:But just reverting 4cad2534d will make this much worse, I think, as
illustrated by the benchmarks I did in [1].
I share this concern, although I do not know what we should do about it.
Well, it's only June. Let's put it on the open issues list for v13
and continue to think about it. I concur that the hashagg spill patch
has made this something that we should worry about for v13, so just
reverting without a better answer isn't very appetizing.
regards, tom lane
On Mon, 2020-06-15 at 11:19 -0400, Robert Haas wrote:
On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:But just reverting 4cad2534d will make this much worse, I think, as
illustrated by the benchmarks I did in [1].I share this concern, although I do not know what we should do about
it.
I attached an updated version of Melanie's patch, combined with the
changes to copy only the necessary attributes to a new slot before
spilling. There are a couple changes:
* I didn't see a reason to descend into a GroupingFunc node, so I
removed that.
* I used a flag in the context rather than two separate callbacks to
the expression walker.
This patch gives the space benefits that we see on master, without the
regression for small numbers of tuples. I saw a little bit of noise in
my test results, but I'm pretty sure it's a win all around. It could
use some review/cleanup though.
Regards,
Jeff Davis
Attachments:
hashagg-spill-project.patchtext/x-patch; charset=UTF-8; name=hashagg-spill-project.patchDownload
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 331acee2814..5b5aa96c517 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -358,6 +358,14 @@ typedef struct HashAggBatch
int64 input_tuples; /* number of tuples in this batch */
} HashAggBatch;
+/* used to find referenced colnos */
+typedef struct FindColsContext
+{
+ bool is_aggref; /* is under an aggref */
+ Bitmapset *aggregated; /* column references under an aggref */
+ Bitmapset *unaggregated; /* other column references */
+} FindColsContext;
+
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -390,8 +398,9 @@ static void finalize_aggregates(AggState *aggstate,
AggStatePerAgg peragg,
AggStatePerGroup pergroup);
static TupleTableSlot *project_aggregates(AggState *aggstate);
-static Bitmapset *find_unaggregated_cols(AggState *aggstate);
-static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
+static void find_cols(AggState *aggstate, Bitmapset **aggregated,
+ Bitmapset **unaggregated);
+static bool find_cols_walker(Node *node, FindColsContext *context);
static void build_hash_tables(AggState *aggstate);
static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
@@ -424,8 +433,8 @@ static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
int used_bits, uint64 input_tuples,
double hashentrysize);
-static Size hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot,
- uint32 hash);
+static Size hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill,
+ TupleTableSlot *slot, uint32 hash);
static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
int setno);
static void hashagg_tapeinfo_init(AggState *aggstate);
@@ -1374,26 +1383,28 @@ project_aggregates(AggState *aggstate)
}
/*
- * find_unaggregated_cols
- * Construct a bitmapset of the column numbers of un-aggregated Vars
- * appearing in our targetlist and qual (HAVING clause)
+ * Walk tlist and qual to find referenced colnos, dividing them into
+ * aggregated and unaggregated sets.
*/
-static Bitmapset *
-find_unaggregated_cols(AggState *aggstate)
+static void
+find_cols(AggState *aggstate, Bitmapset **aggregated, Bitmapset **unaggregated)
{
- Agg *node = (Agg *) aggstate->ss.ps.plan;
- Bitmapset *colnos;
-
- colnos = NULL;
- (void) find_unaggregated_cols_walker((Node *) node->plan.targetlist,
- &colnos);
- (void) find_unaggregated_cols_walker((Node *) node->plan.qual,
- &colnos);
- return colnos;
+ Agg *agg = (Agg *) aggstate->ss.ps.plan;
+ FindColsContext context;
+
+ context.is_aggref = false;
+ context.aggregated = NULL;
+ context.unaggregated = NULL;
+
+ (void) find_cols_walker((Node *) agg->plan.targetlist, &context);
+ (void) find_cols_walker((Node *) agg->plan.qual, &context);
+
+ *aggregated = context.aggregated;
+ *unaggregated = context.unaggregated;
}
static bool
-find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
+find_cols_walker(Node *node, FindColsContext *context)
{
if (node == NULL)
return false;
@@ -1404,16 +1415,24 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
/* setrefs.c should have set the varno to OUTER_VAR */
Assert(var->varno == OUTER_VAR);
Assert(var->varlevelsup == 0);
- *colnos = bms_add_member(*colnos, var->varattno);
+ if (context->is_aggref)
+ context->aggregated = bms_add_member(context->aggregated,
+ var->varattno);
+ else
+ context->unaggregated = bms_add_member(context->unaggregated,
+ var->varattno);
return false;
}
- if (IsA(node, Aggref) || IsA(node, GroupingFunc))
+ if (IsA(node, Aggref))
{
- /* do not descend into aggregate exprs */
+ Assert(!context->is_aggref);
+ context->is_aggref = true;
+ expression_tree_walker(node, find_cols_walker, (void *) context);
+ context->is_aggref = false;
return false;
}
- return expression_tree_walker(node, find_unaggregated_cols_walker,
- (void *) colnos);
+ return expression_tree_walker(node, find_cols_walker,
+ (void *) context);
}
/*
@@ -1531,13 +1550,27 @@ static void
find_hash_columns(AggState *aggstate)
{
Bitmapset *base_colnos;
+ Bitmapset *aggregated_colnos;
+ TupleDesc scanDesc = aggstate->ss.ss_ScanTupleSlot->tts_tupleDescriptor;
List *outerTlist = outerPlanState(aggstate)->plan->targetlist;
int numHashes = aggstate->num_hashes;
EState *estate = aggstate->ss.ps.state;
int j;
/* Find Vars that will be needed in tlist and qual */
- base_colnos = find_unaggregated_cols(aggstate);
+ find_cols(aggstate, &aggregated_colnos, &base_colnos);
+ aggstate->colnos_needed = bms_union(base_colnos, aggregated_colnos);
+ aggstate->max_colno_needed = 0;
+ aggstate->all_cols_needed = true;
+
+ for (int i = 0; i < scanDesc->natts; i++)
+ {
+ int colno = i + 1;
+ if (bms_is_member(colno, aggstate->colnos_needed))
+ aggstate->max_colno_needed = colno;
+ else
+ aggstate->all_cols_needed = false;
+ }
for (j = 0; j < numHashes; ++j)
{
@@ -2096,7 +2129,7 @@ lookup_hash_entries(AggState *aggstate)
perhash->aggnode->numGroups,
aggstate->hashentrysize);
- hashagg_spill_tuple(spill, slot, hash);
+ hashagg_spill_tuple(aggstate, spill, slot, hash);
}
}
}
@@ -2618,7 +2651,7 @@ agg_refill_hash_table(AggState *aggstate)
HASHAGG_READ_BUFFER_SIZE);
for (;;)
{
- TupleTableSlot *slot = aggstate->hash_spill_slot;
+ TupleTableSlot *slot = aggstate->hash_spill_rslot;
MinimalTuple tuple;
uint32 hash;
bool in_hash_table;
@@ -2654,7 +2687,7 @@ agg_refill_hash_table(AggState *aggstate)
ngroups_estimate, aggstate->hashentrysize);
}
/* no memory for a new group, spill */
- hashagg_spill_tuple(&spill, slot, hash);
+ hashagg_spill_tuple(aggstate, &spill, slot, hash);
}
/*
@@ -2933,9 +2966,11 @@ hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo, int used_bits,
* partition.
*/
static Size
-hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot, uint32 hash)
+hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill,
+ TupleTableSlot *inputslot, uint32 hash)
{
LogicalTapeSet *tapeset = spill->tapeset;
+ TupleTableSlot *spillslot;
int partition;
MinimalTuple tuple;
int tapenum;
@@ -2944,8 +2979,28 @@ hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot, uint32 hash)
Assert(spill->partitions != NULL);
- /* XXX: may contain unnecessary attributes, should project */
- tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+ /* spill only attributes that we actually need */
+ if (!aggstate->all_cols_needed)
+ {
+ spillslot = aggstate->hash_spill_wslot;
+ slot_getsomeattrs(inputslot, aggstate->max_colno_needed);
+ ExecClearTuple(spillslot);
+ for (int i = 0; i < spillslot->tts_tupleDescriptor->natts; i++)
+ {
+ if (bms_is_member(i + 1, aggstate->colnos_needed))
+ {
+ spillslot->tts_values[i] = inputslot->tts_values[i];
+ spillslot->tts_isnull[i] = inputslot->tts_isnull[i];
+ }
+ else
+ spillslot->tts_isnull[i] = true;
+ }
+ ExecStoreVirtualTuple(spillslot);
+ }
+ else
+ spillslot = inputslot;
+
+ tuple = ExecFetchSlotMinimalTuple(spillslot, &shouldFree);
partition = (hash & spill->mask) >> spill->shift;
spill->ntuples[partition]++;
@@ -3562,8 +3617,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->hash_metacxt = AllocSetContextCreate(aggstate->ss.ps.state->es_query_cxt,
"HashAgg meta context",
ALLOCSET_DEFAULT_SIZES);
- aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
- &TTSOpsMinimalTuple);
+ aggstate->hash_spill_rslot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsMinimalTuple);
+ aggstate->hash_spill_wslot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsVirtual);
/* this is an array of pointers, not structures */
aggstate->hash_pergroup = pergroups;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 98e0072b8ad..32d6741c5d5 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2148,6 +2148,9 @@ typedef struct AggState
int current_set; /* The current grouping set being evaluated */
Bitmapset *grouped_cols; /* grouped cols in current projection */
List *all_grouped_cols; /* list of all grouped cols in DESC order */
+ Bitmapset *colnos_needed; /* all columns needed from the outer plan */
+ int max_colno_needed; /* highest colno needed from outer plan */
+ bool all_cols_needed; /* are all cols from outer plan needed? */
/* These fields are for grouping set phase data */
int maxsets; /* The max number of sets in any phase */
AggStatePerPhase phases; /* array of all phases */
@@ -2165,7 +2168,8 @@ typedef struct AggState
struct HashTapeInfo *hash_tapeinfo; /* metadata for spill tapes */
struct HashAggSpill *hash_spills; /* HashAggSpill for each grouping set,
* exists only during first pass */
- TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ TupleTableSlot *hash_spill_rslot; /* for reading spill files */
+ TupleTableSlot *hash_spill_wslot; /* for writing spill files */
List *hash_batches; /* hash batches remaining to be processed */
bool hash_ever_spilled; /* ever spilled during this execution? */
bool hash_spill_mode; /* we hit a limit during the current batch
@@ -2186,7 +2190,7 @@ typedef struct AggState
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 49
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 53
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
On Mon, Jun 15, 2020 at 07:38:45PM -0700, Jeff Davis wrote:
On Mon, 2020-06-15 at 11:19 -0400, Robert Haas wrote:
On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:But just reverting 4cad2534d will make this much worse, I think, as
illustrated by the benchmarks I did in [1].I share this concern, although I do not know what we should do about
it.I attached an updated version of Melanie's patch, combined with the
changes to copy only the necessary attributes to a new slot before
spilling. There are a couple changes:* I didn't see a reason to descend into a GroupingFunc node, so I
removed that.* I used a flag in the context rather than two separate callbacks to
the expression walker.This patch gives the space benefits that we see on master, without the
regression for small numbers of tuples. I saw a little bit of noise in
my test results, but I'm pretty sure it's a win all around. It could
use some review/cleanup though.
Looks reasonable. I can't redo my tests at the moment, the machine is
busy with something else. I'll give it a try over the weekend.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
I think it'd be good to get the changes that aren't related to
projection merged. As far as I can tell there's performance regressions
both because of the things I'd listed upthread, and due to the
projection issue. That's not obvious because because we first won
performance and then lost it again in several incremental steps.
COPY (SELECT (random() * 4)::int cat, (random()*10000)::int val FROM generate_series(1, 100000000)) TO '/tmp/data' WITH BINARY;
BEGIN;
DROP TABLE IF EXISTS fewgroups_many_rows;
CREATE TABLE fewgroups_many_rows(cat int4 not null, val int4 not null);
COPY fewgroups_many_rows FROM '/tmp/data' WITH (FORMAT BINARY, FREEZE);
COMMIT;
VACUUM FREEZE fewgroups_many_rows;
Test prep:
Test query:
SET seed=0;SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1;
(the seed seems to reduce noise due to hashtable iv being the same)
best of six:
9e1c9f959422192bbe1b842a2a1ffaf76b080196 12031.906 ms
d52eaa094847d395f942827a6f413904e516994c 12045.487 ms
ac88807f9b227ddcd92b8be9a053094837c1b99a 11950.006 ms
36d22dd95bc87ca68e742da91f47f8826f8758c9 11769.991 ms
5ac4e9a12c6543414891cd8972b2cd36a08e40cc 11551.932 ms
1fdb7f9789c4550204cd62d1746a7deed1dc4c29 11706.948 ms
4eaea3db150af56aa2e40efe91997fd25f3b6d73 11999.908 ms
11de6c903da99a4b2220acfa776fc26c7f384ccc 11999.054 ms
b7fabe80df9a65010bfe5e5d0a979bacebfec382 12165.463 ms
2742c45080077ed3b08b810bb96341499b86d530 12137.505 ms
1f39bce021540fde00990af55b4432c55ef4b3c7 12501.764 ms
9b60c4b979bce060495e2b05ba01d1cc6bffdd2d 12389.047 ms
4cad2534da6d17067d98cf04be2dfc1bda8f2cd0 13319.786 ms
1b2c29469a58cd9086bd86e20c708eb437564a80 13330.616 ms
There's certainly some noise in here, but I think the trends are valid.
/* - * find_unaggregated_cols - * Construct a bitmapset of the column numbers of un-aggregated Vars - * appearing in our targetlist and qual (HAVING clause) + * Walk tlist and qual to find referenced colnos, dividing them into + * aggregated and unaggregated sets. */ -static Bitmapset * -find_unaggregated_cols(AggState *aggstate) +static void +find_cols(AggState *aggstate, Bitmapset **aggregated, Bitmapset **unaggregated) {
It's not this patch's fault, but none, really none, of this stuff should
be in the executor.
Greetings,
Andres Freund
On Mon, Jun 22, 2020 at 9:02 PM Andres Freund <andres@anarazel.de> wrote:
/* - * find_unaggregated_cols - * Construct a bitmapset of the column numbers of un-aggregated Vars - * appearing in our targetlist and qual (HAVING clause) + * Walk tlist and qual to find referenced colnos, dividing them into + * aggregated and unaggregated sets. */ -static Bitmapset * -find_unaggregated_cols(AggState *aggstate) +static void +find_cols(AggState *aggstate, Bitmapset **aggregated, Bitmapset**unaggregated)
{
It's not this patch's fault, but none, really none, of this stuff should
be in the executor.
Were you thinking it could be done in grouping_planner() and then the
bitmaps could be saved in the PlannedStmt?
Or would you have to wait until query_planner()? Or are you imagining
somewhere else entirely?
--
Melanie Plageman
Hi,
On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote:
On Mon, Jun 22, 2020 at 9:02 PM Andres Freund <andres@anarazel.de> wrote:
It's not this patch's fault, but none, really none, of this stuff should
be in the executor.Were you thinking it could be done in grouping_planner() and then the
bitmaps could be saved in the PlannedStmt?
Or would you have to wait until query_planner()? Or are you imagining
somewhere else entirely?
I haven't thought about it in too much detail, but I would say
create_agg_plan() et al. I guess there's some argument to be made to do
it in setrefs.c, because we already do convert_combining_aggrefs() there
(but I don't like that much).
There's no reason to do it before we actually decided on one specific
path, so doing it earlier than create_plan() seems unnecessary. And
having it in agg specific code seems better than putting it into global
routines.
There's probably an argument for having a bit more shared code between
create_agg_plan(), create_group_plan() and
create_groupingsets_plan(). But even just adding a new extract_*_cols()
call to each of those would probably be ok.
Greetings,
Andres Freund
On Tue, Jun 23, 2020 at 10:06 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote:
On Mon, Jun 22, 2020 at 9:02 PM Andres Freund <andres@anarazel.de>
wrote:
It's not this patch's fault, but none, really none, of this stuff
should
be in the executor.
Were you thinking it could be done in grouping_planner() and then the
bitmaps could be saved in the PlannedStmt?
Or would you have to wait until query_planner()? Or are you imagining
somewhere else entirely?I haven't thought about it in too much detail, but I would say
create_agg_plan() et al. I guess there's some argument to be made to do
it in setrefs.c, because we already do convert_combining_aggrefs() there
(but I don't like that much).There's no reason to do it before we actually decided on one specific
path, so doing it earlier than create_plan() seems unnecessary. And
having it in agg specific code seems better than putting it into global
routines.There's probably an argument for having a bit more shared code between
create_agg_plan(), create_group_plan() and
create_groupingsets_plan(). But even just adding a new extract_*_cols()
call to each of those would probably be ok.
So, my summary of this point in the context of the other discussion
upthread is:
Planner should extract the columns that hashagg will need later during
planning. Planner should not have HashAgg/MixedAgg nodes request smaller
targetlists from their children with CP_SMALL_TLIST to avoid unneeded
projection overhead.
Also, even this extraction should only be done when the number of groups
is large enough to suspect a spill.
So, I wrote a patch that extracts the columns the same way as in
ExecInitAgg but in create_agg_plan() and it doesn't work because we
haven't called set_plan_references().
Then, I wrote a patch that does this in set_upper_references(), and it
seems to work. I've attached that one.
It is basically Jeff's patch (based somewhat on my patch) which extracts
the columns in ExecInitAgg but I moved the functions over to setrefs.c
and gave them a different name.
It's not very elegant.
I shoved it in at the end of set_upper_references(), but I think there
should be a nice way to do it while setting the references for each var
instead of walking over the nodes again.
Also, I think that the bitmapsets for the colnos should maybe be put
somewhere less prominent (than in the main Agg plan node?), since they
are only used in one small place.
I tried putting both bitmaps in an array of two bitmaps in the Agg node
(since there will always be two) to make it look a bit neater, but it
was pretty confusing and error prone to remember which one was
aggregated and which one was unaggregated.
Note that I didn't do anything with costing like only extracting the
columns if there are a lot of groups.
Also, I didn't revert the CP_SMALL_TLIST change in create_agg_plan() or
create_groupingsets_plan().
Not to stir the pot, but I did notice that hashjoin uses CP_SMALL_TLIST
in create_hashjoin_plan() for the inner side sub-tree and the outer side
one if there are multiple batches. I wondered what was different about
that vs hashagg (i.e. why it is okay to do that there).
--
Melanie Plageman
Attachments:
v1-0001-Move-extracting-columns-for-hashagg-to-planner.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Move-extracting-columns-for-hashagg-to-planner.patchDownload
From 0b40ad205cf8fc8a83b8c65f5f18502a51c47370 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Jun 2020 17:07:14 -0700
Subject: [PATCH v1] Move extracting columns for hashagg to planner
Find the columns needed the executor will need to know about for hashagg
and bifurcate them into aggregated and unaggregated column bitmapsets.
Save them in the plan tree for use during execution.
This is done after set_plan_refs() so all the references are correct.
---
src/backend/executor/nodeAgg.c | 117 +++++++++++++--------------
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 2 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/plan/setrefs.c | 69 ++++++++++++++++
src/include/nodes/execnodes.h | 8 +-
src/include/nodes/plannodes.h | 2 +
7 files changed, 140 insertions(+), 62 deletions(-)
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index a20554ae65..2f340e4031 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -359,6 +359,14 @@ typedef struct HashAggBatch
int64 input_tuples; /* number of tuples in this batch */
} HashAggBatch;
+/* used to find referenced colnos */
+typedef struct FindColsContext
+{
+ bool is_aggref; /* is under an aggref */
+ Bitmapset *aggregated; /* column references under an aggref */
+ Bitmapset *unaggregated; /* other column references */
+} FindColsContext;
+
static void select_current_set(AggState *aggstate, int setno, bool is_hash);
static void initialize_phase(AggState *aggstate, int newphase);
static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -391,8 +399,6 @@ static void finalize_aggregates(AggState *aggstate,
AggStatePerAgg peragg,
AggStatePerGroup pergroup);
static TupleTableSlot *project_aggregates(AggState *aggstate);
-static Bitmapset *find_unaggregated_cols(AggState *aggstate);
-static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
static void build_hash_tables(AggState *aggstate);
static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
@@ -425,8 +431,8 @@ static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
int used_bits, uint64 input_tuples,
double hashentrysize);
-static Size hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot,
- uint32 hash);
+static Size hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill,
+ TupleTableSlot *slot, uint32 hash);
static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
int setno);
static void hashagg_tapeinfo_init(AggState *aggstate);
@@ -1374,49 +1380,6 @@ project_aggregates(AggState *aggstate)
return NULL;
}
-/*
- * find_unaggregated_cols
- * Construct a bitmapset of the column numbers of un-aggregated Vars
- * appearing in our targetlist and qual (HAVING clause)
- */
-static Bitmapset *
-find_unaggregated_cols(AggState *aggstate)
-{
- Agg *node = (Agg *) aggstate->ss.ps.plan;
- Bitmapset *colnos;
-
- colnos = NULL;
- (void) find_unaggregated_cols_walker((Node *) node->plan.targetlist,
- &colnos);
- (void) find_unaggregated_cols_walker((Node *) node->plan.qual,
- &colnos);
- return colnos;
-}
-
-static bool
-find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
-{
- if (node == NULL)
- return false;
- if (IsA(node, Var))
- {
- Var *var = (Var *) node;
-
- /* setrefs.c should have set the varno to OUTER_VAR */
- Assert(var->varno == OUTER_VAR);
- Assert(var->varlevelsup == 0);
- *colnos = bms_add_member(*colnos, var->varattno);
- return false;
- }
- if (IsA(node, Aggref) || IsA(node, GroupingFunc))
- {
- /* do not descend into aggregate exprs */
- return false;
- }
- return expression_tree_walker(node, find_unaggregated_cols_walker,
- (void *) colnos);
-}
-
/*
* (Re-)initialize the hash table(s) to empty.
*
@@ -1531,19 +1494,30 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
static void
find_hash_columns(AggState *aggstate)
{
- Bitmapset *base_colnos;
+ TupleDesc scanDesc = aggstate->ss.ss_ScanTupleSlot->tts_tupleDescriptor;
List *outerTlist = outerPlanState(aggstate)->plan->targetlist;
int numHashes = aggstate->num_hashes;
EState *estate = aggstate->ss.ps.state;
int j;
- /* Find Vars that will be needed in tlist and qual */
- base_colnos = find_unaggregated_cols(aggstate);
+ Agg *agg = (Agg *) aggstate->ss.ps.plan;
+ aggstate->colnos_needed = bms_union(agg->unaggregated_colnos, agg->aggregated_colnos);
+ aggstate->max_colno_needed = 0;
+ aggstate->all_cols_needed = true;
+
+ for (int i = 0; i < scanDesc->natts; i++)
+ {
+ int colno = i + 1;
+ if (bms_is_member(colno, aggstate->colnos_needed))
+ aggstate->max_colno_needed = colno;
+ else
+ aggstate->all_cols_needed = false;
+ }
for (j = 0; j < numHashes; ++j)
{
AggStatePerHash perhash = &aggstate->perhash[j];
- Bitmapset *colnos = bms_copy(base_colnos);
+ Bitmapset *colnos = bms_copy(agg->unaggregated_colnos);
AttrNumber *grpColIdx = perhash->aggnode->grpColIdx;
List *hashTlist = NIL;
TupleDesc hashDesc;
@@ -1637,7 +1611,6 @@ find_hash_columns(AggState *aggstate)
bms_free(colnos);
}
- bms_free(base_colnos);
}
/*
@@ -2097,7 +2070,7 @@ lookup_hash_entries(AggState *aggstate)
perhash->aggnode->numGroups,
aggstate->hashentrysize);
- hashagg_spill_tuple(spill, slot, hash);
+ hashagg_spill_tuple(aggstate, spill, slot, hash);
}
}
}
@@ -2619,7 +2592,7 @@ agg_refill_hash_table(AggState *aggstate)
HASHAGG_READ_BUFFER_SIZE);
for (;;)
{
- TupleTableSlot *slot = aggstate->hash_spill_slot;
+ TupleTableSlot *slot = aggstate->hash_spill_rslot;
MinimalTuple tuple;
uint32 hash;
bool in_hash_table;
@@ -2655,7 +2628,7 @@ agg_refill_hash_table(AggState *aggstate)
ngroups_estimate, aggstate->hashentrysize);
}
/* no memory for a new group, spill */
- hashagg_spill_tuple(&spill, slot, hash);
+ hashagg_spill_tuple(aggstate, &spill, slot, hash);
}
/*
@@ -2934,9 +2907,11 @@ hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo, int used_bits,
* partition.
*/
static Size
-hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot, uint32 hash)
+hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill,
+ TupleTableSlot *inputslot, uint32 hash)
{
LogicalTapeSet *tapeset = spill->tapeset;
+ TupleTableSlot *spillslot;
int partition;
MinimalTuple tuple;
int tapenum;
@@ -2945,8 +2920,28 @@ hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot, uint32 hash)
Assert(spill->partitions != NULL);
- /* XXX: may contain unnecessary attributes, should project */
- tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+ /* spill only attributes that we actually need */
+ if (!aggstate->all_cols_needed)
+ {
+ spillslot = aggstate->hash_spill_wslot;
+ slot_getsomeattrs(inputslot, aggstate->max_colno_needed);
+ ExecClearTuple(spillslot);
+ for (int i = 0; i < spillslot->tts_tupleDescriptor->natts; i++)
+ {
+ if (bms_is_member(i + 1, aggstate->colnos_needed))
+ {
+ spillslot->tts_values[i] = inputslot->tts_values[i];
+ spillslot->tts_isnull[i] = inputslot->tts_isnull[i];
+ }
+ else
+ spillslot->tts_isnull[i] = true;
+ }
+ ExecStoreVirtualTuple(spillslot);
+ }
+ else
+ spillslot = inputslot;
+
+ tuple = ExecFetchSlotMinimalTuple(spillslot, &shouldFree);
partition = (hash & spill->mask) >> spill->shift;
spill->ntuples[partition]++;
@@ -3563,8 +3558,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
aggstate->hash_metacxt = AllocSetContextCreate(aggstate->ss.ps.state->es_query_cxt,
"HashAgg meta context",
ALLOCSET_DEFAULT_SIZES);
- aggstate->hash_spill_slot = ExecInitExtraTupleSlot(estate, scanDesc,
- &TTSOpsMinimalTuple);
+ aggstate->hash_spill_rslot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsMinimalTuple);
+ aggstate->hash_spill_wslot = ExecInitExtraTupleSlot(estate, scanDesc,
+ &TTSOpsVirtual);
/* this is an array of pointers, not structures */
aggstate->hash_pergroup = pergroups;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d8cf87e6d0..5f58a2fc27 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1015,6 +1015,8 @@ _copyAgg(const Agg *from)
COPY_SCALAR_FIELD(aggstrategy);
COPY_SCALAR_FIELD(aggsplit);
COPY_SCALAR_FIELD(numCols);
+ COPY_BITMAPSET_FIELD(aggregated_colnos);
+ COPY_BITMAPSET_FIELD(unaggregated_colnos);
if (from->numCols > 0)
{
COPY_POINTER_FIELD(grpColIdx, from->numCols * sizeof(AttrNumber));
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515d..1d544a7f02 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -779,6 +779,8 @@ _outAgg(StringInfo str, const Agg *node)
WRITE_ENUM_FIELD(aggstrategy, AggStrategy);
WRITE_ENUM_FIELD(aggsplit, AggSplit);
WRITE_INT_FIELD(numCols);
+ WRITE_BITMAPSET_FIELD(aggregated_colnos);
+ WRITE_BITMAPSET_FIELD(unaggregated_colnos);
WRITE_ATTRNUMBER_ARRAY(grpColIdx, node->numCols);
WRITE_OID_ARRAY(grpOperators, node->numCols);
WRITE_OID_ARRAY(grpCollations, node->numCols);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab719..7874899972 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2227,6 +2227,8 @@ _readAgg(void)
READ_ENUM_FIELD(aggstrategy, AggStrategy);
READ_ENUM_FIELD(aggsplit, AggSplit);
READ_INT_FIELD(numCols);
+ READ_BITMAPSET_FIELD(aggregated_colnos);
+ READ_BITMAPSET_FIELD(unaggregated_colnos);
READ_ATTRNUMBER_ARRAY(grpColIdx, local_node->numCols);
READ_OID_ARRAY(grpOperators, local_node->numCols);
READ_OID_ARRAY(grpCollations, local_node->numCols);
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index baefe0e946..340ec3ce66 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -68,6 +68,14 @@ typedef struct
int rtoffset;
} fix_upper_expr_context;
+/* used to find referenced colnos */
+typedef struct BifurcateColsContext
+{
+ bool is_aggref; /* is under an aggref */
+ Bitmapset *aggregated; /* column references under an aggref */
+ Bitmapset *unaggregated; /* other column references */
+} BifurcateColsContext;
+
/*
* Check if a Const node is a regclass value. We accept plain OID too,
* since a regclass Const will get folded to that type if it's an argument
@@ -113,6 +121,8 @@ static Node *fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset);
static Node *fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context);
static bool fix_scan_expr_walker(Node *node, fix_scan_expr_context *context);
static void set_join_references(PlannerInfo *root, Join *join, int rtoffset);
+static void bifurcate_agg_cols(List *quals, List *tlist, Bitmapset **aggregated, Bitmapset **unaggregated);
+static bool bifurcate_agg_cols_walker(Node *node, BifurcateColsContext *context);
static void set_upper_references(PlannerInfo *root, Plan *plan, int rtoffset);
static void set_param_references(PlannerInfo *root, Plan *plan);
static Node *convert_combining_aggrefs(Node *node, void *context);
@@ -1879,6 +1889,58 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
pfree(inner_itlist);
}
+
+/*
+ * Walk tlist and qual to find referenced colnos, dividing them into
+ * aggregated and unaggregated sets.
+ */
+static void
+bifurcate_agg_cols(List *quals, List *tlist, Bitmapset **aggregated, Bitmapset **unaggregated)
+{
+ BifurcateColsContext context;
+
+ context.is_aggref = false;
+ context.aggregated = NULL;
+ context.unaggregated = NULL;
+
+ (void) bifurcate_agg_cols_walker((Node *) tlist, &context);
+ (void) bifurcate_agg_cols_walker((Node *) quals, &context);
+
+ *aggregated = context.aggregated;
+ *unaggregated = context.unaggregated;
+}
+
+
+static bool
+bifurcate_agg_cols_walker(Node *node, BifurcateColsContext *context)
+{
+ if (node == NULL)
+ return false;
+ if (IsA(node, Var))
+ {
+ Var *var = (Var *) node;
+
+ if (var->varno != OUTER_VAR && var->varlevelsup != 0)
+ return false;
+
+ if (context->is_aggref)
+ context->aggregated = bms_add_member(context->aggregated, var->varattno);
+ else
+ context->unaggregated = bms_add_member(context->unaggregated, var->varattno);
+ return false;
+ }
+ if (IsA(node, Aggref))
+ {
+ Assert(!context->is_aggref);
+ context->is_aggref = true;
+ expression_tree_walker(node, bifurcate_agg_cols_walker, (void *) context);
+ context->is_aggref = false;
+ return false;
+ }
+ return expression_tree_walker(node, bifurcate_agg_cols_walker, (void *) context);
+}
+
+
/*
* set_upper_references
* Update the targetlist and quals of an upper-level plan node
@@ -1947,6 +2009,13 @@ set_upper_references(PlannerInfo *root, Plan *plan, int rtoffset)
OUTER_VAR,
rtoffset);
+ if (IsA(plan, Agg))
+ {
+ Agg *agg = (Agg *)plan;
+ if (agg->aggstrategy == AGG_HASHED || agg->aggstrategy == AGG_MIXED)
+ bifurcate_agg_cols(plan->qual, plan->targetlist, &agg->aggregated_colnos, &agg->unaggregated_colnos);
+ }
+
pfree(subplan_itlist);
}
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f5dfa32d55..e2aa07cb45 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2169,6 +2169,9 @@ typedef struct AggState
int current_set; /* The current grouping set being evaluated */
Bitmapset *grouped_cols; /* grouped cols in current projection */
List *all_grouped_cols; /* list of all grouped cols in DESC order */
+ Bitmapset *colnos_needed; /* all columns needed from the outer plan */
+ int max_colno_needed; /* highest colno needed from outer plan */
+ bool all_cols_needed; /* are all cols from outer plan needed? */
/* These fields are for grouping set phase data */
int maxsets; /* The max number of sets in any phase */
AggStatePerPhase phases; /* array of all phases */
@@ -2186,7 +2189,8 @@ typedef struct AggState
struct HashTapeInfo *hash_tapeinfo; /* metadata for spill tapes */
struct HashAggSpill *hash_spills; /* HashAggSpill for each grouping set,
* exists only during first pass */
- TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
+ TupleTableSlot *hash_spill_rslot; /* for reading spill files */
+ TupleTableSlot *hash_spill_wslot; /* for writing spill files */
List *hash_batches; /* hash batches remaining to be processed */
bool hash_ever_spilled; /* ever spilled during this execution? */
bool hash_spill_mode; /* we hit a limit during the current batch
@@ -2207,7 +2211,7 @@ typedef struct AggState
* per-group pointers */
/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 49
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 53
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
ProjectionInfo *combinedproj; /* projection machinery */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 83e01074ed..b0c50d2751 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -819,6 +819,8 @@ typedef struct Agg
AggStrategy aggstrategy; /* basic strategy, see nodes.h */
AggSplit aggsplit; /* agg-splitting mode, see nodes.h */
int numCols; /* number of grouping columns */
+ Bitmapset *aggregated_colnos;
+ Bitmapset *unaggregated_colnos;
AttrNumber *grpColIdx; /* their indexes in the target list */
Oid *grpOperators; /* equality operators to compare with */
Oid *grpCollations;
--
2.20.1
On Wed, Jun 24, 2020 at 05:26:07PM -0700, Melanie Plageman wrote:
On Tue, Jun 23, 2020 at 10:06 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote:
On Mon, Jun 22, 2020 at 9:02 PM Andres Freund <andres@anarazel.de>
wrote:
It's not this patch's fault, but none, really none, of this stuff
should
be in the executor.
Were you thinking it could be done in grouping_planner() and then the
bitmaps could be saved in the PlannedStmt?
Or would you have to wait until query_planner()? Or are you imagining
somewhere else entirely?I haven't thought about it in too much detail, but I would say
create_agg_plan() et al. I guess there's some argument to be made to do
it in setrefs.c, because we already do convert_combining_aggrefs() there
(but I don't like that much).There's no reason to do it before we actually decided on one specific
path, so doing it earlier than create_plan() seems unnecessary. And
having it in agg specific code seems better than putting it into global
routines.There's probably an argument for having a bit more shared code between
create_agg_plan(), create_group_plan() and
create_groupingsets_plan(). But even just adding a new extract_*_cols()
call to each of those would probably be ok.So, my summary of this point in the context of the other discussion
upthread is:Planner should extract the columns that hashagg will need later during
planning. Planner should not have HashAgg/MixedAgg nodes request smaller
targetlists from their children with CP_SMALL_TLIST to avoid unneeded
projection overhead.
Also, even this extraction should only be done when the number of groups
is large enough to suspect a spill.
IMO we should extract the columns irrespectedly of the estimates,
otherwise we won't be able to handle underestimates efficiently.
Not to stir the pot, but I did notice that hashjoin uses CP_SMALL_TLIST
in create_hashjoin_plan() for the inner side sub-tree and the outer side
one if there are multiple batches. I wondered what was different about
that vs hashagg (i.e. why it is okay to do that there).
Yeah. That means that if we have to start batching during execution, we
may need to spill much more datai. I'd say that's a hashjoin issue that
we should fix too (in v14).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2020-06-12 14:37:15 -0700, Andres Freund wrote:
On 2020-06-11 11:14:02 -0700, Jeff Davis wrote:
On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote:
Did you run any performance tests?
Yes, I reproduced your ~12% regression from V12, and this patch nearly
eliminated it for me.I spent a fair bit of time looking at the difference. Jeff had let me
know on chat that he was still seeing some difference, but couldn't
quite figure out where that was.Trying it out myself, I observed that the patch helped, but not that
much. After a bit I found one major reason for why:
LookupTupleHashEntryHash() assigned the hash to pointer provided by the
caller's before doing the insertion. That ended up causing a pipeline
stall (I assume it's store forwarding, but not sure). Moving the
assignment to the caller variable to after the insertion got rid of
that.
This is still not resolved. We're right now slower than 12. It's
effectively not possible to do performance comparisons right now. This
was nearly two months ago.
Greetings,
Andres Freund
On Fri, Jul 24, 2020 at 4:51 PM Andres Freund <andres@anarazel.de> wrote:
This is still not resolved. We're right now slower than 12. It's
effectively not possible to do performance comparisons right now. This
was nearly two months ago.
I have added a new open item for this separate
LookupTupleHashEntryHash()/lookup_hash_entry() pipeline-stall issue.
(For the record I mistakenly believed that commit 23023022 resolved
all of the concerns raised on this thread, which is why I closed out
the open item associated with this thread. Evidently work remains to
fix a remaining regression that affects simple in-memory hash
aggregation, though.)
--
Peter Geoghegan
On Sat, Jul 25, 2020 at 12:41 PM Peter Geoghegan <pg@bowt.ie> wrote:
I have added a new open item for this separate
LookupTupleHashEntryHash()/lookup_hash_entry() pipeline-stall issue.
Attached is a rebased version of Andres' now-bitrot 2020-06-12 patch
("aggspeed.diff").
I find that Andres original "SELECT cat, count(*) FROM
fewgroups_many_rows GROUP BY 1;" test case is noticeably improved by
the patch. Without the patch, v13 takes ~11.46 seconds. With the
patch, it takes only ~10.64 seconds.
Didn't test it against v12 yet, but I have no reason to doubt Andres'
explanation. I gather that if we can get this patch committed, we can
close the relevant LookupTupleHashEntryHash() open item.
Can you take this off my hands, Jeff?
Thanks
--
Peter Geoghegan
Attachments:
0001-Fix-LookupTupleHashEntryHash-pipeline-stall-issue.patchapplication/octet-stream; name=0001-Fix-LookupTupleHashEntryHash-pipeline-stall-issue.patchDownload
From 64db73a86b06a57ed9ab241b47f511a084861dfa Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 25 Jul 2020 14:16:44 -0700
Subject: [PATCH] Fix LookupTupleHashEntryHash() pipeline-stall issue.
Rebased version of Andres' aggspeed.diff patch from:
https://postgr.es/m/20200612213715.op4ye4q7gktqvpuo%40alap3.anarazel.de
---
src/include/executor/executor.h | 2 +-
src/backend/executor/execGrouping.c | 29 ++--
src/backend/executor/nodeAgg.c | 167 +++++++++++-----------
src/backend/executor/nodeRecursiveunion.c | 4 +-
src/backend/executor/nodeSetOp.c | 4 +-
src/backend/executor/nodeSubplan.c | 4 +-
6 files changed, 107 insertions(+), 103 deletions(-)
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c7deeac662..415e117407 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -139,7 +139,7 @@ extern TupleHashTable BuildTupleHashTableExt(PlanState *parent,
MemoryContext tempcxt, bool use_variable_hash_iv);
extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
TupleTableSlot *slot,
- bool *isnew);
+ bool *isnew, uint32 *hash);
extern uint32 TupleHashTableHash(TupleHashTable hashtable,
TupleTableSlot *slot);
extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 019b87df21..321f427e47 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -22,11 +22,11 @@
#include "utils/memutils.h"
static int TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
-static uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb,
- const MinimalTuple tuple);
-static TupleHashEntry LookupTupleHashEntry_internal(TupleHashTable hashtable,
- TupleTableSlot *slot,
- bool *isnew, uint32 hash);
+static inline uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb,
+ const MinimalTuple tuple);
+static inline TupleHashEntry LookupTupleHashEntry_internal(TupleHashTable hashtable,
+ TupleTableSlot *slot,
+ bool *isnew, uint32 hash);
/*
* Define parameters for tuple hash table code generation. The interface is
@@ -291,6 +291,9 @@ ResetTupleHashTable(TupleHashTable hashtable)
* If isnew is NULL, we do not create new entries; we return NULL if no
* match is found.
*
+ * If hash is not NULL, we set it to the calculated hash value. This allows
+ * callers access to the hash value even if no entry is returned.
+ *
* If isnew isn't NULL, then a new entry is created if no existing entry
* matches. On return, *isnew is true if the entry is newly created,
* false if it existed already. ->additional_data in the new entry has
@@ -298,11 +301,11 @@ ResetTupleHashTable(TupleHashTable hashtable)
*/
TupleHashEntry
LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
- bool *isnew)
+ bool *isnew, uint32 *hash)
{
TupleHashEntry entry;
MemoryContext oldContext;
- uint32 hash;
+ uint32 local_hash;
/* Need to run the hash functions in short-lived context */
oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
@@ -312,8 +315,13 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
hashtable->cur_eq_func = hashtable->tab_eq_func;
- hash = TupleHashTableHash_internal(hashtable->hashtab, NULL);
- entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
+ local_hash = TupleHashTableHash_internal(hashtable->hashtab, NULL);
+ entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, local_hash);
+
+ if (hash != NULL)
+ *hash = local_hash;
+
+ Assert(entry == NULL || entry->hash == local_hash);
MemoryContextSwitchTo(oldContext);
@@ -362,6 +370,7 @@ LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
hashtable->cur_eq_func = hashtable->tab_eq_func;
entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
+ Assert(entry == NULL || entry->hash == hash);
MemoryContextSwitchTo(oldContext);
@@ -480,7 +489,7 @@ TupleHashTableHash_internal(struct tuplehash_hash *tb,
* NB: This function may or may not change the memory context. Caller is
* expected to change it back.
*/
-static TupleHashEntry
+static inline TupleHashEntry
LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
bool *isnew, uint32 hash)
{
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index b79c845a6b..bbfc4af1ec 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -391,7 +391,9 @@ static void finalize_partialaggregate(AggState *aggstate,
AggStatePerAgg peragg,
AggStatePerGroup pergroupstate,
Datum *resultVal, bool *resultIsNull);
-static void prepare_hash_slot(AggState *aggstate);
+static inline void prepare_hash_slot(AggStatePerHash perhash,
+ TupleTableSlot *inputslot,
+ TupleTableSlot *hashslot);
static void prepare_projection_slot(AggState *aggstate,
TupleTableSlot *slot,
int currentSet);
@@ -413,8 +415,9 @@ static int hash_choose_num_partitions(uint64 input_groups,
double hashentrysize,
int used_bits,
int *log2_npartittions);
-static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash,
- bool *in_hash_table);
+static void initialize_hash_entry(AggState *aggstate,
+ TupleHashTable hashtable,
+ TupleHashEntry entry);
static void lookup_hash_entries(AggState *aggstate);
static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
static void agg_fill_hash_table(AggState *aggstate);
@@ -1207,12 +1210,11 @@ finalize_partialaggregate(AggState *aggstate,
* Extract the attributes that make up the grouping key into the
* hashslot. This is necessary to compute the hash or perform a lookup.
*/
-static void
-prepare_hash_slot(AggState *aggstate)
+static inline void
+prepare_hash_slot(AggStatePerHash perhash,
+ TupleTableSlot *inputslot,
+ TupleTableSlot *hashslot)
{
- TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
- AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
- TupleTableSlot *hashslot = perhash->hashslot;
int i;
/* transfer just the needed columns into hashslot */
@@ -2013,75 +2015,39 @@ hash_choose_num_partitions(uint64 input_groups, double hashentrysize,
}
/*
- * Find or create a hashtable entry for the tuple group containing the current
- * tuple (already set in tmpcontext's outertuple slot), in the current grouping
- * set (which the caller must have selected - note that initialize_aggregate
- * depends on this).
- *
- * When called, CurrentMemoryContext should be the per-query context. The
- * already-calculated hash value for the tuple must be specified.
- *
- * If in "spill mode", then only find existing hashtable entries; don't create
- * new ones. If a tuple's group is not already present in the hash table for
- * the current grouping set, assign *in_hash_table=false and the caller will
- * spill it to disk.
+ * Initialize a freshly-created TupleHashEntry.
*/
-static AggStatePerGroup
-lookup_hash_entry(AggState *aggstate, uint32 hash, bool *in_hash_table)
+static void
+initialize_hash_entry(AggState *aggstate, TupleHashTable hashtable,
+ TupleHashEntry entry)
{
- AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
- TupleTableSlot *hashslot = perhash->hashslot;
- TupleHashEntryData *entry;
- bool isnew = false;
- bool *p_isnew;
+ AggStatePerGroup pergroup;
+ int transno;
- /* if hash table already spilled, don't create new entries */
- p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
+ aggstate->hash_ngroups_current++;
+ hash_agg_check_limits(aggstate);
- /* find or create the hashtable entry using the filtered tuple */
- entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot, p_isnew,
- hash);
+ /* no need to allocate or initialize per-group state */
+ if (aggstate->numtrans == 0)
+ return;
- if (entry == NULL)
+ pergroup = (AggStatePerGroup)
+ MemoryContextAlloc(hashtable->tablecxt,
+ sizeof(AggStatePerGroupData) * aggstate->numtrans);
+
+ entry->additional = pergroup;
+
+ /*
+ * Initialize aggregates for new tuple group, lookup_hash_entries()
+ * already has selected the relevant grouping set.
+ */
+ for (transno = 0; transno < aggstate->numtrans; transno++)
{
- *in_hash_table = false;
- return NULL;
+ AggStatePerTrans pertrans = &aggstate->pertrans[transno];
+ AggStatePerGroup pergroupstate = &pergroup[transno];
+
+ initialize_aggregate(aggstate, pertrans, pergroupstate);
}
- else
- *in_hash_table = true;
-
- if (isnew)
- {
- AggStatePerGroup pergroup;
- int transno;
-
- aggstate->hash_ngroups_current++;
- hash_agg_check_limits(aggstate);
-
- /* no need to allocate or initialize per-group state */
- if (aggstate->numtrans == 0)
- return NULL;
-
- pergroup = (AggStatePerGroup)
- MemoryContextAlloc(perhash->hashtable->tablecxt,
- sizeof(AggStatePerGroupData) * aggstate->numtrans);
-
- entry->additional = pergroup;
-
- /*
- * Initialize aggregates for new tuple group, lookup_hash_entries()
- * already has selected the relevant grouping set.
- */
- for (transno = 0; transno < aggstate->numtrans; transno++)
- {
- AggStatePerTrans pertrans = &aggstate->pertrans[transno];
- AggStatePerGroup pergroupstate = &pergroup[transno];
-
- initialize_aggregate(aggstate, pertrans, pergroupstate);
- }
- }
-
- return entry->additional;
}
/*
@@ -2106,21 +2072,37 @@ static void
lookup_hash_entries(AggState *aggstate)
{
AggStatePerGroup *pergroup = aggstate->hash_pergroup;
+ TupleTableSlot *outerslot = aggstate->tmpcontext->ecxt_outertuple;
int setno;
for (setno = 0; setno < aggstate->num_hashes; setno++)
{
AggStatePerHash perhash = &aggstate->perhash[setno];
+ TupleHashTable hashtable = perhash->hashtable;
+ TupleTableSlot *hashslot = perhash->hashslot;
+ TupleHashEntry entry;
uint32 hash;
- bool in_hash_table;
+ bool isnew = false;
+ bool *p_isnew;
+
+ /* if hash table already spilled, don't create new entries */
+ p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
select_current_set(aggstate, setno, true);
- prepare_hash_slot(aggstate);
- hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
- pergroup[setno] = lookup_hash_entry(aggstate, hash, &in_hash_table);
+ prepare_hash_slot(perhash,
+ outerslot,
+ hashslot);
- /* check to see if we need to spill the tuple for this grouping set */
- if (!in_hash_table)
+ entry = LookupTupleHashEntry(hashtable, hashslot,
+ p_isnew, &hash);
+
+ if (entry != NULL)
+ {
+ if (isnew)
+ initialize_hash_entry(aggstate, hashtable, entry);
+ pergroup[setno] = entry->additional;
+ }
+ else
{
HashAggSpill *spill = &aggstate->hash_spills[setno];
TupleTableSlot *slot = aggstate->tmpcontext->ecxt_outertuple;
@@ -2131,6 +2113,7 @@ lookup_hash_entries(AggState *aggstate)
aggstate->hashentrysize);
hashagg_spill_tuple(aggstate, spill, slot, hash);
+ pergroup[setno] = NULL;
}
}
}
@@ -2588,6 +2571,7 @@ static bool
agg_refill_hash_table(AggState *aggstate)
{
HashAggBatch *batch;
+ AggStatePerHash perhash;
HashAggSpill spill;
HashTapeInfo *tapeinfo = aggstate->hash_tapeinfo;
uint64 ngroups_estimate;
@@ -2639,6 +2623,8 @@ agg_refill_hash_table(AggState *aggstate)
select_current_set(aggstate, batch->setno, true);
+ perhash = &aggstate->perhash[aggstate->current_set];
+
/*
* Spilled tuples are always read back as MinimalTuples, which may be
* different from the outer plan, so recompile the aggregate expressions.
@@ -2652,10 +2638,13 @@ agg_refill_hash_table(AggState *aggstate)
HASHAGG_READ_BUFFER_SIZE);
for (;;)
{
- TupleTableSlot *slot = aggstate->hash_spill_rslot;
+ TupleTableSlot *spillslot = aggstate->hash_spill_rslot;
+ TupleTableSlot *hashslot = perhash->hashslot;
+ TupleHashEntry entry;
MinimalTuple tuple;
uint32 hash;
- bool in_hash_table;
+ bool isnew = false;
+ bool *p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
CHECK_FOR_INTERRUPTS();
@@ -2663,16 +2652,20 @@ agg_refill_hash_table(AggState *aggstate)
if (tuple == NULL)
break;
- ExecStoreMinimalTuple(tuple, slot, true);
- aggstate->tmpcontext->ecxt_outertuple = slot;
+ ExecStoreMinimalTuple(tuple, spillslot, true);
+ aggstate->tmpcontext->ecxt_outertuple = spillslot;
- prepare_hash_slot(aggstate);
- aggstate->hash_pergroup[batch->setno] =
- lookup_hash_entry(aggstate, hash, &in_hash_table);
+ prepare_hash_slot(perhash,
+ aggstate->tmpcontext->ecxt_outertuple,
+ hashslot);
+ entry = LookupTupleHashEntryHash(
+ perhash->hashtable, hashslot, p_isnew, hash);
- if (in_hash_table)
+ if (entry != NULL)
{
- /* Advance the aggregates (or combine functions) */
+ if (isnew)
+ initialize_hash_entry(aggstate, perhash->hashtable, entry);
+ aggstate->hash_pergroup[batch->setno] = entry->additional;
advance_aggregates(aggstate);
}
else
@@ -2688,7 +2681,9 @@ agg_refill_hash_table(AggState *aggstate)
ngroups_estimate, aggstate->hashentrysize);
}
/* no memory for a new group, spill */
- hashagg_spill_tuple(aggstate, &spill, slot, hash);
+ hashagg_spill_tuple(aggstate, &spill, spillslot, hash);
+
+ aggstate->hash_pergroup[batch->setno] = NULL;
}
/*
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 620414a1ed..046242682f 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -94,7 +94,7 @@ ExecRecursiveUnion(PlanState *pstate)
if (plan->numCols > 0)
{
/* Find or build hashtable entry for this tuple's group */
- LookupTupleHashEntry(node->hashtable, slot, &isnew);
+ LookupTupleHashEntry(node->hashtable, slot, &isnew, NULL);
/* Must reset temp context after each hashtable lookup */
MemoryContextReset(node->tempContext);
/* Ignore tuple if already seen */
@@ -141,7 +141,7 @@ ExecRecursiveUnion(PlanState *pstate)
if (plan->numCols > 0)
{
/* Find or build hashtable entry for this tuple's group */
- LookupTupleHashEntry(node->hashtable, slot, &isnew);
+ LookupTupleHashEntry(node->hashtable, slot, &isnew, NULL);
/* Must reset temp context after each hashtable lookup */
MemoryContextReset(node->tempContext);
/* Ignore tuple if already seen */
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index bfd148a41a..8d4ccff19c 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -381,7 +381,7 @@ setop_fill_hash_table(SetOpState *setopstate)
/* Find or build hashtable entry for this tuple's group */
entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
- &isnew);
+ &isnew, NULL);
/* If new tuple group, initialize counts */
if (isnew)
@@ -402,7 +402,7 @@ setop_fill_hash_table(SetOpState *setopstate)
/* For tuples not seen previously, do not make hashtable entry */
entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
- NULL);
+ NULL, NULL);
/* Advance the counts if entry is already present */
if (entry)
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 298b7757f5..38c2fc0b50 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -595,12 +595,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
*/
if (slotNoNulls(slot))
{
- (void) LookupTupleHashEntry(node->hashtable, slot, &isnew);
+ (void) LookupTupleHashEntry(node->hashtable, slot, &isnew, NULL);
node->havehashrows = true;
}
else if (node->hashnulls)
{
- (void) LookupTupleHashEntry(node->hashnulls, slot, &isnew);
+ (void) LookupTupleHashEntry(node->hashnulls, slot, &isnew, NULL);
node->havenullrows = true;
}
--
2.25.1
On Sat, 2020-07-25 at 15:08 -0700, Peter Geoghegan wrote:
I find that Andres original "SELECT cat, count(*) FROM
fewgroups_many_rows GROUP BY 1;" test case is noticeably improved by
the patch. Without the patch, v13 takes ~11.46 seconds. With the
patch, it takes only ~10.64 seconds.
I saw less of an improvement than you or Andres (perhaps just more
noise). But given that both you and Andres are reporting a measurable
improvement, then I went ahead and committed it. Thank you.
Regards,
Jeff Davis
On Sun, Jul 26, 2020 at 4:17 PM Jeff Davis <pgsql@j-davis.com> wrote:
I saw less of an improvement than you or Andres (perhaps just more
noise). But given that both you and Andres are reporting a measurable
improvement, then I went ahead and committed it. Thank you.
Thanks!
--
Peter Geoghegan