Improve hash-agg performance

Started by Andres Freundabout 9 years ago4 messages

andres@anarazel.de

about 9 years ago

2 attachment(s)

Hi,

There's two things I found while working on faster expression
evaluation, slot deforming and batched execution. As those two issues
often seemed quite dominant cost-wise it seemed worthwhile to evaluate
them independently.

1) We atm do one ExecProject() to compute each aggregate's
arguments. Turns out it's noticeably faster to compute the argument
for all aggregates in one go. Both because it reduces the amount of
function call / moves more things into a relatively tight loop, and
because it allows to deform all the required columns at once, rather
than one-by-one. For a single aggregate it'd be faster to avoid
ExecProject alltogether (i.e. directly evaluate the expression as we
used to), but as soon you have two the new approach is faster.

2) For hash-aggs we right now we store the representative tuple using
the input tuple's format, with unneeded columns set to NULL. That
turns out to be expensive if the aggregated-on columns are not
leading columns, because we have to skip over a potentially large
number of NULLs. The fix here is to simply use a different tuple
format for the hashtable. That doesn't cause overhead, because we
already move columns in/out of the hashslot explicitly anyway.

Comments?

Regards,

Andres Freund

Attachments:

0001-Perform-one-only-projection-to-compute-agg-arguments.patchtext/x-patch; charset=us-asciiDownload

From 36b022c8a0f74a037c01c085810365b20830ba34 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 12 Jul 2016 01:01:28 -0700
Subject: [PATCH 1/2] Perform one only projection to compute agg arguments.

Previously we did a ExecProject() for each individual aggregate
argument. Doing so is quite a bit cheaper because ExecProject's fastpath
can do the work at once in a relatively tight loop, and because it
allows to easily slot_getsomeattrs the right amount of columns for all
the aggregates.
---
 src/backend/executor/nodeAgg.c | 136 ++++++++++++++++++++++++++++++-----------
 src/include/nodes/execnodes.h  |   4 ++
 2 files changed, 105 insertions(+), 35 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 28c15ba..3ed6fb2 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -213,6 +213,9 @@ typedef struct AggStatePerTransData
 	 */
 	int			numInputs;
 
+	/* offset of input columns in AggState->evalslot */
+	int			inputoff;
+
 	/*
 	 * Number of aggregated input columns to pass to the transfn.  This
 	 * includes the ORDER BY columns for ordered-set aggs, but not for plain
@@ -297,7 +300,6 @@ typedef struct AggStatePerTransData
 	 * well go whole hog and use ExecProject too.
 	 */
 	TupleDesc	evaldesc;		/* descriptor of input tuples */
-	ProjectionInfo *evalproj;	/* projection machinery */
 
 	/*
 	 * Slots for holding the evaluated input arguments.  These are set up
@@ -847,14 +849,20 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
 	int			setno = 0;
 	int			numGroupingSets = Max(aggstate->phase->numsets, 1);
 	int			numTrans = aggstate->numtrans;
+	TupleTableSlot *slot = aggstate->evalslot;
+	AggStatePerTrans pertrans;
 
-	for (transno = 0; transno < numTrans; transno++)
+	/* compute input for all aggregates */
+	if (aggstate->evalproj)
+		aggstate->evalslot = ExecProject(aggstate->evalproj, NULL);
+
+	for (transno = 0, pertrans = aggstate->pertrans; transno < numTrans;
+		 transno++, pertrans++)
 	{
-		AggStatePerTrans pertrans = &aggstate->pertrans[transno];
 		ExprState  *filter = pertrans->aggfilter;
 		int			numTransInputs = pertrans->numTransInputs;
 		int			i;
-		TupleTableSlot *slot;
+		int			inputoff = pertrans->inputoff;
 
 		/* Skip anything FILTERed out */
 		if (filter)
@@ -868,13 +876,10 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
 				continue;
 		}
 
-		/* Evaluate the current input expressions for this aggregate */
-		slot = ExecProject(pertrans->evalproj, NULL);
-
 		if (pertrans->numSortCols > 0)
 		{
 			/* DISTINCT and/or ORDER BY case */
-			Assert(slot->tts_nvalid == pertrans->numInputs);
+			Assert(slot->tts_nvalid >= pertrans->numInputs);
 
 			/*
 			 * If the transfn is strict, we want to check for nullity before
@@ -887,7 +892,7 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
 			{
 				for (i = 0; i < numTransInputs; i++)
 				{
-					if (slot->tts_isnull[i])
+					if (slot->tts_isnull[i + inputoff])
 						break;
 				}
 				if (i < numTransInputs)
@@ -899,10 +904,25 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
 				/* OK, put the tuple into the tuplesort object */
 				if (pertrans->numInputs == 1)
 					tuplesort_putdatum(pertrans->sortstates[setno],
-									   slot->tts_values[0],
-									   slot->tts_isnull[0]);
+									   slot->tts_values[inputoff],
+									   slot->tts_isnull[inputoff]);
 				else
-					tuplesort_puttupleslot(pertrans->sortstates[setno], slot);
+				{
+					/*
+					 * Copy slot contents, starting from inputoff, into sort
+					 * slot.
+					 */
+					ExecClearTuple(pertrans->evalslot);
+					memcpy(pertrans->evalslot->tts_values,
+						   &slot->tts_values[inputoff],
+						   pertrans->numInputs * sizeof(Datum));
+					memcpy(pertrans->evalslot->tts_isnull,
+						   &slot->tts_isnull[inputoff],
+						   pertrans->numInputs * sizeof(bool));
+					pertrans->evalslot->tts_nvalid = pertrans->numInputs;
+					ExecStoreVirtualTuple(pertrans->evalslot);
+					tuplesort_puttupleslot(pertrans->sortstates[setno], pertrans->evalslot);
+				}
 			}
 		}
 		else
@@ -915,8 +935,8 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
 			Assert(slot->tts_nvalid >= numTransInputs);
 			for (i = 0; i < numTransInputs; i++)
 			{
-				fcinfo->arg[i + 1] = slot->tts_values[i];
-				fcinfo->argnull[i + 1] = slot->tts_isnull[i];
+				fcinfo->arg[i + 1] = slot->tts_values[i + inputoff];
+				fcinfo->argnull[i + 1] = slot->tts_isnull[i + inputoff];
 			}
 
 			for (setno = 0; setno < numGroupingSets; setno++)
@@ -943,20 +963,25 @@ combine_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
 {
 	int			transno;
 	int			numTrans = aggstate->numtrans;
+	TupleTableSlot *slot = NULL;
+	AggStatePerTrans pertrans;
 
 	/* combine not supported with grouping sets */
 	Assert(aggstate->phase->numsets == 0);
 
-	for (transno = 0; transno < numTrans; transno++)
-	{
-		AggStatePerTrans pertrans = &aggstate->pertrans[transno];
-		AggStatePerGroup pergroupstate = &pergroup[transno];
-		TupleTableSlot *slot;
-		FunctionCallInfo fcinfo = &pertrans->transfn_fcinfo;
+	/* compute input for all aggregates */
+	if (aggstate->evalproj)
+		slot = ExecProject(aggstate->evalproj, NULL);
+
+	for (transno = 0, pertrans = aggstate->pertrans; transno < numTrans;
+		 transno++, pertrans++)
+	{
+		AggStatePerGroup pergroupstate = &pergroup[transno];
+		FunctionCallInfo fcinfo = &pertrans->transfn_fcinfo;
+		int			inputoff = pertrans->inputoff;
 
-		/* Evaluate the current input expressions for this aggregate */
-		slot = ExecProject(pertrans->evalproj, NULL);
 		Assert(slot->tts_nvalid >= 1);
+		Assert(slot->tts_nvalid + inputoff >= 1);
 
 		/*
 		 * deserialfn_oid will be set if we must deserialize the input state
@@ -965,18 +990,18 @@ combine_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
 		if (OidIsValid(pertrans->deserialfn_oid))
 		{
 			/* Don't call a strict deserialization function with NULL input */
-			if (pertrans->deserialfn.fn_strict && slot->tts_isnull[0])
+			if (pertrans->deserialfn.fn_strict && slot->tts_isnull[inputoff])
 			{
-				fcinfo->arg[1] = slot->tts_values[0];
-				fcinfo->argnull[1] = slot->tts_isnull[0];
+				fcinfo->arg[1] = slot->tts_values[inputoff];
+				fcinfo->argnull[1] = slot->tts_isnull[inputoff];
 			}
 			else
 			{
 				FunctionCallInfo dsinfo = &pertrans->deserialfn_fcinfo;
 				MemoryContext oldContext;
 
-				dsinfo->arg[0] = slot->tts_values[0];
-				dsinfo->argnull[0] = slot->tts_isnull[0];
+				dsinfo->arg[0] = slot->tts_values[inputoff];
+				dsinfo->argnull[0] = slot->tts_isnull[inputoff];
 				/* Dummy second argument for type-safety reasons */
 				dsinfo->arg[1] = PointerGetDatum(NULL);
 				dsinfo->argnull[1] = false;
@@ -995,8 +1020,8 @@ combine_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
 		}
 		else
 		{
-			fcinfo->arg[1] = slot->tts_values[0];
-			fcinfo->argnull[1] = slot->tts_isnull[0];
+			fcinfo->arg[1] = slot->tts_values[inputoff];
+			fcinfo->argnull[1] = slot->tts_isnull[inputoff];
 		}
 
 		advance_combine_function(aggstate, pertrans, pergroupstate);
@@ -2928,6 +2953,53 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->numaggs = aggno + 1;
 	aggstate->numtrans = transno + 1;
 
+	/*
+	 * Build a single projection computing the aggregate arguments for all
+	 * aggregates at once, that's considerably faster than doing it separately
+	 * for each.
+	 */
+	{
+		List	   *inputeval = NIL;
+		int			offset = 0;
+
+		for (transno = 0; transno < aggstate->numtrans; transno++)
+		{
+			AggStatePerTrans pertrans = &pertransstates[transno];
+			ListCell   *arg;
+
+			pertrans->inputoff = offset;
+
+			/*
+			 * Adjust resno in a copied target entry, to point into the
+			 * combined slot.
+			 */
+			foreach(arg, pertrans->aggref->args)
+			{
+				TargetEntry *tle;
+
+				Assert(IsA(lfirst(arg), TargetEntry));
+				tle = copyObject(lfirst(arg));
+				tle->resno += offset;
+
+				inputeval = lappend(inputeval, tle);
+			}
+
+			offset += list_length(pertrans->aggref->args);
+		}
+
+		aggstate->evaldesc = ExecTypeFromTL(inputeval, false);
+
+		aggstate->evalslot = ExecInitExtraTupleSlot(estate);
+
+		inputeval = (List *) ExecInitExpr((Expr *) inputeval,
+										  (PlanState *) aggstate);
+		aggstate->evalproj = ExecBuildProjectionInfo(inputeval,
+													 aggstate->tmpcontext,
+													 aggstate->evalslot,
+													 NULL);
+		ExecSetSlotDescriptor(aggstate->evalslot, aggstate->evaldesc);
+	}
+
 	return aggstate;
 }
 
@@ -3127,12 +3199,6 @@ build_pertrans_for_aggref(AggStatePerTrans pertrans,
 				(errcode(ERRCODE_GROUPING_ERROR),
 				 errmsg("aggregate function calls cannot be nested")));
 
-	/* Set up projection info for evaluation */
-	pertrans->evalproj = ExecBuildProjectionInfo(pertrans->args,
-												 aggstate->tmpcontext,
-												 pertrans->evalslot,
-												 NULL);
-
 	/*
 	 * If we're doing either DISTINCT or ORDER BY for a plain agg, then we
 	 * have a list of SortGroupClause nodes; fish out the data in them and
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f6f73f3..30e1aee 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1863,6 +1863,10 @@ typedef struct AggState
 	List	   *hash_needed;	/* list of columns needed in hash table */
 	bool		table_filled;	/* hash table filled yet? */
 	TupleHashIterator hashiter; /* for iterating through hash table */
+	/* support for evaluation of agg inputs */
+	TupleTableSlot *evalslot;   /* slot for agg inputs */
+	ProjectionInfo *evalproj;   /* projection machinery */
+	TupleDesc   evaldesc;       /* descriptor of input tuples */
 } AggState;
 
 /* ----------------
-- 
2.10.2

0002-User-narrower-representative-tuples-in-the-hash-agg-.patchtext/x-patch; charset=us-asciiDownload

From a51a696ff7c56e7c8d6720f8375a0253462eee1f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 2 Nov 2016 05:24:18 -0700
Subject: [PATCH 2/2] User narrower representative tuples in the hash-agg
 hashtable.

So far the hashtable stored representative tuples in the form of its
input slot, with all columns in the hashtable that are not
needed (i.e. not grouped upon or functionally dependent) set to NULL.

Thats good for saving memory, but it turns out that having tuples is
NULL isn't free. slot_deform_tuple is faster if there's no NULL bitmap
even if no NULLs are encountered, and skipping over NULLs isn't free.

So compute a separate tuple descriptor that only contains the needed
columns. As columns have already been moved in/out the slot for the
hashtable that does not imply additional overhead.

Author: Andres Freund
---
 src/backend/executor/nodeAgg.c | 162 +++++++++++++++++++++++++++--------------
 src/include/nodes/execnodes.h  |   5 +-
 2 files changed, 111 insertions(+), 56 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 3ed6fb2..f00ef02 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1718,7 +1718,7 @@ build_hash_table(AggState *aggstate)
 	additionalsize = aggstate->numaggs * sizeof(AggStatePerGroupData);
 
 	aggstate->hashtable = BuildTupleHashTable(node->numCols,
-											  node->grpColIdx,
+											  aggstate->hashGrpColIdxHash,
 											  aggstate->phase->eqfunctions,
 											  aggstate->hashfunctions,
 											  node->numGroups,
@@ -1728,29 +1728,24 @@ build_hash_table(AggState *aggstate)
 }
 
 /*
- * Create a list of the tuple columns that actually need to be stored in
- * hashtable entries.  The incoming tuples from the child plan node will
- * contain grouping columns, other columns referenced in our targetlist and
- * qual, columns used to compute the aggregate functions, and perhaps just
- * junk columns we don't use at all.  Only columns of the first two types
- * need to be stored in the hashtable, and getting rid of the others can
- * make the table entries significantly smaller.  To avoid messing up Var
- * numbering, we keep the same tuple descriptor for hashtable entries as the
- * incoming tuples have, but set unwanted columns to NULL in the tuples that
- * go into the table.
+ * Comput thecolumns that actually need to be stored in hashtable entries.
+ * The incoming tuples from the child plan node will contain grouping columns,
+ * other columns referenced in our targetlist and qual, columns used to
+ * compute the aggregate functions, and perhaps just junk columns we don't use
+ * at all.  Only columns of the first two types need to be stored in the
+ * hashtable, and getting rid of the others can make the table entries
+ * significantly smaller.  The hashtable only contains the relevant columns,
+ * and is packed/unpacked in lookup_hash_entry() / agg_retrieve_hash_table()
+ * into the format of the normal input descriptor.
  *
- * To eliminate duplicates, we build a bitmapset of the needed columns, then
- * convert it to an integer list (cheaper to scan at runtime). The list is
- * in decreasing order so that the first entry is the largest;
- * lookup_hash_entry depends on this to use slot_getsomeattrs correctly.
- * Note that the list is preserved over ExecReScanAgg, so we allocate it in
- * the per-query context (unlike the hash table itself).
+ * Additional columns, in addition to the columns grouped by, come from two
+ * sources: Firstly functionally dependent columns that we don't need to group
+ * by themselves, and secondly ctids for row-marks.
  *
- * Note: at present, searching the tlist/qual is not really necessary since
- * the parser should disallow any unaggregated references to ungrouped
- * columns.  However, the search will be needed when we add support for
- * SQL99 semantics that allow use of "functionally dependent" columns that
- * haven't been explicitly grouped by.
+ * To eliminate duplicates, we build a bitmapset of the needed columns, and
+ * then build an array of the columns included in the hashtable.  Note that
+ * the array is preserved over ExecReScanAgg, so we allocate it in the
+ * per-query context (unlike the hash table itself).
  */
 static List *
 find_hash_columns(AggState *aggstate)
@@ -1758,8 +1753,13 @@ find_hash_columns(AggState *aggstate)
 	Agg		   *node = (Agg *) aggstate->ss.ps.plan;
 	Bitmapset  *colnos;
 	List	   *collist;
+	TupleDesc hashDesc;
+	List  *outerTlist = outerPlanState(aggstate)->plan->targetlist;
+	List *hashTlist = NIL;
 	int			i;
 
+	aggstate->largestGrpColIdx = 0;
+
 	/* Find Vars that will be needed in tlist and qual */
 	colnos = find_unaggregated_cols(aggstate);
 	/* Add in all the grouping columns */
@@ -1767,8 +1767,49 @@ find_hash_columns(AggState *aggstate)
 		colnos = bms_add_member(colnos, node->grpColIdx[i]);
 	/* Convert to list, using lcons so largest element ends up first */
 	collist = NIL;
+
+	aggstate->hashGrpColIdxInput =
+		palloc(bms_num_members(colnos) * sizeof(AttrNumber));
+	aggstate->hashGrpColIdxHash =
+		palloc(node->numCols * sizeof(AttrNumber));
+
+	/*
+	 * First build mapping for columns directly hashed. These are the first,
+	 * because they'll be accessed when computing hash values and comparing
+	 * tuples for exact matches. We also build simple mapping for
+	 * execGrouping, so it knows where to find the to-be-hashed / compared
+	 * columns in the input.
+	 */
+	for (i = 0; i < node->numCols; i++)
+	{
+		aggstate->hashGrpColIdxInput[i] = node->grpColIdx[i];
+		aggstate->hashGrpColIdxHash[i] = i + 1;
+		aggstate->numhashGrpCols++;
+		/* delete already mapped columns */
+		bms_del_member(colnos, node->grpColIdx[i]);
+	}
+
+	/* and add the remaining columns */
 	while ((i = bms_first_member(colnos)) >= 0)
-		collist = lcons_int(i, collist);
+	{
+		aggstate->hashGrpColIdxInput[aggstate->numhashGrpCols] = i;
+		aggstate->numhashGrpCols++;
+	}
+
+	/* and build a tuple descriptor for the hashtable */
+	for (i = 0; i < aggstate->numhashGrpCols; i++)
+	{
+		int			varNumber = aggstate->hashGrpColIdxInput[i] - 1;
+
+		hashTlist = lappend(hashTlist, list_nth(outerTlist, varNumber));
+		aggstate->largestGrpColIdx =
+			Max(varNumber + 1, aggstate->largestGrpColIdx);
+	}
+
+	hashDesc = ExecTypeFromTL(hashTlist, false);
+	ExecSetSlotDescriptor(aggstate->hashslot, hashDesc);
+
+	list_free(hashTlist);
 	bms_free(colnos);
 
 	return collist;
@@ -1805,27 +1846,22 @@ static TupleHashEntryData *
 lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
 {
 	TupleTableSlot *hashslot = aggstate->hashslot;
-	ListCell   *l;
 	TupleHashEntryData *entry;
 	bool		isnew;
-
-	/* if first time through, initialize hashslot by cloning input slot */
-	if (hashslot->tts_tupleDescriptor == NULL)
-	{
-		ExecSetSlotDescriptor(hashslot, inputslot->tts_tupleDescriptor);
-		/* Make sure all unused columns are NULLs */
-		ExecStoreAllNullTuple(hashslot);
-	}
+	int i;
 
 	/* transfer just the needed columns into hashslot */
-	slot_getsomeattrs(inputslot, linitial_int(aggstate->hash_needed));
-	foreach(l, aggstate->hash_needed)
-	{
-		int			varNumber = lfirst_int(l) - 1;
+	slot_getsomeattrs(inputslot, aggstate->largestGrpColIdx);
+	ExecClearTuple(hashslot);
 
-		hashslot->tts_values[varNumber] = inputslot->tts_values[varNumber];
-		hashslot->tts_isnull[varNumber] = inputslot->tts_isnull[varNumber];
+	for (i = 0; i < aggstate->numhashGrpCols; i++)
+	{
+		int			varNumber = aggstate->hashGrpColIdxInput[i] - 1;
+
+		hashslot->tts_values[i] = inputslot->tts_values[varNumber];
+		hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber];
 	}
+	ExecStoreVirtualTuple(hashslot);
 
 	/* find or create the hashtable entry using the filtered tuple */
 	entry = LookupTupleHashEntry(aggstate->hashtable, hashslot, &isnew);
@@ -2287,6 +2323,7 @@ agg_retrieve_hash_table(AggState *aggstate)
 	TupleHashEntryData *entry;
 	TupleTableSlot *firstSlot;
 	TupleTableSlot *result;
+	TupleTableSlot *hashslot;
 
 	/*
 	 * get state info from node
@@ -2295,6 +2332,8 @@ agg_retrieve_hash_table(AggState *aggstate)
 	econtext = aggstate->ss.ps.ps_ExprContext;
 	peragg = aggstate->peragg;
 	firstSlot = aggstate->ss.ss_ScanTupleSlot;
+	hashslot = aggstate->hashslot;
+
 
 	/*
 	 * We loop retrieving groups until we find one satisfying
@@ -2302,6 +2341,8 @@ agg_retrieve_hash_table(AggState *aggstate)
 	 */
 	while (!aggstate->agg_done)
 	{
+		int i;
+
 		/*
 		 * Find the next entry in the hash table
 		 */
@@ -2323,12 +2364,24 @@ agg_retrieve_hash_table(AggState *aggstate)
 		ResetExprContext(econtext);
 
 		/*
-		 * Store the copied first input tuple in the tuple table slot reserved
-		 * for it, so that it can be used in ExecProject.
+		 * Transform representative tuple back into one with the right
+		 * columns.
 		 */
-		ExecStoreMinimalTuple(entry->firstTuple,
-							  firstSlot,
-							  false);
+		ExecStoreMinimalTuple(entry->firstTuple, hashslot, false);
+		slot_getallattrs(hashslot);
+		ExecClearTuple(firstSlot);
+		memset(firstSlot->tts_isnull, true,
+			   firstSlot->tts_tupleDescriptor->natts * sizeof(bool));
+
+
+		for (i = 0; i < aggstate->numhashGrpCols; i++)
+		{
+			int			varNumber = aggstate->hashGrpColIdxInput[i] - 1;
+
+			firstSlot->tts_values[varNumber] = hashslot->tts_values[i];
+			firstSlot->tts_isnull[varNumber] = hashslot->tts_isnull[i];
+		}
+		ExecStoreVirtualTuple(firstSlot);
 
 		pergroup = (AggStatePerGroup) entry->additional;
 
@@ -2604,16 +2657,6 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 		aggstate->all_grouped_cols = lcons_int(i, aggstate->all_grouped_cols);
 
 	/*
-	 * Hashing can only appear in the initial phase.
-	 */
-
-	if (node->aggstrategy == AGG_HASHED)
-		execTuplesHashPrepare(node->numCols,
-							  node->grpOperators,
-							  &aggstate->phases[0].eqfunctions,
-							  &aggstate->hashfunctions);
-
-	/*
 	 * Initialize current phase-dependent values to initial phase
 	 */
 
@@ -2634,12 +2677,21 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->peragg = peraggs;
 	aggstate->pertrans = pertransstates;
 
+
+	/*
+	 * Hashing can only appear in the initial phase.
+	 */
 	if (node->aggstrategy == AGG_HASHED)
 	{
+		find_hash_columns(aggstate);
+
+		execTuplesHashPrepare(node->numCols,
+							  node->grpOperators,
+							  &aggstate->phases[0].eqfunctions,
+							  &aggstate->hashfunctions);
+
 		build_hash_table(aggstate);
 		aggstate->table_filled = false;
-		/* Compute the columns we actually need to hash on */
-		aggstate->hash_needed = find_hash_columns(aggstate);
 	}
 	else
 	{
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 30e1aee..4fd5011 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1860,7 +1860,10 @@ typedef struct AggState
 	/* these fields are used in AGG_HASHED mode: */
 	TupleHashTable hashtable;	/* hash table with one entry per group */
 	TupleTableSlot *hashslot;	/* slot for loading hash table */
-	List	   *hash_needed;	/* list of columns needed in hash table */
+	int			numhashGrpCols;	/* number of columns in hash table */
+	int			largestGrpColIdx; /* largest column required for hashing */
+	AttrNumber *hashGrpColIdxInput;	/* and their indices in input slot */
+	AttrNumber *hashGrpColIdxHash;	/* indices for execGrouping in hashtbl */
 	bool		table_filled;	/* hash table filled yet? */
 	TupleHashIterator hashiter; /* for iterating through hash table */
 	/* support for evaluation of agg inputs */
-- 
2.10.2

Heikki Linnakangas

hlinnaka@iki.fi

about 9 years ago

In reply to: Andres Freund (#1)

1 attachment(s)

Re: Improve hash-agg performance

On 11/03/2016 01:07 PM, Andres Freund wrote:

Hi,

There's two things I found while working on faster expression
evaluation, slot deforming and batched execution. As those two issues
often seemed quite dominant cost-wise it seemed worthwhile to evaluate
them independently.

1) We atm do one ExecProject() to compute each aggregate's
arguments. Turns out it's noticeably faster to compute the argument
for all aggregates in one go. Both because it reduces the amount of
function call / moves more things into a relatively tight loop, and
because it allows to deform all the required columns at once, rather
than one-by-one. For a single aggregate it'd be faster to avoid
ExecProject alltogether (i.e. directly evaluate the expression as we
used to), but as soon you have two the new approach is faster.

Makes sense. If we do your refactoring of ExecEvalExpr into an
intermediate opcode representation, I assume the performance difference
will go away anyway.

2) For hash-aggs we right now we store the representative tuple using
the input tuple's format, with unneeded columns set to NULL. That
turns out to be expensive if the aggregated-on columns are not
leading columns, because we have to skip over a potentially large
number of NULLs. The fix here is to simply use a different tuple
format for the hashtable. That doesn't cause overhead, because we
already move columns in/out of the hashslot explicitly anyway.

Heh, I came to the same conclusion a couple of months ago when I was
profiling the aggregate code. I never got around to finish up and post
the patch I wrote back then, but here you go, for comparison. It's
pretty much the same as what you got here. So yeah, seems like a good idea.

- Heikki

Attachments:

0001-Don-t-store-unused-columns-in-hash-table.patchtext/x-diff; name=0001-Don-t-store-unused-columns-in-hash-table.patchDownload

From 18a5098a340e7c8a18bea7e83f87b181f65d1976 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 1 Sep 2016 10:42:32 +0300
Subject: [PATCH 1/1] Don't store unused columns in hash table.

---
 src/backend/executor/nodeAgg.c | 129 ++++++++++++++++++++++++++++++++---------
 src/include/nodes/execnodes.h  |   6 +-
 2 files changed, 108 insertions(+), 27 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index ce2fc28..2521423 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1654,6 +1654,13 @@ build_hash_table(AggState *aggstate)
 	Agg		   *node = (Agg *) aggstate->ss.ps.plan;
 	MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
 	Size		entrysize;
+	int			i;
+	AttrNumber *dummyColIdx;
+
+	dummyColIdx = MemoryContextAlloc(aggstate->aggcontexts[0]->ecxt_per_tuple_memory,
+									 aggstate->numHashCols * sizeof(AttrNumber));
+	for (i = 0; i < aggstate->numHashCols; i++)
+		dummyColIdx[i] = i + 1;
 
 	Assert(node->aggstrategy == AGG_HASHED);
 	Assert(node->numGroups > 0);
@@ -1661,8 +1668,8 @@ build_hash_table(AggState *aggstate)
 	entrysize = offsetof(AggHashEntryData, pergroup) +
 		aggstate->numaggs * sizeof(AggStatePerGroupData);
 
-	aggstate->hashtable = BuildTupleHashTable(node->numCols,
-											  node->grpColIdx,
+	aggstate->hashtable = BuildTupleHashTable(aggstate->numHashCols,
+											  dummyColIdx,
 											  aggstate->phase->eqfunctions,
 											  aggstate->hashfunctions,
 											  node->numGroups,
@@ -1695,27 +1702,71 @@ build_hash_table(AggState *aggstate)
  * columns.  However, the search will be needed when we add support for
  * SQL99 semantics that allow use of "functionally dependent" columns that
  * haven't been explicitly grouped by.
+ *
+ *
+ *
+ * We build two mapping arrays. One to convert an original input tuple to
+ * the format stored in the hash table. Another to convert back.
  */
-static List *
+static void
 find_hash_columns(AggState *aggstate)
 {
 	Agg		   *node = (Agg *) aggstate->ss.ps.plan;
-	Bitmapset  *colnos;
-	List	   *collist;
+	Bitmapset  *colnos = NULL;
+	Bitmapset  *other_colnos;
 	int			i;
+	int			numHashCols;
+	AttrNumber *mapping;
+	AttrNumber	maxHashColIdx = 0;
 
 	/* Find Vars that will be needed in tlist and qual */
-	colnos = find_unaggregated_cols(aggstate);
-	/* Add in all the grouping columns */
+	other_colnos = find_unaggregated_cols(aggstate);
+
+	mapping = palloc(bms_num_members(other_colnos) + node->numCols);
+
+	numHashCols = 0;
+
+	/*
+	 * Begin by putting all the grouping columns in the front, so that they're
+	 * fast to access.
+	 */
 	for (i = 0; i < node->numCols; i++)
-		colnos = bms_add_member(colnos, node->grpColIdx[i]);
-	/* Convert to list, using lcons so largest element ends up first */
-	collist = NIL;
-	while ((i = bms_first_member(colnos)) >= 0)
-		collist = lcons_int(i, collist);
+	{
+		AttrNumber keyColIdx = node->grpColIdx[i];
+
+		mapping[numHashCols++] = keyColIdx;
+		if (keyColIdx > maxHashColIdx)
+			maxHashColIdx = keyColIdx;
+
+		/*
+		 * Note that we don't deduplicate key columns. That would complicate
+		 * the comparisons. Don't write silly queries! (Not sure if that would get
+		 * through the parser and optimizer, though). But make note of the
+		 * columns, so that if they also appear in the targetlist or quals,
+		 * we don't duplicate with those.
+		 */
+		colnos = bms_add_member(colnos, keyColIdx);
+	}
+
+	/*
+	 * Add the other columns to the mapping.
+	 */
+	while ((i = bms_first_member(other_colnos)) >= 0)
+	{
+		if (!bms_is_member(i, colnos))
+		{
+			mapping[numHashCols++] = i;
+			if (i > maxHashColIdx)
+				maxHashColIdx = i;
+			colnos = bms_add_member(colnos, i);
+		}
+	}
 	bms_free(colnos);
+	bms_free(other_colnos);
 
-	return collist;
+	aggstate->hashCols = mapping;
+	aggstate->numHashCols = numHashCols;
+	aggstate->maxHashColIdx = maxHashColIdx;
 }
 
 /*
@@ -1748,27 +1799,39 @@ static AggHashEntry
 lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
 {
 	TupleTableSlot *hashslot = aggstate->hashslot;
-	ListCell   *l;
+	int			i;
 	AggHashEntry entry;
 	bool		isnew;
 
 	/* if first time through, initialize hashslot by cloning input slot */
 	if (hashslot->tts_tupleDescriptor == NULL)
 	{
-		ExecSetSlotDescriptor(hashslot, inputslot->tts_tupleDescriptor);
-		/* Make sure all unused columns are NULLs */
-		ExecStoreAllNullTuple(hashslot);
+		TupleDesc	hashDesc;
+
+		hashDesc = CreateTemplateTupleDesc(aggstate->numHashCols, false);
+		for (i = 0; i < aggstate->numHashCols; i++)
+		{
+			int			varNumber = aggstate->hashCols[i];
+
+			TupleDescCopyEntry(hashDesc, i + 1,
+							   inputslot->tts_tupleDescriptor,
+							   varNumber);
+		}
+
+		ExecSetSlotDescriptor(hashslot, hashDesc);
 	}
 
 	/* transfer just the needed columns into hashslot */
-	slot_getsomeattrs(inputslot, linitial_int(aggstate->hash_needed));
-	foreach(l, aggstate->hash_needed)
+	slot_getsomeattrs(inputslot, aggstate->maxHashColIdx);
+	ExecClearTuple(hashslot);
+	for (i = 0; i < aggstate->numHashCols; i++)
 	{
-		int			varNumber = lfirst_int(l) - 1;
+		int			varNumber = aggstate->hashCols[i];
 
-		hashslot->tts_values[varNumber] = inputslot->tts_values[varNumber];
-		hashslot->tts_isnull[varNumber] = inputslot->tts_isnull[varNumber];
-	}
+		hashslot->tts_values[i] = inputslot->tts_values[varNumber - 1];
+		hashslot->tts_isnull[i] = inputslot->tts_isnull[varNumber - 1];
+	}		
+	ExecStoreVirtualTuple(hashslot);
 
 	/* find or create the hashtable entry using the filtered tuple */
 	entry = (AggHashEntry) LookupTupleHashEntry(aggstate->hashtable,
@@ -2228,6 +2291,8 @@ agg_retrieve_hash_table(AggState *aggstate)
 	AggHashEntry entry;
 	TupleTableSlot *firstSlot;
 	TupleTableSlot *result;
+	TupleTableSlot *hashslot = aggstate->hashslot;
+	int			i;
 
 	/*
 	 * get state info from node
@@ -2268,8 +2333,19 @@ agg_retrieve_hash_table(AggState *aggstate)
 		 * for it, so that it can be used in ExecProject.
 		 */
 		ExecStoreMinimalTuple(entry->shared.firstTuple,
-							  firstSlot,
+							  hashslot,
 							  false);
+		slot_getallattrs(hashslot);
+
+		ExecStoreAllNullTuple(firstSlot);
+		ExecClearTuple(firstSlot);
+		for (i = 0; i < aggstate->numHashCols; i++)
+		{
+			AttrNumber varNumber = aggstate->hashCols[i];
+			firstSlot->tts_values[varNumber - 1] = hashslot->tts_values[i];
+			firstSlot->tts_isnull[varNumber - 1] = hashslot->tts_isnull[i];
+		}
+		ExecStoreVirtualTuple(firstSlot);
 
 		pergroup = entry->pergroup;
 
@@ -2577,10 +2653,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 
 	if (node->aggstrategy == AGG_HASHED)
 	{
+		/* Compute the columns we actually need to hash on */
+		find_hash_columns(aggstate);
+
 		build_hash_table(aggstate);
 		aggstate->table_filled = false;
-		/* Compute the columns we actually need to hash on */
-		aggstate->hash_needed = find_hash_columns(aggstate);
 	}
 	else
 	{
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a4ea1b9..bc23c5f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1853,7 +1853,11 @@ typedef struct AggState
 	/* these fields are used in AGG_HASHED mode: */
 	TupleHashTable hashtable;	/* hash table with one entry per group */
 	TupleTableSlot *hashslot;	/* slot for loading hash table */
-	List	   *hash_needed;	/* list of columns needed in hash table */
+
+	AttrNumber *hashCols;		/* mapping from input tuple to hashed tuple */
+	int			numHashCols;
+	AttrNumber	maxHashColIdx;	/* highest column from input tuple that we need to include in the  hash table */
+
 	bool		table_filled;	/* hash table filled yet? */
 	TupleHashIterator hashiter; /* for iterating through hash table */
 } AggState;
-- 
2.9.3

Andres Freund

andres@anarazel.de

about 9 years ago

In reply to: Heikki Linnakangas (#2)

Re: Improve hash-agg performance

On 2016-11-04 15:18:49 +0200, Heikki Linnakangas wrote:

On 11/03/2016 01:07 PM, Andres Freund wrote:

Hi,

There's two things I found while working on faster expression
evaluation, slot deforming and batched execution. As those two issues
often seemed quite dominant cost-wise it seemed worthwhile to evaluate
them independently.

1) We atm do one ExecProject() to compute each aggregate's
arguments. Turns out it's noticeably faster to compute the argument
for all aggregates in one go. Both because it reduces the amount of
function call / moves more things into a relatively tight loop, and
because it allows to deform all the required columns at once, rather
than one-by-one. For a single aggregate it'd be faster to avoid
ExecProject alltogether (i.e. directly evaluate the expression as we
used to), but as soon you have two the new approach is faster.

Makes sense. If we do your refactoring of ExecEvalExpr into an intermediate
opcode representation, I assume the performance difference will go away
anyway.

Precisely.

2) For hash-aggs we right now we store the representative tuple using
the input tuple's format, with unneeded columns set to NULL. That
turns out to be expensive if the aggregated-on columns are not
leading columns, because we have to skip over a potentially large
number of NULLs. The fix here is to simply use a different tuple
format for the hashtable. That doesn't cause overhead, because we
already move columns in/out of the hashslot explicitly anyway.

Heh, I came to the same conclusion a couple of months ago when I was
profiling the aggregate code. I never got around to finish up and post the
patch I wrote back then, but here you go, for comparison. It's pretty much
the same as what you got here. So yeah, seems like a good idea.

+		/*
+		 * Note that we don't deduplicate key columns. That would complicate
+		 * the comparisons. Don't write silly queries! (Not sure if that would get
+		 * through the parser and optimizer, though).

Hehe. You made me spill more coffee.

Thanks for looking!

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

about 9 years ago

In reply to: Andres Freund (#1)

Re: Improve hash-agg performance

Hi,

On 2016-11-03 04:07:21 -0700, Andres Freund wrote:

Hi,

There's two things I found while working on faster expression
evaluation, slot deforming and batched execution. As those two issues
often seemed quite dominant cost-wise it seemed worthwhile to evaluate
them independently.

1) We atm do one ExecProject() to compute each aggregate's
arguments. Turns out it's noticeably faster to compute the argument
for all aggregates in one go. Both because it reduces the amount of
function call / moves more things into a relatively tight loop, and
because it allows to deform all the required columns at once, rather
than one-by-one. For a single aggregate it'd be faster to avoid
ExecProject alltogether (i.e. directly evaluate the expression as we
used to), but as soon you have two the new approach is faster.

2) For hash-aggs we right now we store the representative tuple using
the input tuple's format, with unneeded columns set to NULL. That
turns out to be expensive if the aggregated-on columns are not
leading columns, because we have to skip over a potentially large
number of NULLs. The fix here is to simply use a different tuple
format for the hashtable. That doesn't cause overhead, because we
already move columns in/out of the hashslot explicitly anyway.

I pushed a bit more polished versions of these.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers