9.5: Memory-bounded HashAgg

Started by Jeff Davisover 11 years ago39 messages

pgsql@j-davis.com

over 11 years ago

1 attachment(s)

This patch is requires the Memory Accounting patch, or something similar
to track memory usage.

The attached patch enables hashagg to spill to disk, which means that
hashagg will contain itself to work_mem even if the planner makes a
bad misestimate of the cardinality.

This is a well-known concept; there's even a Berkeley homework
assignment floating around to implement it -- in postgres 7.2, no
less. I didn't take the exact same approach as the homework assignment
suggests, but it's not much different, either. My apologies if some
classes are still using this as a homework assignment, but postgres
needs to eventually have an answer to this problem.

Included is a GUC, "enable_hashagg_disk" (default on), which allows
the planner to choose hashagg even if it doesn't expect the hashtable
to fit in memory. If it's off, and the planner misestimates the
cardinality, hashagg will still use the disk to contain itself to
work_mem.

One situation that might surprise the user is if work_mem is set too
low, and the user is *relying* on a misestimate to pick hashagg. With
this patch, it would end up going to disk, which might be
significantly slower. The solution for the user is to increase
work_mem.

Rough Design:

Change the hash aggregate algorithm to accept a generic "work item",
which consists of an input file as well as some other bookkeeping
information.

Initially prime the algorithm by adding a single work item where the
file is NULL, indicating that it should read from the outer plan.

If the memory is exhausted during execution of a work item, then
continue to allow existing groups to be aggregated, but do not allow new
groups to be created in the hash table. Tuples representing new groups
are saved in an output partition file referenced in the work item that
is currently being executed.

When the work item is done, emit any groups in the hash table, clear the
hash table, and turn each output partition file into a new work item.

Each time through at least some groups are able to stay in the hash
table, so eventually none will need to be saved in output partitions, no
new work items will be created, and the algorithm will terminate. This
is true even if the number of output partitions is always one.

Open items:
* costing
* EXPLAIN details for disk usage
* choose number of partitions intelligently
* performance testing

Initial tests indicate that it can be competitive with sort+groupagg
when the disk is involved, but more testing is required.

Feedback welcome.

Regards,
Jeff Davis

Attachments:

hashagg-disk-20140810.patchtext/x-patch; charset=UTF-8; name=hashagg-disk-20140810.patchDownload

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 2884,2889 **** include_dir 'conf.d'
--- 2884,2904 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-enable-hashagg-disk" xreflabel="enable_hashagg_disk">
+       <term><varname>enable_hashagg_disk</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>enable_hashagg_disk</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables or disables the query planner's use of hashed aggregation plan
+         types when the planner expects the hash table size to exceed
+         <varname>work_mem</varname>. The default is <literal>on</>.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
        <term><varname>enable_hashjoin</varname> (<type>boolean</type>)
        <indexterm>
*** a/src/backend/executor/execGrouping.c
--- b/src/backend/executor/execGrouping.c
***************
*** 331,336 **** TupleHashEntry
--- 331,385 ----
  LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
  					 bool *isnew)
  {
+ 	uint32 hashvalue;
+ 
+ 	hashvalue = TupleHashEntryHash(hashtable, slot);
+ 	return LookupTupleHashEntryHash(hashtable, slot, hashvalue, isnew);
+ }
+ 
+ /*
+  * TupleHashEntryHash
+  *
+  * Calculate the hash value of the tuple.
+  */
+ uint32
+ TupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot)
+ {
+ 	TupleHashEntryData	dummy;
+ 	TupleHashTable		saveCurHT;
+ 	uint32				hashvalue;
+ 
+ 	/*
+ 	 * Set up data needed by hash function.
+ 	 *
+ 	 * We save and restore CurTupleHashTable just in case someone manages to
+ 	 * invoke this code re-entrantly.
+ 	 */
+ 	hashtable->inputslot = slot;
+ 	hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ 	hashtable->cur_eq_funcs = hashtable->tab_eq_funcs;
+ 
+ 	saveCurHT = CurTupleHashTable;
+ 	CurTupleHashTable = hashtable;
+ 
+ 	dummy.firstTuple = NULL;	/* flag to reference inputslot */
+ 	hashvalue = TupleHashTableHash(&dummy, sizeof(TupleHashEntryData));
+ 
+ 	CurTupleHashTable = saveCurHT;
+ 
+ 	return hashvalue;
+ }
+ 
+ /*
+  * LookupTupleHashEntryHash
+  *
+  * Like LookupTupleHashEntry, but allows the caller to specify the tuple's
+  * hash value, to avoid recalculating it.
+  */
+ TupleHashEntry
+ LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
+ 						 uint32 hashvalue, bool *isnew)
+ {
  	TupleHashEntry entry;
  	MemoryContext oldContext;
  	TupleHashTable saveCurHT;
***************
*** 371,380 **** LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
  
  	/* Search the hash table */
  	dummy.firstTuple = NULL;	/* flag to reference inputslot */
! 	entry = (TupleHashEntry) hash_search(hashtable->hashtab,
! 										 &dummy,
! 										 isnew ? HASH_ENTER : HASH_FIND,
! 										 &found);
  
  	if (isnew)
  	{
--- 420,428 ----
  
  	/* Search the hash table */
  	dummy.firstTuple = NULL;	/* flag to reference inputslot */
! 	entry = (TupleHashEntry) hash_search_with_hash_value(
! 		hashtable->hashtab, &dummy, hashvalue, isnew ? HASH_ENTER : HASH_FIND,
! 		&found);
  
  	if (isnew)
  	{
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
***************
*** 108,121 ****
--- 108,126 ----
  #include "optimizer/tlist.h"
  #include "parser/parse_agg.h"
  #include "parser/parse_coerce.h"
+ #include "storage/buffile.h"
  #include "utils/acl.h"
  #include "utils/builtins.h"
+ #include "utils/dynahash.h"
  #include "utils/lsyscache.h"
  #include "utils/memutils.h"
  #include "utils/syscache.h"
  #include "utils/tuplesort.h"
  #include "utils/datum.h"
  
+ #define HASH_DISK_MIN_PARTITIONS		1
+ #define HASH_DISK_DEFAULT_PARTITIONS	16
+ #define HASH_DISK_MAX_PARTITIONS		256
  
  /*
   * AggStatePerAggData - per-aggregate working state for the Agg scan
***************
*** 301,306 **** typedef struct AggHashEntryData
--- 306,321 ----
  	AggStatePerGroupData pergroup[1];	/* VARIABLE LENGTH ARRAY */
  }	AggHashEntryData;	/* VARIABLE LENGTH STRUCT */
  
+ typedef struct HashWork
+ {
+ 	BufFile		 *input_file;	/* input partition, NULL for outer plan */
+ 	int			  input_bits;	/* number of bits for input partition mask */
+ 
+ 	int			  n_output_partitions; /* number of output partitions */
+ 	BufFile		**output_partitions; /* output partition files */
+ 	int			 *output_ntuples; /* number of tuples in each partition */
+ 	int			  output_bits; /* log2(n_output_partitions) + input_bits */
+ } HashWork;
  
  static void initialize_aggregates(AggState *aggstate,
  					  AggStatePerAgg peragg,
***************
*** 322,331 **** static void finalize_aggregate(AggState *aggstate,
  static Bitmapset *find_unaggregated_cols(AggState *aggstate);
  static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
  static void build_hash_table(AggState *aggstate);
! static AggHashEntry lookup_hash_entry(AggState *aggstate,
! 				  TupleTableSlot *inputslot);
  static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
! static void agg_fill_hash_table(AggState *aggstate);
  static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
  static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
  
--- 337,349 ----
  static Bitmapset *find_unaggregated_cols(AggState *aggstate);
  static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
  static void build_hash_table(AggState *aggstate);
! static AggHashEntry lookup_hash_entry(AggState *aggstate, HashWork *work,
! 				   TupleTableSlot *inputslot, uint32 hashvalue);
! static HashWork *hash_work(BufFile *input_file, int input_bits);
! static void save_tuple(AggState *aggstate, HashWork *work,
! 					   TupleTableSlot *slot, uint32 hashvalue);
  static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
! static bool agg_fill_hash_table(AggState *aggstate);
  static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
  static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
  
***************
*** 946,953 **** build_hash_table(AggState *aggstate)
  											  aggstate->hashfunctions,
  											  node->numGroups,
  											  entrysize,
! 											  aggstate->aggcontext,
  											  tmpmem);
  }
  
  /*
--- 964,974 ----
  											  aggstate->hashfunctions,
  											  node->numGroups,
  											  entrysize,
! 											  aggstate->hashcontext,
  											  tmpmem);
+ 
+ 	aggstate->hash_mem_min = MemoryContextGetAllocated(
+ 		aggstate->hashcontext, true);
  }
  
  /*
***************
*** 1024,1035 **** hash_agg_entry_size(int numAggs)
   * When called, CurrentMemoryContext should be the per-query context.
   */
  static AggHashEntry
! lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
  {
  	TupleTableSlot *hashslot = aggstate->hashslot;
  	ListCell   *l;
  	AggHashEntry entry;
! 	bool		isnew;
  
  	/* if first time through, initialize hashslot by cloning input slot */
  	if (hashslot->tts_tupleDescriptor == NULL)
--- 1045,1059 ----
   * When called, CurrentMemoryContext should be the per-query context.
   */
  static AggHashEntry
! lookup_hash_entry(AggState *aggstate, HashWork *work,
! 				  TupleTableSlot *inputslot, uint32 hashvalue)
  {
  	TupleTableSlot *hashslot = aggstate->hashslot;
  	ListCell   *l;
  	AggHashEntry entry;
! 	int64		hash_mem;
! 	bool		isnew = false;
! 	bool	   *p_isnew;
  
  	/* if first time through, initialize hashslot by cloning input slot */
  	if (hashslot->tts_tupleDescriptor == NULL)
***************
*** 1049,1058 **** lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
  		hashslot->tts_isnull[varNumber] = inputslot->tts_isnull[varNumber];
  	}
  
  	/* find or create the hashtable entry using the filtered tuple */
! 	entry = (AggHashEntry) LookupTupleHashEntry(aggstate->hashtable,
! 												hashslot,
! 												&isnew);
  
  	if (isnew)
  	{
--- 1073,1089 ----
  		hashslot->tts_isnull[varNumber] = inputslot->tts_isnull[varNumber];
  	}
  
+ 	hash_mem = MemoryContextGetAllocated(aggstate->hashcontext, true);
+ 	if (hash_mem == aggstate->hash_mem_min ||
+ 		hash_mem < work_mem * 1024L)
+ 		p_isnew = &isnew;
+ 	else
+ 		p_isnew = NULL;
+ 
  	/* find or create the hashtable entry using the filtered tuple */
! 	entry = (AggHashEntry) LookupTupleHashEntryHash(aggstate->hashtable,
! 													hashslot, hashvalue,
! 													p_isnew);
  
  	if (isnew)
  	{
***************
*** 1060,1068 **** lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
--- 1091,1242 ----
  		initialize_aggregates(aggstate, aggstate->peragg, entry->pergroup);
  	}
  
+ 	if (entry == NULL)
+ 		save_tuple(aggstate, work, inputslot, hashvalue);
+ 
  	return entry;
  }
  
+ 
+ /*
+  * hash_work
+  *
+  * Construct a HashWork item, which represents one iteration of HashAgg to be
+  * done. Should be called in the aggregate's memory context.
+  */
+ static HashWork *
+ hash_work(BufFile *input_file, int input_bits)
+ {
+ 	HashWork *work = palloc(sizeof(HashWork));
+ 
+ 	work->input_file = input_file;
+ 	work->input_bits = input_bits;
+ 
+ 	/*
+ 	 * Will be set only if we run out of memory and need to partition an
+ 	 * additional level.
+ 	 */
+ 	work->n_output_partitions = 0;
+ 	work->output_partitions = NULL;
+ 	work->output_ntuples = NULL;
+ 	work->output_bits = 0;
+ 
+ 	return work;
+ }
+ 
+ /*
+  * save_tuple
+  *
+  * Not enough memory to add tuple as new entry in hash table. Save for later
+  * in the appropriate partition.
+  */
+ static void
+ save_tuple(AggState *aggstate, HashWork *work, TupleTableSlot *slot,
+ 		   uint32 hashvalue)
+ {
+ 	int					 partition;
+ 	MinimalTuple		 tuple;
+ 	BufFile				*file;
+ 	int					 written;
+ 
+ 	if (work->output_partitions == NULL)
+ 	{
+ 		int npartitions = HASH_DISK_DEFAULT_PARTITIONS; //TODO choose
+ 		int partition_bits;
+ 		int i;
+ 
+ 		if (npartitions < HASH_DISK_MIN_PARTITIONS)
+ 			npartitions = HASH_DISK_MIN_PARTITIONS;
+ 		if (npartitions > HASH_DISK_MAX_PARTITIONS)
+ 			npartitions = HASH_DISK_MAX_PARTITIONS;
+ 
+ 		partition_bits = my_log2(npartitions);
+ 
+ 		/* make sure that we don't exhaust the hash bits */
+ 		if (partition_bits + work->input_bits >= 32)
+ 			partition_bits = 32 - work->input_bits;
+ 
+ 		/* number of partitions will be a power of two */
+ 		npartitions = 1L << partition_bits;
+ 
+ 		work->output_bits = partition_bits;
+ 		work->n_output_partitions = npartitions;
+ 		work->output_partitions = palloc(sizeof(BufFile *) * npartitions);
+ 		work->output_ntuples = palloc0(sizeof(int) * npartitions);
+ 
+ 		for (i = 0; i < npartitions; i++)
+ 			work->output_partitions[i] = BufFileCreateTemp(false);
+ 	}
+ 
+ 	if (work->output_bits == 0)
+ 		partition = 0;
+ 	else
+ 		partition = (hashvalue << work->input_bits) >>
+ 			(32 - work->output_bits);
+ 
+ 	work->output_ntuples[partition]++;
+ 	file = work->output_partitions[partition];
+ 	tuple = ExecFetchSlotMinimalTuple(slot);
+ 
+ 	written = BufFileWrite(file, (void *) &hashvalue, sizeof(uint32));
+ 	if (written != sizeof(uint32))
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not write to HashAgg temporary file: %m")));
+ 
+ 	written = BufFileWrite(file, (void *) tuple, tuple->t_len);
+ 	if (written != tuple->t_len)
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not write to HashAgg temporary file: %m")));
+ }
+ 
+ 
+ /*
+  * read_saved_tuple
+  *		read the next tuple from a batch file.  Return NULL if no more.
+  *
+  * On success, *hashvalue is set to the tuple's hash value, and the tuple
+  * itself is stored in the given slot.
+  *
+  * Copied with minor modifications from ExecHashJoinGetSavedTuple.
+  */
+ static TupleTableSlot *
+ read_saved_tuple(BufFile *file, uint32 *hashvalue, TupleTableSlot *tupleSlot)
+ {
+ 	uint32		header[2];
+ 	size_t		nread;
+ 	MinimalTuple tuple;
+ 
+ 	/*
+ 	 * Since both the hash value and the MinimalTuple length word are uint32,
+ 	 * we can read them both in one BufFileRead() call without any type
+ 	 * cheating.
+ 	 */
+ 	nread = BufFileRead(file, (void *) header, sizeof(header));
+ 	if (nread == 0)				/* end of file */
+ 	{
+ 		ExecClearTuple(tupleSlot);
+ 		return NULL;
+ 	}
+ 	if (nread != sizeof(header))
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not read from HashAgg temporary file: %m")));
+ 	*hashvalue = header[0];
+ 	tuple = (MinimalTuple) palloc(header[1]);
+ 	tuple->t_len = header[1];
+ 	nread = BufFileRead(file,
+ 						(void *) ((char *) tuple + sizeof(uint32)),
+ 						header[1] - sizeof(uint32));
+ 	if (nread != header[1] - sizeof(uint32))
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not read from HashAgg temporary file: %m")));
+ 	return ExecStoreMinimalTuple(tuple, tupleSlot, true);
+ }
+ 
+ 
  /*
   * ExecAgg -
   *
***************
*** 1107,1115 **** ExecAgg(AggState *node)
  	/* Dispatch based on strategy */
  	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
  	{
! 		if (!node->table_filled)
! 			agg_fill_hash_table(node);
! 		return agg_retrieve_hash_table(node);
  	}
  	else
  		return agg_retrieve_direct(node);
--- 1281,1296 ----
  	/* Dispatch based on strategy */
  	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
  	{
! 		TupleTableSlot *slot = NULL;
! 
! 		while (slot == NULL)
! 		{
! 			if (!node->table_filled)
! 				if (!agg_fill_hash_table(node))
! 					break;
! 			slot = agg_retrieve_hash_table(node);
! 		}
! 		return slot;
  	}
  	else
  		return agg_retrieve_direct(node);
***************
*** 1325,1337 **** agg_retrieve_direct(AggState *aggstate)
  /*
   * ExecAgg for hashed case: phase 1, read input and build hash table
   */
! static void
  agg_fill_hash_table(AggState *aggstate)
  {
  	PlanState  *outerPlan;
  	ExprContext *tmpcontext;
  	AggHashEntry entry;
  	TupleTableSlot *outerslot;
  
  	/*
  	 * get state info from node
--- 1506,1520 ----
  /*
   * ExecAgg for hashed case: phase 1, read input and build hash table
   */
! static bool
  agg_fill_hash_table(AggState *aggstate)
  {
  	PlanState  *outerPlan;
  	ExprContext *tmpcontext;
  	AggHashEntry entry;
  	TupleTableSlot *outerslot;
+ 	HashWork	*work;
+ 	int			 i;
  
  	/*
  	 * get state info from node
***************
*** 1340,1359 **** agg_fill_hash_table(AggState *aggstate)
  	/* tmpcontext is the per-input-tuple expression context */
  	tmpcontext = aggstate->tmpcontext;
  
  	/*
  	 * Process each outer-plan tuple, and then fetch the next one, until we
  	 * exhaust the outer plan.
  	 */
  	for (;;)
  	{
! 		outerslot = ExecProcNode(outerPlan);
! 		if (TupIsNull(outerslot))
! 			break;
  		/* set up for advance_aggregates call */
  		tmpcontext->ecxt_outertuple = outerslot;
  
  		/* Find or build hashtable entry for this tuple's group */
! 		entry = lookup_hash_entry(aggstate, outerslot);
  
  		/* Advance the aggregates */
  		advance_aggregates(aggstate, entry->pergroup);
--- 1523,1581 ----
  	/* tmpcontext is the per-input-tuple expression context */
  	tmpcontext = aggstate->tmpcontext;
  
+ 	if (aggstate->hash_work == NIL)
+ 	{
+ 		aggstate->agg_done = true;
+ 		return false;
+ 	}
+ 
+ 	work = linitial(aggstate->hash_work);
+ 	aggstate->hash_work = list_delete_first(aggstate->hash_work);
+ 
+ 	/* if not the first time through, reinitialize */
+ 	if (!aggstate->hash_init_state)
+ 	{
+ 		MemoryContextResetAndDeleteChildren(aggstate->hashcontext);
+ 		build_hash_table(aggstate);
+ 	}
+ 
+ 	aggstate->hash_init_state = false;
+ 
  	/*
  	 * Process each outer-plan tuple, and then fetch the next one, until we
  	 * exhaust the outer plan.
  	 */
  	for (;;)
  	{
! 		uint32 hashvalue;
! 
! 		CHECK_FOR_INTERRUPTS();
! 
! 		if (work->input_file == NULL)
! 		{
! 			outerslot = ExecProcNode(outerPlan);
! 			if (TupIsNull(outerslot))
! 				break;
! 
! 			hashvalue = TupleHashEntryHash(aggstate->hashtable, outerslot);
! 		}
! 		else
! 		{
! 			outerslot = read_saved_tuple(work->input_file, &hashvalue,
! 										 aggstate->hashslot);
! 			if (TupIsNull(outerslot))
! 			{
! 				BufFileClose(work->input_file);
! 				work->input_file = NULL;
! 				break;
! 			}
! 		}
! 
  		/* set up for advance_aggregates call */
  		tmpcontext->ecxt_outertuple = outerslot;
  
  		/* Find or build hashtable entry for this tuple's group */
! 		entry = lookup_hash_entry(aggstate, work, outerslot, hashvalue);
  
  		/* Advance the aggregates */
  		advance_aggregates(aggstate, entry->pergroup);
***************
*** 1362,1370 **** agg_fill_hash_table(AggState *aggstate)
--- 1584,1619 ----
  		ResetExprContext(tmpcontext);
  	}
  
+ 	/* add each output partition as a new work item */
+ 	for (i = 0; i < work->n_output_partitions; i++)
+ 	{
+ 		BufFile			*file = work->output_partitions[i];
+ 		MemoryContext	 oldContext;
+ 
+ 		/* partition is empty */
+ 		if (work->output_ntuples == 0)
+ 			continue;
+ 
+ 		/* rewind file for reading */
+ 		if (BufFileSeek(file, 0, 0L, SEEK_SET))
+ 			ereport(ERROR,
+ 					(errcode_for_file_access(),
+ 					 errmsg("could not rewind HashAgg temporary file: %m")));
+ 
+ 		oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
+ 		aggstate->hash_work = lappend(aggstate->hash_work,
+ 									  hash_work(file,
+ 												work->output_bits + work->input_bits));
+ 		MemoryContextSwitchTo(oldContext);
+ 	}
+ 
+ 	pfree(work);
+ 
  	aggstate->table_filled = true;
  	/* Initialize to walk the hash table */
  	ResetTupleHashIterator(aggstate->hashtable, &aggstate->hashiter);
+ 
+ 	return true;
  }
  
  /*
***************
*** 1396,1411 **** agg_retrieve_hash_table(AggState *aggstate)
  	 * We loop retrieving groups until we find one satisfying
  	 * aggstate->ss.ps.qual
  	 */
! 	while (!aggstate->agg_done)
  	{
  		/*
  		 * Find the next entry in the hash table
  		 */
  		entry = (AggHashEntry) ScanTupleHashTable(&aggstate->hashiter);
  		if (entry == NULL)
  		{
! 			/* No more entries in hashtable, so done */
! 			aggstate->agg_done = TRUE;
  			return NULL;
  		}
  
--- 1645,1662 ----
  	 * We loop retrieving groups until we find one satisfying
  	 * aggstate->ss.ps.qual
  	 */
! 	for (;;)
  	{
+ 		CHECK_FOR_INTERRUPTS();
+ 
  		/*
  		 * Find the next entry in the hash table
  		 */
  		entry = (AggHashEntry) ScanTupleHashTable(&aggstate->hashiter);
  		if (entry == NULL)
  		{
! 			/* No more entries in hashtable, so done with this batch */
! 			aggstate->table_filled = false;
  			return NULL;
  		}
  
***************
*** 1636,1645 **** ExecInitAgg(Agg *node, EState *estate, int eflags)
--- 1887,1914 ----
  
  	if (node->aggstrategy == AGG_HASHED)
  	{
+ 		MemoryContext oldContext;
+ 
+ 		aggstate->hashcontext =
+ 			AllocSetContextCreateTracked(aggstate->aggcontext,
+ 										 "HashAgg Hash Table Context",
+ 										 ALLOCSET_DEFAULT_MINSIZE,
+ 										 ALLOCSET_DEFAULT_INITSIZE,
+ 										 ALLOCSET_DEFAULT_MAXSIZE,
+ 										 true);
+ 
  		build_hash_table(aggstate);
+ 		aggstate->hash_init_state = true;
  		aggstate->table_filled = false;
+ 		aggstate->hash_disk = false;
  		/* Compute the columns we actually need to hash on */
  		aggstate->hash_needed = find_hash_columns(aggstate);
+ 
+ 		/* prime with initial work item to read from outer plan */
+ 		oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
+ 		aggstate->hash_work = lappend(aggstate->hash_work,
+ 									  hash_work(NULL, 0));
+ 		MemoryContextSwitchTo(oldContext);
  	}
  	else
  	{
***************
*** 2058,2079 **** ExecReScanAgg(AggState *node)
  	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
  	{
  		/*
! 		 * In the hashed case, if we haven't yet built the hash table then we
! 		 * can just return; nothing done yet, so nothing to undo. If subnode's
! 		 * chgParam is not NULL then it will be re-scanned by ExecProcNode,
! 		 * else no reason to re-scan it at all.
  		 */
! 		if (!node->table_filled)
  			return;
  
  		/*
! 		 * If we do have the hash table and the subplan does not have any
! 		 * parameter changes, then we can just rescan the existing hash table;
! 		 * no need to build it again.
  		 */
! 		if (node->ss.ps.lefttree->chgParam == NULL)
  		{
  			ResetTupleHashIterator(node->hashtable, &node->hashiter);
  			return;
  		}
  	}
--- 2327,2349 ----
  	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
  	{
  		/*
! 		 * In the hashed case, if we haven't done any execution work yet, we
! 		 * can just return; nothing to undo. If subnode's chgParam is not NULL
! 		 * then it will be re-scanned by ExecProcNode, else no reason to
! 		 * re-scan it at all.
  		 */
! 		if (node->hash_init_state)
  			return;
  
  		/*
! 		 * If we do have the hash table, it never went to disk, and the
! 		 * subplan does not have any parameter changes, then we can just
! 		 * rescan the existing hash table; no need to build it again.
  		 */
! 		if (node->ss.ps.lefttree->chgParam == NULL && !node->hash_disk)
  		{
  			ResetTupleHashIterator(node->hashtable, &node->hashiter);
+ 			node->table_filled = true;
  			return;
  		}
  	}
***************
*** 2112,2120 **** ExecReScanAgg(AggState *node)
--- 2382,2409 ----
  
  	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
  	{
+ 		MemoryContext oldContext;
+ 
+ 		node->hashcontext =
+ 			AllocSetContextCreateTracked(node->aggcontext,
+ 										 "HashAgg Hash Table Context",
+ 										 ALLOCSET_DEFAULT_MINSIZE,
+ 										 ALLOCSET_DEFAULT_INITSIZE,
+ 										 ALLOCSET_DEFAULT_MAXSIZE,
+ 										 true);
+ 
  		/* Rebuild an empty hash table */
  		build_hash_table(node);
+ 		node->hash_init_state = true;
  		node->table_filled = false;
+ 		node->hash_disk = false;
+ 		node->hash_work = NIL;
+ 
+ 		/* prime with initial work item to read from outer plan */
+ 		oldContext = MemoryContextSwitchTo(node->aggcontext);
+ 		node->hash_work = lappend(node->hash_work,
+ 								  hash_work(NULL, 0));
+ 		MemoryContextSwitchTo(oldContext);
  	}
  	else
  	{
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
***************
*** 113,118 **** bool		enable_bitmapscan = true;
--- 113,119 ----
  bool		enable_tidscan = true;
  bool		enable_sort = true;
  bool		enable_hashagg = true;
+ bool		enable_hashagg_disk = true;
  bool		enable_nestloop = true;
  bool		enable_material = true;
  bool		enable_mergejoin = true;
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
***************
*** 2741,2747 **** choose_hashed_grouping(PlannerInfo *root,
  	/* plus the per-hash-entry overhead */
  	hashentrysize += hash_agg_entry_size(agg_costs->numAggs);
  
! 	if (hashentrysize * dNumGroups > work_mem * 1024L)
  		return false;
  
  	/*
--- 2741,2748 ----
  	/* plus the per-hash-entry overhead */
  	hashentrysize += hash_agg_entry_size(agg_costs->numAggs);
  
! 	if (!enable_hashagg_disk &&
! 		hashentrysize * dNumGroups > work_mem * 1024L)
  		return false;
  
  	/*
***************
*** 2907,2913 **** choose_hashed_distinct(PlannerInfo *root,
  	/* plus the per-hash-entry overhead */
  	hashentrysize += hash_agg_entry_size(0);
  
! 	if (hashentrysize * dNumDistinctRows > work_mem * 1024L)
  		return false;
  
  	/*
--- 2908,2915 ----
  	/* plus the per-hash-entry overhead */
  	hashentrysize += hash_agg_entry_size(0);
  
! 	if (!enable_hashagg_disk &&
! 		hashentrysize * dNumDistinctRows > work_mem * 1024L)
  		return false;
  
  	/*
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 752,757 **** static struct config_bool ConfigureNamesBool[] =
--- 752,766 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"enable_hashagg_disk", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the planner's use of disk-based hashed aggregation plans."),
+ 			NULL
+ 		},
+ 		&enable_hashagg_disk,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
  			gettext_noop("Enables the planner's use of materialization."),
  			NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 266,271 ****
--- 266,272 ----
  
  #enable_bitmapscan = on
  #enable_hashagg = on
+ #enable_hashagg_disk = on
  #enable_hashjoin = on
  #enable_indexscan = on
  #enable_indexonlyscan = on
*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***************
*** 242,247 **** typedef struct AllocChunkData
--- 242,249 ----
  #define AllocChunkGetPointer(chk)	\
  					((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ))
  
+ static void update_allocation(MemoryContext context, int64 size);
+ 
  /*
   * These functions implement the MemoryContext API for AllocSet contexts.
   */
***************
*** 250,256 **** static void AllocSetFree(MemoryContext context, void *pointer);
  static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size);
  static void AllocSetInit(MemoryContext context);
  static void AllocSetReset(MemoryContext context);
! static void AllocSetDelete(MemoryContext context);
  static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer);
  static bool AllocSetIsEmpty(MemoryContext context);
  static void AllocSetStats(MemoryContext context, int level);
--- 252,258 ----
  static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size);
  static void AllocSetInit(MemoryContext context);
  static void AllocSetReset(MemoryContext context);
! static void AllocSetDelete(MemoryContext context, MemoryContext parent);
  static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer);
  static bool AllocSetIsEmpty(MemoryContext context);
  static void AllocSetStats(MemoryContext context, int level);
***************
*** 430,435 **** randomize_mem(char *ptr, size_t size)
--- 432,440 ----
   * minContextSize: minimum context size
   * initBlockSize: initial allocation block size
   * maxBlockSize: maximum allocation block size
+  *
+  * The flag determining whether this context tracks memory usage is inherited
+  * from the parent context.
   */
  MemoryContext
  AllocSetContextCreate(MemoryContext parent,
***************
*** 438,443 **** AllocSetContextCreate(MemoryContext parent,
--- 443,468 ----
  					  Size initBlockSize,
  					  Size maxBlockSize)
  {
+ 	return AllocSetContextCreateTracked(
+ 		parent, name, minContextSize, initBlockSize, maxBlockSize,
+ 		(parent == NULL) ? false : parent->track_mem);
+ }
+ 
+ /*
+  * AllocSetContextCreateTracked
+  *		Create a new AllocSet context.
+  *
+  * Implementation for AllocSetContextCreate, but also allows the caller to
+  * specify whether memory usage should be tracked or not.
+  */
+ MemoryContext
+ AllocSetContextCreateTracked(MemoryContext parent,
+ 							 const char *name,
+ 							 Size minContextSize,
+ 							 Size initBlockSize,
+ 							 Size maxBlockSize,
+ 							 bool track_mem)
+ {
  	AllocSet	context;
  
  	/* Do the type-independent part of context creation */
***************
*** 445,451 **** AllocSetContextCreate(MemoryContext parent,
  											 sizeof(AllocSetContext),
  											 &AllocSetMethods,
  											 parent,
! 											 name);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
--- 470,477 ----
  											 sizeof(AllocSetContext),
  											 &AllocSetMethods,
  											 parent,
! 											 name,
! 											 track_mem);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
***************
*** 500,505 **** AllocSetContextCreate(MemoryContext parent,
--- 526,534 ----
  					 errdetail("Failed while creating memory context \"%s\".",
  							   name)));
  		}
+ 
+ 		update_allocation((MemoryContext) context, blksize);
+ 
  		block->aset = context;
  		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block->endptr = ((char *) block) + blksize;
***************
*** 590,595 **** AllocSetReset(MemoryContext context)
--- 619,625 ----
  		else
  		{
  			/* Normal case, release the block */
+ 			update_allocation(context, -(block->endptr - ((char*) block)));
  #ifdef CLOBBER_FREED_MEMORY
  			wipe_mem(block, block->freeptr - ((char *) block));
  #endif
***************
*** 611,617 **** AllocSetReset(MemoryContext context)
   * But note we are not responsible for deleting the context node itself.
   */
  static void
! AllocSetDelete(MemoryContext context)
  {
  	AllocSet	set = (AllocSet) context;
  	AllocBlock	block = set->blocks;
--- 641,647 ----
   * But note we are not responsible for deleting the context node itself.
   */
  static void
! AllocSetDelete(MemoryContext context, MemoryContext parent)
  {
  	AllocSet	set = (AllocSet) context;
  	AllocBlock	block = set->blocks;
***************
*** 623,628 **** AllocSetDelete(MemoryContext context)
--- 653,668 ----
  	AllocSetCheck(context);
  #endif
  
+ 	/*
+ 	 * Parent is already unlinked from context, so can't use
+ 	 * update_allocation().
+ 	 */
+ 	while (parent != NULL)
+ 	{
+ 		parent->total_allocated -= context->total_allocated;
+ 		parent = parent->parent;
+ 	}
+ 
  	/* Make it look empty, just in case... */
  	MemSetAligned(set->freelist, 0, sizeof(set->freelist));
  	set->blocks = NULL;
***************
*** 678,683 **** AllocSetAlloc(MemoryContext context, Size size)
--- 718,726 ----
  					 errmsg("out of memory"),
  					 errdetail("Failed on request of size %zu.", size)));
  		}
+ 
+ 		update_allocation(context, blksize);
+ 
  		block->aset = set;
  		block->freeptr = block->endptr = ((char *) block) + blksize;
  
***************
*** 873,878 **** AllocSetAlloc(MemoryContext context, Size size)
--- 916,923 ----
  					 errdetail("Failed on request of size %zu.", size)));
  		}
  
+ 		update_allocation(context, blksize);
+ 
  		block->aset = set;
  		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block->endptr = ((char *) block) + blksize;
***************
*** 976,981 **** AllocSetFree(MemoryContext context, void *pointer)
--- 1021,1027 ----
  			set->blocks = block->next;
  		else
  			prevblock->next = block->next;
+ 		update_allocation(context, -(block->endptr - ((char*) block)));
  #ifdef CLOBBER_FREED_MEMORY
  		wipe_mem(block, block->freeptr - ((char *) block));
  #endif
***************
*** 1088,1093 **** AllocSetRealloc(MemoryContext context, void *pointer, Size size)
--- 1134,1140 ----
  		AllocBlock	prevblock = NULL;
  		Size		chksize;
  		Size		blksize;
+ 		Size		oldblksize;
  
  		while (block != NULL)
  		{
***************
*** 1105,1110 **** AllocSetRealloc(MemoryContext context, void *pointer, Size size)
--- 1152,1159 ----
  		/* Do the realloc */
  		chksize = MAXALIGN(size);
  		blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+ 		oldblksize = block->endptr - ((char *)block);
+ 
  		block = (AllocBlock) realloc(block, blksize);
  		if (block == NULL)
  		{
***************
*** 1114,1119 **** AllocSetRealloc(MemoryContext context, void *pointer, Size size)
--- 1163,1169 ----
  					 errmsg("out of memory"),
  					 errdetail("Failed on request of size %zu.", size)));
  		}
+ 		update_allocation(context, blksize - oldblksize);
  		block->freeptr = block->endptr = ((char *) block) + blksize;
  
  		/* Update pointers since block has likely been moved */
***************
*** 1277,1282 **** AllocSetStats(MemoryContext context, int level)
--- 1327,1359 ----
  }
  
  
+ /*
+  * update_allocation
+  *
+  * Track newly-allocated or newly-freed memory (freed memory should be
+  * negative).
+  */
+ static void
+ update_allocation(MemoryContext context, int64 size)
+ {
+ 	MemoryContext parent;
+ 
+ 	if (!context->track_mem)
+ 		return;
+ 
+ 	context->self_allocated += size;
+ 
+ 	for (parent = context; parent != NULL; parent = parent->parent)
+ 	{
+ 		if (!parent->track_mem)
+ 			break;
+ 
+ 		parent->total_allocated += size;
+ 		Assert(parent->self_allocated >= 0);
+ 		Assert(parent->total_allocated >= 0);
+ 	}
+ }
+ 
  #ifdef MEMORY_CONTEXT_CHECKING
  
  /*
*** a/src/backend/utils/mmgr/mcxt.c
--- b/src/backend/utils/mmgr/mcxt.c
***************
*** 187,192 **** MemoryContextResetChildren(MemoryContext context)
--- 187,194 ----
  void
  MemoryContextDelete(MemoryContext context)
  {
+ 	MemoryContext parent = context->parent;
+ 
  	AssertArg(MemoryContextIsValid(context));
  	/* We had better not be deleting TopMemoryContext ... */
  	Assert(context != TopMemoryContext);
***************
*** 202,208 **** MemoryContextDelete(MemoryContext context)
  	 */
  	MemoryContextSetParent(context, NULL);
  
! 	(*context->methods->delete_context) (context);
  	VALGRIND_DESTROY_MEMPOOL(context);
  	pfree(context);
  }
--- 204,211 ----
  	 */
  	MemoryContextSetParent(context, NULL);
  
! 	/* pass the parent in case it's needed, however */
! 	(*context->methods->delete_context) (context, parent);
  	VALGRIND_DESTROY_MEMPOOL(context);
  	pfree(context);
  }
***************
*** 324,329 **** MemoryContextAllowInCriticalSection(MemoryContext context, bool allow)
--- 327,349 ----
  }
  
  /*
+  * MemoryContextGetAllocated
+  *
+  * Return memory allocated by the system to this context. If total is true,
+  * include child contexts. Context must have track_mem set.
+  */
+ int64
+ MemoryContextGetAllocated(MemoryContext context, bool total)
+ {
+ 	Assert(context->track_mem);
+ 
+ 	if (total)
+ 		return context->total_allocated;
+ 	else
+ 		return context->self_allocated;
+ }
+ 
+ /*
   * GetMemoryChunkSpace
   *		Given a currently-allocated chunk, determine the total space
   *		it occupies (including all memory-allocation overhead).
***************
*** 546,552 **** MemoryContext
  MemoryContextCreate(NodeTag tag, Size size,
  					MemoryContextMethods *methods,
  					MemoryContext parent,
! 					const char *name)
  {
  	MemoryContext node;
  	Size		needed = size + strlen(name) + 1;
--- 566,573 ----
  MemoryContextCreate(NodeTag tag, Size size,
  					MemoryContextMethods *methods,
  					MemoryContext parent,
! 					const char *name,
! 					bool track_mem)
  {
  	MemoryContext node;
  	Size		needed = size + strlen(name) + 1;
***************
*** 576,581 **** MemoryContextCreate(NodeTag tag, Size size,
--- 597,605 ----
  	node->firstchild = NULL;
  	node->nextchild = NULL;
  	node->isReset = true;
+ 	node->track_mem = track_mem;
+ 	node->total_allocated = 0;
+ 	node->self_allocated = 0;
  	node->name = ((char *) node) + size;
  	strcpy(node->name, name);
  
*** a/src/include/executor/executor.h
--- b/src/include/executor/executor.h
***************
*** 144,149 **** extern TupleHashTable BuildTupleHashTable(int numCols, AttrNumber *keyColIdx,
--- 144,155 ----
  extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
  					 TupleTableSlot *slot,
  					 bool *isnew);
+ extern uint32 TupleHashEntryHash(TupleHashTable hashtable,
+ 					 TupleTableSlot *slot);
+ extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
+ 					 TupleTableSlot *slot,
+ 					 uint32 hashvalue,
+ 					 bool *isnew);
  extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
  				   TupleTableSlot *slot,
  				   FmgrInfo *eqfunctions,
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 1718,1728 **** typedef struct AggState
--- 1718,1733 ----
  	AggStatePerGroup pergroup;	/* per-Aggref-per-group working state */
  	HeapTuple	grp_firstTuple; /* copy of first tuple of current group */
  	/* these fields are used in AGG_HASHED mode: */
+ 	MemoryContext hashcontext;	/* subcontext to use for hash table */
  	TupleHashTable hashtable;	/* hash table with one entry per group */
  	TupleTableSlot *hashslot;	/* slot for loading hash table */
  	List	   *hash_needed;	/* list of columns needed in hash table */
+ 	bool		hash_init_state; /* in initial state before execution? */
  	bool		table_filled;	/* hash table filled yet? */
+ 	bool		hash_disk;		/* have we exceeded memory yet? */
+ 	int64		hash_mem_min;	/* memory used by empty hash table */
  	TupleHashIterator hashiter; /* for iterating through hash table */
+ 	List	   *hash_work;		/* remaining work to be done */
  } AggState;
  
  /* ----------------
*** a/src/include/nodes/memnodes.h
--- b/src/include/nodes/memnodes.h
***************
*** 41,47 **** typedef struct MemoryContextMethods
  	void	   *(*realloc) (MemoryContext context, void *pointer, Size size);
  	void		(*init) (MemoryContext context);
  	void		(*reset) (MemoryContext context);
! 	void		(*delete_context) (MemoryContext context);
  	Size		(*get_chunk_space) (MemoryContext context, void *pointer);
  	bool		(*is_empty) (MemoryContext context);
  	void		(*stats) (MemoryContext context, int level);
--- 41,48 ----
  	void	   *(*realloc) (MemoryContext context, void *pointer, Size size);
  	void		(*init) (MemoryContext context);
  	void		(*reset) (MemoryContext context);
! 	void		(*delete_context) (MemoryContext context,
! 								   MemoryContext parent);
  	Size		(*get_chunk_space) (MemoryContext context, void *pointer);
  	bool		(*is_empty) (MemoryContext context);
  	void		(*stats) (MemoryContext context, int level);
***************
*** 60,65 **** typedef struct MemoryContextData
--- 61,69 ----
  	MemoryContext nextchild;	/* next child of same parent */
  	char	   *name;			/* context name (just for debugging) */
  	bool		isReset;		/* T = no space alloced since last reset */
+ 	bool		track_mem;		/* whether to track memory usage */
+ 	int64		total_allocated; /* including child contexts */
+ 	int64		self_allocated; /* not including child contexts */
  #ifdef USE_ASSERT_CHECKING
  	bool		allowInCritSection;	/* allow palloc in critical section */
  #endif
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
***************
*** 57,62 **** extern bool enable_bitmapscan;
--- 57,63 ----
  extern bool enable_tidscan;
  extern bool enable_sort;
  extern bool enable_hashagg;
+ extern bool enable_hashagg_disk;
  extern bool enable_nestloop;
  extern bool enable_material;
  extern bool enable_mergejoin;
*** a/src/include/utils/memutils.h
--- b/src/include/utils/memutils.h
***************
*** 96,101 **** extern void MemoryContextDeleteChildren(MemoryContext context);
--- 96,102 ----
  extern void MemoryContextResetAndDeleteChildren(MemoryContext context);
  extern void MemoryContextSetParent(MemoryContext context,
  					   MemoryContext new_parent);
+ extern int64 MemoryContextGetAllocated(MemoryContext context, bool total);
  extern Size GetMemoryChunkSpace(void *pointer);
  extern MemoryContext GetMemoryChunkContext(void *pointer);
  extern MemoryContext MemoryContextGetParent(MemoryContext context);
***************
*** 117,123 **** extern bool MemoryContextContains(MemoryContext context, void *pointer);
  extern MemoryContext MemoryContextCreate(NodeTag tag, Size size,
  					MemoryContextMethods *methods,
  					MemoryContext parent,
! 					const char *name);
  
  
  /*
--- 118,125 ----
  extern MemoryContext MemoryContextCreate(NodeTag tag, Size size,
  					MemoryContextMethods *methods,
  					MemoryContext parent,
! 					const char *name,
! 					bool track_mem);
  
  
  /*
***************
*** 130,135 **** extern MemoryContext AllocSetContextCreate(MemoryContext parent,
--- 132,143 ----
  					  Size minContextSize,
  					  Size initBlockSize,
  					  Size maxBlockSize);
+ extern MemoryContext AllocSetContextCreateTracked(MemoryContext parent,
+ 					  const char *name,
+ 					  Size minContextSize,
+ 					  Size initBlockSize,
+ 					  Size maxBlockSize,
+ 					  bool track_mem);
  
  /*
   * Recommended default alloc parameters, suitable for "ordinary" contexts
*** a/src/test/regress/expected/rangefuncs.out
--- b/src/test/regress/expected/rangefuncs.out
***************
*** 3,8 **** SELECT name, setting FROM pg_settings WHERE name LIKE 'enable%';
--- 3,9 ----
  ----------------------+---------
   enable_bitmapscan    | on
   enable_hashagg       | on
+  enable_hashagg_disk  | on
   enable_hashjoin      | on
   enable_indexonlyscan | on
   enable_indexscan     | on
***************
*** 12,18 **** SELECT name, setting FROM pg_settings WHERE name LIKE 'enable%';
   enable_seqscan       | on
   enable_sort          | on
   enable_tidscan       | on
! (11 rows)
  
  CREATE TABLE foo2(fooid int, f2 int);
  INSERT INTO foo2 VALUES(1, 11);
--- 13,19 ----
   enable_seqscan       | on
   enable_sort          | on
   enable_tidscan       | on
! (12 rows)
  
  CREATE TABLE foo2(fooid int, f2 int);
  INSERT INTO foo2 VALUES(1, 11);

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Jeff Davis (#1)

Re: 9.5: Memory-bounded HashAgg

Hi,

it's 1AM here, so only a few comments after quickly reading the patch.

On 10.8.2014 23:26, Jeff Davis wrote:

This patch is requires the Memory Accounting patch, or something
similar to track memory usage.

I think the patch you sent actually includes the accounting patch. Is
that on purpose, or by accident?

I'd suggest keeping these two patches separate.

Rough Design:

Change the hash aggregate algorithm to accept a generic "work item",
which consists of an input file as well as some other bookkeeping
information.

Initially prime the algorithm by adding a single work item where the
file is NULL, indicating that it should read from the outer plan.

If the memory is exhausted during execution of a work item, then
continue to allow existing groups to be aggregated, but do not allow
new groups to be created in the hash table. Tuples representing new
groups are saved in an output partition file referenced in the work
item that is currently being executed.

When the work item is done, emit any groups in the hash table, clear
the hash table, and turn each output partition file into a new work
item.

Each time through at least some groups are able to stay in the hash
table, so eventually none will need to be saved in output
partitions, no new work items will be created, and the algorithm will
terminate. This is true even if the number of output partitions is
always one.

So once a group gets into memory, it stays there? That's going to work
fine for aggregates with fixed-size state (int4, or generally state that
gets allocated and does not grow), but I'm afraid for aggregates with
growing state (as for example array_agg and similar) that's not really a
solution.

How difficult would it be to dump the current state into a file (and
remove them from the hash table)?

While hacking on the hash join, I envisioned the hash aggregate might
work in a very similar manner, i.e. something like this:

* nbatches=1, nbits=0
* when work_mem gets full => nbatches *= 2, nbits += 1
* get rid of half the groups, using nbits from the hash
=> dump the current states into 'states.batchno' file
=> dump further tuples to 'tuples.batchno' file
* continue until the end, or until work_mem gets full again

This is pretty much what the hashjoin does, except that the join needs
to batch the outer relation too (which hashagg does not need to do).
Otherwise most of the batching logic can be copied.

It also seems to me that the logic of the patch is about this:

* try to lookup the group in the hash table
* found => call the transition function
* not found
* enough space => call transition function
* not enough space => tuple/group goes to a batch

Which pretty much means all tuples need to do the lookup first. The nice
thing on the hash-join approach is that you don't really need to do the
lookup - you just need to peek at the hash whether the group belongs to
the current batch (and if not, to which batch it should go).

Of course, that would require the ability to dump the current state of
the group, but for the aggregates using basic types as a state
(int4/int8, ...) with fixed-length state that's trivial.

For aggregates using 'internal' to pass pointers that requires some help
from the author - serialization/deserialization functions.

Open items:
* costing

Not sure how this is done for the hash-join, but I guess that might be a
good place for inspiration.

* EXPLAIN details for disk usage
* choose number of partitions intelligently

What is the purpose of HASH_DISK_MAX_PARTITIONS? I mean, when we decide
we need 2048 partitions, why should we use less if we believe it will
get us over work_mem?

* performance testing

Initial tests indicate that it can be competitive with sort+groupagg
when the disk is involved, but more testing is required.

For us, removing the sort is a big deal, because we're working with

100M rows regularly. It's more complicated though, because the sort is

usually enforced by COUNT(DISTINCT) and that's not going to disappear
because of this patch. But that's solvable with a custom aggregate.

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jeff Davis

pgsql@j-davis.com

over 11 years ago

In reply to: Tomas Vondra (#2)

Re: 9.5: Memory-bounded HashAgg

On Mon, 2014-08-11 at 01:29 +0200, Tomas Vondra wrote:

On 10.8.2014 23:26, Jeff Davis wrote:

This patch is requires the Memory Accounting patch, or something
similar to track memory usage.

I think the patch you sent actually includes the accounting patch. Is
that on purpose, or by accident?

Accident, thank you.

So once a group gets into memory, it stays there? That's going to work
fine for aggregates with fixed-size state (int4, or generally state that
gets allocated and does not grow), but I'm afraid for aggregates with
growing state (as for example array_agg and similar) that's not really a
solution.

I agree in theory, but for now I'm just not handling that case at all
because there is other work that needs to be done first. For one thing,
we would need a way to save the transition state, and we don't really
have that. In the case of array_agg, the state is not serialized and
there's no generic way to ask it to serialize itself without finalizing.

I'm open to ideas. Do you think my patch is going generally in the right
direction, and we can address this problem later; or do you think we
need a different approach entirely?

While hacking on the hash join, I envisioned the hash aggregate might
work in a very similar manner, i.e. something like this:

* nbatches=1, nbits=0
* when work_mem gets full => nbatches *= 2, nbits += 1
* get rid of half the groups, using nbits from the hash
=> dump the current states into 'states.batchno' file
=> dump further tuples to 'tuples.batchno' file
* continue until the end, or until work_mem gets full again

It would get a little messy with HashAgg. Hashjoin is dealing entirely
with tuples; HashAgg deals with tuples and groups.

Also, if the transition state is fixed-size (or even nearly so), it
makes no sense to remove groups from the hash table before they are
finished. We'd need to detect that somehow, and it seems almost like two
different algorithms (though maybe not a bad idea to use a different
algorithm for things like array_agg).

Not saying that it can't be done, but (unless you have an idea) requires
quite a bit more work than what I did here.

It also seems to me that the logic of the patch is about this:

* try to lookup the group in the hash table
* found => call the transition function
* not found
* enough space => call transition function
* not enough space => tuple/group goes to a batch

Which pretty much means all tuples need to do the lookup first. The nice
thing on the hash-join approach is that you don't really need to do the
lookup - you just need to peek at the hash whether the group belongs to
the current batch (and if not, to which batch it should go).

That's an interesting point. I suspect that, in practice, the cost of
hashing the tuple is more expensive (or at least not much cheaper than)
doing a failed lookup.

For aggregates using 'internal' to pass pointers that requires some help
from the author - serialization/deserialization functions.

Ah, yes, this is what I was referring to earlier.

* EXPLAIN details for disk usage
* choose number of partitions intelligently

What is the purpose of HASH_DISK_MAX_PARTITIONS? I mean, when we decide
we need 2048 partitions, why should we use less if we believe it will
get us over work_mem?

Because I suspect there are costs in having an extra file around that
I'm not accounting for directly. We are implicitly assuming that the OS
will keep around enough buffers for each BufFile to do sequential writes
when needed. If we create a zillion partitions, then either we end up
with random I/O or we push the memory burden into the OS buffer cache.

We could try to model those costs explicitly to put some downward
pressure on the number of partitions we select, but I just chose to cap
it for now.

For us, removing the sort is a big deal, because we're working with

100M rows regularly. It's more complicated though, because the sort is

usually enforced by COUNT(DISTINCT) and that's not going to disappear
because of this patch. But that's solvable with a custom aggregate.

I hope this offers you a good alternative.

I'm not sure it will ever beat sort for very high cardinality cases, but
I hope it can beat sort when the group size averages something higher
than one. It will also be safer, so the optimizer can be more aggressive
about choosing HashAgg.

Thank you for taking a look so quickly!

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Jeff Davis (#3)

Re: 9.5: Memory-bounded HashAgg

On 12 Srpen 2014, 7:06, Jeff Davis wrote:

On Mon, 2014-08-11 at 01:29 +0200, Tomas Vondra wrote:

On 10.8.2014 23:26, Jeff Davis wrote:

This patch is requires the Memory Accounting patch, or something
similar to track memory usage.

I think the patch you sent actually includes the accounting patch. Is
that on purpose, or by accident?

Accident, thank you.

So once a group gets into memory, it stays there? That's going to work
fine for aggregates with fixed-size state (int4, or generally state that
gets allocated and does not grow), but I'm afraid for aggregates with
growing state (as for example array_agg and similar) that's not really a
solution.

I agree in theory, but for now I'm just not handling that case at all
because there is other work that needs to be done first. For one thing,
we would need a way to save the transition state, and we don't really
have that. In the case of array_agg, the state is not serialized and
there's no generic way to ask it to serialize itself without finalizing.

Yes and no.

It's true we don't have this ability for aggregates passing state using
'internal', and arguably these are the cases that matter (because those
are the states that tend to "bloat" as more values are passed to the
aggregate).

We can do that for states with a known type (because we have serialize
deserialize methods for them), but we can't really require all aggregates
to use only known types. The 'internal' is there for a reason.

So I think eventually we should to support something like this:

CREATE AGGREGATE myaggregate (
...
SERIALIZE_FUNC = 'dump_data',
DESERIALIZE_FUNC = 'read_data',
...
);

That being said, we can't require this from all existing aggregates.
There'll always be aggregates not providing this (for example some old
ones).

So even if we have this, we'll have to support the case when it's not
provided - possibly by using the batching algorithm you provided. What
I imagine is this:

hitting work_mem limit -> do we know how to dump the aggregate state?

yes (known type or serialize/deserialize)
=> use the batching algorithm from hash join

no (unknown type, ...)
=> use the batching algorithm described in the original message

Now, I'm not trying to make you implement all this - I'm willing to work
on that. Implementing this CREATE AGGREGATE extension is however tightly
coupled with your patch, because that's the only place where it might be
used (that I'm aware of).

I'm open to ideas. Do you think my patch is going generally in the right
direction, and we can address this problem later; or do you think we
need a different approach entirely?

I certainly think having memory-bounded hashagg is a great improvement,
and yes - this patch can get us there. Maybe it won't get us all the way
to the "perfect solution" but so what? We can improve that by further
patches (and I'm certainly willing to spend some time on that).

So thanks a lot for working on this!

While hacking on the hash join, I envisioned the hash aggregate might
work in a very similar manner, i.e. something like this:

* nbatches=1, nbits=0
* when work_mem gets full => nbatches *= 2, nbits += 1
* get rid of half the groups, using nbits from the hash
=> dump the current states into 'states.batchno' file
=> dump further tuples to 'tuples.batchno' file
* continue until the end, or until work_mem gets full again

It would get a little messy with HashAgg. Hashjoin is dealing entirely
with tuples; HashAgg deals with tuples and groups.

I don't see why it should get messy? In the end, you have a chunk of
data and a hash for it.

Also, if the transition state is fixed-size (or even nearly so), it
makes no sense to remove groups from the hash table before they are
finished. We'd need to detect that somehow, and it seems almost like two
different algorithms (though maybe not a bad idea to use a different
algorithm for things like array_agg).

It just means you need to walk through the hash table, look at the
hashes and dump ~50% of the groups to a file. I'm not sure how difficult
that is with dynahash, though (hashjoin uses a custom hashtable, that
makes this very simple).

Not saying that it can't be done, but (unless you have an idea) requires
quite a bit more work than what I did here.

It also seems to me that the logic of the patch is about this:

* try to lookup the group in the hash table
* found => call the transition function
* not found
* enough space => call transition function
* not enough space => tuple/group goes to a batch

Which pretty much means all tuples need to do the lookup first. The nice
thing on the hash-join approach is that you don't really need to do the
lookup - you just need to peek at the hash whether the group belongs to
the current batch (and if not, to which batch it should go).

That's an interesting point. I suspect that, in practice, the cost of
hashing the tuple is more expensive (or at least not much cheaper than)
doing a failed lookup.

I think you're missing the point, here. You need to compute the hash in
both cases. And then you either can do a lookup or just peek at the first
few bits of the hash to see whether it's in the current batch or not.

Certainly, doing this:

batchno = hash & (nbatches - 1);

if (batchno > curbatch) {
... not current batch, dump to file ...
}

is much faster than a lookup. Also, as the hash table grows (beyond L3
cache size, which is a few MBs today), it becomes much slower in my
experience - that's one of the lessons I learnt while hacking on the
hashjoin. And we're dealing with hashagg not fitting into work_mem, so
this seems to be relevant.

For aggregates using 'internal' to pass pointers that requires some help
from the author - serialization/deserialization functions.

Ah, yes, this is what I was referring to earlier.

* EXPLAIN details for disk usage
* choose number of partitions intelligently

What is the purpose of HASH_DISK_MAX_PARTITIONS? I mean, when we decide
we need 2048 partitions, why should we use less if we believe it will
get us over work_mem?

Because I suspect there are costs in having an extra file around that
I'm not accounting for directly. We are implicitly assuming that the OS
will keep around enough buffers for each BufFile to do sequential writes
when needed. If we create a zillion partitions, then either we end up
with random I/O or we push the memory burden into the OS buffer cache.

Assuming I understand it correctly, I think this logic is broken. Are you
saying "We'll try to do memory-bounded hashagg, but not for the really
large datasets because of fear we might cause random I/O"?

While I certainly understand your concerns about generating excessive
amount of random I/O, I think the modern filesystem are handling that just
fine (coalescing the writes into mostly sequential writes, etc.). Also,
current hardware is really good at handling this (controllers with write
cache, SSDs etc.).

Also, if hash-join does not worry about number of batches, why should
hashagg worry about that? I expect the I/O patterns to be very similar.

And if you have many batches, it means you have tiny work_mem, compared
to the amount of data. Either you have unreasonably small work_mem
(better fix that) or a lot of data (better have a lot of RAM and good
storage, or you'll suffer anyway).

In any case, trying to fix this by limiting number of partitions seems
like a bad approach. I think factoring those concerns into a costing
model is more appropriate.

We could try to model those costs explicitly to put some downward
pressure on the number of partitions we select, but I just chose to cap
it for now.

OK, understood. We can't get all the goodies in the first version.

For us, removing the sort is a big deal, because we're working with

100M rows regularly. It's more complicated though, because the sort is

usually enforced by COUNT(DISTINCT) and that's not going to disappear
because of this patch. But that's solvable with a custom aggregate.

I hope this offers you a good alternative.

I'm not sure it will ever beat sort for very high cardinality cases, but
I hope it can beat sort when the group size averages something higher
than one. It will also be safer, so the optimizer can be more aggressive
about choosing HashAgg.

It's certainly an improvement, although the sort may get there for one
of two reasons:

(a) COUNT(DISTINCT) -> this is solved by a custom aggregate

(b) bad estimate of required memory -> this is common for aggregates
passing 'internal' state (planner uses some quite high defaults)

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jeff Davis

pgsql@j-davis.com

over 11 years ago

In reply to: Tomas Vondra (#4)

Re: 9.5: Memory-bounded HashAgg

On Tue, 2014-08-12 at 14:58 +0200, Tomas Vondra wrote:

CREATE AGGREGATE myaggregate (
...
SERIALIZE_FUNC = 'dump_data',
DESERIALIZE_FUNC = 'read_data',
...
);

Seems reasonable.

I don't see why it should get messy? In the end, you have a chunk of
data and a hash for it.

Perhaps it's fine; I'd have to see the approach.

It just means you need to walk through the hash table, look at the
hashes and dump ~50% of the groups to a file.

If you have fixed-size states, why would you *want* to remove the group?
What is gained?

One thing I like about my simple approach is that it returns a good
number of groups after each pass, and then those are completely finished
(returned to the operator above, even). That's impossible with HashJoin
because the hashing all needs to be done before the probe phase begins.

The weakness of my approach is the array_agg case that you mention,
because this approach doesn't offer a way to dump out transition states.
It seems like that could be added later, but let me know if you see a
problem there.

I think you're missing the point, here. You need to compute the hash in
both cases. And then you either can do a lookup or just peek at the first
few bits of the hash to see whether it's in the current batch or not.

I understood that. The point I was trying to make (which might or might
not be true) was that: (a) this only matters for a failed lookup,
because a successful lookup would just go in the hash table anyway; and
(b) a failed lookup probably doesn't cost much compared to all of the
other things that need to happen along that path.

I should have chosen a better example though. For instance: if the
lookup fails, we need to write the tuple, and writing the tuple is sure
to swamp the cost of a failed hash lookup.

is much faster than a lookup. Also, as the hash table grows (beyond L3
cache size, which is a few MBs today), it becomes much slower in my
experience - that's one of the lessons I learnt while hacking on the
hashjoin. And we're dealing with hashagg not fitting into work_mem, so
this seems to be relevant.

Could be, but this is also the path that goes to disk, so I'm not sure
how significant it is.

Because I suspect there are costs in having an extra file around that
I'm not accounting for directly. We are implicitly assuming that the OS
will keep around enough buffers for each BufFile to do sequential writes
when needed. If we create a zillion partitions, then either we end up
with random I/O or we push the memory burden into the OS buffer cache.

Assuming I understand it correctly, I think this logic is broken. Are you
saying "We'll try to do memory-bounded hashagg, but not for the really
large datasets because of fear we might cause random I/O"?

No, the memory is still bounded even for very high cardinality inputs
(ignoring array_agg case for now). When a partition is processed later,
it also might exhaust work_mem, and need to write out tuples to its own
set of partitions. This allows memory-bounded execution to succeed even
if the number of partitions each iteration is one, though it will result
in repeated I/O for the same tuple.

While I certainly understand your concerns about generating excessive
amount of random I/O, I think the modern filesystem are handling that just
fine (coalescing the writes into mostly sequential writes, etc.). Also,
current hardware is really good at handling this (controllers with write
cache, SSDs etc.).

All of that requires memory. We shouldn't dodge a work_mem limit by
using the kernel's memory, instead.

Also, if hash-join does not worry about number of batches, why should
hashagg worry about that? I expect the I/O patterns to be very similar.

One difference with HashJoin is that, to create a large number of
batches, the inner side must be huge, which is not the expected
operating mode for HashJoin[1]. Regardless, every partition that is
active *does* have a memory cost. HashJoin might ignore that cost, but
that doesn't make it right.

I think the right analogy here is to Sort's poly-phase merge -- it
doesn't merge all of the runs at once; see the comments at the top of
tuplesort.c.

In other words, sometimes it's better to have fewer partitions (for
hashing) or merge fewer runs at once (for sorting). It does more
repeated I/O, but the I/O is more sequential.

In any case, trying to fix this by limiting number of partitions seems
like a bad approach. I think factoring those concerns into a costing
model is more appropriate.

Fair enough. I haven't modeled the cost yet; and I agree that an upper
limit is quite crude.

(a) COUNT(DISTINCT) -> this is solved by a custom aggregate

Is there a reason we can't offer a hash-based strategy for this one? It
would have to be separate hash tables for different aggregates, but it
seems like it could work.

(b) bad estimate of required memory -> this is common for aggregates
passing 'internal' state (planner uses some quite high defaults)

Maybe some planner hooks? Ideas?

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Jeff Davis (#5)

Re: 9.5: Memory-bounded HashAgg

On 13 Srpen 2014, 7:02, Jeff Davis wrote:

On Tue, 2014-08-12 at 14:58 +0200, Tomas Vondra wrote:

CREATE AGGREGATE myaggregate (
...
SERIALIZE_FUNC = 'dump_data',
DESERIALIZE_FUNC = 'read_data',
...
);

Seems reasonable.

I don't see why it should get messy? In the end, you have a chunk of
data and a hash for it.

Perhaps it's fine; I'd have to see the approach.

It just means you need to walk through the hash table, look at the
hashes and dump ~50% of the groups to a file.

If you have fixed-size states, why would you *want* to remove the group?
What is gained?

You're right that for your batching algorithm (based on lookups), that's
not really needed, and keeping everything in memory is a good initial
approach.

My understanding of the batching algorithm (and I may be wrong on this
one) is that once you choose the number of batches, it's pretty much
fixed. Is that the case?

But what will happen in case of significant cardinality underestimate?
I.e. what will happen if you decide to use 16 batches, and then find
out 256 would be more appropriate? I believe you'll end up with batches
16x the size you'd want, most likely exceeding work_mem.

Do I understand that correctly?

But back to the removal of aggregate states from memory (irrespectedly
of the size) - this is what makes the hashjoin-style batching possible,
because it:

(a) makes the batching decision simple (peeking at hash)
(b) makes it possible to repeatedly increase the number of batches
(c) provides a simple trigger for the increase of batch count

Some of this might be achievable even with keeping the states in memory.
I mean, you can add more batches on the fly, and handle this similarly
to hash join, while reading tuples from the batch (moving the tuples to
the proper batch, if needed).

The problem is that once you have the memory full, there's no trigger
to alert you that you should increase the number of batches again.

One thing I like about my simple approach is that it returns a good
number of groups after each pass, and then those are completely finished
(returned to the operator above, even). That's impossible with HashJoin
because the hashing all needs to be done before the probe phase begins.

The hash-join approach returns ~1/N groups after each pass, so I fail to
see how this is better?

The weakness of my approach is the array_agg case that you mention,
because this approach doesn't offer a way to dump out transition states.
It seems like that could be added later, but let me know if you see a
problem there.

Right. Let's not solve this in the first version of the patch.

I think you're missing the point, here. You need to compute the hash in
both cases. And then you either can do a lookup or just peek at the
first
few bits of the hash to see whether it's in the current batch or not.

I understood that. The point I was trying to make (which might or might
not be true) was that: (a) this only matters for a failed lookup,
because a successful lookup would just go in the hash table anyway; and
(b) a failed lookup probably doesn't cost much compared to all of the
other things that need to happen along that path.

OK. I don't have numbers proving otherwise at hand, and you're probably
right that once the batching kicks in, the other parts are likely more
expensive than this.

I should have chosen a better example though. For instance: if the
lookup fails, we need to write the tuple, and writing the tuple is sure
to swamp the cost of a failed hash lookup.

is much faster than a lookup. Also, as the hash table grows (beyond L3
cache size, which is a few MBs today), it becomes much slower in my
experience - that's one of the lessons I learnt while hacking on the
hashjoin. And we're dealing with hashagg not fitting into work_mem, so
this seems to be relevant.

Could be, but this is also the path that goes to disk, so I'm not sure
how significant it is.

It may or may not go to the disk, actually. The fact that you're doing
batching means it's written to a temporary file, but with large amounts
of RAM it may not get written to disk.

That's because the work_mem is only a very soft guarantee - a query may
use multiple work_mem buffers in a perfectly legal way. So the users ten
to set this rather conservatively. For example we have >256GB of RAM in
each machine, usually <24 queries running at the same time and yet we
have only work_mem=800MB. On the few occasions when a hash join is
batched, it usually remains in page cache and never actually gets writte
to disk. Or maybe it gets written, but it's still in the page cache so
the backend never notices that.

It's true there are other costs though - I/O calls, etc. So it's not free.

Because I suspect there are costs in having an extra file around that
I'm not accounting for directly. We are implicitly assuming that the

OS

will keep around enough buffers for each BufFile to do sequential

writes

when needed. If we create a zillion partitions, then either we end up
with random I/O or we push the memory burden into the OS buffer cache.

Assuming I understand it correctly, I think this logic is broken. Are
you
saying "We'll try to do memory-bounded hashagg, but not for the really
large datasets because of fear we might cause random I/O"?

No, the memory is still bounded even for very high cardinality inputs
(ignoring array_agg case for now). When a partition is processed later,
it also might exhaust work_mem, and need to write out tuples to its own
set of partitions. This allows memory-bounded execution to succeed even
if the number of partitions each iteration is one, though it will result
in repeated I/O for the same tuple.

Aha! And the new batches are 'private' to the work item, making it a bit
recursive, right? Is there any reason not to just double the number of
batches globally? I mean, why not to just say

nbatches *= 2

which effectively splits each batch into two? Half the groups stays
in the current one, half is moved to a new one.

It makes it almost perfectly sequential, because you're reading
a single batch, keeping half the tuples and writing the other half to
a new batch. If you increase the number of batches a bit more, e.g.

nbatches *= 4

then you're keeping 1/4 and writing into 3 new batches.

That seems like a better solution to me.

While I certainly understand your concerns about generating excessive
amount of random I/O, I think the modern filesystem are handling that
just
fine (coalescing the writes into mostly sequential writes, etc.). Also,
current hardware is really good at handling this (controllers with write
cache, SSDs etc.).

All of that requires memory. We shouldn't dodge a work_mem limit by
using the kernel's memory, instead.

Sure, saving memory at one place just to waste it somewhere else is
a poor solution. But I don't think work_mem is a memory-saving tool.
I see it as a memory-limiting protection.

Also, if hash-join does not worry about number of batches, why should
hashagg worry about that? I expect the I/O patterns to be very similar.

One difference with HashJoin is that, to create a large number of
batches, the inner side must be huge, which is not the expected
operating mode for HashJoin[1]. Regardless, every partition that is
active *does* have a memory cost. HashJoin might ignore that cost, but
that doesn't make it right.

I think the right analogy here is to Sort's poly-phase merge -- it
doesn't merge all of the runs at once; see the comments at the top of
tuplesort.c.

In other words, sometimes it's better to have fewer partitions (for
hashing) or merge fewer runs at once (for sorting). It does more
repeated I/O, but the I/O is more sequential.

OK. I don't have a clear opinion on this yet. I don't think the costs
are that high, but maybe I'm wrong in this.

It also seems to me that using HASH_DISK_MAX_PARTITIONS, and then allowing
each work item to create it's own set of additional partitions effectively
renders the HASH_DISK_MAX_PARTITIONS futile.

In any case, trying to fix this by limiting number of partitions seems
like a bad approach. I think factoring those concerns into a costing
model is more appropriate.

Fair enough. I haven't modeled the cost yet; and I agree that an upper
limit is quite crude.

OK, let's keep the HASH_DISK_MAX_PARTITIONS for now and improve this later.

(a) COUNT(DISTINCT) -> this is solved by a custom aggregate

Is there a reason we can't offer a hash-based strategy for this one? It
would have to be separate hash tables for different aggregates, but it
seems like it could work.

I don't know what is the exact reasoning, but apparently it's how the
current planner works. Whenever it sees COUNT(DISTINCT) it enforces a
sort. I suspect this is because of fear of memory requirements (because
a distinct requires keeping all the items), which might have been
perfectly valid when this was designed.

(b) bad estimate of required memory -> this is common for aggregates
passing 'internal' state (planner uses some quite high defaults)

Maybe some planner hooks? Ideas?

My plan is to add this to the CREATE AGGREGATE somehow - either as a
constant parameter (allowing to set a custom constant size) or a callback
to a 'sizing' function (estimating the size based on number of items,
average width and ndistinct in the group). In any case, this is
independent of this patch.

I think that for this patch we may either keep the current batching
strategy (and proceed with the TODO items you listed in your first patch).

Or we may investigate the alternative (hash-join-like) batching strategy.
I suppose this may be done after the TODO items, but I'm afrait it may
impact some of them (e.g. the costing). This can be done with the
simple aggregates (using fixed-size types for states), but eventually
it will require adding the serialize/deserialize to CREATE AGGREGATE.

Now, I'm very in favor of the #2 choice (because that's what works best
with the aggregates I need to use), but I'm also a big fan of the
'availability beats unavailable features 100% of the time' principle.

So if you decide to go for #1 now, I'm fine with that. I'm open to do
the next step - either as a follow-up patch, or maybe as an alternative
spin-off of your patch.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#6)

Re: 9.5: Memory-bounded HashAgg

On 13.8.2014 12:31, Tomas Vondra wrote:

On 13 Srpen 2014, 7:02, Jeff Davis wrote:

On Tue, 2014-08-12 at 14:58 +0200, Tomas Vondra wrote:

(b) bad estimate of required memory -> this is common for aggregates
passing 'internal' state (planner uses some quite high defaults)

Maybe some planner hooks? Ideas?

My plan is to add this to the CREATE AGGREGATE somehow - either as a
constant parameter (allowing to set a custom constant size) or a callback
to a 'sizing' function (estimating the size based on number of items,
average width and ndistinct in the group). In any case, this is
independent of this patch.

FWIW, the constant parameter is already implemented for 9.4. Adding the
function seems possible - the most difficult part seems to be getting
all the necessary info before count_agg_clauses() is called. For example
now dNumGroups is evaluated after the call (and tuples/group seems like
a useful info for sizing).

While this seems unrelated to the patch discussed here, it's true that:

(a) good estimate of the memory is important for initial estimate of
batch count

(b) dynamic increase of batch count alleviates issues from
underestimating the amount of memory necessary for states

But let's leave this out of scope for the current patch.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jeff Davis

pgsql@j-davis.com

over 11 years ago

In reply to: Tomas Vondra (#6)

Re: 9.5: Memory-bounded HashAgg

I think the hash-join like approach is reasonable, but I also think
you're going to run into a lot of challenges that make it more complex
for HashAgg. For instance, let's say you have the query:

SELECT x, array_agg(y) FROM foo GROUP BY x;

Say the transition state is an array (for the sake of simplicity), so
the hash table has something like:

1000 => {7, 8, 9}
1001 => {12, 13, 14}

You run out of memory and need to split the hash table, so you scan the
hash table and find that group 1001 needs to be written to disk. So you
serialize the key and array and write them out.

Then the next tuple you get is (1001, 19). What do you do? Create a new
group 1001 => {19} (how do you combine it later with the first one)? Or
try to fetch the existing group 1001 from disk and advance it (horrible
random I/O)?

On Wed, 2014-08-13 at 12:31 +0200, Tomas Vondra wrote:

My understanding of the batching algorithm (and I may be wrong on this
one) is that once you choose the number of batches, it's pretty much
fixed. Is that the case?

It's only fixed for that one "work item" (iteration). A different K can
be selected if memory is exhausted again. But you're right: this is a
little less flexible than HashJoin.

But what will happen in case of significant cardinality underestimate?
I.e. what will happen if you decide to use 16 batches, and then find
out 256 would be more appropriate? I believe you'll end up with batches
16x the size you'd want, most likely exceeding work_mem.

Yes, except that work_mem would never be exceeded. If the partitions are
16X work_mem, then each would be added as another work_item, and
hopefully it would choose better the next time.

One thing I like about my simple approach is that it returns a good
number of groups after each pass, and then those are completely finished
(returned to the operator above, even). That's impossible with HashJoin
because the hashing all needs to be done before the probe phase begins.

The hash-join approach returns ~1/N groups after each pass, so I fail to
see how this is better?

You can't return any tuples until you begin the probe phase, and that
doesn't happen until you've hashed the entire inner side (which involves
splitting and other work). With my patch, it will return some tuples
after the first scan. Perhaps I'm splitting hairs here, but the idea of
finalizing some groups as early as possible seems appealing.

Aha! And the new batches are 'private' to the work item, making it a bit
recursive, right? Is there any reason not to just double the number of
batches globally?

I didn't quite follow this proposal.

It also seems to me that using HASH_DISK_MAX_PARTITIONS, and then allowing
each work item to create it's own set of additional partitions effectively
renders the HASH_DISK_MAX_PARTITIONS futile.

It's the number of active partitions that matter, because that's what
causes the random I/O.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Jeff Davis (#8)

Re: 9.5: Memory-bounded HashAgg

Jeff Davis <pgsql@j-davis.com> writes:

I think the hash-join like approach is reasonable, but I also think
you're going to run into a lot of challenges that make it more complex
for HashAgg. For instance, let's say you have the query:

SELECT x, array_agg(y) FROM foo GROUP BY x;

Say the transition state is an array (for the sake of simplicity), so
the hash table has something like:

1000 => {7, 8, 9}
1001 => {12, 13, 14}

You run out of memory and need to split the hash table, so you scan the
hash table and find that group 1001 needs to be written to disk. So you
serialize the key and array and write them out.

Then the next tuple you get is (1001, 19). What do you do? Create a new
group 1001 => {19} (how do you combine it later with the first one)? Or
try to fetch the existing group 1001 from disk and advance it (horrible
random I/O)?

If you're following the HashJoin model, then what you do is the same thing
it does: you write the input tuple back out to the pending batch file for
the hash partition that now contains key 1001, whence it will be processed
when you get to that partition. I don't see that there's any special case
here.

The fly in the ointment is how to serialize a partially-computed aggregate
state value to disk, if it's not of a defined SQL type.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Jeff Davis (#8)

Re: 9.5: Memory-bounded HashAgg

On 14 Srpen 2014, 9:22, Jeff Davis wrote:

I think the hash-join like approach is reasonable, but I also think
you're going to run into a lot of challenges that make it more complex
for HashAgg. For instance, let's say you have the query:

SELECT x, array_agg(y) FROM foo GROUP BY x;

Say the transition state is an array (for the sake of simplicity), so
the hash table has something like:

1000 => {7, 8, 9}
1001 => {12, 13, 14}

You run out of memory and need to split the hash table, so you scan the
hash table and find that group 1001 needs to be written to disk. So you
serialize the key and array and write them out.

Then the next tuple you get is (1001, 19). What do you do? Create a new
group 1001 => {19} (how do you combine it later with the first one)? Or
try to fetch the existing group 1001 from disk and advance it (horrible
random I/O)?

No, that's not how it works. The batching algorithm works with a hash of
the group. For example let's suppose you do this:

batchno = hash % nbatches;

which essentially keeps the last few bits of the hash. 0 bits for
nbatches=1, 1 bit for nbatches=2, 2 bits for nbatches=4 etc.

So let's say we have 2 batches, and we're working on the first batch.
That means we're using 1 bit:

batchno = hash % 2;

and for the first batch we're keeping only groups with batchno=0. So
only groups with 0 as the last bit are in batchno==0.

When running out of memory, you simply do

nbatches *= 2

and start considering one more bit from the hash. So if you had this
before:

group_a => batchno=0 => {7, 8, 9}
group_b => batchno=0 => {12, 13, 14}
group_c => batchno=0 => {23, 1, 45}
group_d => batchno=0 => {77, 37, 54}

(where batchno is a bit string), after doubling the number of batches
you get something like this:

group_a => batchno=10 => {7, 8, 9}
group_b => batchno=00 => {12, 13, 14}
group_c => batchno=00 => {23, 1, 45}
group_d => batchno=10 => {77, 37, 54}

So you have only two possible batchno values here, depending on the new
most-significant bit - either you got 0 (which means it's still in the
current batch) or 1 (and you need to move it to the temp file of the
new batch).

Then, when you get a new tuple, you get it's hash and do a simple check
of the last few bits - effectively computing batchno just like before

batchno = hash % nbatches;

Either it belongs to the current batch (and either it's in the hash
table, or you add it there), or it's not - in that case write it to a
temp file.

It gets a bit more complex when you increase the number of batches
repeatedly (effectively you need to do the check/move when reading the
batches).

For sure, it's not for free - it may write to quite a few files. Is it
more expensive than what you propose? I'm not sure about that. With
your batching scheme, you'll end up with lower number of large batches,
and you'll need to read and split them, possibly repeatedly. The
batching scheme from hashjoin minimizes this.

IMHO the only way to find out is to some actual tests.

On Wed, 2014-08-13 at 12:31 +0200, Tomas Vondra wrote:

My understanding of the batching algorithm (and I may be wrong on this
one) is that once you choose the number of batches, it's pretty much
fixed. Is that the case?

It's only fixed for that one "work item" (iteration). A different K can
be selected if memory is exhausted again. But you're right: this is a
little less flexible than HashJoin.

But what will happen in case of significant cardinality underestimate?
I.e. what will happen if you decide to use 16 batches, and then find
out 256 would be more appropriate? I believe you'll end up with batches
16x the size you'd want, most likely exceeding work_mem.

Yes, except that work_mem would never be exceeded. If the partitions are
16X work_mem, then each would be added as another work_item, and
hopefully it would choose better the next time.

Only for aggregates with fixed-length state. For aggregates with growing
serialize/deserialize, the states may eventually exceeding work_mem.

One thing I like about my simple approach is that it returns a good
number of groups after each pass, and then those are completely

finished

(returned to the operator above, even). That's impossible with

HashJoin

because the hashing all needs to be done before the probe phase

begins.

The hash-join approach returns ~1/N groups after each pass, so I fail to
see how this is better?

You can't return any tuples until you begin the probe phase, and that
doesn't happen until you've hashed the entire inner side (which involves
splitting and other work). With my patch, it will return some tuples
after the first scan. Perhaps I'm splitting hairs here, but the idea of
finalizing some groups as early as possible seems appealing.

I fail to see how this is different from your approach? How can you
output any tuples before processing the whole inner relation?

After the first scan, the hash-join approach is pretty much guaranteed
to output ~1/N tuples.

Aha! And the new batches are 'private' to the work item, making it a bit
recursive, right? Is there any reason not to just double the number of
batches globally?

I didn't quite follow this proposal.

Again, it's about a difference between your batching approach and the
hashjoin-style batching. The hashjoin batching keeps a single level of
batches, and when hitting work_mem just doubles the number of batches.

Your approach is to do multi-level batching, and I was thinking whether
it'd be possible to use the same approach (single level). But in
retrospect it probably does not make much sense, because the multi-level
batching is one of the points of the proposed approach.

It also seems to me that using HASH_DISK_MAX_PARTITIONS, and then
allowing
each work item to create it's own set of additional partitions
effectively
renders the HASH_DISK_MAX_PARTITIONS futile.

It's the number of active partitions that matter, because that's what
causes the random I/O.

OK, point taken. While I share the general concern about random I/O,
I'm not sure this case is particularly problematic.

regard
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Jeff Davis

pgsql@j-davis.com

over 11 years ago

In reply to: Tom Lane (#9)

Re: 9.5: Memory-bounded HashAgg

On Thu, 2014-08-14 at 10:06 -0400, Tom Lane wrote:

If you're following the HashJoin model, then what you do is the same thing
it does: you write the input tuple back out to the pending batch file for
the hash partition that now contains key 1001, whence it will be processed
when you get to that partition. I don't see that there's any special case
here.

HashJoin only deals with tuples. With HashAgg, you have to deal with a
mix of tuples and partially-computed aggregate state values. Not
impossible, but it is a little more awkward than HashJoin.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Atri Sharma

atri.jiit@gmail.com

over 11 years ago

In reply to: Jeff Davis (#11)

Re: 9.5: Memory-bounded HashAgg

On Thursday, August 14, 2014, Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2014-08-14 at 10:06 -0400, Tom Lane wrote:

If you're following the HashJoin model, then what you do is the same

thing

it does: you write the input tuple back out to the pending batch file for
the hash partition that now contains key 1001, whence it will be

processed

when you get to that partition. I don't see that there's any special

case

here.

HashJoin only deals with tuples. With HashAgg, you have to deal with a
mix of tuples and partially-computed aggregate state values. Not
impossible, but it is a little more awkward than HashJoin.

Not to mention future cases if we start maintaining multiple state
values,in regarded to grouping sets.

Regards,

Atri

--
Regards,

Atri
*l'apprenant*

#13

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Jeff Davis (#11)

Re: 9.5: Memory-bounded HashAgg

Jeff Davis <pgsql@j-davis.com> writes:

HashJoin only deals with tuples. With HashAgg, you have to deal with a
mix of tuples and partially-computed aggregate state values. Not
impossible, but it is a little more awkward than HashJoin.

Not sure that I follow your point. You're going to have to deal with that
no matter what, no?

I guess in principle you could avoid the need to dump agg state to disk.
What you'd have to do is write out tuples to temp files even when you
think you've processed them entirely, so that if you later realize you
need to split the current batch, you can recompute the states of the
postponed aggregates from scratch (ie from the input tuples) when you get
around to processing the batch they got moved to. This would avoid
confronting the how-to-dump-agg-state problem, but it seems to have little
else to recommend it. Even if splitting a batch is a rare occurrence,
the killer objection here is that even a totally in-memory HashAgg would
have to write all its input to a temp file, on the small chance that it
would exceed work_mem and need to switch to batching.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Tom Lane (#13)

Re: 9.5: Memory-bounded HashAgg

On 14 Srpen 2014, 18:12, Tom Lane wrote:

Jeff Davis <pgsql@j-davis.com> writes:

HashJoin only deals with tuples. With HashAgg, you have to deal with a
mix of tuples and partially-computed aggregate state values. Not
impossible, but it is a little more awkward than HashJoin.

Not sure that I follow your point. You're going to have to deal with that
no matter what, no?

That is not how the patch work. Once the memory consumption hits work_mem,
it keeps the already existing groups in memory, and only stops creating
new groups. For each tuple, hashagg does a lookup - if the group is
already in memory, it performs the transition, otherwise it writes the
tuple to disk (and does some batching, but that's mostly irrelevant here).

This way it's not necessary to dump the partially-computed states, and for
fixed-size states it actually limits the amount of consumed memory. For
variable-length aggregates (array_agg et.al.) not so much.

I guess in principle you could avoid the need to dump agg state to disk.
What you'd have to do is write out tuples to temp files even when you
think you've processed them entirely, so that if you later realize you
need to split the current batch, you can recompute the states of the
postponed aggregates from scratch (ie from the input tuples) when you get
around to processing the batch they got moved to. This would avoid
confronting the how-to-dump-agg-state problem, but it seems to have little
else to recommend it. Even if splitting a batch is a rare occurrence,
the killer objection here is that even a totally in-memory HashAgg would
have to write all its input to a temp file, on the small chance that it
would exceed work_mem and need to switch to batching.

Yeah, I think putting this burden on each hashagg is not a good thing.

I was thinking about is an automatic fall-back - try to do an in-memory
hash-agg. When you hit work_mem limit, see how far we are (have we scanned
10% or 90% of tuples?), and decide whether to restart with batching.

But I think there's no single solution, fixing all the possible cases. I
think the patch proposed here is a solid starting point, that may be
improved and extended by further patches. Eventually, what I think might
work is this combination of approaches:

1) fixed-size states and states with serialize/deserialize methods

=> hashjoin-like batching (i.e. dumping both tuples and states)

2) variable-size states without serialize/deserialize

=> Jeff's approach (keep states in memory, dump tuples)
=> possibly with the rescan fall-back, for quickly growing states

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Atri Sharma (#12)

Re: 9.5: Memory-bounded HashAgg

On 14 Srpen 2014, 18:02, Atri Sharma wrote:

On Thursday, August 14, 2014, Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2014-08-14 at 10:06 -0400, Tom Lane wrote:

If you're following the HashJoin model, then what you do is the same

thing

it does: you write the input tuple back out to the pending batch file

for

the hash partition that now contains key 1001, whence it will be

processed

when you get to that partition. I don't see that there's any special

case

here.

HashJoin only deals with tuples. With HashAgg, you have to deal with a
mix of tuples and partially-computed aggregate state values. Not
impossible, but it is a little more awkward than HashJoin.

+1

Not to mention future cases if we start maintaining multiple state
values,in regarded to grouping sets.

So what would you do for aggregates where the state is growing quickly?
Say, things like median() or array_agg()?

I think that "we can't do that for all aggregates" does not imply "we must
not do that at all."

There will always be aggregates not implementing dumping state for various
reasons, and in those cases the proposed approach is certainly a great
improvement. I like it, and I hope it will get committed.

But maybe for aggregates supporting serialize/deserialize of the state
(including all aggregates using known types, not just fixed-size types) a
hashjoin-like batching would be better? I can name a few custom aggregates
that'd benefit tremendously from this.

Just to be clear - this is certainly non-trivial to implement, and I'm not
trying to force anyone (e.g. Jeff) to implement the ideas I proposed. I'm
ready to spend time on reviewing the current patch, implement the approach
I proposed and compare the behaviour.

Kudos to Jeff for working on this.

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Tomas Vondra (#14)

Re: 9.5: Memory-bounded HashAgg

"Tomas Vondra" <tv@fuzzy.cz> writes:

On 14 Srpen 2014, 18:12, Tom Lane wrote:

Not sure that I follow your point. You're going to have to deal with that
no matter what, no?

That is not how the patch work. Once the memory consumption hits work_mem,
it keeps the already existing groups in memory, and only stops creating
new groups.

Oh? So if we have aggregates like array_agg whose memory footprint
increases over time, the patch completely fails to avoid bloat?

I might think a patch with such a limitation was useful, if it weren't
for the fact that aggregates of that nature are a large part of the
cases where the planner misestimates the table size in the first place.
Any complication that we add to nodeAgg should be directed towards
dealing with cases that the planner is likely to get wrong.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Jeff Davis

pgsql@j-davis.com

over 11 years ago

In reply to: Tomas Vondra (#10)

Re: 9.5: Memory-bounded HashAgg

On Thu, 2014-08-14 at 16:17 +0200, Tomas Vondra wrote:

Either it belongs to the current batch (and either it's in the hash
table, or you add it there), or it's not - in that case write it to a
temp file.

I think the part you left out is that you need two files per batch: one
for the dumped-out partially-computed state values, and one for the
tuples.

In other words, you haven't really discussed the step where you reunite
the tuples with that partially-computed state.

For sure, it's not for free - it may write to quite a few files. Is it
more expensive than what you propose? I'm not sure about that. With
your batching scheme, you'll end up with lower number of large batches,
and you'll need to read and split them, possibly repeatedly. The
batching scheme from hashjoin minimizes this.

My approach only has fewer batches if it elects to have fewer batches,
which might happen for two reasons:
1. A cardinality misestimate. This certainly could happen, but we do
have useful numbers to work from (we know the number of tuples and
distincts that we've read so far), so it's far from a blind guess.
2. We're concerned about the random I/O from way too many partitions.

I fail to see how this is different from your approach? How can you
output any tuples before processing the whole inner relation?

Right, the only thing I avoid is scanning the hash table and dumping out
the groups.

This isn't a major distinction, more like "my approach does a little
less work before returning tuples", and I'm not even sure I can defend
that, so I'll retract this point.

Your approach is to do multi-level batching, and I was thinking whether
it'd be possible to use the same approach (single level). But in
retrospect it probably does not make much sense, because the multi-level
batching is one of the points of the proposed approach.

Now that I think about it, many of the points we discussed could
actually work with either approach:
* In my approach, if I need more partitions, I could create more in
much the same way as HashJoin to keep it single-level (as you suggest
above).
* In your approach, if there are too many partitions, you could avoid
random I/O by intentionally putting tuples from multiple partitions in a
single file and moving them while reading.
* If given a way to write out the partially-computed states, I could
evict some groups from the hash table to keep an array_agg() bounded.

Our approaches only differ on one fundamental trade-off that I see:
(A) My approach requires a hash lookup of an already-computed hash for
every incoming tuple, not only the ones going into the hash table.
(B) Your approach requires scanning the hash table and dumping out the
states every time the hash table fills up, which therefore requires a
way to dump out the partial states.

You could probably win the argument by pointing out that (A) is O(N) and
(B) is O(log2(N)). But I suspect that cost (A) is very low.

Unfortunately, it would take some effort to test your approach because
we'd actually need a way to write out the partially-computed state, and
the algorithm itself seems a little more complex. So I'm not really sure
how to proceed.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Atri Sharma

atri.jiit@gmail.com

over 11 years ago

In reply to: Tomas Vondra (#15)

Re: 9.5: Memory-bounded HashAgg

On Thu, Aug 14, 2014 at 10:21 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 14 Srpen 2014, 18:02, Atri Sharma wrote:

On Thursday, August 14, 2014, Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2014-08-14 at 10:06 -0400, Tom Lane wrote:

If you're following the HashJoin model, then what you do is the same

thing

it does: you write the input tuple back out to the pending batch file

for

the hash partition that now contains key 1001, whence it will be

processed

when you get to that partition. I don't see that there's any special

case

here.

HashJoin only deals with tuples. With HashAgg, you have to deal with a
mix of tuples and partially-computed aggregate state values. Not
impossible, but it is a little more awkward than HashJoin.

+1

Not to mention future cases if we start maintaining multiple state
values,in regarded to grouping sets.

So what would you do for aggregates where the state is growing quickly?
Say, things like median() or array_agg()?

I think that "we can't do that for all aggregates" does not imply "we must
not do that at all."

There will always be aggregates not implementing dumping state for various
reasons, and in those cases the proposed approach is certainly a great
improvement. I like it, and I hope it will get committed.

But maybe for aggregates supporting serialize/deserialize of the state
(including all aggregates using known types, not just fixed-size types) a
hashjoin-like batching would be better? I can name a few custom aggregates
that'd benefit tremendously from this.

Yeah, could work, but is it worth adding additional paths (assuming this
patch gets committed) for some aggregates? I think we should do a further
analysis on the use case.

Just to be clear - this is certainly non-trivial to implement, and I'm not
trying to force anyone (e.g. Jeff) to implement the ideas I proposed. I'm
ready to spend time on reviewing the current patch, implement the approach
I proposed and compare the behaviour.

Totally agreed. It would be a different approach, albeit as you said, the
approach can be done off the current patch.

Kudos to Jeff for working on this.

Agreed :)

--
Regards,

Atri
*l'apprenant*

#19

Jeff Davis

pgsql@j-davis.com

over 11 years ago

In reply to: Tom Lane (#16)

Re: 9.5: Memory-bounded HashAgg

On Thu, 2014-08-14 at 12:53 -0400, Tom Lane wrote:

Oh? So if we have aggregates like array_agg whose memory footprint
increases over time, the patch completely fails to avoid bloat?

Yes, in its current form.

I might think a patch with such a limitation was useful, if it weren't
for the fact that aggregates of that nature are a large part of the
cases where the planner misestimates the table size in the first place.
Any complication that we add to nodeAgg should be directed towards
dealing with cases that the planner is likely to get wrong.

In my experience, the planner has a lot of difficulty estimating the
cardinality unless it's coming from a base table without any operators
above it (other than maybe a simple predicate). This is probably a lot
more common than array_agg problems, simply because array_agg is
relatively rare compared with GROUP BY in general.

Also, there are also cases where my patch should win against Sort even
when it does go to disk. For instance, high enough cardinality to exceed
work_mem, but also a large enough group size. Sort will have to deal
with all of the tuples before it can group any of them, whereas HashAgg
can group at least some of them along the way.

Consider the skew case where the cardinality is 2M, work_mem fits 1M
groups, and the input consists of the keys 1..1999999 mixed randomly
inside one billion zeros. (Aside: if the input is non-random, you may
not get the skew value before the hash table fills up, in which case
HashAgg is just as bad as Sort.)

That being said, we can hold out for an array_agg fix if desired. As I
pointed out in another email, my proposal is compatible with the idea of
dumping groups out of the hash table, and does take some steps in that
direction.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Jeff Davis (#17)

1 attachment(s)

Re: 9.5: Memory-bounded HashAgg

On 14.8.2014 18:54, Jeff Davis wrote:

On Thu, 2014-08-14 at 16:17 +0200, Tomas Vondra wrote:

Either it belongs to the current batch (and either it's in the hash
table, or you add it there), or it's not - in that case write it to a
temp file.

I think the part you left out is that you need two files per batch: one
for the dumped-out partially-computed state values, and one for the
tuples.

In other words, you haven't really discussed the step where you reunite
the tuples with that partially-computed state.

No, that's not how the serialize/deserialize should work. The aggregate
needs to store the state as-is, so that after deserializing it gets
pretty much the same thing.

For example, for 'median' the state is the list of all the values
received so far, and when serializing it you have to write all the
values out. After deserializing it, you will get the same list of values.

Some aggregates may use complex data structures that may need more
elaborate serialize.

For sure, it's not for free - it may write to quite a few files. Is it
more expensive than what you propose? I'm not sure about that. With
your batching scheme, you'll end up with lower number of large batches,
and you'll need to read and split them, possibly repeatedly. The
batching scheme from hashjoin minimizes this.

My approach only has fewer batches if it elects to have fewer batches,
which might happen for two reasons:
1. A cardinality misestimate. This certainly could happen, but we do
have useful numbers to work from (we know the number of tuples and
distincts that we've read so far), so it's far from a blind guess.
2. We're concerned about the random I/O from way too many partitions.

OK. We can't really do much with the cardinality estimate.

As for the random IO concerns, I did a quick test to see how this
behaves. I used a HP ProLiant DL380 G5 (i.e. a quite old machine, from
2006-09 if I'm not mistaken). 16GB RAM, RAID10 on 6 x 10k SAS drives,
512MB write cache. So a quite lousy machine, considering today's standards.

I used a simple C program (attached) that creates N files, and writes
into them in a round-robin fashion until a particular file size is
reached. I opted for 64GB total size, 1kB writes.

./iotest filecount filesize writesize

File size is in MB, writesize is in bytes. So for example this writes 64
files, each 1GB, using 512B writes.

./iotest 64 1024 512

Measured is duration before/after fsync (in seconds):

files | file size | before fsync | after fsync
---------------------------------------------------------
32 | 2048 | 290.16 | 294.33
64 | 1024 | 264.68 | 267.60
128 | 512 | 278.68 | 283.44
256 | 256 | 332.11 | 338.45
1024 | 64 | 419.91 | 425.48
2048 | 32 | 450.37 | 455.20

So while there is a difference, I don't think it's the 'random I/O wall'
as usually observed on rotational drives. Also, this is 2.6.32 kernel,
and my suspicion is that with a newer one the behaviour would be better.

I also have an SSD in that machine (Intel S3700), so I did the same test
with these results:

files | file size | before fsync | after fsync
---------------------------------------------------------
32 | 2048 | 445.05 | 464.73
64 | 1024 | 447.32 | 466.56
128 | 512 | 446.63 | 465.90
256 | 256 | 446.64 | 466.19
1024 | 64 | 511.85 | 523.24
2048 | 32 | 579.92 | 590.76

So yes, the number of files matter, but I don't think it's strong enough
to draw a clear line on how many batches we allow. Especially
considering how old this machine is (on 3.x kernels, we usually see much
better performance in I/O intensive conditions).

I fail to see how this is different from your approach? How can you
output any tuples before processing the whole inner relation?

Right, the only thing I avoid is scanning the hash table and dumping out
the groups.

This isn't a major distinction, more like "my approach does a little
less work before returning tuples", and I'm not even sure I can defend
that, so I'll retract this point.

Your approach is to do multi-level batching, and I was thinking whether
it'd be possible to use the same approach (single level). But in
retrospect it probably does not make much sense, because the multi-level
batching is one of the points of the proposed approach.

Now that I think about it, many of the points we discussed could
actually work with either approach:
* In my approach, if I need more partitions, I could create more in
much the same way as HashJoin to keep it single-level (as you suggest
above).
* In your approach, if there are too many partitions, you could avoid
random I/O by intentionally putting tuples from multiple partitions in a
single file and moving them while reading.
* If given a way to write out the partially-computed states, I could
evict some groups from the hash table to keep an array_agg() bounded.

Our approaches only differ on one fundamental trade-off that I see:
(A) My approach requires a hash lookup of an already-computed hash for
every incoming tuple, not only the ones going into the hash table.
(B) Your approach requires scanning the hash table and dumping out the
states every time the hash table fills up, which therefore requires a
way to dump out the partial states.

You could probably win the argument by pointing out that (A) is O(N) and
(B) is O(log2(N)). But I suspect that cost (A) is very low.

Unfortunately, it would take some effort to test your approach because
we'd actually need a way to write out the partially-computed state, and
the algorithm itself seems a little more complex. So I'm not really sure
how to proceed.

I plan to work on this a bit over the next week or two. In any case,
it'll be a limited implementation, but hopefully it will be usable for
some initial testing.

regards
Tomas

#21

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#20)

Re: 9.5: Memory-bounded HashAgg

On 14.8.2014 21:47, Tomas Vondra wrote:

On 14.8.2014 18:54, Jeff Davis wrote:

On Thu, 2014-08-14 at 16:17 +0200, Tomas Vondra wrote:

Either it belongs to the current batch (and either it's in the hash
table, or you add it there), or it's not - in that case write it to a
temp file.

I think the part you left out is that you need two files per batch: one
for the dumped-out partially-computed state values, and one for the
tuples.

In other words, you haven't really discussed the step where you reunite
the tuples with that partially-computed state.

No, that's not how the serialize/deserialize should work. The aggregate
needs to store the state as-is, so that after deserializing it gets
pretty much the same thing.

For example, for 'median' the state is the list of all the values
received so far, and when serializing it you have to write all the
values out. After deserializing it, you will get the same list of values.

Some aggregates may use complex data structures that may need more
elaborate serialize.

For sure, it's not for free - it may write to quite a few files. Is it
more expensive than what you propose? I'm not sure about that. With
your batching scheme, you'll end up with lower number of large batches,
and you'll need to read and split them, possibly repeatedly. The
batching scheme from hashjoin minimizes this.

My approach only has fewer batches if it elects to have fewer batches,
which might happen for two reasons:
1. A cardinality misestimate. This certainly could happen, but we do
have useful numbers to work from (we know the number of tuples and
distincts that we've read so far), so it's far from a blind guess.
2. We're concerned about the random I/O from way too many partitions.

OK. We can't really do much with the cardinality estimate.

As for the random IO concerns, I did a quick test to see how this
behaves. I used a HP ProLiant DL380 G5 (i.e. a quite old machine, from
2006-09 if I'm not mistaken). 16GB RAM, RAID10 on 6 x 10k SAS drives,
512MB write cache. So a quite lousy machine, considering today's standards.

I used a simple C program (attached) that creates N files, and writes
into them in a round-robin fashion until a particular file size is
reached. I opted for 64GB total size, 1kB writes.

./iotest filecount filesize writesize

File size is in MB, writesize is in bytes. So for example this writes 64
files, each 1GB, using 512B writes.

./iotest 64 1024 512

Measured is duration before/after fsync (in seconds):

files | file size | before fsync | after fsync
---------------------------------------------------------
32 | 2048 | 290.16 | 294.33
64 | 1024 | 264.68 | 267.60
128 | 512 | 278.68 | 283.44
256 | 256 | 332.11 | 338.45
1024 | 64 | 419.91 | 425.48
2048 | 32 | 450.37 | 455.20

So while there is a difference, I don't think it's the 'random I/O wall'
as usually observed on rotational drives. Also, this is 2.6.32 kernel,
and my suspicion is that with a newer one the behaviour would be better.

I also have an SSD in that machine (Intel S3700), so I did the same test
with these results:

files | file size | before fsync | after fsync
---------------------------------------------------------
32 | 2048 | 445.05 | 464.73
64 | 1024 | 447.32 | 466.56
128 | 512 | 446.63 | 465.90
256 | 256 | 446.64 | 466.19
1024 | 64 | 511.85 | 523.24
2048 | 32 | 579.92 | 590.76

So yes, the number of files matter, but I don't think it's strong enough
to draw a clear line on how many batches we allow. Especially
considering how old this machine is (on 3.x kernels, we usually see much
better performance in I/O intensive conditions).

And just for fun, I did the same test on a workstation with 8GB of RAM,
S3700 SSD, i5-2500 CPU and kernel 3.12. That is, a more modern
hardware / kernel / ... compared to the machine above.

For a test writing 32GB of data (4x the RAM), I got these results:

files | file size | before fsync | after fsync
------------------------------------------------------
32 | 1024 | 171.27 | 175.96
64 | 512 | 165.57 | 170.12
128 | 256 | 165.29 | 169.95
256 | 128 | 164.69 | 169.62
512 | 64 | 163.98 | 168.90
1024 | 32 | 165.44 | 170.50
2048 | 16 | 165.97 | 171.35
4096 | 8 | 166.55 | 173.26

So, no sign of slowdown at all, in this case. I don't have a rotational
disk in the machine at this moment, so I can't repeat the test. But I
don't expect the impact to be much worse than for the old machine.

I'm not sure whether this proves we should not worry about the number of
batches at all - the old kernel / machines will be with us for some
time. However, I'm not a fan of artificialy limiting the implementation
because of a decade old machines either.

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Jeff Davis (#19)

Re: 9.5: Memory-bounded HashAgg

On Thu, Aug 14, 2014 at 2:21 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2014-08-14 at 12:53 -0400, Tom Lane wrote:

Oh? So if we have aggregates like array_agg whose memory footprint
increases over time, the patch completely fails to avoid bloat?

Yes, in its current form.

I might think a patch with such a limitation was useful, if it weren't
for the fact that aggregates of that nature are a large part of the
cases where the planner misestimates the table size in the first place.
Any complication that we add to nodeAgg should be directed towards
dealing with cases that the planner is likely to get wrong.

In my experience, the planner has a lot of difficulty estimating the
cardinality unless it's coming from a base table without any operators
above it (other than maybe a simple predicate). This is probably a lot
more common than array_agg problems, simply because array_agg is
relatively rare compared with GROUP BY in general.

I think that's right, and I rather like your (Jeff's) approach. It's
definitely true that we could do better if we have a mechanism for
serializing and deserializing group states, but (1) I think an awful
lot of cases would get an awful lot better even just with the approach
proposed here and (2) I doubt we would make the
serialization/deserialization interfaces mandatory, so even if we had
that we'd probably want a fallback strategy anyway.

Furthermore, I don't really see that we're backing ourselves into a
corner here. If prohibiting creation of additional groups isn't
sufficient to control memory usage, but we have
serialization/deserialization functions, we can just pick an arbitrary
subset of the groups that we're storing in memory and spool their
transition states off to disk, thus reducing memory even further. I
understand Tomas' point to be that this is quite different from what
we do for hash joins, but I think it's a different problem. In the
case of a hash join, there are two streams of input tuples, and we've
got to batch them in compatible ways. If we were to, say, exclude an
arbitrary subset of tuples from the hash table instead of basing it on
the hash code, we'd have to test *every* outer tuple against the hash
table for *every* batch. That would incur a huge amount of additional
cost vs. being able to discard outer tuples once the batch to which
they pertain has been processed.

But the situation here isn't comparable, because there's only one
input stream. I'm pretty sure we'll want to keep track of which
transition states we've spilled due to lack of memory as opposed to
those which were never present in the table at all, so that we can
segregate the unprocessed tuples that pertain to spilled transition
states from the ones that pertain to a group we haven't begun yet.
And it might be that if we know (or learn as we go along) that we're
going to vastly blow out work_mem it makes sense to use batching to
segregate the tuples that we decide not to process onto N tapes binned
by hash code, so that we have a better chance that future batches will
be the right size to fit in memory. But I'm not convinced that
there's a compelling reason why the *first* batch has to be chosen by
hash code; we're actually best off picking any arbitrary set of groups
that does the best job reducing the amount of data remaining to be
processed, at least if the transition states are fixed size and maybe
even if they aren't.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Jeff Davis (#1)

1 attachment(s)

Re: 9.5: Memory-bounded HashAgg

On 10.8.2014 23:26, Jeff Davis wrote:

This patch is requires the Memory Accounting patch, or something similar
to track memory usage.

The attached patch enables hashagg to spill to disk, which means that
hashagg will contain itself to work_mem even if the planner makes a
bad misestimate of the cardinality.

This is a well-known concept; there's even a Berkeley homework
assignment floating around to implement it -- in postgres 7.2, no
less. I didn't take the exact same approach as the homework assignment
suggests, but it's not much different, either. My apologies if some
classes are still using this as a homework assignment, but postgres
needs to eventually have an answer to this problem.

Included is a GUC, "enable_hashagg_disk" (default on), which allows
the planner to choose hashagg even if it doesn't expect the hashtable
to fit in memory. If it's off, and the planner misestimates the
cardinality, hashagg will still use the disk to contain itself to
work_mem.

One situation that might surprise the user is if work_mem is set too
low, and the user is *relying* on a misestimate to pick hashagg. With
this patch, it would end up going to disk, which might be
significantly slower. The solution for the user is to increase
work_mem.

Rough Design:

Change the hash aggregate algorithm to accept a generic "work item",
which consists of an input file as well as some other bookkeeping
information.

Initially prime the algorithm by adding a single work item where the
file is NULL, indicating that it should read from the outer plan.

If the memory is exhausted during execution of a work item, then
continue to allow existing groups to be aggregated, but do not allow new
groups to be created in the hash table. Tuples representing new groups
are saved in an output partition file referenced in the work item that
is currently being executed.

When the work item is done, emit any groups in the hash table, clear the
hash table, and turn each output partition file into a new work item.

Each time through at least some groups are able to stay in the hash
table, so eventually none will need to be saved in output partitions, no
new work items will be created, and the algorithm will terminate. This
is true even if the number of output partitions is always one.

Open items:
* costing
* EXPLAIN details for disk usage
* choose number of partitions intelligently
* performance testing

Initial tests indicate that it can be competitive with sort+groupagg
when the disk is involved, but more testing is required.

Feedback welcome.

I've been working on this for a few hours - getting familiar with the
code, testing queries etc. Two comments.

1) Apparently there's something broken, because with this:

create table table_b (fk_id int, val_a int, val_b int);
insert into table_b
select i, mod(i,1000), mod(i,1000)
from generate_series(1,10000000) s(i);
analyze table_b;

I get this:

set work_mem = '8MB';
explain analyze select fk_id, count(*)
from table_b where val_a < 50 and val_b < 50 group by 1;

The connection to the server was lost. Attempting reset: Failed.

Stacktrace attached, but apparently there's a segfault in
advance_transition_function when accessing pergroupstate.

This happened for all queries that I tried, once they needed to do
the batching.

2) Using the same hash value both for dynahash and batching seems
really fishy to me. I'm not familiar with dynahash, but I'd bet
the way it's done now will lead to bad distribution in the hash
table (some buckets will be always empty in some batches, etc.).

This is why hashjoin tries so hard to use non-overlapping parts
of the hash for batchno/bucketno.

The hashjoin implements it's onw hash table, which makes it clear
how the bucket is derived from the hash value. I'm not sure how
dynahash does that, but I'm pretty sure we can'd just reuse the hash
value like this.

I see two options - compute our own hash value, or somehow derive
a new one (e.g. by doing "hashvalue XOR random_seed"). I'm not sure
the latter would work, though.

regards
Tomas

#24

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Robert Haas (#22)

Re: 9.5: Memory-bounded HashAgg

On 15.8.2014 19:53, Robert Haas wrote:

On Thu, Aug 14, 2014 at 2:21 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2014-08-14 at 12:53 -0400, Tom Lane wrote:

Oh? So if we have aggregates like array_agg whose memory footprint
increases over time, the patch completely fails to avoid bloat?

Yes, in its current form.

I might think a patch with such a limitation was useful, if it weren't
for the fact that aggregates of that nature are a large part of the
cases where the planner misestimates the table size in the first place.
Any complication that we add to nodeAgg should be directed towards
dealing with cases that the planner is likely to get wrong.

In my experience, the planner has a lot of difficulty estimating the
cardinality unless it's coming from a base table without any operators
above it (other than maybe a simple predicate). This is probably a lot
more common than array_agg problems, simply because array_agg is
relatively rare compared with GROUP BY in general.

I think that's right, and I rather like your (Jeff's) approach. It's
definitely true that we could do better if we have a mechanism for
serializing and deserializing group states, but (1) I think an awful
lot of cases would get an awful lot better even just with the approach
proposed here and (2) I doubt we would make the
serialization/deserialization interfaces mandatory, so even if we had
that we'd probably want a fallback strategy anyway.

I certainly agree that we need Jeff's approach even if we can do better
in some cases (when we are able to serialize/deserialize the states).

And yes, (mis)estimating the cardinalities is a big issue, and certainly
a source of many problems.

Furthermore, I don't really see that we're backing ourselves into a
corner here. If prohibiting creation of additional groups isn't
sufficient to control memory usage, but we have
serialization/deserialization functions, we can just pick an arbitrary
subset of the groups that we're storing in memory and spool their
transition states off to disk, thus reducing memory even further. I
understand Tomas' point to be that this is quite different from what
we do for hash joins, but I think it's a different problem. In the
case of a hash join, there are two streams of input tuples, and we've
got to batch them in compatible ways. If we were to, say, exclude an
arbitrary subset of tuples from the hash table instead of basing it on
the hash code, we'd have to test *every* outer tuple against the hash
table for *every* batch. That would incur a huge amount of additional
cost vs. being able to discard outer tuples once the batch to which
they pertain has been processed.

Being able to batch inner and outer relations in a matching way is
certainly one of the reasons why hashjoin uses that particular scheme.
There are other reasons, though - for example being able to answer 'Does
this group belong to this batch?' quickly, and automatically update
number of batches.

I'm not saying the lookup is extremely costly, but I'd be very surprised
if it was as cheap as modulo on a 32-bit integer. Not saying it's the
dominant cost here, but memory bandwidth is quickly becoming one of the
main bottlenecks.

But the situation here isn't comparable, because there's only one
input stream. I'm pretty sure we'll want to keep track of which
transition states we've spilled due to lack of memory as opposed to
those which were never present in the table at all, so that we can
segregate the unprocessed tuples that pertain to spilled transition
states from the ones that pertain to a group we haven't begun yet.

Why would that be necessary or useful? I don't see a reason for tracking
that / segregating the tuples.

And it might be that if we know (or learn as we go along) that we're
going to vastly blow out work_mem it makes sense to use batching to
segregate the tuples that we decide not to process onto N tapes binned
by hash code, so that we have a better chance that future batches will
be the right size to fit in memory. But I'm not convinced that
there's a compelling reason why the *first* batch has to be chosen by
hash code; we're actually best off picking any arbitrary set of groups
that does the best job reducing the amount of data remaining to be
processed, at least if the transition states are fixed size and maybe
even if they aren't.

If you don't choose the fist batch by hash code, it's over, IMHO. You
can't just redo that later easily, because the HashWork items are
already treated separately.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Jeff Davis

pgsql@j-davis.com

over 11 years ago

In reply to: Robert Haas (#22)

Re: 9.5: Memory-bounded HashAgg

On Fri, 2014-08-15 at 13:53 -0400, Robert Haas wrote:

I think that's right, and I rather like your (Jeff's) approach. It's
definitely true that we could do better if we have a mechanism for
serializing and deserializing group states, but (1) I think an awful
lot of cases would get an awful lot better even just with the approach
proposed here and (2) I doubt we would make the
serialization/deserialization interfaces mandatory, so even if we had
that we'd probably want a fallback strategy anyway.

Thank you for taking a look.

To solve the problem for array_agg, that would open up two potentially
lengthy discussions:

1. Trying to support non-serialized representations (like
ArrayBuildState for array_agg) as a real type rather than using
"internal".

2. What changes should we make to the aggregate API? As long as we're
changing/extending it, should we go the whole way and support partial
aggregation[1]http://blogs.msdn.com/b/craigfr/archive/2008/01/18/partial-aggregation.aspx (particularly useful for parallelism)?

Both of those discussions are worth having, and perhaps they can happen
in parallel as I wrap up this patch.

I'll see whether I can get consensus that my approach is (potentially)
commit-worthy, and your statement that it (potentially) solves a real
problem is a big help.

Regards,
Jeff Davis

[1]: http://blogs.msdn.com/b/craigfr/archive/2008/01/18/partial-aggregation.aspx
http://blogs.msdn.com/b/craigfr/archive/2008/01/18/partial-aggregation.aspx

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Jeff Davis (#25)

Re: 9.5: Memory-bounded HashAgg

On 19 Srpen 2014, 9:52, Jeff Davis wrote:

On Fri, 2014-08-15 at 13:53 -0400, Robert Haas wrote:

I think that's right, and I rather like your (Jeff's) approach. It's
definitely true that we could do better if we have a mechanism for
serializing and deserializing group states, but (1) I think an awful
lot of cases would get an awful lot better even just with the approach
proposed here and (2) I doubt we would make the
serialization/deserialization interfaces mandatory, so even if we had
that we'd probably want a fallback strategy anyway.

Thank you for taking a look.

To solve the problem for array_agg, that would open up two potentially
lengthy discussions:

1. Trying to support non-serialized representations (like
ArrayBuildState for array_agg) as a real type rather than using
"internal".

That's certainly an option, and it's quite straightforward. The downside
of it is that you either prevent the aggregates from using the most
efficient state form (e.g. the array_agg might use a simple array as a
state) or you cause a proliferation of types with no other purpose.

2. What changes should we make to the aggregate API? As long as we're
changing/extending it, should we go the whole way and support partial
aggregation[1] (particularly useful for parallelism)?

Maybe, but not in this patch please. That's far wider scope, and while
considering it when designing API changes is probably a good idea, we
should resist the attempt to do those two things in the same patch.

Both of those discussions are worth having, and perhaps they can happen
in parallel as I wrap up this patch.

Exactly.

I'll see whether I can get consensus that my approach is (potentially)
commit-worthy, and your statement that it (potentially) solves a real
problem is a big help.

IMHO it's a step in the right direction. It may not go as far as I'd like,
but that's OK.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Tomas Vondra (#24)

Re: 9.5: Memory-bounded HashAgg

On Sun, Aug 17, 2014 at 1:17 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Being able to batch inner and outer relations in a matching way is
certainly one of the reasons why hashjoin uses that particular scheme.
There are other reasons, though - for example being able to answer 'Does
this group belong to this batch?' quickly, and automatically update
number of batches.

I'm not saying the lookup is extremely costly, but I'd be very surprised
if it was as cheap as modulo on a 32-bit integer. Not saying it's the
dominant cost here, but memory bandwidth is quickly becoming one of the
main bottlenecks.

Well, I think you're certainly right that a hash table lookup is more
expensive than modulo on a 32-bit integer; so much is obvious. But if
the load factor is not too large, I think that it's not a *lot* more
expensive, so it could be worth it if it gives us other advantages.
As I see it, the advantage of Jeff's approach is that it doesn't
really matter whether our estimates are accurate or not. We don't
have to decide at the beginning how many batches to do, and then
possibly end up using too much or too little memory per batch if we're
wrong; we can let the amount of memory actually used during execution
determine the number of batches. That seems good. Of course, a hash
join can increase the number of batches on the fly, but only by
doubling it, so you might go from 4 batches to 8 when 5 would really
have been enough. And a hash join also can't *reduce* the number of
batches on the fly, which might matter a lot. Getting the number of
batches right avoids I/O, which is a lot more expensive than CPU.

But the situation here isn't comparable, because there's only one
input stream. I'm pretty sure we'll want to keep track of which
transition states we've spilled due to lack of memory as opposed to
those which were never present in the table at all, so that we can
segregate the unprocessed tuples that pertain to spilled transition
states from the ones that pertain to a group we haven't begun yet.

Why would that be necessary or useful? I don't see a reason for tracking
that / segregating the tuples.

Suppose there are going to be three groups: A, B, C. Each is an
array_agg(), and they're big, so only of them will fit in work_mem at
a time. However, we don't know that at the beginning, either because
we don't write the code to try or because we do write that code but
our cardinality estimates are way off; instead, we're under the
impression that all four will fit in work_mem. So we start reading
tuples. We see values for A and B, but we don't see any values for C
because those all occur later in the input. Eventually, we run short
of memory and cut off creation of new groups. Any tuples for C are
now going to get written to a tape from which we'll later reread them.
After a while, even that proves insufficient and we spill the
transition state for B to disk. Any further tuples that show up for C
will need to be written to tape as well. We continue processing and
finish group A.

Now it's time to do batch #2. Presumably, we begin by reloading the
serialized transition state for group B. To finish group B, we must
look at all the tuples that might possibly fall in that group. If all
of the remaining tuples are on a single tape, we'll have to read all
the tuples in group B *and* all the tuples in group C; we'll
presumably rewrite the tuples that are not part of this batch onto a
new tape, which we'll then process in batch #3. But if we took
advantage of the first pass through the input to put the tuples for
group B on one tape and the tuples for group C on another tape, we can
be much more efficient - just read the remaining tuples for group B,
not mixed with anything else, and then read a separate tape for group
C.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Jeff Davis

pgsql@j-davis.com

over 11 years ago

In reply to: Robert Haas (#27)

Re: 9.5: Memory-bounded HashAgg

On Wed, 2014-08-20 at 14:32 -0400, Robert Haas wrote:

Well, I think you're certainly right that a hash table lookup is more
expensive than modulo on a 32-bit integer; so much is obvious. But if
the load factor is not too large, I think that it's not a *lot* more
expensive, so it could be worth it if it gives us other advantages.
As I see it, the advantage of Jeff's approach is that it doesn't
really matter whether our estimates are accurate or not. We don't
have to decide at the beginning how many batches to do, and then
possibly end up using too much or too little memory per batch if we're
wrong; we can let the amount of memory actually used during execution
determine the number of batches. That seems good. Of course, a hash
join can increase the number of batches on the fly, but only by
doubling it, so you might go from 4 batches to 8 when 5 would really
have been enough. And a hash join also can't *reduce* the number of
batches on the fly, which might matter a lot. Getting the number of
batches right avoids I/O, which is a lot more expensive than CPU.

My approach uses partition counts that are powers-of-two also, so I
don't think that's a big differentiator. In principle my algorithm could
adapt to other partition counts, but I'm not sure how big of an
advantage there is.

I think the big benefit of my approach is that it doesn't needlessly
evict groups from the hashtable. Consider input like 0, 1, 0, 2, ..., 0,
N. For large N, if you evict group 0, you have to write out about N
tuples; but if you leave it in, you only have to write out about N/2
tuples. The hashjoin approach doesn't give you any control over
eviction, so you only have about 1/P chance of saving the skew group
(where P is the ultimate number of partitions). With my approach, we'd
always keep the skew group in memory (unless we're very unlucky, and the
hash table fills up before we even see the skew value).

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Jeff Davis (#28)

Re: 9.5: Memory-bounded HashAgg

Summary of this thread so far:

There was a lot of discussion comparing this with Tomas Vondra's Hash
Join patch. The conclusion was that while it would be nice to be able to
dump transition state to disk, for aggregates like array_agg, the patch
is fine as it is. Dumping transition states would require much more
work, and this is already useful without it. Moreover, solving the
array_agg problem later won't require a rewrite; rather, it'll build on
top of this.

You listed a number of open items in the original post, and these are
still outstanding:

* costing
* EXPLAIN details for disk usage
* choose number of partitions intelligently
* performance testing

I think this is enough for this commitfest - we have consensus on the
design. For the next one, please address those open items, and resubmit.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Jeff Davis

pgsql@j-davis.com

over 11 years ago

In reply to: Heikki Linnakangas (#29)

Re: 9.5: Memory-bounded HashAgg

On Tue, 2014-08-26 at 12:39 +0300, Heikki Linnakangas wrote:

I think this is enough for this commitfest - we have consensus on the
design. For the next one, please address those open items, and resubmit.

Agreed, return with feedback.

I need to get the accounting patch in first, which needs to address some
performance issues, but there's a chance of wrapping those up quickly.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Jeff Davis (#30)

Re: 9.5: Memory-bounded HashAgg

On 26.8.2014 21:38, Jeff Davis wrote:

On Tue, 2014-08-26 at 12:39 +0300, Heikki Linnakangas wrote:

I think this is enough for this commitfest - we have consensus on
the design. For the next one, please address those open items, and
resubmit.

Agreed, return with feedback.

I need to get the accounting patch in first, which needs to address
some performance issues, but there's a chance of wrapping those up
quickly.

Sounds good to me.

I'd like to coordinate our efforts on this a bit, if you're interested.

I've been working on the hashjoin-like batching approach PoC (because I
proposed it, so it's fair I do the work), and I came to the conclusion
that it's pretty much impossible to implement on top of dynahash. I
ended up replacing it with a hashtable (similar to the one in the
hashjoin patch, unsurprisingly), which supports the batching approach
well, and is more memory efficient and actually faster (I see ~25%
speedup in most cases, although YMMV).

I plan to address this in 4 patches:

(1) replacement of dynahash by the custom hash table (done)

(2) memory accounting (not sure what's your plan, I've used the
approach I proposed on 23/8 for now, with a few bugfixes/cleanups)

(3) applying your HashWork patch on top of this (I have this mostly
completed, but need to do more testing over the weekend)

(4) extending this with the batching I proposed, initially only for
aggregates with states that we can serialize/deserialize easily
(e.g. types passed by value) - I'd like to hack on this next week

So at this point I have (1) and (2) pretty much ready, (3) is almost
complete and I plan to start hacking on (4). Also, this does not address
the open items listed in your message.

But I agree this is more complex than the patch you proposed. So if you
choose to pursue your patch, I have no problem with that - I'll then
rebase my changes on top of your patch and submit them separately.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#31)

8 attachment(s)

Re: 9.5: Memory-bounded HashAgg

On 29.8.2014 00:02, Tomas Vondra wrote:

On 26.8.2014 21:38, Jeff Davis wrote:

On Tue, 2014-08-26 at 12:39 +0300, Heikki Linnakangas wrote:

I think this is enough for this commitfest - we have consensus on
the design. For the next one, please address those open items, and
resubmit.

Agreed, return with feedback.

I need to get the accounting patch in first, which needs to address
some performance issues, but there's a chance of wrapping those up
quickly.

Sounds good to me.

I'd like to coordinate our efforts on this a bit, if you're interested.

I've been working on the hashjoin-like batching approach PoC (because I
proposed it, so it's fair I do the work), and I came to the conclusion
that it's pretty much impossible to implement on top of dynahash. I
ended up replacing it with a hashtable (similar to the one in the
hashjoin patch, unsurprisingly), which supports the batching approach
well, and is more memory efficient and actually faster (I see ~25%
speedup in most cases, although YMMV).

I plan to address this in 4 patches:

(1) replacement of dynahash by the custom hash table (done)

(2) memory accounting (not sure what's your plan, I've used the
approach I proposed on 23/8 for now, with a few bugfixes/cleanups)

(3) applying your HashWork patch on top of this (I have this mostly
completed, but need to do more testing over the weekend)

(4) extending this with the batching I proposed, initially only for
aggregates with states that we can serialize/deserialize easily
(e.g. types passed by value) - I'd like to hack on this next week

So at this point I have (1) and (2) pretty much ready, (3) is almost
complete and I plan to start hacking on (4). Also, this does not address
the open items listed in your message.

Hi,

Attached are patches implementing this. In the end, I decided to keep
the two approaches separate for now, i.e. either the HashWork-based
batching, or hashjoin-like batching. It's easier to play with when it's
separate, and I think we need to figure out how the two approaches fit
together first (if they fit at all).

Shared patches:

(1) hashagg-dense-allocation-v1.patch

- replacement for dynahash, with dense allocation (essentially the
same idea as in the hashjoin patch)

- this is necessary by the hashjoin-like batching, because dynahash
does not free memory

- it also makes the hashagg less memory expensive and faster (see
the test results further down)

- IMHO this part is in pretty good shape, i.e. I don't expect bugs
or issues in this (although I do expect pushback to replacing
dynahash, which is a code widely used in the whole codebase).

(2) memory-accounting-v1.patch

- based on the ideas discussed in the 'memory accounting thread',
with some improvements

- this really needs a lot of work, the current code works but there
are various subtle issues - essentially this should be replaced
with whatever comes from the memory accounting thread

These two patches need to be applied first, before using either (3a-b)
or (4), implementing the two batching approaches:

(3a) hashagg-batching-jeff-v1.patch

- essentially a 1:1 of Jeff's patch, applied on top of the dense-
allocated hash table, mentioned in (1)

- I also ran into a few bugs causing segfaults IIRC (I'll report
them in a separate message, if I remember them)

(3b) hashagg-batching-jeff-pt2-v1.patch

- this adds two things - basic estimation of how many partitions
to use, and basic info to explain

- the idea behind estimating number of partitions is quite simple:

We don't really need to decide until the first tuple needs to be
stored - when that happens, see how many more tuples we expect,
and use this ratio as the number of partitions (or rather the
nearest power of 2). In most cases this number of partitions
is higher, because it assumes once we get the same number of
tuples, we'll get the same number of new groups. But that's most
likely untrue, as some of the groups are already present in the
hash table.

This may be further improved - first, at this stage we only know
the expected number of input tuples. Second, with various
aggregates the existing states may grow as more tuples are added
to the state.

So at the end we can look at how many tuples we actually got,
and how much memory we actually consumed, and use that to decide
on the size for the second-level HashWork items. For example, if
we expected N tuples, but actually got 2*N, and at the end of
the initial batch we ended up with 2*work_mem, we may choose
to do 4 partitions in the second step - that way we're more
likely not to exceed work_mem, and we can do that right away.

I believe this might effectively limit the necessary HashWork
levels to 2:

* initial scan
* 1st level : # of partitions determined on the first tuple
* 2nd level : # of partitions determined at the end of the
initial scan

Does that make sense?

- regarding the info added to explain, I came to conclusion that
these values are interesting:

* number of batches - how many HashWork items were created

* number of rebatches - number of times a HashWork is split into
partitions

* rescan ratio - number of tuples that had to be stored into a
batch, and then read again

- this may be higher > 100% if there are multiple
levels of HashWork items, so a single tuple may
be read/stored repeatedly because of using too
low number of partitions

* min/max partitions size (probably not as useful as I thought)

And the hashjoin-like batching (which is in considerably less mature
state compared to the previous patch):

(4) hashagg-batching-hashjoin-v1.patch

- there's not much to say about the principle, it's pretty much the
same as in hashjoin, and uses a single level of batches (as
opposed to the tree-ish structure of HashWork items)

- I added similar info to explain (especially the rescan ratio)

- currently this only supports aggregates with states passed by
value (e.g. COUNT(*))

- extension to known types seems straightforward, supporting
'internal' will require more work

So either you apply (1), (2), (3a) and (3b), or (1), (2) and (4).

All the patches currently pass 'make installcheck', expect for a few
failures that are caused by different order of rows in the result (which
is really an issue in the test itself, not using an ORDER BY clause and
expecting sorted output).

Regarding memory contexts
-------------------------

Both patches measure only memory used for the hash table, not the whole
aggcontext, which is really the right thing to measure. For aggregates
using passed-by-value states this does not make any differece, but
passed-by-ref states are allocated in aggcontext.

For example array_agg creates sub-contexts of aggcontext for each group.

So I think the hierarchy of context will require some rethinking,
because we want/need to throw away the states between partitions. As
this is currently located in aggcontext, it's difficult (we'd have to
redo the whole initialization).

Work_mem sizes
--------------

Another problem with the current memory accounting is that it tracks
blocks, not individual palloc/pfree calls. However AllocSet keeps some
of the blocks allocated for future use, which confuses the accounting.
This only happens with small work_mem values, values like 8MB or more
seem to work fine. I'm not sure what the accounting will look like, but
I expect it to solve this issue.

Testing and benchmarking
------------------------

I also did some basic testing, with three datasets - the testing scripts
and results are attached in the hashagg-testing.tgz. See the
hashagg-bench.sql for details - it creates three tables: small (1M),
medium (10M) and large (50M) with columns with different cardinalities.

The a series of GROUP BY queries is executed - query "a" has 1:1 groups
(i.e. 1 group per table rows), "b" 1:10 (10 rows per group), "c" 1:100
and "d" only 100 groups in total. These queries are executed with
different work_mem values (64MB to 1GB), and the durations are measured.
See the hashagg-bench.sql script (in the .tgz) for details.

Attached are two CSV files contain both raw results (4 runs per query),
and aggregated results (average of the runs), logs with complete logs
and explain (analyze) plans of the queries for inspection.

Attached are two charts for the large dataset (50M), because it nicely
illustrates the differences - for work_mem=1024MB and work_mem=128MB.

In general, it shows that for this set of queries:

* Dense allocation gives ~20% speedup (and this is true for the
other datasets). The only case when this is not true is query "a"
but that's the query not using HashAggregate (so the dense
allocation has nothing to do with this, AFAIK).

* The difference between the two approaches is rather small.
Sometimes the Jeff's approach is faster, sometimes hashjoin-like
batching is faster.

* There may be cases when we actually slow-down queries, because we
trigger batching (irrespectedly of the approach). This is a
feature, not a bug. Either we want to respect work_mem or not.

It's important to say however that this test is extremely simplistic and
very simple for the planner to get the number of groups reasonably
right, as the queries are grouping by a single column with a quite well
known cardinality. In practice, that's hardly the case. And incorrect
estimates are probably the place where the differences between the
approaches will be most significant.

Also, the 'large' dataset is not really as large as it should be. 50M
rows is not that much I guess.

I think we should create a wider set of tests, which should give us some
insight into proper costing etc.

Tomas

Attachments:

hashagg-dense-allocation-v1.patchtext/x-diff; name=hashagg-dense-allocation-v1.patchDownload

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 510d1c5..6455864 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -296,9 +296,18 @@ typedef struct AggHashEntryData *AggHashEntry;
 
 typedef struct AggHashEntryData
 {
-	TupleHashEntryData shared;	/* common header for hash table entries */
-	/* per-aggregate transition status array - must be last! */
+
+	/* pointer to the next entry in the bucket */
+	AggHashEntry next;
+
+	/* hash computed from the group keys (stored in mintuple) */
+	uint32	hashvalue;
+
+	/* minimal tuple storing values for group keys */
+	MinimalTuple tuple;
+
 	AggStatePerGroupData pergroup[1];	/* VARIABLE LENGTH ARRAY */
+
 }	AggHashEntryData;	/* VARIABLE LENGTH STRUCT */
 
 
@@ -321,7 +330,7 @@ static void finalize_aggregate(AggState *aggstate,
 				   Datum *resultVal, bool *resultIsNull);
 static Bitmapset *find_unaggregated_cols(AggState *aggstate);
 static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
-static void build_hash_table(AggState *aggstate);
+static void build_hash_table(AggState *aggstate, Size tuple_width);
 static AggHashEntry lookup_hash_entry(AggState *aggstate,
 				  TupleTableSlot *inputslot);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
@@ -329,6 +338,86 @@ static void agg_fill_hash_table(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
 static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
 
+static uint32 compute_hash_value(AggState * aggstate, TupleTableSlot * slot);
+static uint32 compute_bucket(AggState * aggstate, uint32 hashvalue);
+static bool groups_match(AggState * aggstate, TupleTableSlot *slot, AggHashEntry entry);
+static void increase_nbuckets(AggState * aggstate);
+static char * chunk_alloc(AggHashTable htab, int size);
+static void reset_hash_table(AggHashTable htab);
+
+static void IteratorReset(AggHashTable htab);
+static AggHashEntry IteratorGetNext(AggHashTable htab);
+
+/*
+ * The size of the chunks for dense allocation. This needs to be >8kB
+ * because the default (and only) memory context implementation uses
+ * 8kB as a boundary for keeping the blocks on a freelist. Which is
+ * exactly what we don't want here - we want to free the chunk when
+ * we don't need it (so that it can be reused for aggstate and so on).
+ *
+ * 16kB seems like a good default value.
+ */
+#define HASH_CHUNK_SIZE			(16*1024L)
+
+typedef struct HashChunkData
+{
+	int			ntuples;	/* number of tuples stored in this chunk */
+	Size		maxlen;		/* length of the buffer */
+	Size		used;		/* number of chunk bytes already used */
+
+	struct HashChunkData   *next;	/* pointer to the next chunk (linked list) */
+
+	char		data[1];	/* buffer allocated at the end */
+} HashChunkData;
+
+typedef struct HashChunkData* HashChunk;
+
+/*
+ * A simple hashtable, storing the data densely into larger chunks.
+ * Originally, HashAgg used dynahash (through methods in nodeGrouping.c)
+ * but that does not allow removing the entries and freeing memory. So
+ * this approach, already used in nodeHash.c was used here too (and
+ * wrapped a bit more nicely).
+ *
+ * The hash entries (containing the per-group data) and tuples (with
+ * keys of the group) are interleaved, i.e. the entry is always stored
+ * first, then the tuple (in a MinimalTuple format). The entries are
+ * always fixed size (either the aggregate state is passed by value and
+ * stored inline, or passed by reference and stored in a regularly
+ * palloced memory), the tuples are of arbitrary size.
+ */
+typedef struct AggHashTableData
+{
+
+	int	nentries;		/* number of hash table entries */
+	int	nbuckets;		/* current number of buckets */
+	int	nbuckets_max;	/* max number of buckets */
+
+	/* items copied from the TupleHashTable, because we still need them */
+	MemoryContext	tmpctx;		/* short-lived memory context (hash/eq funcs) */
+	AttrNumber 	   *keyColIdx;	/* attr numbers of key columns */
+	int				numCols;	/* number of columns */
+	TupleTableSlot *slot;		/* tuple slot for groups_match */
+	Size			entrysize;	/* size of hash table entry (no tuple) */
+
+	MemoryContext	htabctx;	/* memory context for the chunks */
+
+	/* buckets of the hash table */
+	AggHashEntry   *buckets;
+
+	/*
+	 * Used for iterating through the hash table - it keeps track of the
+	 * current chunk, and entry within the chunk. Use the provided
+	 * methods to initialize and advance the iterator.
+	 */
+	HashChunk		cur_chunk;
+	AggHashEntry	cur_entry;
+
+	/* list of chunks with dense-packed entries / minimal tuples */
+	HashChunk		chunks_hash;
+
+} AggHashTableData;
+
 
 /*
  * Initialize all aggregates for a new group of input values.
@@ -928,26 +1017,111 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
  * The hash table always lives in the aggcontext memory context.
  */
 static void
-build_hash_table(AggState *aggstate)
+build_hash_table(AggState *aggstate, Size tuple_width)
 {
 	Agg		   *node = (Agg *) aggstate->ss.ps.plan;
-	MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
-	Size		entrysize;
+	Size		entrysize;	/* size of entry in the hash table */
+	Size		groupsize;	/* space used by the group (includes bucket) */
+	AggHashTable	htab;
+
+	/* we assume 1024 buckets (i.e. 8kB of memory) is minimum */
+	int nbuckets     = 1024;
+	int nbuckets_max = 1024;
 
 	Assert(node->aggstrategy == AGG_HASHED);
 	Assert(node->numGroups > 0);
 
-	entrysize = sizeof(AggHashEntryData) +
-		(aggstate->numaggs - 1) * sizeof(AggStatePerGroupData);
-
-	aggstate->hashtable = BuildTupleHashTable(node->numCols,
-											  node->grpColIdx,
-											  aggstate->eqfunctions,
-											  aggstate->hashfunctions,
-											  node->numGroups,
-											  entrysize,
-											  aggstate->aggcontext,
-											  tmpmem);
+	/*
+	 * Compute size of the hash table entry (this is actual size, but it
+	 * does not include the MinimalTuple size, with values of the keys
+	 * for the group). There's only a pointer to the minimal tuple.
+	 */
+	entrysize = MAXALIGN(sizeof(AggHashEntryData) +
+				(aggstate->numaggs - 1) * sizeof(AggStatePerGroupData));
+
+	/*
+	 * Estimate the size of the group, so that we can estimate how many
+	 * of them fit into work_mem, and thus estimate what is the reasonable
+	 * max number of buckets that we can use. To do that we add the entry
+	 * size, a bucket (because we're shooting for <1 load factor), and
+	 * estimated tuple width (because we'll keep the first tuple for each
+	 * group because of group key values).
+	 *
+	 * XXX This does not include size of the aggregate states, passed by
+	 *     reference. First, we don't know how to determine that. However,
+	 *     if the states are small it won't make much difference and if
+	 *     the states get large the memory required for the buckets is
+	 *     going to be much less important.
+	 */
+	groupsize = entrysize + sizeof(AggHashEntry)
+				+ sizeof(MinimalTupleData) + tuple_width;
+
+	/*
+	 * determine maximum number of buckets that can fit into work_mem (along with
+	 * the entry)
+	 *
+	 * This assumes all the space is used by AggHashEntries, but many aggregates
+	 * are keeping state separate (e.g. as a "pass by reference" Datums), which
+	 * results in nbuckets_max values higher than possible in practice. But we
+	 * don't know that at this point, and we don't need to worry too much about
+	 * it because those aggregates do it to handle states that are significantly
+	 * larger than 8B, which makes the 8B per-bucket negligible.
+	 *
+	 * And of course, as mentioned above, this does not include the actual data
+	 * stored in the MinimalTuple.
+	 *
+	 * XXX We may re-evaluate this over time, as we'll know how many entries are
+	 *     there, and thus what is the average size of aggregate size. That is,
+	 *     as the state size grows, we may decrease the number of buckets. We'll
+	 *     save a bit of memory by that (although not much).
+	 */
+	while (nbuckets_max * groupsize <= work_mem * 1024L)
+		nbuckets_max *= 2;
+
+	/*
+	 * Update the initial number of buckets to match expected number of groups,
+	 * but don't grow over nbuckets_max because in that case we'll start with
+	 * the batching anyway.
+	 */
+	while ((nbuckets < node->numGroups) && (nbuckets < nbuckets_max))
+		nbuckets *= 2;
+
+	/*
+	 * XXX When batching, we might use (numGroups / nbuckets) as a starting
+	 * nbatch value, but maybe we can start with nbatch=1 with the assumption
+	 * that multiple tuples will be 'compressed' into the group (and thus
+	 * we'll write less data in total).
+	 */
+
+	htab = (AggHashTable)MemoryContextAllocZero(aggstate->aggcontext,
+											sizeof(AggHashTableData));
+
+	/* TODO create a memory context for the hash table */
+	htab->htabctx = AllocSetContextCreate(aggstate->aggcontext,
+											"HashAggHashTable",
+											ALLOCSET_DEFAULT_MINSIZE,
+											ALLOCSET_DEFAULT_INITSIZE,
+											ALLOCSET_DEFAULT_MAXSIZE);
+
+	/* buckets are just pointers to AggHashEntryData structures */
+	htab->buckets = (AggHashEntry*)MemoryContextAllocZero(htab->htabctx,
+									nbuckets * sizeof(AggHashEntry));
+
+	/* copy the column IDs from the node */
+	htab->keyColIdx = node->grpColIdx;
+
+	/* we'll use the per-tuple memory context for the hash/eq functions */
+	htab->tmpctx = aggstate->tmpcontext->ecxt_per_tuple_memory;
+
+	htab->nbuckets = nbuckets;
+	htab->nbuckets_max = nbuckets_max;
+	htab->nentries = 0;
+	htab->slot = NULL;
+	htab->numCols = node->numCols;
+	htab->entrysize = entrysize;
+
+	aggstate->hashtable = htab;
+
 }
 
 /*
@@ -1026,40 +1200,77 @@ hash_agg_entry_size(int numAggs)
 static AggHashEntry
 lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
 {
-	TupleTableSlot *hashslot = aggstate->hashslot;
-	ListCell   *l;
-	AggHashEntry entry;
-	bool		isnew;
 
-	/* if first time through, initialize hashslot by cloning input slot */
-	if (hashslot->tts_tupleDescriptor == NULL)
+	AggHashEntry entry = NULL;
+	uint32		hashvalue;
+	uint32		bucketno;
+	MinimalTuple mintuple;
+
+	hashvalue = compute_hash_value(aggstate, inputslot);
+	bucketno = compute_bucket(aggstate, hashvalue);
+
+	entry = aggstate->hashtable->buckets[bucketno];
+
+	/* try to find a matching entry in the hash table (in the bucket) */
+	while (entry != NULL)
 	{
-		ExecSetSlotDescriptor(hashslot, inputslot->tts_tupleDescriptor);
-		/* Make sure all unused columns are NULLs */
-		ExecStoreAllNullTuple(hashslot);
+
+		/* first check the hashes, only then check the keys (if hashes match) */
+		if ((entry->hashvalue == hashvalue) && (groups_match(aggstate, inputslot, entry)))
+			break;
+
+		/* these are not the entries you're looking for ... */
+		entry = entry->next;
 	}
 
-	/* transfer just the needed columns into hashslot */
-	slot_getsomeattrs(inputslot, linitial_int(aggstate->hash_needed));
-	foreach(l, aggstate->hash_needed)
+	/* There's not a maching entry in the bucket, so create a new one and
+	 * copy in data both for the aggregates, and the MinimalTuple containing
+	 * keys for the group columns. */
+	if (entry == NULL)
 	{
-		int			varNumber = lfirst_int(l) - 1;
 
-		hashslot->tts_values[varNumber] = inputslot->tts_values[varNumber];
-		hashslot->tts_isnull[varNumber] = inputslot->tts_isnull[varNumber];
-	}
+		MemoryContext old;
 
-	/* find or create the hashtable entry using the filtered tuple */
-	entry = (AggHashEntry) LookupTupleHashEntry(aggstate->hashtable,
-												hashslot,
-												&isnew);
+		/* only a reference to the mintuple - we'll copy it into a chunk */
+		mintuple = ExecFetchSlotMinimalTuple(inputslot);
+
+		/* FIXME probably create a separate context for the hash table, instead
+		 * of using aggcontext for everything ... */
+		old = MemoryContextSwitchTo(aggstate->aggcontext);
+
+		/* we need enough space for the entry and tuple with key values */
+		entry = (AggHashEntry) chunk_alloc(aggstate->hashtable,
+							aggstate->hashtable->entrysize + mintuple->t_len);
+
+		entry->hashvalue = hashvalue;
+
+		/* add to the proper bucket */
+		entry->next = aggstate->hashtable->buckets[bucketno];
+		aggstate->hashtable->buckets[bucketno] = entry;
+
+		/* the tuple is placed right after the entry (maxaligned) */
+		entry->tuple = (MinimalTuple)((char*)entry + aggstate->hashtable->entrysize);
+
+		/*
+		 * FIXME This seems to copy all the data, including columns that are not part
+		 * of the key (i.e. are there only as inputs for the aggregates - that may be
+		 * quite wasteful when there are many aggregates / the values are long etc.)
+		 */
+		memcpy(entry->tuple, mintuple, mintuple->t_len);
+
+		MemoryContextSwitchTo(old);
 
-	if (isnew)
-	{
 		/* initialize aggregates for new tuple group */
 		initialize_aggregates(aggstate, aggstate->peragg, entry->pergroup);
+
+		aggstate->hashtable->nentries += 1;
+
 	}
 
+	/* once we exceed 1 entry / bucket, increase number of buckets */
+	if (aggstate->hashtable->nentries > aggstate->hashtable->nbuckets)
+		increase_nbuckets(aggstate);
+
 	return entry;
 }
 
@@ -1363,8 +1574,10 @@ agg_fill_hash_table(AggState *aggstate)
 	}
 
 	aggstate->table_filled = true;
-	/* Initialize to walk the hash table */
-	ResetTupleHashIterator(aggstate->hashtable, &aggstate->hashiter);
+
+	/* Initialize for iteration through the table (first bucket / entry) */
+	IteratorReset(aggstate->hashtable);
+
 }
 
 /*
@@ -1381,6 +1594,7 @@ agg_retrieve_hash_table(AggState *aggstate)
 	AggHashEntry entry;
 	TupleTableSlot *firstSlot;
 	int			aggno;
+	AggHashTable	htab;
 
 	/*
 	 * get state info from node
@@ -1391,23 +1605,11 @@ agg_retrieve_hash_table(AggState *aggstate)
 	aggnulls = econtext->ecxt_aggnulls;
 	peragg = aggstate->peragg;
 	firstSlot = aggstate->ss.ss_ScanTupleSlot;
+	htab = aggstate->hashtable;
 
-	/*
-	 * We loop retrieving groups until we find one satisfying
-	 * aggstate->ss.ps.qual
-	 */
-	while (!aggstate->agg_done)
+	/* loop over entries in buckets */
+	while ((entry = IteratorGetNext(htab)) != NULL)
 	{
-		/*
-		 * Find the next entry in the hash table
-		 */
-		entry = (AggHashEntry) ScanTupleHashTable(&aggstate->hashiter);
-		if (entry == NULL)
-		{
-			/* No more entries in hashtable, so done */
-			aggstate->agg_done = TRUE;
-			return NULL;
-		}
 
 		/*
 		 * Clear the per-output-tuple context for each group
@@ -1419,19 +1621,19 @@ agg_retrieve_hash_table(AggState *aggstate)
 		ResetExprContext(econtext);
 
 		/*
-		 * Store the copied first input tuple in the tuple table slot reserved
-		 * for it, so that it can be used in ExecProject.
-		 */
-		ExecStoreMinimalTuple(entry->shared.firstTuple,
-							  firstSlot,
-							  false);
+		* Store the copied first input tuple in the tuple table slot reserved
+		* for it, so that it can be used in ExecProject.
+		*/
+		ExecStoreMinimalTuple(entry->tuple,
+							firstSlot,
+							false);
 
 		pergroup = entry->pergroup;
 
 		/*
-		 * Finalize each aggregate calculation, and stash results in the
-		 * per-output-tuple context.
-		 */
+		* Finalize each aggregate calculation, and stash results in the
+		* per-output-tuple context.
+		*/
 		for (aggno = 0; aggno < aggstate->numaggs; aggno++)
 		{
 			AggStatePerAgg peraggstate = &peragg[aggno];
@@ -1439,25 +1641,25 @@ agg_retrieve_hash_table(AggState *aggstate)
 
 			Assert(peraggstate->numSortCols == 0);
 			finalize_aggregate(aggstate, peraggstate, pergroupstate,
-							   &aggvalues[aggno], &aggnulls[aggno]);
+							&aggvalues[aggno], &aggnulls[aggno]);
 		}
 
 		/*
-		 * Use the representative input tuple for any references to
-		 * non-aggregated input columns in the qual and tlist.
-		 */
+		* Use the representative input tuple for any references to
+		* non-aggregated input columns in the qual and tlist.
+		*/
 		econtext->ecxt_outertuple = firstSlot;
 
 		/*
-		 * Check the qual (HAVING clause); if the group does not match, ignore
-		 * it and loop back to try to process another group.
-		 */
+		* Check the qual (HAVING clause); if the group does not match, ignore
+		* it and loop back to try to process another group.
+		*/
 		if (ExecQual(aggstate->ss.ps.qual, econtext, false))
 		{
 			/*
-			 * Form and return a projection tuple using the aggregate results
-			 * and the representative input tuple.
-			 */
+			* Form and return a projection tuple using the aggregate results
+			* and the representative input tuple.
+			*/
 			TupleTableSlot *result;
 			ExprDoneCond isDone;
 
@@ -1472,8 +1674,11 @@ agg_retrieve_hash_table(AggState *aggstate)
 		}
 		else
 			InstrCountFiltered1(aggstate, 1);
+
 	}
 
+	aggstate->agg_done = true;
+
 	/* No more groups */
 	return NULL;
 }
@@ -1515,7 +1720,6 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->agg_done = false;
 	aggstate->pergroup = NULL;
 	aggstate->grp_firstTuple = NULL;
-	aggstate->hashtable = NULL;
 
 	/*
 	 * Create expression contexts.  We need two, one for per-input-tuple
@@ -1546,7 +1750,9 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 */
 	ExecInitScanTupleSlot(estate, &aggstate->ss);
 	ExecInitResultTupleSlot(estate, &aggstate->ss.ps);
-	aggstate->hashslot = ExecInitExtraTupleSlot(estate);
+
+	/* FIXME maybe we could reuse this in groups_match for better efficiency (?) */
+	// aggstate->hashslot = ExecInitExtraTupleSlot(estate);
 
 	/*
 	 * initialize child expressions
@@ -1636,7 +1842,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 
 	if (node->aggstrategy == AGG_HASHED)
 	{
-		build_hash_table(aggstate);
+		build_hash_table(aggstate, outerPlan->plan_width);
 		aggstate->table_filled = false;
 		/* Compute the columns we actually need to hash on */
 		aggstate->hash_needed = find_hash_columns(aggstate);
@@ -2073,7 +2279,7 @@ ExecReScanAgg(AggState *node)
 		 */
 		if (node->ss.ps.lefttree->chgParam == NULL)
 		{
-			ResetTupleHashIterator(node->hashtable, &node->hashiter);
+			IteratorReset(node->hashtable);
 			return;
 		}
 	}
@@ -2112,8 +2318,9 @@ ExecReScanAgg(AggState *node)
 
 	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
 	{
+		Plan * outerPlan = outerPlan((Agg *) node->ss.ps.plan);
 		/* Rebuild an empty hash table */
-		build_hash_table(node);
+		build_hash_table(node, outerPlan->plan_width);
 		node->table_filled = false;
 	}
 	else
@@ -2269,3 +2476,385 @@ aggregate_dummy(PG_FUNCTION_ARGS)
 		 fcinfo->flinfo->fn_oid);
 	return (Datum) 0;			/* keep compiler quiet */
 }
+
+/*
+ * Computes a hash value from the group keys - this is pretty much the
+ * same as TupleHashTableHash, except that it's simplified a bit, and
+ * does not pass the tuples through an input etc.
+ */
+static uint32
+compute_hash_value(AggState * aggstate, TupleTableSlot * slot)
+{
+
+	uint32		hashkey = 0;
+	FmgrInfo   *hashfunctions = aggstate->hashfunctions;
+	int i = 0;
+
+	MemoryContext oldContext;
+
+	/* FIXME is it really OK to reset the per-tuple context here? */
+
+	/* Reset and switch into the temp context. */
+	MemoryContextReset(aggstate->hashtable->tmpctx);
+	oldContext = MemoryContextSwitchTo(aggstate->hashtable->tmpctx);
+
+	/* compute hash only from the needed column */
+	for (i = 0; i < aggstate->hashtable->numCols; i++)
+	{
+
+		AttrNumber	att = aggstate->hashtable->keyColIdx[i];
+		Datum		attr;
+		bool		isNull;
+
+		/* rotate hashkey left 1 bit at each step */
+		hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
+
+		attr = slot_getattr(slot, att, &isNull);
+
+		if (!isNull)			/* treat nulls as having hash key 0 */
+		{
+			uint32		hkey;
+
+			hkey = DatumGetUInt32(FunctionCall1(&hashfunctions[i],
+												attr));
+			hashkey ^= hkey;
+		}
+	}
+
+	MemoryContextSwitchTo(oldContext);
+
+	return hashkey;
+
+}
+
+/*
+ * Computes index of the bucket the group entry belongs to (same principles as
+ * in ExecHashGetBucketAndBatch in nodeHash.c)
+ */
+static uint32
+compute_bucket(AggState * aggstate, uint32 hashvalue)
+{
+	return hashvalue & (aggstate->hashtable->nbuckets - 1);
+}
+
+/*
+ * Compares that the group keys of the two groups actually match, using the
+ * equality functions. This is much more expensive than comparing uint32
+ * values (hashes), so always check hashes first.
+ */
+static bool
+groups_match(AggState * aggstate, TupleTableSlot *slot, AggHashEntry entry)
+{
+	bool		result;
+	FmgrInfo   *eqfunctions = aggstate->eqfunctions;
+	TupleDesc	tupdesc;
+	int i = 0;
+
+	MemoryContext oldContext;
+
+	/*
+	 * XXX Do we really need to do this slot gymnastics? can't we get the
+	 * info from the minimal tuple directly? It init happens only once,
+	 * so the overhead is not that bad, but it's annoying. And we still
+	 * have to call ExecStoreMinimalTuple every time.
+	 */
+	if (aggstate->hashtable->slot == NULL)
+	{
+		oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
+		tupdesc = CreateTupleDescCopy(slot->tts_tupleDescriptor);
+		aggstate->hashtable->slot = MakeSingleTupleTableSlot(tupdesc);
+		MemoryContextSwitchTo(oldContext);
+	}
+
+	/* FIXME is it really OK to reset the per-tuple memory context here? */
+
+	/* Reset and switch into the temp context. */
+	MemoryContextReset(aggstate->hashtable->tmpctx);
+	oldContext = MemoryContextSwitchTo(aggstate->hashtable->tmpctx);
+
+	ExecStoreMinimalTuple(entry->tuple, aggstate->hashtable->slot, false);
+
+	/*
+	 * We cannot report a match without checking all the fields, but we can
+	 * report a non-match as soon as we find unequal fields.  So, start
+	 * comparing at the last field (least significant sort key). That's the
+	 * most likely to be different if we are dealing with sorted input.
+	 */
+	result = true;
+
+	for (i = aggstate->hashtable->numCols; --i >= 0;)
+	{
+
+		AttrNumber	att = aggstate->hashtable->keyColIdx[i];
+		Datum		attr1,
+					attr2;
+		bool		isNull1,
+					isNull2;
+
+		attr1 = slot_getattr(slot,  att, &isNull1);
+		attr2 = slot_getattr(aggstate->hashtable->slot, att, &isNull2);
+
+		if (isNull1 != isNull2)
+		{
+			result = false;		/* one null and one not; they aren't equal */
+			break;
+		}
+
+		if (isNull1)
+			continue;			/* both are null, treat as equal */
+
+		/* Apply the type-specific equality function */
+
+		if (!DatumGetBool(FunctionCall2(&eqfunctions[i],
+										attr1, attr2)))
+		{
+			result = false;		/* they aren't equal */
+			break;
+		}
+	}
+
+	MemoryContextSwitchTo(oldContext);
+
+	return result;
+}
+
+/*
+ * Resize the hash table for good performance. We're shooting for (nentries <= nbuckets)
+ * which should give us 1 group per bucket on average. We're working with groups and not
+ * tuples. And multiple tuples with the same hash are most likely in the same group, thus
+ * merged into a single entry. So we should not see many buckets with a long list of
+ * entries (which can happen in hashjoin quite easily).
+ */
+static void
+increase_nbuckets(AggState * aggstate)
+{
+
+	HashChunk chunk;
+	AggHashTable htab = aggstate->hashtable;
+
+	/* we've reached maximum number of buckets */
+	if (htab->nbuckets >= htab->nbuckets_max)
+		return;
+
+	htab->nbuckets *= 2;
+	htab->buckets
+		= (AggHashEntry*)repalloc(htab->buckets,
+								  htab->nbuckets * sizeof(AggHashEntry));
+	memset(htab->buckets, 0, htab->nbuckets * sizeof(AggHashEntry));
+
+	chunk = htab->chunks_hash;
+	while (chunk != NULL)
+	{
+
+		/* position within the buffer (up to chunk->used) */
+		size_t idx = 0;
+
+		/* we have a whole number of entries */
+		Assert(chunk->used % htab->entrysize == 0);
+
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx < chunk->used)
+		{
+
+			AggHashEntry entry = (AggHashEntry)(chunk->data + idx);
+
+			int bucketno = compute_bucket(aggstate, entry->hashvalue);
+
+			entry->next = htab->buckets[bucketno];
+			htab->buckets[bucketno] = entry;
+
+			/* bytes occupied in memory HJ tuple overhead + actual tuple length */
+			idx += htab->entrysize + entry->tuple->t_len;
+
+		}
+
+		/* proceed to the next chunk */
+		chunk = chunk->next;
+
+	}
+
+}
+
+static
+char * chunk_alloc(AggHashTable htab, int size)
+{
+	/* XXX maybe we should use MAXALIGN(size) here ... */
+
+	/* we need >8kB to get immediate free in aset.c */
+	Assert(HASH_CHUNK_SIZE > 8192);
+
+	/*
+	 * If the requested size is over 1/8 of chunk size, allocate a
+	 * separate chunk. of this size.
+	 *
+	 * XXXX This may be problematic, because chunks like this may get
+	 *      below 8kB, and thus be considered 'regular' blocks by aset.c
+	 *      (and put on freelist, instead of freeing immediately).
+	 */
+	if (size > (HASH_CHUNK_SIZE/8))
+	{
+
+		/*
+		 * Allocate new chunk and put it at the beginning of the list.
+		 *
+		 * There's no point in making this 2^N size, because blocks over
+		 * 8kB are handled as a special case in aset.c (exact size).
+		 */
+		HashChunk newChunk
+			= (HashChunk)MemoryContextAllocZero(htab->htabctx,
+								offsetof(HashChunkData, data) + size);
+
+		newChunk->maxlen = size;
+		newChunk->used = 0;
+		newChunk->ntuples = 0;
+
+		/*
+		 * If there already is a chunk, add the new one after it, so we
+		 * can still use the space in the existing one.
+		 */
+		if (htab->chunks_hash != NULL)
+		{
+			newChunk->next = htab->chunks_hash->next;
+			htab->chunks_hash->next = newChunk;
+		}
+		else
+		{
+			newChunk->next = htab->chunks_hash;
+			htab->chunks_hash = newChunk;
+		}
+
+		newChunk->used += size;
+		newChunk->ntuples += 1;
+
+		return newChunk->data;
+
+	}
+
+	/*
+	 * Requested size is less than 1/8 of a chunk, so place it in the
+	 * current chunk if there is enough free space. If not, allocate
+	 * a new chunk and add it there.
+	 */
+	if ((htab->chunks_hash == NULL) ||
+		(htab->chunks_hash->maxlen - htab->chunks_hash->used) < size)
+	{
+		/* allocate new chunk and put it at the beginning of the list */
+		HashChunk newChunk
+			= (HashChunk)MemoryContextAllocZero(htab->htabctx,
+						offsetof(HashChunkData, data) + HASH_CHUNK_SIZE);
+
+		newChunk->maxlen = HASH_CHUNK_SIZE;
+		newChunk->used = 0;
+		newChunk->ntuples = 0;
+
+		newChunk->next = htab->chunks_hash;
+		htab->chunks_hash = newChunk;
+	}
+
+	/* OK, we have enough space in the chunk, let's add the tuple */
+	htab->chunks_hash->used += size;
+	htab->chunks_hash->ntuples += 1;
+
+	/* allocate pointer to the start of the tuple memory */
+	return htab->chunks_hash->data + (htab->chunks_hash->used - size);
+
+}
+
+/*
+ * Resets the hash table iterator, so that it points to the first entry
+ * in the first chunk (the chunk created last, thus placed first in the
+ * list of chunks).
+ */
+static
+void IteratorReset(AggHashTable htab)
+{
+
+	htab->cur_chunk = htab->chunks_hash;
+
+	/* there may be no chunks at all (empty hash table) */
+	if (htab->cur_chunk != NULL)
+		htab->cur_entry = (AggHashEntry)htab->cur_chunk->data;
+	else
+		htab->cur_entry = NULL;
+
+}
+
+/*
+ * Returns the next hash table entry. Works by scanning the chunks, not
+ * by scanning the buckets etc. Returns NULL when there are no more
+ * entries.
+ */
+static
+AggHashEntry IteratorGetNext(AggHashTable htab)
+{
+
+	AggHashEntry	entry = NULL;
+	Size			len;
+
+	/* we've completed the last chunk (in the previous call) */
+	if (htab->cur_chunk == NULL)
+		return NULL;
+
+	/* we're not beyond the chunk data */
+	Assert((char*)htab->cur_entry < (htab->cur_chunk->data + htab->cur_chunk->used));
+
+	/*
+	 * We're still in the current chunk (otherwise the current chunk
+	 * would be set to NULL), so cur_entry points to a valid entry.
+	 * So compute how many bytes we need to skip to the next entry.
+	 */
+	entry = htab->cur_entry;
+	len = entry->tuple->t_len + htab->entrysize;
+
+	/*
+	 * Proceed to the next entry and check if we've reached end of this
+	 * chunk. If yes, skip to the next one and set the current entry
+	 * accordingly (chunk=NULL means there's no valid entry).
+	 */
+	htab->cur_entry = (AggHashEntry)((char*)entry + len);
+
+	if ((char*)htab->cur_entry >= (htab->cur_chunk->data + htab->cur_chunk->used))
+	{
+		htab->cur_chunk = htab->cur_chunk->next;
+		if (htab->cur_chunk != NULL)
+			htab->cur_entry = (AggHashEntry)htab->cur_chunk->data;
+		else
+			htab->cur_entry = NULL;
+	}
+
+	return entry;
+
+}
+
+/*
+ * Resets the contents of the hash table - removes all the entries and
+ * tuples, but keeps the 'size' of the hash table (nbuckets).
+ */
+static
+void reset_hash_table(AggHashTable htab) {
+
+	MemoryContext htabctx = htab->htabctx;
+	MemoryContext parent = htab->htabctx->parent;
+
+	htab->nentries = 0;
+	htab->chunks_hash = NULL;
+
+	/*
+	 * XXX If we could reset the context instead of recreating it
+	 *     from scratch, that'd be nice. However currently the reset
+	 *     often does not free a lot of memory because it keeps the
+	 *     blocks for future allocations.
+	 */
+	htab->htabctx = AllocSetContextCreateTracked(parent,
+											"HashAggHashTable",
+											ALLOCSET_DEFAULT_MINSIZE,
+											ALLOCSET_DEFAULT_INITSIZE,
+											ALLOCSET_DEFAULT_MAXSIZE,
+											true);
+
+	MemoryContextDelete(htabctx);
+
+	htab->buckets = (AggHashEntry*)MemoryContextAllocZero(htab->htabctx,
+								htab->nbuckets * sizeof(AggHashEntry));
+
+}
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b271f21..995389b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1701,6 +1701,8 @@ typedef struct GroupState
 /* these structs are private in nodeAgg.c: */
 typedef struct AggStatePerAggData *AggStatePerAgg;
 typedef struct AggStatePerGroupData *AggStatePerGroup;
+typedef struct AggHashEntryData *AggHashEntry;
+typedef struct AggHashTableData *AggHashTable;
 
 typedef struct AggState
 {
@@ -1714,15 +1716,16 @@ typedef struct AggState
 	ExprContext *tmpcontext;	/* econtext for input expressions */
 	AggStatePerAgg curperagg;	/* identifies currently active aggregate */
 	bool		agg_done;		/* indicates completion of Agg scan */
+
 	/* these fields are used in AGG_PLAIN and AGG_SORTED modes: */
 	AggStatePerGroup pergroup;	/* per-Aggref-per-group working state */
 	HeapTuple	grp_firstTuple; /* copy of first tuple of current group */
+
 	/* these fields are used in AGG_HASHED mode: */
-	TupleHashTable hashtable;	/* hash table with one entry per group */
-	TupleTableSlot *hashslot;	/* slot for loading hash table */
 	List	   *hash_needed;	/* list of columns needed in hash table */
 	bool		table_filled;	/* hash table filled yet? */
-	TupleHashIterator hashiter; /* for iterating through hash table */
+	AggHashTable	hashtable;	/* instance of the simple hash table */
+
 } AggState;
 
 /* ----------------

memory-accounting-v1.patchtext/x-diff; name=memory-accounting-v1.patchDownload

diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 743455e..d556f0b 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -242,6 +242,8 @@ typedef struct AllocChunkData
 #define AllocChunkGetPointer(chk)	\
 					((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ))
 
+static void update_allocation(MemoryContext context, Size size);
+
 /*
  * These functions implement the MemoryContext API for AllocSet contexts.
  */
@@ -430,6 +432,9 @@ randomize_mem(char *ptr, size_t size)
  * minContextSize: minimum context size
  * initBlockSize: initial allocation block size
  * maxBlockSize: maximum allocation block size
+ *
+ * The flag determining whether this context tracks memory usage is inherited
+ * from the parent context.
  */
 MemoryContext
 AllocSetContextCreate(MemoryContext parent,
@@ -438,6 +443,26 @@ AllocSetContextCreate(MemoryContext parent,
 					  Size initBlockSize,
 					  Size maxBlockSize)
 {
+	return AllocSetContextCreateTracked(
+		parent, name, minContextSize, initBlockSize, maxBlockSize,
+		false);
+}
+
+/*
+ * AllocSetContextCreateTracked
+ *		Create a new AllocSet context.
+ *
+ * Implementation for AllocSetContextCreate, but also allows the caller to
+ * specify whether memory usage should be tracked or not.
+ */
+MemoryContext
+AllocSetContextCreateTracked(MemoryContext parent,
+							 const char *name,
+							 Size minContextSize,
+							 Size initBlockSize,
+							 Size maxBlockSize,
+							 bool track_mem)
+{
 	AllocSet	context;
 
 	/* Do the type-independent part of context creation */
@@ -445,7 +470,8 @@ AllocSetContextCreate(MemoryContext parent,
 											 sizeof(AllocSetContext),
 											 &AllocSetMethods,
 											 parent,
-											 name);
+											 name,
+											 track_mem);
 
 	/*
 	 * Make sure alloc parameters are reasonable, and save them.
@@ -500,6 +526,9 @@ AllocSetContextCreate(MemoryContext parent,
 					 errdetail("Failed while creating memory context \"%s\".",
 							   name)));
 		}
+
+		update_allocation((MemoryContext) context, blksize);
+
 		block->aset = context;
 		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
 		block->endptr = ((char *) block) + blksize;
@@ -590,6 +619,7 @@ AllocSetReset(MemoryContext context)
 		else
 		{
 			/* Normal case, release the block */
+			update_allocation(context, -(block->endptr - ((char*) block)));
 #ifdef CLOBBER_FREED_MEMORY
 			wipe_mem(block, block->freeptr - ((char *) block));
 #endif
@@ -616,6 +646,8 @@ AllocSetDelete(MemoryContext context)
 	AllocSet	set = (AllocSet) context;
 	AllocBlock	block = set->blocks;
 
+	MemoryAccounting accounting;
+
 	AssertArg(AllocSetIsValid(set));
 
 #ifdef MEMORY_CONTEXT_CHECKING
@@ -623,6 +655,17 @@ AllocSetDelete(MemoryContext context)
 	AllocSetCheck(context);
 #endif
 
+	if (context->accounting != NULL) {
+
+		accounting = context->accounting->parent;
+
+		while (accounting != NULL)
+		{
+			accounting->total_allocated -= context->accounting->total_allocated;
+			accounting = accounting->parent;
+		}
+	}
+
 	/* Make it look empty, just in case... */
 	MemSetAligned(set->freelist, 0, sizeof(set->freelist));
 	set->blocks = NULL;
@@ -678,6 +721,9 @@ AllocSetAlloc(MemoryContext context, Size size)
 					 errmsg("out of memory"),
 					 errdetail("Failed on request of size %zu.", size)));
 		}
+
+		update_allocation(context, blksize);
+
 		block->aset = set;
 		block->freeptr = block->endptr = ((char *) block) + blksize;
 
@@ -873,6 +919,8 @@ AllocSetAlloc(MemoryContext context, Size size)
 					 errdetail("Failed on request of size %zu.", size)));
 		}
 
+		update_allocation(context, blksize);
+
 		block->aset = set;
 		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
 		block->endptr = ((char *) block) + blksize;
@@ -976,6 +1024,7 @@ AllocSetFree(MemoryContext context, void *pointer)
 			set->blocks = block->next;
 		else
 			prevblock->next = block->next;
+		update_allocation(context, -(block->endptr - ((char*) block)));
 #ifdef CLOBBER_FREED_MEMORY
 		wipe_mem(block, block->freeptr - ((char *) block));
 #endif
@@ -1088,6 +1137,7 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
 		AllocBlock	prevblock = NULL;
 		Size		chksize;
 		Size		blksize;
+		Size		oldblksize;
 
 		while (block != NULL)
 		{
@@ -1105,6 +1155,8 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
 		/* Do the realloc */
 		chksize = MAXALIGN(size);
 		blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+		oldblksize = block->endptr - ((char *)block);
+
 		block = (AllocBlock) realloc(block, blksize);
 		if (block == NULL)
 		{
@@ -1114,6 +1166,7 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
 					 errmsg("out of memory"),
 					 errdetail("Failed on request of size %zu.", size)));
 		}
+		update_allocation(context, blksize - oldblksize);
 		block->freeptr = block->endptr = ((char *) block) + blksize;
 
 		/* Update pointers since block has likely been moved */
@@ -1277,6 +1330,45 @@ AllocSetStats(MemoryContext context, int level)
 }
 
 
+/*
+ * update_allocation
+ *
+ * Track newly-allocated or newly-freed memory (freed memory should be
+ * negative).
+ */
+static void
+update_allocation(MemoryContext context, Size size)
+{
+
+	MemoryAccounting accounting = context->accounting;
+
+	if (accounting == NULL)
+		return;
+
+	/*
+	 * Update self_allocated only if this accounting info is specific
+	 * for this context (i.e. if tracking was requested for the context).
+	 */
+	if (context->track_mem)
+		accounting->self_allocated += size;
+
+	/*
+	 * Update total_allocated for all contexts up the accounting tree
+	 * (including this one).
+	 */
+	while (accounting != NULL) {
+
+		accounting->total_allocated += size;
+
+		Assert(accounting->self_allocated >= 0);
+		Assert(accounting->total_allocated >= accounting->self_allocated);
+
+		accounting = accounting->parent;
+
+	}
+
+}
+
 #ifdef MEMORY_CONTEXT_CHECKING
 
 /*
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 4185a03..a70b296 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -202,7 +202,12 @@ MemoryContextDelete(MemoryContext context)
 	 */
 	MemoryContextSetParent(context, NULL);
 
+	/* pass the parent in case it's needed, however */
 	(*context->methods->delete_context) (context);
+
+	if (context->track_mem)
+		pfree(context->accounting);
+
 	VALGRIND_DESTROY_MEMPOOL(context);
 	pfree(context);
 }
@@ -324,6 +329,26 @@ MemoryContextAllowInCriticalSection(MemoryContext context, bool allow)
 }
 
 /*
+ * MemoryContextGetAllocated
+ *
+ * Return memory allocated by the system to this context. If total is true,
+ * include child contexts. Context must have track_mem set.
+ */
+Size
+MemoryContextGetAllocated(MemoryContext context, bool total)
+{
+	Assert(context->track_mem);
+
+	if (! context->track_mem)
+		return 0;
+
+	if (total)
+		return context->accounting->total_allocated;
+	else
+		return context->accounting->self_allocated;
+}
+
+/*
  * GetMemoryChunkSpace
  *		Given a currently-allocated chunk, determine the total space
  *		it occupies (including all memory-allocation overhead).
@@ -546,7 +571,8 @@ MemoryContext
 MemoryContextCreate(NodeTag tag, Size size,
 					MemoryContextMethods *methods,
 					MemoryContext parent,
-					const char *name)
+					const char *name,
+					bool track_mem)
 {
 	MemoryContext node;
 	Size		needed = size + strlen(name) + 1;
@@ -576,6 +602,8 @@ MemoryContextCreate(NodeTag tag, Size size,
 	node->firstchild = NULL;
 	node->nextchild = NULL;
 	node->isReset = true;
+	node->track_mem = track_mem;
+	node->accounting = NULL;
 	node->name = ((char *) node) + size;
 	strcpy(node->name, name);
 
@@ -596,6 +624,24 @@ MemoryContextCreate(NodeTag tag, Size size,
 #endif
 	}
 
+	/*
+	 * If accounting was requested for this context, create the struct
+	 * and link it to the accounting from parent context. Otherwise just
+	 * copy the accounting reference from parent.
+	 *
+	 * If both cases, accounting in parent may be NULL, which means
+	 * we don't need to update accounting in the upper memory contexts.
+	 */
+	if (track_mem)
+	{
+		node->accounting = (MemoryAccounting)MemoryContextAlloc(TopMemoryContext,
+												sizeof(MemoryAccountingData));
+		if (parent)
+			node->accounting->parent = parent->accounting;
+	} else if (parent) {
+		node->accounting = parent->accounting;
+	}
+
 	VALGRIND_CREATE_MEMPOOL(node, 0, false);
 
 	/* Return to type-specific creation routine to finish up */
diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index ad77509..a6d9f8c 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -50,6 +50,18 @@ typedef struct MemoryContextMethods
 #endif
 } MemoryContextMethods;
 
+typedef struct MemoryAccountingData {
+
+	Size	total_allocated; /* including child contexts */
+	Size	self_allocated;  /* not including child contexts */
+
+	/* parent accounting (not parent context) */
+	struct MemoryAccountingData * parent;
+
+} MemoryAccountingData;
+
+typedef MemoryAccountingData * MemoryAccounting;
+
 
 typedef struct MemoryContextData
 {
@@ -60,6 +72,8 @@ typedef struct MemoryContextData
 	MemoryContext nextchild;	/* next child of same parent */
 	char	   *name;			/* context name (just for debugging) */
 	bool		isReset;		/* T = no space alloced since last reset */
+	bool		track_mem;		/* whether to track memory usage */
+	MemoryAccounting	accounting;
 #ifdef USE_ASSERT_CHECKING
 	bool		allowInCritSection;	/* allow palloc in critical section */
 #endif
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index 2fede86..b82f923 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -96,6 +96,7 @@ extern void MemoryContextDeleteChildren(MemoryContext context);
 extern void MemoryContextResetAndDeleteChildren(MemoryContext context);
 extern void MemoryContextSetParent(MemoryContext context,
 					   MemoryContext new_parent);
+extern Size MemoryContextGetAllocated(MemoryContext context, bool total);
 extern Size GetMemoryChunkSpace(void *pointer);
 extern MemoryContext GetMemoryChunkContext(void *pointer);
 extern MemoryContext MemoryContextGetParent(MemoryContext context);
@@ -117,7 +118,8 @@ extern bool MemoryContextContains(MemoryContext context, void *pointer);
 extern MemoryContext MemoryContextCreate(NodeTag tag, Size size,
 					MemoryContextMethods *methods,
 					MemoryContext parent,
-					const char *name);
+					const char *name,
+					bool track_mem);
 
 
 /*
@@ -130,6 +132,12 @@ extern MemoryContext AllocSetContextCreate(MemoryContext parent,
 					  Size minContextSize,
 					  Size initBlockSize,
 					  Size maxBlockSize);
+extern MemoryContext AllocSetContextCreateTracked(MemoryContext parent,
+					  const char *name,
+					  Size minContextSize,
+					  Size initBlockSize,
+					  Size maxBlockSize,
+					  bool track_mem);
 
 /*
  * Recommended default alloc parameters, suitable for "ordinary" contexts

hashagg-batching-jeff-v1.patchtext/x-diff; name=hashagg-batching-jeff-v1.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 49547ee..b651858 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2884,6 +2884,21 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-hashagg-disk" xreflabel="enable_hashagg_disk">
+      <term><varname>enable_hashagg_disk</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_hashagg_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of hashed aggregation plan
+        types when the planner expects the hash table size to exceed
+        <varname>work_mem</varname>. The default is <literal>on</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
       <term><varname>enable_hashjoin</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 6455864..3ae9583 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -108,6 +108,7 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_agg.h"
 #include "parser/parse_coerce.h"
+#include "storage/buffile.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -115,7 +116,11 @@
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
 #include "utils/datum.h"
+#include "utils/dynahash.h"
 
+#define HASH_DISK_MIN_PARTITIONS		1
+#define HASH_DISK_DEFAULT_PARTITIONS	4
+#define HASH_DISK_MAX_PARTITIONS		256
 
 /*
  * AggStatePerAggData - per-aggregate working state for the Agg scan
@@ -310,6 +315,24 @@ typedef struct AggHashEntryData
 
 }	AggHashEntryData;	/* VARIABLE LENGTH STRUCT */
 
+/*
+ * Used as a unit of work when batching. After reaching work_mem, no new
+ * groups are added to the hash table, and the tuples are divided into
+ * multiple batches (using a range of bits from the hash value).
+ * 
+ * At the end, each output partition (represented by a temporary file)
+ * is converted into a new HashWork item and the process is repeated.
+ */
+typedef struct HashWork
+{
+	BufFile		 *input_file;	/* input partition, NULL for outer plan */
+	int			  input_bits;	/* number of bits for input partition mask */
+
+	int			  n_output_partitions; /* number of output partitions */
+	BufFile		**output_partitions; /* output partition files */
+	int			 *output_ntuples; /* number of tuples in each partition */
+	int			  output_bits; /* log2(n_output_partitions) + input_bits */
+} HashWork;
 
 static void initialize_aggregates(AggState *aggstate,
 					  AggStatePerAgg peragg,
@@ -331,10 +354,11 @@ static void finalize_aggregate(AggState *aggstate,
 static Bitmapset *find_unaggregated_cols(AggState *aggstate);
 static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
 static void build_hash_table(AggState *aggstate, Size tuple_width);
-static AggHashEntry lookup_hash_entry(AggState *aggstate,
-				  TupleTableSlot *inputslot);
+static AggHashEntry lookup_hash_entry(AggState *aggstate, HashWork * work,
+					uint32 hashvalue, TupleTableSlot *inputslot);
+static HashWork *hash_work(BufFile *input_file, int input_bits);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
-static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_fill_hash_table(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
 static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
 
@@ -348,6 +372,12 @@ static void reset_hash_table(AggHashTable htab);
 static void IteratorReset(AggHashTable htab);
 static AggHashEntry IteratorGetNext(AggHashTable htab);
 
+static TupleTableSlot *
+read_saved_tuple(BufFile *file, uint32 *hashvalue, TupleTableSlot *tupleSlot);
+static void
+save_tuple(AggState *aggstate, HashWork *work, TupleTableSlot *slot,
+		   uint32 hashvalue);
+
 /*
  * The size of the chunks for dense allocation. This needs to be >8kB
  * because the default (and only) memory context implementation uses
@@ -412,6 +442,7 @@ typedef struct AggHashTableData
 	 */
 	HashChunk		cur_chunk;
 	AggHashEntry	cur_entry;
+	int				niterated;
 
 	/* list of chunks with dense-packed entries / minimal tuples */
 	HashChunk		chunks_hash;
@@ -1096,12 +1127,15 @@ build_hash_table(AggState *aggstate, Size tuple_width)
 	htab = (AggHashTable)MemoryContextAllocZero(aggstate->aggcontext,
 											sizeof(AggHashTableData));
 
+	htab->niterated = 0;
+
 	/* TODO create a memory context for the hash table */
-	htab->htabctx = AllocSetContextCreate(aggstate->aggcontext,
+	htab->htabctx = AllocSetContextCreateTracked(aggstate->aggcontext,
 											"HashAggHashTable",
 											ALLOCSET_DEFAULT_MINSIZE,
 											ALLOCSET_DEFAULT_INITSIZE,
-											ALLOCSET_DEFAULT_MAXSIZE);
+											ALLOCSET_DEFAULT_MAXSIZE,
+											true);
 
 	/* buckets are just pointers to AggHashEntryData structures */
 	htab->buckets = (AggHashEntry*)MemoryContextAllocZero(htab->htabctx,
@@ -1198,15 +1232,14 @@ hash_agg_entry_size(int numAggs)
  * When called, CurrentMemoryContext should be the per-query context.
  */
 static AggHashEntry
-lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
+lookup_hash_entry(AggState *aggstate, HashWork * work, uint32 hashvalue,
+TupleTableSlot *inputslot)
 {
 
 	AggHashEntry entry = NULL;
-	uint32		hashvalue;
 	uint32		bucketno;
 	MinimalTuple mintuple;
 
-	hashvalue = compute_hash_value(aggstate, inputslot);
 	bucketno = compute_bucket(aggstate, hashvalue);
 
 	entry = aggstate->hashtable->buckets[bucketno];
@@ -1223,10 +1256,13 @@ lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
 		entry = entry->next;
 	}
 
-	/* There's not a maching entry in the bucket, so create a new one and
-	 * copy in data both for the aggregates, and the MinimalTuple containing
-	 * keys for the group columns. */
-	if (entry == NULL)
+	/*
+	 * There's not a maching entry in the bucket (and we've not reached the
+	 * work_mem limit, so create a new one and copy in data both for the
+	 * aggregates, and the MinimalTuple containing keys for the group columns.
+	 */
+	if ((entry == NULL) &&
+		(MemoryContextGetAllocated(aggstate->hashtable->htabctx, true) < work_mem * 1024L))
 	{
 
 		MemoryContext old;
@@ -1318,9 +1354,16 @@ ExecAgg(AggState *node)
 	/* Dispatch based on strategy */
 	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
 	{
-		if (!node->table_filled)
-			agg_fill_hash_table(node);
-		return agg_retrieve_hash_table(node);
+		TupleTableSlot *slot = NULL;
+
+		while (slot == NULL)
+		{
+			if (!node->table_filled)
+				if (! agg_fill_hash_table(node))
+					break;	/* no more HashWork items to process */
+			slot = agg_retrieve_hash_table(node);
+		}
+		return slot;
 	}
 	else
 		return agg_retrieve_direct(node);
@@ -1536,13 +1579,15 @@ agg_retrieve_direct(AggState *aggstate)
 /*
  * ExecAgg for hashed case: phase 1, read input and build hash table
  */
-static void
+static bool
 agg_fill_hash_table(AggState *aggstate)
 {
 	PlanState  *outerPlan;
 	ExprContext *tmpcontext;
 	AggHashEntry entry;
-	TupleTableSlot *outerslot;
+	TupleTableSlot *outerslot = NULL;
+	HashWork   *work;
+	int			i;
 
 	/*
 	 * get state info from node
@@ -1551,33 +1596,120 @@ agg_fill_hash_table(AggState *aggstate)
 	/* tmpcontext is the per-input-tuple expression context */
 	tmpcontext = aggstate->tmpcontext;
 
+	/* if there's no HashWork item, we're done */
+	if (aggstate->hash_work == NIL)
+	{
+		aggstate->agg_done = true;
+		return false;
+	}
+
+	work = linitial(aggstate->hash_work);
+	aggstate->hash_work = list_delete_first(aggstate->hash_work);
+
+	/* if not the first time through, reinitialize */
+	if (!aggstate->hash_init_state)
+	{
+		/* FIXME get rid of all the previous aggregate states somehow.
+		 *       Either reset the aggcontext, or clear the hash table
+		 *       somehow. Resetting the context seems better. */
+
+		/* reset the hash table (free the chunks, zero buckets) */
+		reset_hash_table(aggstate->hashtable);
+	}
+
+	/* reinitialize on the next item */
+	aggstate->hash_init_state = false;
+
 	/*
 	 * Process each outer-plan tuple, and then fetch the next one, until we
 	 * exhaust the outer plan.
 	 */
 	for (;;)
 	{
-		outerslot = ExecProcNode(outerPlan);
-		if (TupIsNull(outerslot))
-			break;
-		/* set up for advance_aggregates call */
-		tmpcontext->ecxt_outertuple = outerslot;
+
+		uint32 hashvalue;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* the first HashWork item means we need to fetch the tuples */
+		if (work->input_file == NULL)
+		{
+			outerslot = ExecProcNode(outerPlan);
+			if (TupIsNull(outerslot))
+				break;
+
+			hashvalue = compute_hash_value(aggstate, outerslot);
+		}
+		else
+		{
+			/* first time through this HashWork item */
+			if (outerslot == NULL)
+				outerslot = MakeSingleTupleTableSlot(aggstate->hashtable->slot->tts_tupleDescriptor);
+
+			outerslot = read_saved_tuple(work->input_file, &hashvalue, outerslot);
+			if (TupIsNull(outerslot))
+			{
+				BufFileClose(work->input_file);
+				work->input_file = NULL;
+				break;
+			}
+		}
 
 		/* Find or build hashtable entry for this tuple's group */
-		entry = lookup_hash_entry(aggstate, outerslot);
+		entry = lookup_hash_entry(aggstate, work, hashvalue, outerslot);
 
-		/* Advance the aggregates */
-		advance_aggregates(aggstate, entry->pergroup);
+		if (entry != NULL) {
 
-		/* Reset per-input-tuple context after each tuple */
-		ResetExprContext(tmpcontext);
+			/* set up for advance_aggregates call */
+			tmpcontext->ecxt_outertuple = outerslot;
+
+			/* Advance the aggregates */
+			advance_aggregates(aggstate, entry->pergroup);
+
+			/* Reset per-input-tuple context after each tuple */
+			ResetExprContext(tmpcontext);
+
+		} else {
+
+			/* no entry for this tuple, and we've reached work_mem */
+			save_tuple(aggstate, work, outerslot, hashvalue);
+
+		}
 	}
 
+	/* add each output partition as a new work item */
+	for (i = 0; i < work->n_output_partitions; i++)
+	{
+		BufFile			*file = work->output_partitions[i];
+		MemoryContext	 oldContext;
+
+		/* partition is empty */
+		if (work->output_ntuples[i] == 0)
+			continue;
+
+		/* rewind file for reading */
+		if (BufFileSeek(file, 0, 0L, SEEK_SET))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not rewind HashAgg temporary file: %m")));
+
+		oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
+		aggstate->hash_work = lappend(aggstate->hash_work,
+									  hash_work(file,
+												work->output_bits + work->input_bits));
+		MemoryContextSwitchTo(oldContext);
+	}
+
+	pfree(work);
+
 	aggstate->table_filled = true;
 
 	/* Initialize for iteration through the table (first bucket / entry) */
 	IteratorReset(aggstate->hashtable);
 
+	/* ready to return groups from this hash table */
+	return true;
+
 }
 
 /*
@@ -1620,6 +1752,8 @@ agg_retrieve_hash_table(AggState *aggstate)
 		 */
 		ResetExprContext(econtext);
 
+		htab->niterated += 1;
+
 		/*
 		* Store the copied first input tuple in the tuple table slot reserved
 		* for it, so that it can be used in ExecProject.
@@ -1677,7 +1811,8 @@ agg_retrieve_hash_table(AggState *aggstate)
 
 	}
 
-	aggstate->agg_done = true;
+	/* No more entries in hashtable, so done with this batch */
+	aggstate->table_filled = false;
 
 	/* No more groups */
 	return NULL;
@@ -1739,11 +1874,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 * recover no-longer-wanted space.
 	 */
 	aggstate->aggcontext =
-		AllocSetContextCreate(CurrentMemoryContext,
+		AllocSetContextCreateTracked(CurrentMemoryContext,
 							  "AggContext",
 							  ALLOCSET_DEFAULT_MINSIZE,
 							  ALLOCSET_DEFAULT_INITSIZE,
-							  ALLOCSET_DEFAULT_MAXSIZE);
+							  ALLOCSET_DEFAULT_MAXSIZE, true);
 
 	/*
 	 * tuple table initialization
@@ -1842,10 +1977,22 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 
 	if (node->aggstrategy == AGG_HASHED)
 	{
+		MemoryContext oldContext;
+
 		build_hash_table(aggstate, outerPlan->plan_width);
 		aggstate->table_filled = false;
+		aggstate->hash_init_state = true;
+		aggstate->hash_disk = false;
+
 		/* Compute the columns we actually need to hash on */
 		aggstate->hash_needed = find_hash_columns(aggstate);
+
+		/* prime with initial work item to read from outer plan */
+		oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
+		aggstate->hash_work = lappend(aggstate->hash_work,
+									  hash_work(NULL, 0));
+		MemoryContextSwitchTo(oldContext);
+
 	}
 	else
 	{
@@ -2264,22 +2411,23 @@ ExecReScanAgg(AggState *node)
 	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
 	{
 		/*
-		 * In the hashed case, if we haven't yet built the hash table then we
-		 * can just return; nothing done yet, so nothing to undo. If subnode's
-		 * chgParam is not NULL then it will be re-scanned by ExecProcNode,
-		 * else no reason to re-scan it at all.
+		 * In the hashed case, if we haven't done any execution work yet, we
+		 * can just return; nothing to undo. If subnode's chgParam is not NULL
+		 * then it will be re-scanned by ExecProcNode, else no reason to
+		 * re-scan it at all.
 		 */
-		if (!node->table_filled)
+		if (node->hash_init_state)
 			return;
 
 		/*
-		 * If we do have the hash table and the subplan does not have any
-		 * parameter changes, then we can just rescan the existing hash table;
-		 * no need to build it again.
+		 * If we do have the hash table, it never went to disk, and the
+		 * subplan does not have any parameter changes, then we can just
+		 * rescan the existing hash table; no need to build it again.
 		 */
-		if (node->ss.ps.lefttree->chgParam == NULL)
+		if (node->ss.ps.lefttree->chgParam == NULL && !node->hash_disk)
 		{
 			IteratorReset(node->hashtable);
+			node->table_filled = true;
 			return;
 		}
 	}
@@ -2318,10 +2466,21 @@ ExecReScanAgg(AggState *node)
 
 	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
 	{
+		MemoryContext oldContext;
 		Plan * outerPlan = outerPlan((Agg *) node->ss.ps.plan);
+
 		/* Rebuild an empty hash table */
 		build_hash_table(node, outerPlan->plan_width);
+		node->hash_init_state = true;
 		node->table_filled = false;
+		node->hash_disk = false;
+		node->hash_work = NIL;
+
+		/* prime with initial work item to read from outer plan */
+		oldContext = MemoryContextSwitchTo(node->aggcontext);
+		node->hash_work = lappend(node->hash_work,
+								  hash_work(NULL, 0));
+		MemoryContextSwitchTo(oldContext);
 	}
 	else
 	{
@@ -2827,6 +2986,144 @@ AggHashEntry IteratorGetNext(AggHashTable htab)
 }
 
 /*
+ * hash_work
+ *
+ * Construct a HashWork item, which represents one iteration of HashAgg to be
+ * done. Should be called in the aggregate's memory context.
+ */
+static HashWork *
+hash_work(BufFile *input_file, int input_bits)
+{
+	HashWork *work = palloc(sizeof(HashWork));
+
+	work->input_file = input_file;
+	work->input_bits = input_bits;
+
+	/*
+	 * Will be set only if we run out of memory and need to partition an
+	 * additional level.
+	 */
+	work->n_output_partitions = 0;
+	work->output_partitions = NULL;
+	work->output_ntuples = NULL;
+	work->output_bits = 0;
+
+	return work;
+}
+
+/*
+ * save_tuple
+ *
+ * Not enough memory to add tuple as new entry in hash table. Save for later
+ * in the appropriate partition.
+ */
+static void
+save_tuple(AggState *aggstate, HashWork *work, TupleTableSlot *slot,
+		   uint32 hashvalue)
+{
+	int					 partition;
+	MinimalTuple		 tuple;
+	BufFile				*file;
+	int					 written;
+
+	if (work->output_partitions == NULL)
+	{
+		int npartitions = HASH_DISK_DEFAULT_PARTITIONS; //TODO choose
+		int partition_bits;
+		int i;
+
+		if (npartitions < HASH_DISK_MIN_PARTITIONS)
+			npartitions = HASH_DISK_MIN_PARTITIONS;
+		if (npartitions > HASH_DISK_MAX_PARTITIONS)
+			npartitions = HASH_DISK_MAX_PARTITIONS;
+
+		partition_bits = my_log2(npartitions);
+
+		/* make sure that we don't exhaust the hash bits */
+		if (partition_bits + work->input_bits >= 32)
+			partition_bits = 32 - work->input_bits;
+
+		/* number of partitions will be a power of two */
+		npartitions = 1L << partition_bits;
+
+		work->output_bits = partition_bits;
+		work->n_output_partitions = npartitions;
+		work->output_partitions = palloc(sizeof(BufFile *) * npartitions);
+		work->output_ntuples = palloc0(sizeof(int) * npartitions);
+
+		for (i = 0; i < npartitions; i++)
+			work->output_partitions[i] = BufFileCreateTemp(false);
+	}
+
+	if (work->output_bits == 0)
+		partition = 0;
+	else
+		partition = (hashvalue << work->input_bits) >>
+			(32 - work->output_bits);
+
+	work->output_ntuples[partition]++;
+	file = work->output_partitions[partition];
+	tuple = ExecFetchSlotMinimalTuple(slot);
+
+	written = BufFileWrite(file, (void *) &hashvalue, sizeof(uint32));
+	if (written != sizeof(uint32))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to HashAgg temporary file: %m")));
+
+	written = BufFileWrite(file, (void *) tuple, tuple->t_len);
+	if (written != tuple->t_len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to HashAgg temporary file: %m")));
+}
+
+
+/*
+ * read_saved_tuple
+ *		read the next tuple from a batch file.  Return NULL if no more.
+ *
+ * On success, *hashvalue is set to the tuple's hash value, and the tuple
+ * itself is stored in the given slot.
+ *
+ * Copied with minor modifications from ExecHashJoinGetSavedTuple.
+ */
+static TupleTableSlot *
+read_saved_tuple(BufFile *file, uint32 *hashvalue, TupleTableSlot *tupleSlot)
+{
+	uint32		header[2];
+	size_t		nread;
+	MinimalTuple tuple;
+
+	/*
+	 * Since both the hash value and the MinimalTuple length word are uint32,
+	 * we can read them both in one BufFileRead() call without any type
+	 * cheating.
+	 */
+	nread = BufFileRead(file, (void *) header, sizeof(header));
+	if (nread == 0)				/* end of file */
+	{
+		ExecClearTuple(tupleSlot);
+		return NULL;
+	}
+	if (nread != sizeof(header))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from HashAgg temporary file: %m")));
+	*hashvalue = header[0];
+	tuple = (MinimalTuple) palloc(header[1]);
+	tuple->t_len = header[1];
+	nread = BufFileRead(file,
+						(void *) ((char *) tuple + sizeof(uint32)),
+						header[1] - sizeof(uint32));
+	if (nread != header[1] - sizeof(uint32))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from HashAgg temporary file: %m")));
+	return ExecStoreMinimalTuple(tuple, tupleSlot, true);
+}
+
+/*
  * Resets the contents of the hash table - removes all the entries and
  * tuples, but keeps the 'size' of the hash table (nbuckets).
  */
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 0cdb790..926abad 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -113,6 +113,7 @@ bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
 bool		enable_hashagg = true;
+bool		enable_hashagg_disk = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
 bool		enable_mergejoin = true;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e1480cd..7b8135d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2741,7 +2741,8 @@ choose_hashed_grouping(PlannerInfo *root,
 	/* plus the per-hash-entry overhead */
 	hashentrysize += hash_agg_entry_size(agg_costs->numAggs);
 
-	if (hashentrysize * dNumGroups > work_mem * 1024L)
+	if (!enable_hashagg_disk &&
+		hashentrysize * dNumGroups > work_mem * 1024L)
 		return false;
 
 	/*
@@ -2907,7 +2908,8 @@ choose_hashed_distinct(PlannerInfo *root,
 	/* plus the per-hash-entry overhead */
 	hashentrysize += hash_agg_entry_size(0);
 
-	if (hashentrysize * dNumDistinctRows > work_mem * 1024L)
+	if (!enable_hashagg_disk &&
+		hashentrysize * dNumDistinctRows > work_mem * 1024L)
 		return false;
 
 	/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8c57803..5128e20 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -749,6 +749,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"enable_hashagg_disk", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of disk-based hashed aggregation plans."),
+			NULL
+		},
+		&enable_hashagg_disk,
+		true,
+		NULL, NULL, NULL
+	},
+	{
 		{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of materialization."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index df98b02..8f5b73b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -266,6 +266,7 @@
 
 #enable_bitmapscan = on
 #enable_hashagg = on
+#enable_hashagg_disk = on
 #enable_hashjoin = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index a70b296..97034f1 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -634,7 +634,7 @@ MemoryContextCreate(NodeTag tag, Size size,
 	 */
 	if (track_mem)
 	{
-		node->accounting = (MemoryAccounting)MemoryContextAlloc(TopMemoryContext,
+		node->accounting = (MemoryAccounting)MemoryContextAllocZero(TopMemoryContext,
 												sizeof(MemoryAccountingData));
 		if (parent)
 			node->accounting->parent = parent->accounting;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 995389b..1a61ac7 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1726,6 +1726,11 @@ typedef struct AggState
 	bool		table_filled;	/* hash table filled yet? */
 	AggHashTable	hashtable;	/* instance of the simple hash table */
 
+	/* simple batching */
+	bool		hash_init_state; /* in initial state before execution? */
+	bool		hash_disk;		/* have we exceeded memory yet? */
+	List	   *hash_work;		/* remaining work to be done */
+
 } AggState;
 
 /* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 75e2afb..d363e65 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -57,6 +57,7 @@ extern bool enable_bitmapscan;
 extern bool enable_tidscan;
 extern bool enable_sort;
 extern bool enable_hashagg;
+extern bool enable_hashagg_disk;
 extern bool enable_nestloop;
 extern bool enable_material;
 extern bool enable_mergejoin;
diff --git a/src/test/regress/expected/rangefuncs.out b/src/test/regress/expected/rangefuncs.out
index 774e75e..e88c83c 100644
--- a/src/test/regress/expected/rangefuncs.out
+++ b/src/test/regress/expected/rangefuncs.out
@@ -3,6 +3,7 @@ SELECT name, setting FROM pg_settings WHERE name LIKE 'enable%';
 ----------------------+---------
  enable_bitmapscan    | on
  enable_hashagg       | on
+ enable_hashagg_disk  | on
  enable_hashjoin      | on
  enable_indexonlyscan | on
  enable_indexscan     | on
@@ -12,7 +13,7 @@ SELECT name, setting FROM pg_settings WHERE name LIKE 'enable%';
  enable_seqscan       | on
  enable_sort          | on
  enable_tidscan       | on
-(11 rows)
+(12 rows)
 
 CREATE TABLE foo2(fooid int, f2 int);
 INSERT INTO foo2 VALUES(1, 11);

hashagg-batching-jeff-pt2-v1.patchtext/x-diff; name=hashagg-batching-jeff-pt2-v1.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..54caade 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -78,6 +78,8 @@ static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
 			  ExplainState *es);
+static void show_agg_batching(AggState *astate, List *ancestors,
+			  ExplainState *es);
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
@@ -1391,6 +1393,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 										   planstate, es);
 			break;
 		case T_Agg:
+			show_agg_batching((AggState *) planstate, ancestors, es);
 			show_agg_keys((AggState *) planstate, ancestors, es);
 			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
 			if (plan->qual)
@@ -1790,6 +1793,36 @@ show_agg_keys(AggState *astate, List *ancestors,
 }
 
 /*
+ * Show the batching info for an Agg node.
+ */
+static void
+show_agg_batching(AggState *astate, List *ancestors,
+			  ExplainState *es)
+{
+	Agg		   *plan = (Agg *) astate->ss.ps.plan;
+
+	if ((es->analyze) && (plan->aggstrategy == AGG_HASHED))
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Batch Count: %d  Rebatches: %d  Smallest: %ldkB  Largest: %ldkB Rescanned: %.0f%%\n",
+							 astate->batch_count, astate->rebatch_count,
+							 astate->batch_min_size / 1024,
+							 astate->batch_max_size / 1024,
+							 astate->ntuples_rescanned * 100.0 / astate->ntuples_scanned);
+		}
+		else
+		{
+			ExplainPropertyLong("Batch Count", astate->batch_count, es);
+			ExplainPropertyLong("Batch Smallest", astate->batch_min_size/1024, es);
+			ExplainPropertyLong("Batch Largest", astate->batch_max_size/1024, es);
+			ExplainPropertyLong("Batch Rescan Rate", (astate->ntuples_rescanned * 100) / astate->ntuples_scanned, es);
+		}
+	}
+}
+
+/*
  * Show the grouping keys for a Group node.
  */
 static void
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 3ae9583..1266faf 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -118,7 +118,7 @@
 #include "utils/datum.h"
 #include "utils/dynahash.h"
 
-#define HASH_DISK_MIN_PARTITIONS		1
+#define HASH_DISK_MIN_PARTITIONS		2
 #define HASH_DISK_DEFAULT_PARTITIONS	4
 #define HASH_DISK_MAX_PARTITIONS		256
 
@@ -328,6 +328,7 @@ typedef struct HashWork
 	BufFile		 *input_file;	/* input partition, NULL for outer plan */
 	int			  input_bits;	/* number of bits for input partition mask */
 
+	double		  ntuples_expected; /* number of tuples expected on input */
 	int			  n_output_partitions; /* number of output partitions */
 	BufFile		**output_partitions; /* output partition files */
 	int			 *output_ntuples; /* number of tuples in each partition */
@@ -356,7 +357,7 @@ static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
 static void build_hash_table(AggState *aggstate, Size tuple_width);
 static AggHashEntry lookup_hash_entry(AggState *aggstate, HashWork * work,
 					uint32 hashvalue, TupleTableSlot *inputslot);
-static HashWork *hash_work(BufFile *input_file, int input_bits);
+static HashWork *hash_work(BufFile *input_file, int input_bits, double ntuples_expected);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
 static bool agg_fill_hash_table(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
@@ -376,7 +377,7 @@ static TupleTableSlot *
 read_saved_tuple(BufFile *file, uint32 *hashvalue, TupleTableSlot *tupleSlot);
 static void
 save_tuple(AggState *aggstate, HashWork *work, TupleTableSlot *slot,
-		   uint32 hashvalue);
+		   uint32 hashvalue, int64 ntuples_current);
 
 /*
  * The size of the chunks for dense allocation. This needs to be >8kB
@@ -1588,6 +1589,8 @@ agg_fill_hash_table(AggState *aggstate)
 	TupleTableSlot *outerslot = NULL;
 	HashWork   *work;
 	int			i;
+	int64		ntuples = 0;
+	Size		allocated;
 
 	/*
 	 * get state info from node
@@ -1639,6 +1642,7 @@ agg_fill_hash_table(AggState *aggstate)
 				break;
 
 			hashvalue = compute_hash_value(aggstate, outerslot);
+			aggstate->ntuples_scanned += 1;
 		}
 		else
 		{
@@ -1653,8 +1657,12 @@ agg_fill_hash_table(AggState *aggstate)
 				work->input_file = NULL;
 				break;
 			}
+
+			aggstate->ntuples_rescanned += 1;
 		}
 
+		ntuples++;
+
 		/* Find or build hashtable entry for this tuple's group */
 		entry = lookup_hash_entry(aggstate, work, hashvalue, outerslot);
 
@@ -1672,11 +1680,81 @@ agg_fill_hash_table(AggState *aggstate)
 		} else {
 
 			/* no entry for this tuple, and we've reached work_mem */
-			save_tuple(aggstate, work, outerslot, hashvalue);
+			save_tuple(aggstate, work, outerslot, hashvalue, ntuples);
 
 		}
 	}
 
+	/*
+	 * XXX Idea on estimating the number of partitions necessary in the
+	 *     next step, based on estimating group state size etc.
+	 *
+	 * See how much memory is allocated in the memory context, so that
+	 * we can use it to compute average state size (including ovehead).
+	 * When doing the initial partitioning, we did it when reaching
+	 * work_mem, but the states may have grown further.
+	 *
+	 * Assuming the groups are not of wildly different size, we can
+	 * optimize the partitioning in the following work items like this:
+	 *
+	 * 1) computing average group size
+	 *
+	 *   avg_group_size = allocated_bytes / nentries
+	 *
+	 * 2) computing number of groups that fit into work_mem
+	 *
+	 *   groups_work_mem = (work_mem * 1024L) / avg_group_size
+	 *
+	 * 3) computing tuples per group
+	 *
+	 *   ntuples_per_group = (ntuples / nentries)
+	 *
+	 * 4) computing ntuples_in_partition
+	 *
+	 *   optimal_partition_tuples = ntuples_per_group * groups_work_mem
+	 *
+	 * 5) we know how many tuples we wrote into each partition - we can
+	 *    either compute it as (ntuples/npartitions) which is easy, or
+	 *    track the number per partition (more correct), so we can
+	 *    decide into how many partitions should we split it in the next
+	 *    step
+	 *
+	 *     npartitions = ntuples_per_partition / optimal_partition_tuples
+	 *
+	 *    and we can do this immediately at the beginning, using the
+	 *    hash value (assuming it's a power of 2).
+	 *
+	 * This should minimize the number of times a tuple is read from the
+	 * temporary file, only to be written again because there's not enough
+	 * free memory.
+	 *
+	 * The question is how this will deal with exceptionally large
+	 * groups. Technically, all partitions should receive about the
+	 * same number of groups, but if there's a very frequent group the
+	 * partition may be much larger (many more tuples, belonging to the
+	 * very large group). What we need to prevent is splitting the data
+	 * into needlessly small partitions.
+	 */
+
+	allocated = MemoryContextGetAllocated(aggstate->hashtable->htabctx, true);
+
+	/* keep track of the largest/smallest batch size */
+	if (aggstate->batch_count == 1)
+	{
+		aggstate->batch_min_size = allocated;
+		aggstate->batch_max_size = allocated;
+	}
+	else
+	{
+		if (allocated < aggstate->batch_min_size)
+			aggstate->batch_min_size = allocated;
+		if (allocated > aggstate->batch_max_size)
+			aggstate->batch_max_size = allocated;
+	}
+
+	if (work->n_output_partitions > 0)
+		aggstate->rebatch_count += 1;
+
 	/* add each output partition as a new work item */
 	for (i = 0; i < work->n_output_partitions; i++)
 	{
@@ -1693,11 +1771,27 @@ agg_fill_hash_table(AggState *aggstate)
 					(errcode_for_file_access(),
 					 errmsg("could not rewind HashAgg temporary file: %m")));
 
+		/*
+		 * XXX This assumes all the batches are equally sized, but that
+		 * can easily not be the case - imagine a frequent group. All the
+		 * tuples will get saved into the same batch, thus making it
+		 * much larger than the rest.
+		 *
+		 * At this point, we can also estimate the average group size as
+		 * (allocated memory / nentries), and number of tuples per group
+		 * (ntuples / nentries). Granted, those are pretty rough estimates
+		 * that can go wrong in many ways, but it's better than nothing.
+		 *
+		 * TODO Keep track of number of tuples saved in each group.
+		 */
 		oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
-		aggstate->hash_work = lappend(aggstate->hash_work,
-									  hash_work(file,
-												work->output_bits + work->input_bits));
+		aggstate->hash_work
+			= lappend(aggstate->hash_work,
+					  hash_work(file,
+					  work->output_bits + work->input_bits,
+					  work->output_ntuples[i]));
 		MemoryContextSwitchTo(oldContext);
+		aggstate->batch_count += 1;
 	}
 
 	pfree(work);
@@ -1984,13 +2078,19 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 		aggstate->hash_init_state = true;
 		aggstate->hash_disk = false;
 
+		/* explain (analyze) counters */
+		aggstate->batch_count = 1;
+		aggstate->rebatch_count = 0;
+		aggstate->ntuples_scanned = 0;
+		aggstate->ntuples_rescanned = 0;
+
 		/* Compute the columns we actually need to hash on */
 		aggstate->hash_needed = find_hash_columns(aggstate);
 
 		/* prime with initial work item to read from outer plan */
 		oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
 		aggstate->hash_work = lappend(aggstate->hash_work,
-									  hash_work(NULL, 0));
+									  hash_work(NULL, 0, outerPlan->plan_rows));
 		MemoryContextSwitchTo(oldContext);
 
 	}
@@ -2467,7 +2567,7 @@ ExecReScanAgg(AggState *node)
 	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
 	{
 		MemoryContext oldContext;
-		Plan * outerPlan = outerPlan((Agg *) node->ss.ps.plan);
+		Plan *outerPlan = outerPlan((Agg *) node->ss.ps.plan);
 
 		/* Rebuild an empty hash table */
 		build_hash_table(node, outerPlan->plan_width);
@@ -2476,10 +2576,16 @@ ExecReScanAgg(AggState *node)
 		node->hash_disk = false;
 		node->hash_work = NIL;
 
+		/* explain (analyze) counters */
+		node->batch_count = 1;
+		node->rebatch_count = 0;
+		node->ntuples_scanned = 0;
+		node->ntuples_rescanned = 0;
+
 		/* prime with initial work item to read from outer plan */
 		oldContext = MemoryContextSwitchTo(node->aggcontext);
 		node->hash_work = lappend(node->hash_work,
-								  hash_work(NULL, 0));
+								  hash_work(NULL, 0, outerPlan->plan_rows));
 		MemoryContextSwitchTo(oldContext);
 	}
 	else
@@ -2992,12 +3098,13 @@ AggHashEntry IteratorGetNext(AggHashTable htab)
  * done. Should be called in the aggregate's memory context.
  */
 static HashWork *
-hash_work(BufFile *input_file, int input_bits)
+hash_work(BufFile *input_file, int input_bits, double ntuples_expected)
 {
 	HashWork *work = palloc(sizeof(HashWork));
 
 	work->input_file = input_file;
 	work->input_bits = input_bits;
+	work->ntuples_expected = ntuples_expected;
 
 	/*
 	 * Will be set only if we run out of memory and need to partition an
@@ -3019,7 +3126,7 @@ hash_work(BufFile *input_file, int input_bits)
  */
 static void
 save_tuple(AggState *aggstate, HashWork *work, TupleTableSlot *slot,
-		   uint32 hashvalue)
+		   uint32 hashvalue, int64 ntuples_current)
 {
 	int					 partition;
 	MinimalTuple		 tuple;
@@ -3028,7 +3135,35 @@ save_tuple(AggState *aggstate, HashWork *work, TupleTableSlot *slot,
 
 	if (work->output_partitions == NULL)
 	{
-		int npartitions = HASH_DISK_DEFAULT_PARTITIONS; //TODO choose
+		/*
+		 * We will choose the number of partitions based on when we
+		 * reached full work_mem. We expect to get work->ntuples in
+		 * total, and we've reached work_mem at ntuples, so we expect
+		 * to get (work->ntuples - ntuples) additional tuples.
+		 * 
+		 * By assuming to reach work_mem every 'ntuples' tuples, we can
+		 * estimate the number of batches as
+		 * 
+		 *    ceil((work->ntuples - ntuples)/ntuples)
+		 * 
+		 * which is actually
+		 * 
+		 *    (int)(1 + (work->ntuples - ntuples)/ntuples)
+		 * 
+		 * and that's just (work->ntuples/ntuples). We'll impose some
+		 * basic safety numbers on those values.
+		 * 
+		 * It's probably better to over-estimate here, under-estimation
+		 * means we'll have to read the tuples and write then again,
+		 * into another set of batches. Not really efficient.
+		 *
+		 * Also, even if there's a huge group in this batch, the state
+		 * size usually grows along with the number of tuples passed
+		 * to the transition function. So even in this case it should
+		 * be a good estimate (e.g. the batching should not be triggered
+		 * too early).
+		 */
+		int npartitions = (work->ntuples_expected / ntuples_current);
 		int partition_bits;
 		int i;
 
@@ -3037,7 +3172,9 @@ save_tuple(AggState *aggstate, HashWork *work, TupleTableSlot *slot,
 		if (npartitions > HASH_DISK_MAX_PARTITIONS)
 			npartitions = HASH_DISK_MAX_PARTITIONS;
 
+		/* make it a power of 2 */
 		partition_bits = my_log2(npartitions);
+		npartitions = (1 << partition_bits);
 
 		/* make sure that we don't exhaust the hash bits */
 		if (partition_bits + work->input_bits >= 32)
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1a61ac7..9b2558e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1731,6 +1731,20 @@ typedef struct AggState
 	bool		hash_disk;		/* have we exceeded memory yet? */
 	List	   *hash_work;		/* remaining work to be done */
 
+	/* counters used mostly for explain (analyze) */
+	int			batch_count;		/* number of batches generated in total */
+	int			rebatch_count;		/* number of rebatches (splitting of a batch) */
+	Size		batch_min_size;		/* minimum batch size (bytes) */
+	Size		batch_max_size;		/* maximum batch size (bytes) */
+
+	/*
+	 * These two counters allow evaluation of how many times the tuples
+	 * were saved/read. With no batching, rescanned=0. With a single
+	 * level of batching (rescanned/scanned < 1.00) and with multi-level
+	 * batching it may happen that (rescanned/scanned > 1.00). */
+	int64		ntuples_scanned;	/* number of input tuples scanned */
+	int64		ntuples_rescanned;	/* number of tuples saved/read */
+
 } AggState;
 
 /* ----------------

hashagg-batching-hashjoin-v1.patchtext/x-diff; name=hashagg-batching-hashjoin-v1.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 49547ee..b651858 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2884,6 +2884,21 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-hashagg-disk" xreflabel="enable_hashagg_disk">
+      <term><varname>enable_hashagg_disk</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_hashagg_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of hashed aggregation plan
+        types when the planner expects the hash table size to exceed
+        <varname>work_mem</varname>. The default is <literal>on</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
       <term><varname>enable_hashjoin</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..ca9f026 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -78,6 +78,8 @@ static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
 			  ExplainState *es);
+static void show_agg_batching(AggState *astate, List *ancestors,
+			  ExplainState *es);
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
@@ -1391,6 +1393,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 										   planstate, es);
 			break;
 		case T_Agg:
+			show_agg_batching((AggState *) planstate, ancestors, es);
 			show_agg_keys((AggState *) planstate, ancestors, es);
 			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
 			if (plan->qual)
@@ -1790,6 +1793,37 @@ show_agg_keys(AggState *astate, List *ancestors,
 }
 
 /*
+ * Show the batching info for an Agg node.
+ */
+static void
+show_agg_batching(AggState *astate, List *ancestors,
+			  ExplainState *es)
+{
+	Agg		   *plan = (Agg *) astate->ss.ps.plan;
+
+	if ((es->analyze) && (plan->aggstrategy == AGG_HASHED))
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Batch Count: %d  Original: %d  Smallest: %ldkB  Largest: %ldkB Rescanned: %.0f%%\n",
+							 astate->nbatch, astate->nbatch_original,
+							 astate->batch_min_size / 1024,
+							 astate->batch_max_size / 1024,
+							 astate->ntuples_rescanned * 100.0 / astate->ntuples_scanned);
+		}
+		else
+		{
+			ExplainPropertyLong("Batch Count", astate->nbatch, es);
+			ExplainPropertyLong("Batch Count Original", astate->nbatch_original, es);
+			ExplainPropertyLong("Batch Smallest", astate->batch_min_size/1024, es);
+			ExplainPropertyLong("Batch Largest", astate->batch_max_size/1024, es);
+			ExplainPropertyLong("Batch Rescan Rate", (astate->ntuples_rescanned * 100) / astate->ntuples_scanned, es);
+		}
+	}
+}
+
+/*
  * Show the grouping keys for a Group node.
  */
 static void
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 6455864..d0e30b1 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -96,6 +96,8 @@
 
 #include "postgres.h"
 
+#include <limits.h>
+
 #include "access/htup_details.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_aggregate.h"
@@ -108,6 +110,7 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_agg.h"
 #include "parser/parse_coerce.h"
+#include "storage/buffile.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -115,7 +118,11 @@
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
 #include "utils/datum.h"
+#include "utils/dynahash.h"
 
+#define HASH_DISK_MIN_PARTITIONS		2
+#define HASH_DISK_DEFAULT_PARTITIONS	4
+#define HASH_DISK_MAX_PARTITIONS		256
 
 /*
  * AggStatePerAggData - per-aggregate working state for the Agg scan
@@ -310,7 +317,6 @@ typedef struct AggHashEntryData
 
 }	AggHashEntryData;	/* VARIABLE LENGTH STRUCT */
 
-
 static void initialize_aggregates(AggState *aggstate,
 					  AggStatePerAgg peragg,
 					  AggStatePerGroup pergroup);
@@ -332,22 +338,44 @@ static Bitmapset *find_unaggregated_cols(AggState *aggstate);
 static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
 static void build_hash_table(AggState *aggstate, Size tuple_width);
 static AggHashEntry lookup_hash_entry(AggState *aggstate,
-				  TupleTableSlot *inputslot);
+					uint32 hashvalue, TupleTableSlot *inputslot);
+static void create_hash_entry(AggState *aggstate, AggHashEntry entry);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
-static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_fill_hash_table(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
 static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
 
 static uint32 compute_hash_value(AggState * aggstate, TupleTableSlot * slot);
 static uint32 compute_bucket(AggState * aggstate, uint32 hashvalue);
+static uint32 compute_batchno(AggState * aggstate, uint32 hashvalue);
 static bool groups_match(AggState * aggstate, TupleTableSlot *slot, AggHashEntry entry);
 static void increase_nbuckets(AggState * aggstate);
+static void increase_nbatches(AggState * aggstate);
 static char * chunk_alloc(AggHashTable htab, int size);
 static void reset_hash_table(AggHashTable htab);
 
+static int choose_nbatch(AggState *aggstate, int nbuckets, Size tuple_width);
+static void init_batch_files(AggState * aggstate);
+static void close_batch_files(AggState * aggstate);
+static void reinit_batch_files(AggState * aggstate);
+
 static void IteratorReset(AggHashTable htab);
 static AggHashEntry IteratorGetNext(AggHashTable htab);
 
+static TupleTableSlot *
+read_saved_tuple(BufFile *file, uint32 *hashvalue, TupleTableSlot *tupleSlot);
+AggHashEntry
+read_saved_group(AggState * aggstate, BufFile *file, AggHashEntry entry);
+
+static void
+save_tuple(AggState *aggstate, int batchno, TupleTableSlot *slot,
+		   uint32 hashvalue);
+static void
+save_group(AggState *aggstate, int batchno, AggHashEntry entry);
+
+static bool
+batching_supported(AggState * aggstate);
+
 /*
  * The size of the chunks for dense allocation. This needs to be >8kB
  * because the default (and only) memory context implementation uses
@@ -392,6 +420,7 @@ typedef struct AggHashTableData
 	int	nentries;		/* number of hash table entries */
 	int	nbuckets;		/* current number of buckets */
 	int	nbuckets_max;	/* max number of buckets */
+	int	nbuckets_bits;	/* bits for nbuckets_max (used for batching) */
 
 	/* items copied from the TupleHashTable, because we still need them */
 	MemoryContext	tmpctx;		/* short-lived memory context (hash/eq funcs) */
@@ -412,6 +441,7 @@ typedef struct AggHashTableData
 	 */
 	HashChunk		cur_chunk;
 	AggHashEntry	cur_entry;
+	int				niterated;
 
 	/* list of chunks with dense-packed entries / minimal tuples */
 	HashChunk		chunks_hash;
@@ -1027,6 +1057,7 @@ build_hash_table(AggState *aggstate, Size tuple_width)
 	/* we assume 1024 buckets (i.e. 8kB of memory) is minimum */
 	int nbuckets     = 1024;
 	int nbuckets_max = 1024;
+	int nbuckets_bits = 10;
 
 	Assert(node->aggstrategy == AGG_HASHED);
 	Assert(node->numGroups > 0);
@@ -1076,7 +1107,10 @@ build_hash_table(AggState *aggstate, Size tuple_width)
 	 *     save a bit of memory by that (although not much).
 	 */
 	while (nbuckets_max * groupsize <= work_mem * 1024L)
+	{
 		nbuckets_max *= 2;
+		nbuckets_bits += 1;
+	}
 
 	/*
 	 * Update the initial number of buckets to match expected number of groups,
@@ -1096,12 +1130,15 @@ build_hash_table(AggState *aggstate, Size tuple_width)
 	htab = (AggHashTable)MemoryContextAllocZero(aggstate->aggcontext,
 											sizeof(AggHashTableData));
 
+	htab->niterated = 0;
+
 	/* TODO create a memory context for the hash table */
-	htab->htabctx = AllocSetContextCreate(aggstate->aggcontext,
+	htab->htabctx = AllocSetContextCreateTracked(aggstate->aggcontext,
 											"HashAggHashTable",
 											ALLOCSET_DEFAULT_MINSIZE,
 											ALLOCSET_DEFAULT_INITSIZE,
-											ALLOCSET_DEFAULT_MAXSIZE);
+											ALLOCSET_DEFAULT_MAXSIZE,
+											true);
 
 	/* buckets are just pointers to AggHashEntryData structures */
 	htab->buckets = (AggHashEntry*)MemoryContextAllocZero(htab->htabctx,
@@ -1115,6 +1152,7 @@ build_hash_table(AggState *aggstate, Size tuple_width)
 
 	htab->nbuckets = nbuckets;
 	htab->nbuckets_max = nbuckets_max;
+	htab->nbuckets_bits = nbuckets_bits;
 	htab->nentries = 0;
 	htab->slot = NULL;
 	htab->numCols = node->numCols;
@@ -1198,15 +1236,14 @@ hash_agg_entry_size(int numAggs)
  * When called, CurrentMemoryContext should be the per-query context.
  */
 static AggHashEntry
-lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
+lookup_hash_entry(AggState *aggstate, uint32 hashvalue,
+TupleTableSlot *inputslot)
 {
 
 	AggHashEntry entry = NULL;
-	uint32		hashvalue;
 	uint32		bucketno;
 	MinimalTuple mintuple;
 
-	hashvalue = compute_hash_value(aggstate, inputslot);
 	bucketno = compute_bucket(aggstate, hashvalue);
 
 	entry = aggstate->hashtable->buckets[bucketno];
@@ -1223,9 +1260,11 @@ lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
 		entry = entry->next;
 	}
 
-	/* There's not a maching entry in the bucket, so create a new one and
-	 * copy in data both for the aggregates, and the MinimalTuple containing
-	 * keys for the group columns. */
+	/*
+	 * There's not a maching entry in the bucket, create a new one and
+	 * copy in data both for the aggregates, and the MinimalTuple
+	 * containing keys for the group columns.
+	 */
 	if (entry == NULL)
 	{
 
@@ -1265,16 +1304,48 @@ lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
 
 		aggstate->hashtable->nentries += 1;
 
-	}
+		/* once we exceed 1 entry / bucket, increase number of buckets */
+		if (aggstate->hashtable->nentries > aggstate->hashtable->nbuckets)
+			increase_nbuckets(aggstate);
 
-	/* once we exceed 1 entry / bucket, increase number of buckets */
-	if (aggstate->hashtable->nentries > aggstate->hashtable->nbuckets)
-		increase_nbuckets(aggstate);
+	}
 
 	return entry;
 }
 
 /*
+ * Creates a new hash entry in the hash table, containing the provided
+ * data. This assumes there's not a matching entry (this is not checked,
+ * and it's expected the caller not to break this).
+ * 
+ * This is used when adding entries with aggregate states read from
+ * a batch file.
+ */
+static void create_hash_entry(AggState *aggstate, AggHashEntry entry)
+{
+	AggHashEntry entry_new
+		= (AggHashEntry) chunk_alloc(aggstate->hashtable,
+					aggstate->hashtable->entrysize + entry->tuple->t_len);
+
+	AggHashTable htab = aggstate->hashtable;
+
+	int bucketno = compute_bucket(aggstate, entry->hashvalue);
+
+	Assert((bucketno >= 0) && (bucketno < htab->nbuckets));
+	Assert(aggstate->cur_batch == compute_batchno(aggstate, entry->hashvalue));
+
+	memcpy(entry_new, entry, htab->entrysize);
+
+	entry_new->tuple = (MinimalTuple)((char*)entry_new + htab->entrysize);
+
+	memcpy(entry_new->tuple, entry->tuple, entry->tuple->t_len);
+
+	entry_new->next = htab->buckets[bucketno];
+	htab->buckets[bucketno] = entry_new;
+
+}
+
+/*
  * ExecAgg -
  *
  *	  ExecAgg receives tuples from its outer subplan and aggregates over
@@ -1318,9 +1389,16 @@ ExecAgg(AggState *node)
 	/* Dispatch based on strategy */
 	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
 	{
-		if (!node->table_filled)
-			agg_fill_hash_table(node);
-		return agg_retrieve_hash_table(node);
+		TupleTableSlot *slot = NULL;
+
+		while (slot == NULL)
+		{
+			if (!node->table_filled)
+				if (! agg_fill_hash_table(node))
+					break;	/* no more batches to process */
+			slot = agg_retrieve_hash_table(node);
+		}
+		return slot;
 	}
 	else
 		return agg_retrieve_direct(node);
@@ -1536,13 +1614,16 @@ agg_retrieve_direct(AggState *aggstate)
 /*
  * ExecAgg for hashed case: phase 1, read input and build hash table
  */
-static void
+static bool
 agg_fill_hash_table(AggState *aggstate)
 {
 	PlanState  *outerPlan;
 	ExprContext *tmpcontext;
 	AggHashEntry entry;
-	TupleTableSlot *outerslot;
+	TupleTableSlot *outerslot = NULL;
+	int64		ntuples = 0;
+	Size		allocated;
+	BufFile	   *infile = NULL;
 
 	/*
 	 * get state info from node
@@ -1551,33 +1632,172 @@ agg_fill_hash_table(AggState *aggstate)
 	/* tmpcontext is the per-input-tuple expression context */
 	tmpcontext = aggstate->tmpcontext;
 
+	/* if there're no more batches, we're done */
+	if (aggstate->cur_batch == aggstate->nbatch)
+	{
+		aggstate->agg_done = true;
+		return false;
+	}
+
+	/* if not the first time through, reinitialize */
+	if (aggstate->cur_batch > 0)
+	{
+
+		BufFile *file_groups = aggstate->batched_groups[aggstate->cur_batch];
+
+		/* used for all the read_saved_group calls, to minimize palloc
+		 * overhead (and released in the last one automatically) */
+		AggHashEntry entry = (AggHashEntry)palloc0(aggstate->hashtable->entrysize);
+
+		/* reset the hash table (free the chunks, zero buckets, ...) */
+		reset_hash_table(aggstate->hashtable);
+
+		/* read all the aggregate states and either insert them into the
+		 * hash table, or move them to the proper batch */
+		BufFileSeek(file_groups, 0, 0L, SEEK_SET);
+
+		while ((entry = read_saved_group(aggstate, file_groups, entry)) != NULL)
+		{
+			/* XXX hashjoin uses a single call to compute both bucket
+			 * and batch, maybe we could do the same here (and pass
+			 * bucketno to create_hash_entry) */
+			int batchno = compute_batchno(aggstate, entry->hashvalue);
+
+			if (batchno == aggstate->cur_batch)
+				/* keep in the current batch */
+				create_hash_entry(aggstate, entry);
+			else
+				/* move to the proper batch */
+				save_group(aggstate, batchno, entry);
+		}
+
+		/* we're done with the temp file */
+		BufFileClose(file_groups);
+		aggstate->batched_groups[aggstate->cur_batch] = NULL;
+
+		/* prepare to read the saved tuples */
+		BufFileSeek(aggstate->batched_tuples[aggstate->cur_batch], 0, 0L, SEEK_SET);
+		infile = aggstate->batched_tuples[aggstate->cur_batch];
+	}
+
 	/*
 	 * Process each outer-plan tuple, and then fetch the next one, until we
 	 * exhaust the outer plan.
 	 */
 	for (;;)
 	{
-		outerslot = ExecProcNode(outerPlan);
-		if (TupIsNull(outerslot))
-			break;
-		/* set up for advance_aggregates call */
-		tmpcontext->ecxt_outertuple = outerslot;
 
-		/* Find or build hashtable entry for this tuple's group */
-		entry = lookup_hash_entry(aggstate, outerslot);
+		uint32 hashvalue;
+		int batchno = 0;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* the first batch means we need to fetch the tuples */
+		if (aggstate->cur_batch == 0)
+		{
+			outerslot = ExecProcNode(outerPlan);
+
+			if (TupIsNull(outerslot))
+				break;
+
+			/* copy the tuple descriptor for the following batches */
+			if (aggstate->tupdesc == NULL)
+			{
+				MemoryContext oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
+				aggstate->tupdesc = CreateTupleDescCopy(outerslot->tts_tupleDescriptor);
+				MemoryContextSwitchTo(oldContext);
+			}
+
+			hashvalue = compute_hash_value(aggstate, outerslot);
+			aggstate->ntuples_scanned += 1;
+		}
+		else
+		{
+
+			if (outerslot == NULL)
+				outerslot = MakeSingleTupleTableSlot(aggstate->tupdesc);
+
+			outerslot = read_saved_tuple(infile, &hashvalue, outerslot);
+			if (TupIsNull(outerslot))
+			{
+				BufFileClose(infile);
+				aggstate->batched_tuples[aggstate->cur_batch] = NULL;
+				break;
+			}
+
+			aggstate->ntuples_rescanned += 1;
+		}
+
+		ntuples++;
+
+		batchno = compute_batchno(aggstate, hashvalue);
+
+		Assert(batchno >= aggstate->cur_batch);
+
+		if (batchno == aggstate->cur_batch) {
+
+			/* Find or build hashtable entry for this tuple's group */
+			entry = lookup_hash_entry(aggstate, hashvalue, outerslot);
+
+			/* set up for advance_aggregates call */
+			tmpcontext->ecxt_outertuple = outerslot;
 
-		/* Advance the aggregates */
-		advance_aggregates(aggstate, entry->pergroup);
+			/* Advance the aggregates */
+			advance_aggregates(aggstate, entry->pergroup);
 
-		/* Reset per-input-tuple context after each tuple */
-		ResetExprContext(tmpcontext);
+			/* Reset per-input-tuple context after each tuple */
+			ResetExprContext(tmpcontext);
+
+			/* have we exceeded work_mem? if yes, increase number of batches
+			 * 
+			 * FIXME This uses htabctx, which is OK for states using
+			 *       pass-by-value types, but it's not really correct
+			 *       in general (use aggcontext instead).
+			 */
+			if (MemoryContextGetAllocated(aggstate->hashtable->htabctx, true) >= work_mem * 1024L)
+				increase_nbatches(aggstate);
+
+		} else {
+
+			/* no entry for this tuple, and we've reached work_mem */
+			save_tuple(aggstate, batchno, outerslot, hashvalue);
+
+		}
 	}
 
+	allocated = MemoryContextGetAllocated(aggstate->hashtable->htabctx, true);
+
+	/* keep track of the largest/smallest batch size */
+	if (aggstate->cur_batch == 0)
+	{
+		aggstate->batch_min_size = allocated;
+		aggstate->batch_max_size = allocated;
+	}
+	else
+	{
+		if (allocated < aggstate->batch_min_size)
+			aggstate->batch_min_size = allocated;
+		if (allocated > aggstate->batch_max_size)
+			aggstate->batch_max_size = allocated;
+	}
+
+	/* if we're in the first batch, and there were 0 tuples, we're done */
+	if ((aggstate->cur_batch == 0) && (aggstate->ntuples_scanned == 0))
+	{
+		aggstate->agg_done = true;
+		return false;
+	}
+
+	/* the hash table is filled, and we're ready for the next batch */
 	aggstate->table_filled = true;
+	aggstate->cur_batch += 1;
 
 	/* Initialize for iteration through the table (first bucket / entry) */
 	IteratorReset(aggstate->hashtable);
 
+	/* ready to return groups from this hash table */
+	return true;
+
 }
 
 /*
@@ -1620,6 +1840,8 @@ agg_retrieve_hash_table(AggState *aggstate)
 		 */
 		ResetExprContext(econtext);
 
+		htab->niterated += 1;
+
 		/*
 		* Store the copied first input tuple in the tuple table slot reserved
 		* for it, so that it can be used in ExecProject.
@@ -1677,7 +1899,8 @@ agg_retrieve_hash_table(AggState *aggstate)
 
 	}
 
-	aggstate->agg_done = true;
+	/* No more entries in hashtable, so done with this batch */
+	aggstate->table_filled = false;
 
 	/* No more groups */
 	return NULL;
@@ -1739,11 +1962,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 * recover no-longer-wanted space.
 	 */
 	aggstate->aggcontext =
-		AllocSetContextCreate(CurrentMemoryContext,
+		AllocSetContextCreateTracked(CurrentMemoryContext,
 							  "AggContext",
 							  ALLOCSET_DEFAULT_MINSIZE,
 							  ALLOCSET_DEFAULT_INITSIZE,
-							  ALLOCSET_DEFAULT_MAXSIZE);
+							  ALLOCSET_DEFAULT_MAXSIZE, true);
 
 	/*
 	 * tuple table initialization
@@ -1842,10 +2065,29 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 
 	if (node->aggstrategy == AGG_HASHED)
 	{
+		MemoryContext oldContext;
+
 		build_hash_table(aggstate, outerPlan->plan_width);
 		aggstate->table_filled = false;
+
+		aggstate->tupdesc = NULL;
+		aggstate->nbatch = choose_nbatch(aggstate, aggstate->hashtable->nbuckets, outerPlan->plan_width);
+		aggstate->nbatch_original = aggstate->nbatch;
+		aggstate->cur_batch = 0;
+
+		/* initialize temporary files for batched tuples/groups */
+		init_batch_files(aggstate);
+
+		/* explain (analyze) counters */
+		aggstate->ntuples_scanned = 0;
+		aggstate->ntuples_rescanned = 0;
+
 		/* Compute the columns we actually need to hash on */
 		aggstate->hash_needed = find_hash_columns(aggstate);
+
+		/* prime with initial work item to read from outer plan */
+		oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
+		MemoryContextSwitchTo(oldContext);
 	}
 	else
 	{
@@ -2198,6 +2440,18 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	/* Update numaggs to match number of unique aggregates found */
 	aggstate->numaggs = aggno + 1;
 
+	/* check whether we can do batching */
+	aggstate->batching_enabled = batching_supported(aggstate);
+
+	/*
+	 * If in hashed mode, with no batching, disable nbuckets_max limit,
+	 * because if we're gonna exhaust memory, there's not much
+	 * difference between doing that fast and slow. It's equally bad
+	 * either way :-/
+	 */
+	if ((node->aggstrategy == AGG_HASHED) && (! aggstate->batching_enabled))
+		aggstate->hashtable->nbuckets_max = INT_MAX/2;
+
 	return aggstate;
 }
 
@@ -2245,6 +2499,10 @@ ExecEndAgg(AggState *node)
 	/* clean up tuple table */
 	ExecClearTuple(node->ss.ss_ScanTupleSlot);
 
+	/* properly close the batch files (in batching mode) */
+	if (node->nbatch != 0)
+		close_batch_files(node);
+
 	MemoryContextDelete(node->aggcontext);
 
 	outerPlan = outerPlanState(node);
@@ -2264,22 +2522,24 @@ ExecReScanAgg(AggState *node)
 	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
 	{
 		/*
-		 * In the hashed case, if we haven't yet built the hash table then we
-		 * can just return; nothing done yet, so nothing to undo. If subnode's
-		 * chgParam is not NULL then it will be re-scanned by ExecProcNode,
-		 * else no reason to re-scan it at all.
+		 * In the hashed case, if we haven't done any execution work yet, we
+		 * can just return; nothing to undo. If subnode's chgParam is not NULL
+		 * then it will be re-scanned by ExecProcNode, else no reason to
+		 * re-scan it at all.
 		 */
-		if (!node->table_filled)
-			return;
+		// FIXME maybe it was a good idea to have hash_init_state ...
+		// if (node->hash_init_state)
+		//	return;
 
 		/*
-		 * If we do have the hash table and the subplan does not have any
-		 * parameter changes, then we can just rescan the existing hash table;
-		 * no need to build it again.
+		 * If we do have the hash table, it never went to disk, and the
+		 * subplan does not have any parameter changes, then we can just
+		 * rescan the existing hash table; no need to build it again.
 		 */
-		if (node->ss.ps.lefttree->chgParam == NULL)
+		if (node->ss.ps.lefttree->chgParam == NULL && (node->nbatch == 1))
 		{
 			IteratorReset(node->hashtable);
+			node->table_filled = true;
 			return;
 		}
 	}
@@ -2318,10 +2578,43 @@ ExecReScanAgg(AggState *node)
 
 	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
 	{
-		Plan * outerPlan = outerPlan((Agg *) node->ss.ps.plan);
+		Plan *outerPlan = outerPlan((Agg *) node->ss.ps.plan);
+
 		/* Rebuild an empty hash table */
 		build_hash_table(node, outerPlan->plan_width);
 		node->table_filled = false;
+		node->tupdesc = NULL;
+
+		/* reset the current batch, but remember the nbatch */
+		node->cur_batch = 0;
+
+		/* XXX The way we work with the temporary files right now is that
+		 * on rescan we throw them away, and start over. The problem is
+		 * that when the rescan triggers after somewhere after the initial
+		 * batch and before completing all the batches, we don't know
+		 * which groups/tuples were already moved (copied) to the following
+		 * batches (so we can't just move them again). Also, we close the
+		 * files as soon as we complete reading them.
+		 * 
+		 * We could however improve this by keeping the files open until
+		 * ExecEndAgg, and remembering which tuples / groups we've
+		 * already moved to the appropriate batch (a batchno/tupleno
+		 * pair should be enough), and only move the tuples after that.
+		 * 
+		 * The problem is with the initial batch, which is only in memory
+		 * by default. We could serialize this to a file once it's
+		 * complete (only the groups, should be less than work_mem),
+		 * but that's likely to impact even plans that don't require
+		 * the rescan. Not sure if it's know in advance whether a rescan
+		 * is likely to happen.
+		 */
+
+		/* reinitialize the files with batched tuples/groups */
+		reinit_batch_files(node);
+
+		/* explain (analyze) counters */
+		node->ntuples_scanned = 0;
+		node->ntuples_rescanned = 0;
 	}
 	else
 	{
@@ -2649,9 +2942,6 @@ increase_nbuckets(AggState * aggstate)
 		/* position within the buffer (up to chunk->used) */
 		size_t idx = 0;
 
-		/* we have a whole number of entries */
-		Assert(chunk->used % htab->entrysize == 0);
-
 		/* process all tuples stored in this chunk (and then free it) */
 		while (idx < chunk->used)
 		{
@@ -2675,6 +2965,120 @@ increase_nbuckets(AggState * aggstate)
 
 }
 
+
+/*
+ * Increase the number of batches - we'll double it by default, but we
+ * may grow faster if needed. Contrary to increasing the number of
+ * buckets, this needs to remove ~50% of the entries (when doubling
+ * the number of batches).
+ * 
+ * We keep the number of buckets etc. because we expect the table to
+ * grow further.
+ */
+static void
+increase_nbatches(AggState * aggstate)
+{
+
+	HashChunk chunk, chunk_prev;
+	AggHashTable htab = aggstate->hashtable;
+	int i, nbatch_old = aggstate->nbatch;
+	MemoryContext oldctx;
+
+	/* remember the old chunks (and reset to NULL, to allocate new ones) */
+	HashChunk oldchunks = htab->chunks_hash;
+	htab->chunks_hash = NULL;
+
+	aggstate->nbatch      *= 2;
+	aggstate->hashtable->nentries = 0;
+
+	oldctx = MemoryContextSwitchTo(aggstate->aggcontext);
+
+	/* initialize appropriate number of temporary files */
+	if (aggstate->nbatch == 2)
+	{
+		aggstate->batched_groups = (BufFile**)palloc0(2*sizeof(BufFile*));
+		aggstate->batched_tuples = (BufFile**)palloc0(2*sizeof(BufFile*));
+	}
+	else
+	{
+		aggstate->batched_groups = (BufFile**)repalloc(aggstate->batched_groups, aggstate->nbatch * sizeof(BufFile*));
+		aggstate->batched_tuples = (BufFile**)repalloc(aggstate->batched_tuples, aggstate->nbatch * sizeof(BufFile*));
+	}
+
+	for (i = nbatch_old; i < aggstate->nbatch; i++)
+	{
+		aggstate->batched_groups[i] = BufFileCreateTemp(false);
+		aggstate->batched_tuples[i] = BufFileCreateTemp(false);
+	}
+
+	MemoryContextSwitchTo(oldctx);
+
+	/* reset the buckets (we'll rebuild them from scratch) */
+	memset(htab->buckets, 0, htab->nbuckets * sizeof(AggHashEntry));
+
+	chunk = oldchunks;
+	while (chunk != NULL)
+	{
+
+		/* position within the buffer (up to chunk->used) */
+		size_t idx = 0;
+
+		/* current chunk (so that we can pfree it at the end) */
+		chunk_prev = chunk;
+
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx < chunk->used)
+		{
+
+			AggHashEntry entry = (AggHashEntry)(chunk->data + idx);
+
+			/* this already uses the updated nbatch values */
+			int batchno = compute_batchno(aggstate, entry->hashvalue);
+
+			Assert(batchno >= aggstate->cur_batch);
+
+			if (batchno == aggstate->cur_batch) {
+
+				/*
+				 * If the batch number is still cur_batch, copy it to
+				 * a new chunk, and put it into the proper bucket.
+				 */
+
+				int bucketno = compute_bucket(aggstate, entry->hashvalue);
+
+				AggHashEntry entry_new = (AggHashEntry)chunk_alloc(htab,
+								htab->entrysize + entry->tuple->t_len);
+
+				memcpy(entry_new, entry, htab->entrysize + entry->tuple->t_len);
+
+				entry_new->tuple = (MinimalTuple)((char*)entry_new + htab->entrysize);
+
+				/* fine, just put the entry into */
+				entry_new->next = htab->buckets[bucketno];
+				htab->buckets[bucketno] = entry_new;
+
+				aggstate->hashtable->nentries += 1;
+
+			} else {
+
+				/* different batch - save the group */
+				save_group(aggstate, batchno, entry);
+
+			}
+
+			/* bytes occupied in memory HJ tuple overhead + actual tuple length */
+			idx += htab->entrysize + entry->tuple->t_len;
+
+		}
+
+		/* proceed to the next chunk */
+		chunk = chunk->next;
+
+		pfree(chunk_prev);
+	}
+
+}
+
 static
 char * chunk_alloc(AggHashTable htab, int size)
 {
@@ -2804,6 +3208,7 @@ AggHashEntry IteratorGetNext(AggHashTable htab)
 	 * So compute how many bytes we need to skip to the next entry.
 	 */
 	entry = htab->cur_entry;
+
 	len = entry->tuple->t_len + htab->entrysize;
 
 	/*
@@ -2826,6 +3231,166 @@ AggHashEntry IteratorGetNext(AggHashTable htab)
 
 }
 
+
+/*
+ * save_tuple
+ *
+ * Not enough memory to add tuple as new entry in hash table. Save for later
+ * in the appropriate partition.
+ */
+static void
+save_tuple(AggState *aggstate, int batchno, TupleTableSlot *slot,
+		   uint32 hashvalue)
+{
+	MinimalTuple		 tuple;
+	BufFile				*file;
+	int					 written;
+
+	Assert(batchno > aggstate->cur_batch);
+
+	file = aggstate->batched_tuples[batchno];
+
+	tuple = ExecFetchSlotMinimalTuple(slot);
+
+	written = BufFileWrite(file, (void *) &hashvalue, sizeof(uint32));
+	if (written != sizeof(uint32))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to HashAgg temporary file: %m")));
+
+	written = BufFileWrite(file, (void *) tuple, tuple->t_len);
+	if (written != tuple->t_len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to HashAgg temporary file: %m")));
+}
+
+
+/*
+ * save_group
+ *
+ */
+static void
+save_group(AggState *aggstate, int batchno, AggHashEntry entry)
+{
+	MinimalTuple		 tuple;
+	BufFile				*file;
+	int					 written;
+
+	Assert(batchno > aggstate->cur_batch);
+
+	file = aggstate->batched_groups[batchno];
+	tuple = entry->tuple;
+
+	written = BufFileWrite(file, (void *) entry, aggstate->hashtable->entrysize);
+	if (written != aggstate->hashtable->entrysize)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to HashAgg temporary file: %m")));
+
+	written = BufFileWrite(file, (void *) tuple, tuple->t_len);
+	if (written != tuple->t_len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to HashAgg temporary file: %m")));
+}
+
+
+/*
+ * read_saved_tuple
+ *		read the next tuple from a batch file.  Return NULL if no more.
+ *
+ * On success, *hashvalue is set to the tuple's hash value, and the tuple
+ * itself is stored in the given slot.
+ *
+ * Copied with minor modifications from ExecHashJoinGetSavedTuple.
+ */
+static TupleTableSlot *
+read_saved_tuple(BufFile *file, uint32 *hashvalue, TupleTableSlot *tupleSlot)
+{
+	uint32		header[2];
+	size_t		nread;
+	MinimalTuple tuple;
+
+	/*
+	 * Since both the hash value and the MinimalTuple length word are uint32,
+	 * we can read them both in one BufFileRead() call without any type
+	 * cheating.
+	 */
+	nread = BufFileRead(file, (void *) header, sizeof(header));
+	if (nread == 0)				/* end of file */
+	{
+		ExecClearTuple(tupleSlot);
+		return NULL;
+	}
+	if (nread != sizeof(header))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from HashAgg temporary file: %m")));
+	*hashvalue = header[0];
+	tuple = (MinimalTuple) palloc(header[1]);
+	tuple->t_len = header[1];
+	nread = BufFileRead(file,
+						(void *) ((char *) tuple + sizeof(uint32)),
+						header[1] - sizeof(uint32));
+	if (nread != header[1] - sizeof(uint32))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from HashAgg temporary file: %m")));
+	return ExecStoreMinimalTuple(tuple, tupleSlot, true);
+}
+
+/*
+ * read_saved_group
+ */
+AggHashEntry
+read_saved_group(AggState * aggstate, BufFile *file, AggHashEntry entry)
+{
+	uint32		tlen;
+	size_t		nread;
+
+	/* we know the size of the entry, we don't know the tuple size yet */
+
+	Assert(entry != NULL);
+
+	/* always release the tuple (well, maybe we could keep track of the
+	 * allocated space and reuse the tuple buffer) */
+	if (entry->tuple != NULL)
+		pfree(entry->tuple);
+
+	nread = BufFileRead(file, (void *)entry, aggstate->hashtable->entrysize);
+	if (nread == 0)				/* end of file */
+	{
+		pfree(entry);
+		return NULL;
+	}
+
+	if (nread != aggstate->hashtable->entrysize)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from HashAgg temporary file: %m")));
+
+	/* now, we need to read the actual tuple - first, read the length */
+	nread = BufFileRead(file, (void *)&tlen, sizeof(uint32));
+	if (nread != sizeof(int32))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from HashAgg temporary file: %m")));
+
+	/* now, allocate space for the tuple and read the rest */
+	entry->tuple = (MinimalTuple) palloc(tlen);
+	entry->tuple->t_len = tlen;
+	nread = BufFileRead(file,
+						(void *) ((char *) entry->tuple + sizeof(uint32)),
+						tlen - sizeof(uint32));
+	if (nread != tlen - sizeof(uint32))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from HashAgg temporary file: %m")));
+
+	return entry;
+}
+
 /*
  * Resets the contents of the hash table - removes all the entries and
  * tuples, but keeps the 'size' of the hash table (nbuckets).
@@ -2858,3 +3423,114 @@ void reset_hash_table(AggHashTable htab) {
 								htab->nbuckets * sizeof(AggHashEntry));
 
 }
+
+static uint32 compute_batchno(AggState * aggstate, uint32 hashvalue) {
+	if ((! aggstate->batching_enabled) || (aggstate->nbatch == 1))
+		return 0;
+	else
+		/*
+		 * When computing the batch number, skip the bits that might be
+		 * used for buckets.
+		 * 
+		 * XXX We should probably make sure that we don't exceed the 32
+		 *     bits we have available in the hash. This is pretty much
+		 *     the same issue as in hash join.
+		 */
+		return (hashvalue >> aggstate->hashtable->nbuckets_bits) & (aggstate->nbatch - 1);
+}
+
+static bool
+batching_supported(AggState * aggstate)
+{
+	int aggno;
+
+	/* check that all the aggregates use state passed by value */
+	for (aggno = 0; aggno < aggstate->numaggs; aggno++)
+		if (! aggstate->peragg[aggno].transtypeByVal)
+			return false;
+
+	return true;
+}
+
+static void
+init_batch_files(AggState * aggstate)
+{
+
+	int i;
+
+	if (aggstate->nbatch > 1) {
+		aggstate->batched_groups = (BufFile**)palloc0(aggstate->nbatch * sizeof(BufFile*));
+		aggstate->batched_tuples = (BufFile**)palloc0(aggstate->nbatch * sizeof(BufFile*));
+	}
+
+	for (i = 1; i < aggstate->nbatch; i++) {
+		aggstate->batched_groups[i] = BufFileCreateTemp(false);
+		aggstate->batched_tuples[i] = BufFileCreateTemp(false);
+	}
+
+}
+
+static void
+close_batch_files(AggState * aggstate)
+{
+	int i;
+
+	for (i = 1; i < aggstate->nbatch; i++) {
+
+		/* if we're halfway through the batches, the files might be
+		 * already closed (and set to NULL) */
+		if (aggstate->batched_groups[i] != NULL)
+			BufFileClose(aggstate->batched_groups[i]);
+
+		if (aggstate->batched_tuples[i] != NULL)
+			BufFileClose(aggstate->batched_tuples[i]);
+
+		aggstate->batched_groups[i] = NULL;
+		aggstate->batched_tuples[i] = NULL;
+
+	}
+}
+
+static void
+reinit_batch_files(AggState * aggstate)
+{
+	int i;
+
+	/* make sure all the files are properly closed */
+	close_batch_files(aggstate);
+
+	/* reinit all the files (skip the first one, which is batchno=0) */
+	for (i = 1; i < aggstate->nbatch; i++)
+	{
+		aggstate->batched_groups[i] = BufFileCreateTemp(false);
+		aggstate->batched_tuples[i] = BufFileCreateTemp(false);
+	}
+}
+
+static int
+choose_nbatch(AggState *aggstate, int nbuckets, Size tuple_width)
+{
+	Agg		   *node = (Agg *) aggstate->ss.ps.plan;
+
+	/* space used by the group (includes bucket) */
+	Size		groupsize, bucketssize, groupssize;
+	int nbatch = 1;
+
+	Assert(node->aggstrategy == AGG_HASHED);
+	Assert(node->numGroups > 0);
+
+	/* XXX see how build_hash_table estimates entrysize and groupsize */
+	groupsize = MAXALIGN(sizeof(AggHashEntryData) +
+			(aggstate->numaggs - 1) * sizeof(AggStatePerGroupData)) +
+			+ MAXALIGN(sizeof(MinimalTupleData)) + tuple_width;
+
+	bucketssize = nbuckets * sizeof(AggHashEntry);
+	groupssize = groupsize * node->numGroups;
+
+	/* double nbatch until we're expected to fit in work_mem */
+	while (groupssize / nbatch + bucketssize >= work_mem * 1024L)
+		nbatch *= 2;
+
+	return nbatch;
+
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 0cdb790..926abad 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -113,6 +113,7 @@ bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
 bool		enable_hashagg = true;
+bool		enable_hashagg_disk = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
 bool		enable_mergejoin = true;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e1480cd..7b8135d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2741,7 +2741,8 @@ choose_hashed_grouping(PlannerInfo *root,
 	/* plus the per-hash-entry overhead */
 	hashentrysize += hash_agg_entry_size(agg_costs->numAggs);
 
-	if (hashentrysize * dNumGroups > work_mem * 1024L)
+	if (!enable_hashagg_disk &&
+		hashentrysize * dNumGroups > work_mem * 1024L)
 		return false;
 
 	/*
@@ -2907,7 +2908,8 @@ choose_hashed_distinct(PlannerInfo *root,
 	/* plus the per-hash-entry overhead */
 	hashentrysize += hash_agg_entry_size(0);
 
-	if (hashentrysize * dNumDistinctRows > work_mem * 1024L)
+	if (!enable_hashagg_disk &&
+		hashentrysize * dNumDistinctRows > work_mem * 1024L)
 		return false;
 
 	/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8c57803..5128e20 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -749,6 +749,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"enable_hashagg_disk", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of disk-based hashed aggregation plans."),
+			NULL
+		},
+		&enable_hashagg_disk,
+		true,
+		NULL, NULL, NULL
+	},
+	{
 		{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of materialization."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index df98b02..8f5b73b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -266,6 +266,7 @@
 
 #enable_bitmapscan = on
 #enable_hashagg = on
+#enable_hashagg_disk = on
 #enable_hashjoin = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index a70b296..97034f1 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -634,7 +634,7 @@ MemoryContextCreate(NodeTag tag, Size size,
 	 */
 	if (track_mem)
 	{
-		node->accounting = (MemoryAccounting)MemoryContextAlloc(TopMemoryContext,
+		node->accounting = (MemoryAccounting)MemoryContextAllocZero(TopMemoryContext,
 												sizeof(MemoryAccountingData));
 		if (parent)
 			node->accounting->parent = parent->accounting;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 995389b..f2286fe 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -22,7 +22,7 @@
 #include "utils/reltrigger.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplestore.h"
-
+#include "storage/buffile.h"
 
 /* ----------------
  *	  IndexInfo information
@@ -1725,6 +1725,31 @@ typedef struct AggState
 	List	   *hash_needed;	/* list of columns needed in hash table */
 	bool		table_filled;	/* hash table filled yet? */
 	AggHashTable	hashtable;	/* instance of the simple hash table */
+	TupleDesc	tupdesc;
+
+	/* simple batching */
+	bool		batching_enabled;	/* can we serialize group states? */
+	int			nbatch;				/* number of batches */
+	int			nbatch_bits;		/* bits from the hash */
+	int			nbatch_original;	/* number of batches (original) */
+	int			cur_batch;			/* current batch (up to nbatch-1) */
+
+	/* temporary files with data for further batches */
+	BufFile		**batched_groups;	/* serialized aggregate states */
+	BufFile		**batched_tuples;	/* serialized tuples */
+
+	/* counters for explain (analyze) */
+	Size		batch_min_size;		/* minimum batch size (bytes) */
+	Size		batch_max_size;		/* maximum batch size (bytes) */
+
+	/*
+	 * These two counters allow evaluation of how many times the tuples
+	 * were saved/read. With no batching, rescanned=0. With a single
+	 * level of batching (rescanned/scanned < 1.00) and with multi-level
+	 * batching it may happen that (rescanned/scanned > 1.00).
+	 */
+	int64		ntuples_scanned;	/* number of input tuples scanned */
+	int64		ntuples_rescanned;	/* number of tuples saved/read */
 
 } AggState;
 
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 75e2afb..d363e65 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -57,6 +57,7 @@ extern bool enable_bitmapscan;
 extern bool enable_tidscan;
 extern bool enable_sort;
 extern bool enable_hashagg;
+extern bool enable_hashagg_disk;
 extern bool enable_nestloop;
 extern bool enable_material;
 extern bool enable_mergejoin;
diff --git a/src/test/regress/expected/rangefuncs.out b/src/test/regress/expected/rangefuncs.out
index 774e75e..e88c83c 100644
--- a/src/test/regress/expected/rangefuncs.out
+++ b/src/test/regress/expected/rangefuncs.out
@@ -3,6 +3,7 @@ SELECT name, setting FROM pg_settings WHERE name LIKE 'enable%';
 ----------------------+---------
  enable_bitmapscan    | on
  enable_hashagg       | on
+ enable_hashagg_disk  | on
  enable_hashjoin      | on
  enable_indexonlyscan | on
  enable_indexscan     | on
@@ -12,7 +13,7 @@ SELECT name, setting FROM pg_settings WHERE name LIKE 'enable%';
  enable_seqscan       | on
  enable_sort          | on
  enable_tidscan       | on
-(11 rows)
+(12 rows)
 
 CREATE TABLE foo2(fooid int, f2 int);
 INSERT INTO foo2 VALUES(1, 11);

large-128.pngimage/png; name=large-128.pngDownload

large-1024.pngimage/png; name=large-1024.pngDownload

hashagg-testing.tgzapplication/x-tgz; name=hashagg-testing.tgzDownload

#33

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#32)

2 attachment(s)

Re: 9.5: Memory-bounded HashAgg

On 4.9.2014 00:42, Tomas Vondra wrote:

Attached are two CSV files contain both raw results (4 runs per query),
and aggregated results (average of the runs), logs with complete logs
and explain (analyze) plans of the queries for inspection.

Of course, I forgot to attach the CSV files ... here they are.

Tomas

#34

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Robert Haas (#27)

Re: 9.5: Memory-bounded HashAgg

On 20.8.2014 20:32, Robert Haas wrote:

On Sun, Aug 17, 2014 at 1:17 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Being able to batch inner and outer relations in a matching way is
certainly one of the reasons why hashjoin uses that particular scheme.
There are other reasons, though - for example being able to answer 'Does
this group belong to this batch?' quickly, and automatically update
number of batches.

I'm not saying the lookup is extremely costly, but I'd be very surprised
if it was as cheap as modulo on a 32-bit integer. Not saying it's the
dominant cost here, but memory bandwidth is quickly becoming one of the
main bottlenecks.

Well, I think you're certainly right that a hash table lookup is more
expensive than modulo on a 32-bit integer; so much is obvious. But if
the load factor is not too large, I think that it's not a *lot* more
expensive, so it could be worth it if it gives us other advantages.

Yes, that may be true. I'm not opposed to Jeff's approach in general -
it's certainly a nice solution for cases with fixed size of the
aggregate states.

But I still don't see how it could handle the aggregates with growing
aggregate state (which is the case that troubles me, because that's what
we see in our workloads).

As I see it, the advantage of Jeff's approach is that it doesn't
really matter whether our estimates are accurate or not. We don't
have to decide at the beginning how many batches to do, and then
possibly end up using too much or too little memory per batch if we're
wrong; we can let the amount of memory actually used during execution
determine the number of batches. That seems good. Of course, a hash

Yes. I think that maybe we could use Jeff's approach even for 'growing
aggregate state' case, assuming we can serialize the aggregate states
and release the memory properly.

First, the problem with the current hash table used in HashAggregate
(i.e. dynahash) is that it never actually frees memory - when you do
HASH_REMOVE it only moves it to a list of entries for future use.

Imagine a workload where you initially see only 1 tuple for each group
before work_mem gets full. At that point you stop adding new groups, but
the current ones will grow. Even if you know how to serialize the
aggregate states (which we don't), you're in trouble because the initial
state is small (only 1 tuple was passed to the group) and most of the
memory is stuck in dynahash.

join can increase the number of batches on the fly, but only by
doubling it, so you might go from 4 batches to 8 when 5 would really
have been enough. And a hash join also can't *reduce* the number of
batches on the fly, which might matter a lot. Getting the number of
batches right avoids I/O, which is a lot more expensive than CPU.

Regarding the estimates, I don't see much difference between the two
approaches when handling this issue.

It's true you can wait with deciding how many partitions (aka batches)
to create until work_mem is full, at which point you have more
information than at the very beginning. You know how many tuples you've
already seen, how many tuples you expect (which is however only an
estimate etc.). And you may use that to estimate the number of
partitions to create.

That however comes at a cost - it's not really a memory-bounded hash
aggregate, because you explicitly allow exceeding work_mem as more
tuples for existing groups arrive.

Also, no one really says the initial estimate of how many tuples will be
aggregated is correct. It's about as unreliable as the group count
estimate. So how exactly are you going to estimate the partitions?

Considering this, I doubt being able to choose arbitrary number of
partitions (instead of only powers of 2) is really an advantage.

Reducing the number of partitions might matter, but in my experience
most estimation errors are underestimations. Because we assume
independence where in practice columns are dependent, etc.

I agree that getting the batches right is important, but OTOH when using
hash join using more smaller batches is often significantly faster than
using one large one. So it depends.

Whe I think we should prevent is under-estimating the number of batches,
because in that case you have to read the whole batch, write part of it
again and then read it again. Instead of just writing it once (into two
files). Reading a tuple from a batch only to write it to another batch
is not really efficient.

But the situation here isn't comparable, because there's only one
input stream. I'm pretty sure we'll want to keep track of which
transition states we've spilled due to lack of memory as opposed to
those which were never present in the table at all, so that we can
segregate the unprocessed tuples that pertain to spilled transition
states from the ones that pertain to a group we haven't begun yet.

Why would that be necessary or useful? I don't see a reason for tracking
that / segregating the tuples.

Suppose there are going to be three groups: A, B, C. Each is an
array_agg(), and they're big, so only of them will fit in work_mem at
a time. However, we don't know that at the beginning, either because
we don't write the code to try or because we do write that code but
our cardinality estimates are way off; instead, we're under the
impression that all four will fit in work_mem. So we start reading
tuples. We see values for A and B, but we don't see any values for C
because those all occur later in the input. Eventually, we run short
of memory and cut off creation of new groups. Any tuples for C are
now going to get written to a tape from which we'll later reread them.
After a while, even that proves insufficient and we spill the
transition state for B to disk. Any further tuples that show up for C
will need to be written to tape as well. We continue processing and
finish group A.

Now it's time to do batch #2. Presumably, we begin by reloading the
serialized transition state for group B. To finish group B, we must
look at all the tuples that might possibly fall in that group. If all
of the remaining tuples are on a single tape, we'll have to read all
the tuples in group B *and* all the tuples in group C; we'll
presumably rewrite the tuples that are not part of this batch onto a
new tape, which we'll then process in batch #3. But if we took
advantage of the first pass through the input to put the tuples for
group B on one tape and the tuples for group C on another tape, we can
be much more efficient - just read the remaining tuples for group B,
not mixed with anything else, and then read a separate tape for group
C.

OK, I understand the idea. However I don't think it makes much sense to
segregate every little group - that's a perfect fit for batching.

What might be worth segregating are exceptionally large groups, because
that what may cause batching inefficient - for example when a group is
larger than work_mem, it will result in a batch per group (even if those
remaining groups are tiny). But we have no way to identify this group,
because we have no way to determine the size of the state.

What we might do is assume that the size is proportional to number of
tuples, and segregate only those largest groups. This can easily be done
with hashjoin-like batching - adding ntuples, isSegregated and
skewBatchId the AggHashEntry. The placeholder (only the hash entry will
be stored in the batch, but the actual state etc. will be stored
separetely). This is a bit similar to how hashjoin handles skew buckets.

It's true that Jeff's approach handles this somewhat better, but at the
cost of not really bounding the memory consumed by HashAggregate.

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#34)

Re: 9.5: Memory-bounded HashAgg

On 4.9.2014 01:34, Tomas Vondra wrote:

On 20.8.2014 20:32, Robert Haas wrote:

As I see it, the advantage of Jeff's approach is that it doesn't
really matter whether our estimates are accurate or not. We don't
have to decide at the beginning how many batches to do, and then
possibly end up using too much or too little memory per batch if we're
wrong; we can let the amount of memory actually used during execution
determine the number of batches. That seems good. Of course, a hash

Also, you don't actually have to decide the number of batches at the
very beginning. You can start start with nbatch=1 and decide how many
batches to use when the work_mem is reached. I.e. at exactly the same
moment / using the same amount of info as with Jeff's approach. No?

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Tomas Vondra (#34)

Re: 9.5: Memory-bounded HashAgg

On Wed, Sep 3, 2014 at 7:34 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Well, I think you're certainly right that a hash table lookup is more
expensive than modulo on a 32-bit integer; so much is obvious. But if
join can increase the number of batches on the fly, but only by
doubling it, so you might go from 4 batches to 8 when 5 would really
have been enough. And a hash join also can't *reduce* the number of
batches on the fly, which might matter a lot. Getting the number of
batches right avoids I/O, which is a lot more expensive than CPU.

Regarding the estimates, I don't see much difference between the two
approaches when handling this issue.

It's true you can wait with deciding how many partitions (aka batches)
to create until work_mem is full, at which point you have more
information than at the very beginning. You know how many tuples you've
already seen, how many tuples you expect (which is however only an
estimate etc.). And you may use that to estimate the number of
partitions to create.

I think it's significantly better than that. The first point I'd make
is that if work_mem never fills up, you don't need to batch anything
at all. That's a potentially huge win over batching a join we thought
was going to overrun work_mem, but didn't.

But even work_mem does fill up, I think we still come out ahead,
because we don't necessarily need to dump the *entirety* of each batch
to disk. For example, suppose there are 900 distinct values and only
300 of them can fit in memory at a time. We read the input until
work_mem is full and we see a previously-unseen value, so we decide to
split the input up into 4 batches. We now finish reading the input.
Each previously-seen value gets added to an existing in-memory group,
and each each new value gets written into one of four disk files. At
the end of the input, 300 groups are complete, and we have four files
on disk each of which contains the data for 150 of the remaining 600
groups.

Now, the alternative strategy is to batch from the beginning. Here,
we decide right from the get-go that we're using 4 batches, so batch
#1 goes into memory and the remaining 3 batches get written to three
different disk files. At the end of the input, 225 groups are
complete, and we have three files on disk each of which contains the
data for 225 of the remaining 675 groups. This seems clearly
inferior, because we have written 675 groups to disk when it would
have been possible to write only 600.

The gains can be even more significant when the input data is skewed.
For example, suppose things are as above, but ten values accounts for
90% of all the inputs, and the remaining 890 values account for the
other 10% of the inputs. Furthermore, let's suppose we have no table
statistics or they are totally wrong. In Jeff's approach, as long as
each of those values occurs at least once before work_mem fills up,
they'll all be processed in the initial pass through the data, which
means we will write at most 10% of the data to disk. In fact it will
be a little bit less, because batch 1 will have not only then 10
frequently-occurring values but also 290 others, so our initial pass
through the data will complete 300 groups covering (if the
less-frequent values are occur with uniform frequency) 93.258% of the
data. The remaining ~6.8% will be split up into 4 files which we can
then reread and process. But if we use the other approach, we'll only
get 2 or 3 of the 10 commonly-occurring values in the first batch, so
we expect to write about 75% of the data out to one of our three batch
files. That's a BIG difference - more than 10x the I/O load that
Jeff's approach would have incurred. Now, admittedly, we could use a
skew optimization similar to the one we use for hash joins to try to
get the MCVs into the first batch, and that would help a lot when the
statistics are right - but sometimes the statistics are wrong, and
Jeff's approach doesn't care. It just keeps on working.

That however comes at a cost - it's not really a memory-bounded hash
aggregate, because you explicitly allow exceeding work_mem as more
tuples for existing groups arrive.

Well, that would be true for now, but as has been mentioned, we can
add new methods to the aggregate infrastructure to serialize and
de-serialize transition states. I guess I agree that, in the absence
of such infrastructure, your patch might be a better way to handle
cases like array_agg, but I'm pretty happy to see that infrastructure
get added.

Hmm. It occurs to me that it could also be really good to add a
"merge transition states" operator to the aggregate infrastructure.
That would allow further improvements to Jeff's approach for cases
like array_agg. If we serialize a transition state to disk because
it's not fitting in memory, we don't need to reload it before
continuing to process the group, or at least not right away. We can
instead just start a new transitions state and then merge all of the
accumulated states at the end of the hash join. That's good, because
it means we're not using up precious work_mem for transition state
data that really isn't needed until it's time to start finalizing
groups. And it would be useful for parallelism eventually, too. :-)

Also, no one really says the initial estimate of how many tuples will be
aggregated is correct. It's about as unreliable as the group count
estimate. So how exactly are you going to estimate the partitions?

Considering this, I doubt being able to choose arbitrary number of
partitions (instead of only powers of 2) is really an advantage.

You're right. I was using the terminology in an imprecise and
misleading way. What I meant was more along the lines of what's in
the first four paragraphs of this email - namely, that with Jeff's
approach, it seems that you can be certain of using all the memory you
have available on the first pass through, whereas with your approach
there seems to be a risk of dumping data to disk that could have been
kept in memory and processed. Also, it's very likely that all of the
frequently-occurring values will get handled in the initial pass.

To put this another way, and I think we all agree on this, I think we
should be very concerned with minimizing the number of times the data
gets rewritten. If the data doesn't fit in memory, we're going to
have to rewrite at least some of it. But the algorithm we choose
could cause us to rewrite more of it than necessary, and that's bad.

Whe I think we should prevent is under-estimating the number of batches,
because in that case you have to read the whole batch, write part of it
again and then read it again. Instead of just writing it once (into two
files). Reading a tuple from a batch only to write it to another batch
is not really efficient.

Completely agreed. Choosing a partition count that is higher than
necessary doesn't hurt much. The expensive part is spilling the
tuples to disk for processing in a future batch rather than processing
them immediately. Once we've decided we're going to do that one way
or the other, the cost of distributing the tuples we decide to write
among (say) 16 tapes vs. 4 tapes is probably relatively small. (At
some point this breaks down; 1024 tapes will overflow the FD table.)
But picking a partition count that is too low could be extremely
expensive, in that, as you say, we'd need to rewrite the data a second
time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Jeff Davis

pgsql@j-davis.com

about 11 years ago

In reply to: Jeff Davis (#1)

1 attachment(s)

Re: 9.5: Memory-bounded HashAgg

On Sun, 2014-08-10 at 14:26 -0700, Jeff Davis wrote:

This patch is requires the Memory Accounting patch, or something similar
to track memory usage.

The attached patch enables hashagg to spill to disk, which means that
hashagg will contain itself to work_mem even if the planner makes a
bad misestimate of the cardinality.

New patch attached. All open items are complete, though the patch may
have a few rough edges.

Summary of changes:

* rebased on top of latest memory accounting patch
/messages/by-id/1417497257.5584.5.camel@jeff-desktop
* added a flag to hash_create to prevent it from creating an extra
level of memory context
- without this, the memory accounting would have a measurable impact
on performance
* cost model for the disk usage
* intelligently choose the number of partitions for each pass of the
data
* explain support
* in build_hash_table(), be more intelligent about the value of
nbuckets to pass to BuildTupleHashTable()
- BuildTupleHashTable tries to choose a value to keep the table in
work_mem, but it isn't very accurate.
* some very rudimentary testing (sanity checks, really) shows good
results

Summary of previous discussion (my summary; I may have missed some
points):

Tom Lane requested that the patch also handle the case where transition
values grow (e.g. array_agg) beyond work_mem. I feel this patch provides
a lot of benefit as it is, and trying to handle that case would be a lot
more work (we need a way to write the transition values out to disk at a
minimum, and perhaps combine them with other transition values). I also
don't think my patch would interfere with a fix there in the future.

Tomas Vondra suggested an alternative design that more closely resembles
HashJoin: instead of filling up the hash table and then spilling any new
groups, the idea would be to split the current data into two partitions,
keep one in the hash table, and spill the other (see
ExecHashIncreaseNumBatches()). This has the advantage that it's very
fast to identify whether the tuple is part of the in-memory batch or
not; and we can avoid even looking in the memory hashtable if not.

The batch-splitting approach has a major downside, however: you are
likely to evict a skew value from the in-memory batch, which will result
in all subsequent tuples with that skew value going to disk. My approach
never evicts from the in-memory table until we actually finalize the
groups, so the skew values are likely to be completely processed in the
first pass.

So, the attached patch implements my original approach, which I still
feel is the best solution.

Regards,
Jeff Davis

Attachments:

hashagg-disk-20141211.patchtext/x-patch; charset=UTF-8; name=hashagg-disk-20141211.patchDownload

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 3017,3022 **** include_dir 'conf.d'
--- 3017,3037 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-enable-hashagg-disk" xreflabel="enable_hashagg_disk">
+       <term><varname>enable_hashagg_disk</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>enable_hashagg_disk</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables or disables the query planner's use of hashed aggregation plan
+         types when the planner expects the hash table size to exceed
+         <varname>work_mem</varname>. The default is <literal>on</>.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
        <term><varname>enable_hashjoin</varname> (<type>boolean</type>)
        <indexterm>
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
***************
*** 86,91 **** static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
--- 86,92 ----
  					 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
+ static void show_hashagg_info(AggState *hashstate, ExplainState *es);
  static void show_tidbitmap_info(BitmapHeapScanState *planstate,
  					ExplainState *es);
  static void show_instrumentation_count(const char *qlabel, int which,
***************
*** 1423,1428 **** ExplainNode(PlanState *planstate, List *ancestors,
--- 1424,1430 ----
  		case T_Agg:
  			show_agg_keys((AggState *) planstate, ancestors, es);
  			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+ 			show_hashagg_info((AggState *) planstate, es);
  			if (plan->qual)
  				show_instrumentation_count("Rows Removed by Filter", 1,
  										   planstate, es);
***************
*** 1913,1918 **** show_sort_info(SortState *sortstate, ExplainState *es)
--- 1915,1956 ----
  }
  
  /*
+  * Show information on hash aggregate buckets and batches
+  */
+ static void
+ show_hashagg_info(AggState *aggstate, ExplainState *es)
+ {
+ 	Agg *agg = (Agg *)aggstate->ss.ps.plan;
+ 
+ 	Assert(IsA(aggstate, AggState));
+ 
+ 	if (agg->aggstrategy != AGG_HASHED)
+ 		return;
+ 
+ 	if (!aggstate->hash_init_state)
+ 	{
+ 		long	memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+ 		long	diskKb	  = (aggstate->hash_disk + 1023) / 1024;
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(
+ 				es->str,
+ 				"Batches: %d  Memory Usage: %ldkB  Disk Usage:%ldkB\n",
+ 				aggstate->hash_num_batches, memPeakKb, diskKb);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyLong("HashAgg Batches",
+ 								aggstate->hash_num_batches, es);
+ 			ExplainPropertyLong("Peak Memory Usage", memPeakKb, es);
+ 			ExplainPropertyLong("Disk Usage", diskKb, es);
+ 		}
+ 	}
+ }
+ 
+ /*
   * Show information on hash buckets/batches.
   */
  static void
*** a/src/backend/executor/execGrouping.c
--- b/src/backend/executor/execGrouping.c
***************
*** 310,316 **** BuildTupleHashTable(int numCols, AttrNumber *keyColIdx,
  	hash_ctl.hcxt = tablecxt;
  	hashtable->hashtab = hash_create("TupleHashTable", nbuckets,
  									 &hash_ctl,
! 					HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
  
  	return hashtable;
  }
--- 310,317 ----
  	hash_ctl.hcxt = tablecxt;
  	hashtable->hashtab = hash_create("TupleHashTable", nbuckets,
  									 &hash_ctl,
! 									 HASH_ELEM | HASH_FUNCTION | HASH_COMPARE |
! 									 HASH_CONTEXT | HASH_NOCHILDCXT);
  
  	return hashtable;
  }
***************
*** 331,336 **** TupleHashEntry
--- 332,386 ----
  LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
  					 bool *isnew)
  {
+ 	uint32 hashvalue;
+ 
+ 	hashvalue = TupleHashEntryHash(hashtable, slot);
+ 	return LookupTupleHashEntryHash(hashtable, slot, hashvalue, isnew);
+ }
+ 
+ /*
+  * TupleHashEntryHash
+  *
+  * Calculate the hash value of the tuple.
+  */
+ uint32
+ TupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot)
+ {
+ 	TupleHashEntryData	dummy;
+ 	TupleHashTable		saveCurHT;
+ 	uint32				hashvalue;
+ 
+ 	/*
+ 	 * Set up data needed by hash function.
+ 	 *
+ 	 * We save and restore CurTupleHashTable just in case someone manages to
+ 	 * invoke this code re-entrantly.
+ 	 */
+ 	hashtable->inputslot = slot;
+ 	hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+ 	hashtable->cur_eq_funcs = hashtable->tab_eq_funcs;
+ 
+ 	saveCurHT = CurTupleHashTable;
+ 	CurTupleHashTable = hashtable;
+ 
+ 	dummy.firstTuple = NULL;	/* flag to reference inputslot */
+ 	hashvalue = TupleHashTableHash(&dummy, sizeof(TupleHashEntryData));
+ 
+ 	CurTupleHashTable = saveCurHT;
+ 
+ 	return hashvalue;
+ }
+ 
+ /*
+  * LookupTupleHashEntryHash
+  *
+  * Like LookupTupleHashEntry, but allows the caller to specify the tuple's
+  * hash value, to avoid recalculating it.
+  */
+ TupleHashEntry
+ LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
+ 						 uint32 hashvalue, bool *isnew)
+ {
  	TupleHashEntry entry;
  	MemoryContext oldContext;
  	TupleHashTable saveCurHT;
***************
*** 371,380 **** LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
  
  	/* Search the hash table */
  	dummy.firstTuple = NULL;	/* flag to reference inputslot */
! 	entry = (TupleHashEntry) hash_search(hashtable->hashtab,
! 										 &dummy,
! 										 isnew ? HASH_ENTER : HASH_FIND,
! 										 &found);
  
  	if (isnew)
  	{
--- 421,429 ----
  
  	/* Search the hash table */
  	dummy.firstTuple = NULL;	/* flag to reference inputslot */
! 	entry = (TupleHashEntry) hash_search_with_hash_value(
! 		hashtable->hashtab, &dummy, hashvalue, isnew ? HASH_ENTER : HASH_FIND,
! 		&found);
  
  	if (isnew)
  	{
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
***************
*** 96,101 ****
--- 96,103 ----
  
  #include "postgres.h"
  
+ #include <math.h>
+ 
  #include "access/htup_details.h"
  #include "catalog/objectaccess.h"
  #include "catalog/pg_aggregate.h"
***************
*** 108,121 ****
--- 110,127 ----
  #include "optimizer/tlist.h"
  #include "parser/parse_agg.h"
  #include "parser/parse_coerce.h"
+ #include "storage/buffile.h"
  #include "utils/acl.h"
  #include "utils/builtins.h"
+ #include "utils/dynahash.h"
  #include "utils/lsyscache.h"
  #include "utils/memutils.h"
  #include "utils/syscache.h"
  #include "utils/tuplesort.h"
  #include "utils/datum.h"
  
+ #define HASH_DISK_MIN_PARTITIONS		1
+ #define HASH_DISK_MAX_PARTITIONS		256
  
  /*
   * AggStatePerAggData - per-aggregate working state for the Agg scan
***************
*** 301,306 **** typedef struct AggHashEntryData
--- 307,323 ----
  	AggStatePerGroupData pergroup[1];	/* VARIABLE LENGTH ARRAY */
  }	AggHashEntryData;	/* VARIABLE LENGTH STRUCT */
  
+ typedef struct HashWork
+ {
+ 	BufFile		 *input_file;	/* input partition, NULL for outer plan */
+ 	int			  input_bits;	/* number of bits for input partition mask */
+ 	int64		  input_groups; /* estimated number of input groups */
+ 
+ 	int			  n_output_partitions; /* number of output partitions */
+ 	BufFile		**output_partitions; /* output partition files */
+ 	int64		 *output_ntuples; /* number of tuples in each partition */
+ 	int			  output_bits; /* log2(n_output_partitions) + input_bits */
+ } HashWork;
  
  static void initialize_aggregates(AggState *aggstate,
  					  AggStatePerAgg peragg,
***************
*** 321,331 **** static void finalize_aggregate(AggState *aggstate,
  				   Datum *resultVal, bool *resultIsNull);
  static Bitmapset *find_unaggregated_cols(AggState *aggstate);
  static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
! static void build_hash_table(AggState *aggstate);
! static AggHashEntry lookup_hash_entry(AggState *aggstate,
! 				  TupleTableSlot *inputslot);
  static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
! static void agg_fill_hash_table(AggState *aggstate);
  static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
  static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
  
--- 338,352 ----
  				   Datum *resultVal, bool *resultIsNull);
  static Bitmapset *find_unaggregated_cols(AggState *aggstate);
  static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
! static void build_hash_table(AggState *aggstate, long nbuckets);
! static AggHashEntry lookup_hash_entry(AggState *aggstate, HashWork *work,
! 				   TupleTableSlot *inputslot, uint32 hashvalue);
! static HashWork *hash_work(BufFile *input_file, int64 input_groups,
! 						   int input_bits);
! static void save_tuple(AggState *aggstate, HashWork *work,
! 					   TupleTableSlot *slot, uint32 hashvalue);
  static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
! static bool agg_fill_hash_table(AggState *aggstate);
  static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
  static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
  
***************
*** 923,942 **** find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
  }
  
  /*
   * Initialize the hash table to empty.
   *
   * The hash table always lives in the aggcontext memory context.
   */
  static void
! build_hash_table(AggState *aggstate)
  {
  	Agg		   *node = (Agg *) aggstate->ss.ps.plan;
  	MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
  	Size		entrysize;
  
  	Assert(node->aggstrategy == AGG_HASHED);
  	Assert(node->numGroups > 0);
  
  	entrysize = sizeof(AggHashEntryData) +
  		(aggstate->numaggs - 1) * sizeof(AggStatePerGroupData);
  
--- 944,989 ----
  }
  
  /*
+  * Estimate all memory used by a group in the hash table.
+  */
+ Size
+ hash_group_size(int numAggs, int inputWidth, Size transitionSpace)
+ {
+ 	Size size;
+ 
+ 	/* tuple overhead */
+ 	size = MAXALIGN(sizeof(MinimalTupleData));
+ 	/* group key */
+ 	size += MAXALIGN(inputWidth);
+ 	/* hash table overhead */
+ 	size += hash_agg_entry_size(numAggs);
+ 	/* by-ref transition space */
+ 	size += transitionSpace;
+ 
+ 	return size;
+ }
+ 
+ /*
   * Initialize the hash table to empty.
   *
   * The hash table always lives in the aggcontext memory context.
   */
  static void
! build_hash_table(AggState *aggstate, long nbuckets)
  {
  	Agg		   *node = (Agg *) aggstate->ss.ps.plan;
  	MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
  	Size		entrysize;
+ 	Size		hash_group_mem = hash_group_size(aggstate->numaggs,
+ 												 node->plan_width,
+ 												 node->transitionSpace);
  
  	Assert(node->aggstrategy == AGG_HASHED);
  	Assert(node->numGroups > 0);
  
+ 	/* don't exceed work_mem */
+ 	nbuckets = Min(nbuckets, (long) ((work_mem * 1024L) / hash_group_mem));
+ 
  	entrysize = sizeof(AggHashEntryData) +
  		(aggstate->numaggs - 1) * sizeof(AggStatePerGroupData);
  
***************
*** 944,953 **** build_hash_table(AggState *aggstate)
  											  node->grpColIdx,
  											  aggstate->eqfunctions,
  											  aggstate->hashfunctions,
! 											  node->numGroups,
  											  entrysize,
! 											  aggstate->aggcontext,
  											  tmpmem);
  }
  
  /*
--- 991,1006 ----
  											  node->grpColIdx,
  											  aggstate->eqfunctions,
  											  aggstate->hashfunctions,
! 											  nbuckets,
  											  entrysize,
! 											  aggstate->hashcontext,
  											  tmpmem);
+ 
+ 	aggstate->hash_mem_min = MemoryContextMemAllocated(
+ 		aggstate->hashcontext, true);
+ 
+ 	if (aggstate->hash_mem_min > aggstate->hash_mem_peak)
+ 		aggstate->hash_mem_peak = aggstate->hash_mem_min;
  }
  
  /*
***************
*** 1024,1035 **** hash_agg_entry_size(int numAggs)
   * When called, CurrentMemoryContext should be the per-query context.
   */
  static AggHashEntry
! lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
  {
  	TupleTableSlot *hashslot = aggstate->hashslot;
  	ListCell   *l;
  	AggHashEntry entry;
! 	bool		isnew;
  
  	/* if first time through, initialize hashslot by cloning input slot */
  	if (hashslot->tts_tupleDescriptor == NULL)
--- 1077,1091 ----
   * When called, CurrentMemoryContext should be the per-query context.
   */
  static AggHashEntry
! lookup_hash_entry(AggState *aggstate, HashWork *work,
! 				  TupleTableSlot *inputslot, uint32 hashvalue)
  {
  	TupleTableSlot *hashslot = aggstate->hashslot;
  	ListCell   *l;
  	AggHashEntry entry;
! 	int64		hash_mem;
! 	bool		isnew = false;
! 	bool	   *p_isnew;
  
  	/* if first time through, initialize hashslot by cloning input slot */
  	if (hashslot->tts_tupleDescriptor == NULL)
***************
*** 1049,1058 **** lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
  		hashslot->tts_isnull[varNumber] = inputslot->tts_isnull[varNumber];
  	}
  
  	/* find or create the hashtable entry using the filtered tuple */
! 	entry = (AggHashEntry) LookupTupleHashEntry(aggstate->hashtable,
! 												hashslot,
! 												&isnew);
  
  	if (isnew)
  	{
--- 1105,1124 ----
  		hashslot->tts_isnull[varNumber] = inputslot->tts_isnull[varNumber];
  	}
  
+ 	hash_mem = MemoryContextMemAllocated(aggstate->hashcontext, true);
+ 	if (hash_mem > aggstate->hash_mem_peak)
+ 		aggstate->hash_mem_peak = hash_mem;
+ 
+ 	if (hash_mem <= aggstate->hash_mem_min ||
+ 		hash_mem < work_mem * 1024L)
+ 		p_isnew = &isnew;
+ 	else
+ 		p_isnew = NULL;
+ 
  	/* find or create the hashtable entry using the filtered tuple */
! 	entry = (AggHashEntry) LookupTupleHashEntryHash(aggstate->hashtable,
! 													hashslot, hashvalue,
! 													p_isnew);
  
  	if (isnew)
  	{
***************
*** 1060,1068 **** lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
--- 1126,1291 ----
  		initialize_aggregates(aggstate, aggstate->peragg, entry->pergroup);
  	}
  
+ 	if (entry == NULL)
+ 		save_tuple(aggstate, work, inputslot, hashvalue);
+ 
  	return entry;
  }
  
+ 
+ /*
+  * hash_work
+  *
+  * Construct a HashWork item, which represents one iteration of HashAgg to be
+  * done. Should be called in the aggregate's memory context.
+  */
+ static HashWork *
+ hash_work(BufFile *input_file, int64 input_groups, int input_bits)
+ {
+ 	HashWork *work = palloc(sizeof(HashWork));
+ 
+ 	work->input_file = input_file;
+ 	work->input_bits = input_bits;
+ 	work->input_groups = input_groups;
+ 
+ 	/*
+ 	 * Will be set only if we run out of memory and need to partition an
+ 	 * additional level.
+ 	 */
+ 	work->n_output_partitions = 0;
+ 	work->output_partitions = NULL;
+ 	work->output_ntuples = NULL;
+ 	work->output_bits = 0;
+ 
+ 	return work;
+ }
+ 
+ /*
+  * save_tuple
+  *
+  * Not enough memory to add tuple as new entry in hash table. Save for later
+  * in the appropriate partition.
+  */
+ static void
+ save_tuple(AggState *aggstate, HashWork *work, TupleTableSlot *slot,
+ 		   uint32 hashvalue)
+ {
+ 	int					 partition;
+ 	MinimalTuple		 tuple;
+ 	BufFile				*file;
+ 	int					 written;
+ 
+ 	if (work->output_partitions == NULL)
+ 	{
+ 		Agg		*agg = (Agg *) aggstate->ss.ps.plan;
+ 		Size	 group_size = hash_group_size(aggstate->numaggs,
+ 											  agg->plan_width,
+ 											  agg->transitionSpace);
+ 		double	 total_size = group_size * work->input_groups;
+ 		int		 npartitions;
+ 		int		 partition_bits;
+ 
+ 		/*
+ 		 * Try to make enough partitions so that each one fits in work_mem,
+ 		 * with a little slop.
+ 		 */
+ 		npartitions = ceil ( (1.5 * total_size) / (work_mem * 1024L) );
+ 
+ 		if (npartitions < HASH_DISK_MIN_PARTITIONS)
+ 			npartitions = HASH_DISK_MIN_PARTITIONS;
+ 		if (npartitions > HASH_DISK_MAX_PARTITIONS)
+ 			npartitions = HASH_DISK_MAX_PARTITIONS;
+ 
+ 		partition_bits = my_log2(npartitions);
+ 
+ 		/* make sure that we don't exhaust the hash bits */
+ 		if (partition_bits + work->input_bits >= 32)
+ 			partition_bits = 32 - work->input_bits;
+ 
+ 		/* number of partitions will be a power of two */
+ 		npartitions = 1L << partition_bits;
+ 
+ 		work->output_bits = partition_bits;
+ 		work->n_output_partitions = npartitions;
+ 		work->output_partitions = palloc0(sizeof(BufFile *) * npartitions);
+ 		work->output_ntuples = palloc0(sizeof(int64) * npartitions);
+ 	}
+ 
+ 	if (work->output_bits == 0)
+ 		partition = 0;
+ 	else
+ 		partition = (hashvalue << work->input_bits) >>
+ 			(32 - work->output_bits);
+ 
+ 	work->output_ntuples[partition]++;
+ 
+ 	if (work->output_partitions[partition] == NULL)
+ 		work->output_partitions[partition] = BufFileCreateTemp(false);
+ 	file = work->output_partitions[partition];
+ 
+ 	tuple = ExecFetchSlotMinimalTuple(slot);
+ 
+ 	written = BufFileWrite(file, (void *) &hashvalue, sizeof(uint32));
+ 	if (written != sizeof(uint32))
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not write to HashAgg temporary file: %m")));
+ 	aggstate->hash_disk += written;
+ 
+ 	written = BufFileWrite(file, (void *) tuple, tuple->t_len);
+ 	if (written != tuple->t_len)
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not write to HashAgg temporary file: %m")));
+ 	aggstate->hash_disk += written;
+ }
+ 
+ 
+ /*
+  * read_saved_tuple
+  *		read the next tuple from a batch file.  Return NULL if no more.
+  *
+  * On success, *hashvalue is set to the tuple's hash value, and the tuple
+  * itself is stored in the given slot.
+  *
+  * Copied with minor modifications from ExecHashJoinGetSavedTuple.
+  */
+ static TupleTableSlot *
+ read_saved_tuple(BufFile *file, uint32 *hashvalue, TupleTableSlot *tupleSlot)
+ {
+ 	uint32		header[2];
+ 	size_t		nread;
+ 	MinimalTuple tuple;
+ 
+ 	/*
+ 	 * Since both the hash value and the MinimalTuple length word are uint32,
+ 	 * we can read them both in one BufFileRead() call without any type
+ 	 * cheating.
+ 	 */
+ 	nread = BufFileRead(file, (void *) header, sizeof(header));
+ 	if (nread == 0)				/* end of file */
+ 	{
+ 		ExecClearTuple(tupleSlot);
+ 		return NULL;
+ 	}
+ 	if (nread != sizeof(header))
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not read from HashAgg temporary file: %m")));
+ 	*hashvalue = header[0];
+ 	tuple = (MinimalTuple) palloc(header[1]);
+ 	tuple->t_len = header[1];
+ 	nread = BufFileRead(file,
+ 						(void *) ((char *) tuple + sizeof(uint32)),
+ 						header[1] - sizeof(uint32));
+ 	if (nread != header[1] - sizeof(uint32))
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not read from HashAgg temporary file: %m")));
+ 	return ExecStoreMinimalTuple(tuple, tupleSlot, true);
+ }
+ 
+ 
  /*
   * ExecAgg -
   *
***************
*** 1107,1115 **** ExecAgg(AggState *node)
  	/* Dispatch based on strategy */
  	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
  	{
! 		if (!node->table_filled)
! 			agg_fill_hash_table(node);
! 		return agg_retrieve_hash_table(node);
  	}
  	else
  		return agg_retrieve_direct(node);
--- 1330,1345 ----
  	/* Dispatch based on strategy */
  	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
  	{
! 		TupleTableSlot *slot = NULL;
! 
! 		while (slot == NULL)
! 		{
! 			if (!node->table_filled)
! 				if (!agg_fill_hash_table(node))
! 					break;
! 			slot = agg_retrieve_hash_table(node);
! 		}
! 		return slot;
  	}
  	else
  		return agg_retrieve_direct(node);
***************
*** 1325,1337 **** agg_retrieve_direct(AggState *aggstate)
  /*
   * ExecAgg for hashed case: phase 1, read input and build hash table
   */
! static void
  agg_fill_hash_table(AggState *aggstate)
  {
  	PlanState  *outerPlan;
  	ExprContext *tmpcontext;
  	AggHashEntry entry;
  	TupleTableSlot *outerslot;
  
  	/*
  	 * get state info from node
--- 1555,1569 ----
  /*
   * ExecAgg for hashed case: phase 1, read input and build hash table
   */
! static bool
  agg_fill_hash_table(AggState *aggstate)
  {
  	PlanState  *outerPlan;
  	ExprContext *tmpcontext;
  	AggHashEntry entry;
  	TupleTableSlot *outerslot;
+ 	HashWork	*work;
+ 	int			 i;
  
  	/*
  	 * get state info from node
***************
*** 1340,1359 **** agg_fill_hash_table(AggState *aggstate)
  	/* tmpcontext is the per-input-tuple expression context */
  	tmpcontext = aggstate->tmpcontext;
  
  	/*
  	 * Process each outer-plan tuple, and then fetch the next one, until we
  	 * exhaust the outer plan.
  	 */
  	for (;;)
  	{
! 		outerslot = ExecProcNode(outerPlan);
! 		if (TupIsNull(outerslot))
! 			break;
  		/* set up for advance_aggregates call */
  		tmpcontext->ecxt_outertuple = outerslot;
  
  		/* Find or build hashtable entry for this tuple's group */
! 		entry = lookup_hash_entry(aggstate, outerslot);
  
  		/* Advance the aggregates */
  		advance_aggregates(aggstate, entry->pergroup);
--- 1572,1640 ----
  	/* tmpcontext is the per-input-tuple expression context */
  	tmpcontext = aggstate->tmpcontext;
  
+ 	if (aggstate->hash_work == NIL)
+ 	{
+ 		aggstate->agg_done = true;
+ 		return false;
+ 	}
+ 
+ 	work = linitial(aggstate->hash_work);
+ 	aggstate->hash_work = list_delete_first(aggstate->hash_work);
+ 
+ 	/* if not the first time through, reinitialize */
+ 	if (!aggstate->hash_init_state)
+ 	{
+ 		long	 nbuckets;
+ 		Agg		*node = (Agg *) aggstate->ss.ps.plan;
+ 
+ 		MemoryContextResetAndDeleteChildren(aggstate->hashcontext);
+ 
+ 		/*
+ 		 * If this table will hold only a partition of the input, then use a
+ 		 * proportionally smaller estimate for nbuckets.
+ 		 */
+ 		nbuckets = node->numGroups >> work->input_bits;
+ 
+ 		build_hash_table(aggstate, nbuckets);
+ 	}
+ 
+ 	aggstate->hash_init_state = false;
+ 
  	/*
  	 * Process each outer-plan tuple, and then fetch the next one, until we
  	 * exhaust the outer plan.
  	 */
  	for (;;)
  	{
! 		uint32 hashvalue;
! 
! 		CHECK_FOR_INTERRUPTS();
! 
! 		if (work->input_file == NULL)
! 		{
! 			outerslot = ExecProcNode(outerPlan);
! 			if (TupIsNull(outerslot))
! 				break;
! 
! 			hashvalue = TupleHashEntryHash(aggstate->hashtable, outerslot);
! 		}
! 		else
! 		{
! 			outerslot = read_saved_tuple(work->input_file, &hashvalue,
! 										 aggstate->hashslot);
! 			if (TupIsNull(outerslot))
! 			{
! 				BufFileClose(work->input_file);
! 				work->input_file = NULL;
! 				break;
! 			}
! 		}
! 
  		/* set up for advance_aggregates call */
  		tmpcontext->ecxt_outertuple = outerslot;
  
  		/* Find or build hashtable entry for this tuple's group */
! 		entry = lookup_hash_entry(aggstate, work, outerslot, hashvalue);
  
  		/* Advance the aggregates */
  		advance_aggregates(aggstate, entry->pergroup);
***************
*** 1362,1370 **** agg_fill_hash_table(AggState *aggstate)
--- 1643,1697 ----
  		ResetExprContext(tmpcontext);
  	}
  
+ 	if (work->input_file)
+ 		BufFileClose(work->input_file);
+ 
+ 	/* add each output partition as a new work item */
+ 	for (i = 0; i < work->n_output_partitions; i++)
+ 	{
+ 		BufFile			*file = work->output_partitions[i];
+ 		MemoryContext	 oldContext;
+ 		HashWork		*new_work;
+ 		int64			 input_ngroups;
+ 
+ 		/* partition is empty */
+ 		if (file == NULL)
+ 			continue;
+ 
+ 		/* rewind file for reading */
+ 		if (BufFileSeek(file, 0, 0L, SEEK_SET))
+ 			ereport(ERROR,
+ 					(errcode_for_file_access(),
+ 					 errmsg("could not rewind HashAgg temporary file: %m")));
+ 
+ 		/*
+ 		 * Estimate the number of input groups for this new work item as the
+ 		 * total number of tuples in its input file. Although that's a worst
+ 		 * case, it's not bad here for two reasons: (1) overestimating is
+ 		 * better than underestimating; and (2) we've already scanned the
+ 		 * relation once, so it's likely that we've already finalized many of
+ 		 * the common values.
+ 		 */
+ 		input_ngroups = work->output_ntuples[i];
+ 
+ 		oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
+ 		new_work = hash_work(file,
+ 							 input_ngroups,
+ 							 work->output_bits + work->input_bits);
+ 		aggstate->hash_work = lappend(
+ 			aggstate->hash_work,
+ 			new_work);
+ 		aggstate->hash_num_batches++;
+ 		MemoryContextSwitchTo(oldContext);
+ 	}
+ 
+ 	pfree(work);
+ 
  	aggstate->table_filled = true;
  	/* Initialize to walk the hash table */
  	ResetTupleHashIterator(aggstate->hashtable, &aggstate->hashiter);
+ 
+ 	return true;
  }
  
  /*
***************
*** 1396,1411 **** agg_retrieve_hash_table(AggState *aggstate)
  	 * We loop retrieving groups until we find one satisfying
  	 * aggstate->ss.ps.qual
  	 */
! 	while (!aggstate->agg_done)
  	{
  		/*
  		 * Find the next entry in the hash table
  		 */
  		entry = (AggHashEntry) ScanTupleHashTable(&aggstate->hashiter);
  		if (entry == NULL)
  		{
! 			/* No more entries in hashtable, so done */
! 			aggstate->agg_done = TRUE;
  			return NULL;
  		}
  
--- 1723,1740 ----
  	 * We loop retrieving groups until we find one satisfying
  	 * aggstate->ss.ps.qual
  	 */
! 	for (;;)
  	{
+ 		CHECK_FOR_INTERRUPTS();
+ 
  		/*
  		 * Find the next entry in the hash table
  		 */
  		entry = (AggHashEntry) ScanTupleHashTable(&aggstate->hashiter);
  		if (entry == NULL)
  		{
! 			/* No more entries in hashtable, so done with this batch */
! 			aggstate->table_filled = false;
  			return NULL;
  		}
  
***************
*** 1636,1645 **** ExecInitAgg(Agg *node, EState *estate, int eflags)
  
  	if (node->aggstrategy == AGG_HASHED)
  	{
! 		build_hash_table(aggstate);
  		aggstate->table_filled = false;
  		/* Compute the columns we actually need to hash on */
  		aggstate->hash_needed = find_hash_columns(aggstate);
  	}
  	else
  	{
--- 1965,1997 ----
  
  	if (node->aggstrategy == AGG_HASHED)
  	{
! 		MemoryContext oldContext;
! 
! 		aggstate->hash_mem_min = 0;
! 		aggstate->hash_mem_peak = 0;
! 		aggstate->hash_num_batches = 0;
! 		aggstate->hash_init_state = true;
  		aggstate->table_filled = false;
+ 		aggstate->hash_disk = 0;
+ 
+ 		aggstate->hashcontext =
+ 			AllocSetContextCreate(aggstate->aggcontext,
+ 								  "HashAgg Hash Table Context",
+ 								  ALLOCSET_DEFAULT_MINSIZE,
+ 								  ALLOCSET_DEFAULT_INITSIZE,
+ 								  ALLOCSET_DEFAULT_MAXSIZE);
+ 
+ 		build_hash_table(aggstate, node->numGroups);
+ 
  		/* Compute the columns we actually need to hash on */
  		aggstate->hash_needed = find_hash_columns(aggstate);
+ 
+ 		/* prime with initial work item to read from outer plan */
+ 		oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
+ 		aggstate->hash_work = lappend(aggstate->hash_work,
+ 									  hash_work(NULL, node->numGroups, 0));
+ 		aggstate->hash_num_batches++;
+ 		MemoryContextSwitchTo(oldContext);
  	}
  	else
  	{
***************
*** 2048,2079 **** ExecEndAgg(AggState *node)
  void
  ExecReScanAgg(AggState *node)
  {
  	ExprContext *econtext = node->ss.ps.ps_ExprContext;
! 	int			aggno;
  
  	node->agg_done = false;
  
  	node->ss.ps.ps_TupFromTlist = false;
  
! 	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
  	{
  		/*
! 		 * In the hashed case, if we haven't yet built the hash table then we
! 		 * can just return; nothing done yet, so nothing to undo. If subnode's
! 		 * chgParam is not NULL then it will be re-scanned by ExecProcNode,
! 		 * else no reason to re-scan it at all.
  		 */
! 		if (!node->table_filled)
  			return;
  
  		/*
! 		 * If we do have the hash table and the subplan does not have any
! 		 * parameter changes, then we can just rescan the existing hash table;
! 		 * no need to build it again.
  		 */
! 		if (node->ss.ps.lefttree->chgParam == NULL)
  		{
  			ResetTupleHashIterator(node->hashtable, &node->hashiter);
  			return;
  		}
  	}
--- 2400,2433 ----
  void
  ExecReScanAgg(AggState *node)
  {
+ 	Agg			*agg = (Agg *) node->ss.ps.plan;
  	ExprContext *econtext = node->ss.ps.ps_ExprContext;
! 	int			 aggno;
  
  	node->agg_done = false;
  
  	node->ss.ps.ps_TupFromTlist = false;
  
! 	if (agg->aggstrategy == AGG_HASHED)
  	{
  		/*
! 		 * In the hashed case, if we haven't done any execution work yet, we
! 		 * can just return; nothing to undo. If subnode's chgParam is not NULL
! 		 * then it will be re-scanned by ExecProcNode, else no reason to
! 		 * re-scan it at all.
  		 */
! 		if (node->hash_init_state)
  			return;
  
  		/*
! 		 * If we do have the hash table, it never went to disk, and the
! 		 * subplan does not have any parameter changes, then we can just
! 		 * rescan the existing hash table; no need to build it again.
  		 */
! 		if (node->ss.ps.lefttree->chgParam == NULL && node->hash_disk == 0)
  		{
  			ResetTupleHashIterator(node->hashtable, &node->hashiter);
+ 			node->table_filled = true;
  			return;
  		}
  	}
***************
*** 2110,2120 **** ExecReScanAgg(AggState *node)
  	 */
  	MemoryContextResetAndDeleteChildren(node->aggcontext);
  
! 	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
  	{
  		/* Rebuild an empty hash table */
! 		build_hash_table(node);
  		node->table_filled = false;
  	}
  	else
  	{
--- 2464,2493 ----
  	 */
  	MemoryContextResetAndDeleteChildren(node->aggcontext);
  
! 	if (agg->aggstrategy == AGG_HASHED)
  	{
+ 		MemoryContext oldContext;
+ 
+ 		node->hashcontext =
+ 			AllocSetContextCreate(node->aggcontext,
+ 								  "HashAgg Hash Table Context",
+ 								  ALLOCSET_DEFAULT_MINSIZE,
+ 								  ALLOCSET_DEFAULT_INITSIZE,
+ 								  ALLOCSET_DEFAULT_MAXSIZE);
+ 
  		/* Rebuild an empty hash table */
! 		build_hash_table(node, agg->numGroups);
! 		node->hash_init_state = true;
  		node->table_filled = false;
+ 		node->hash_disk = 0;
+ 		node->hash_work = NIL;
+ 
+ 		/* prime with initial work item to read from outer plan */
+ 		oldContext = MemoryContextSwitchTo(node->aggcontext);
+ 		node->hash_work = lappend(node->hash_work,
+ 								  hash_work(NULL, agg->numGroups, 0));
+ 		node->hash_num_batches++;
+ 		MemoryContextSwitchTo(oldContext);
  	}
  	else
  	{
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
***************
*** 75,80 ****
--- 75,81 ----
  
  #include "access/htup_details.h"
  #include "executor/executor.h"
+ #include "executor/nodeAgg.h"
  #include "executor/nodeHash.h"
  #include "miscadmin.h"
  #include "nodes/nodeFuncs.h"
***************
*** 113,118 **** bool		enable_bitmapscan = true;
--- 114,120 ----
  bool		enable_tidscan = true;
  bool		enable_sort = true;
  bool		enable_hashagg = true;
+ bool		enable_hashagg_disk = true;
  bool		enable_nestloop = true;
  bool		enable_material = true;
  bool		enable_mergejoin = true;
***************
*** 1468,1474 **** cost_agg(Path *path, PlannerInfo *root,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, double numGroups,
  		 Cost input_startup_cost, Cost input_total_cost,
! 		 double input_tuples)
  {
  	double		output_tuples;
  	Cost		startup_cost;
--- 1470,1476 ----
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, double numGroups,
  		 Cost input_startup_cost, Cost input_total_cost,
! 		 int input_width, double input_tuples)
  {
  	double		output_tuples;
  	Cost		startup_cost;
***************
*** 1531,1536 **** cost_agg(Path *path, PlannerInfo *root,
--- 1533,1542 ----
  	else
  	{
  		/* must be AGG_HASHED */
+ 		double	group_size = hash_group_size(aggcosts->numAggs,
+ 											 input_width,
+ 											 aggcosts->transitionSpace);
+ 
  		startup_cost = input_total_cost;
  		startup_cost += aggcosts->transCost.startup;
  		startup_cost += aggcosts->transCost.per_tuple * input_tuples;
***************
*** 1538,1543 **** cost_agg(Path *path, PlannerInfo *root,
--- 1544,1578 ----
  		total_cost = startup_cost;
  		total_cost += aggcosts->finalCost * numGroups;
  		total_cost += cpu_tuple_cost * numGroups;
+ 
+ 		if (group_size * numGroups > (work_mem * 1024L))
+ 		{
+ 			double groups_per_batch = (work_mem * 1024L) / group_size;
+ 
+ 			/* first batch doesn't go to disk */
+ 			double groups_disk = numGroups - groups_per_batch;
+ 
+ 			/*
+ 			 * Assume that the groups that go to disk are of an average number
+ 			 * of tuples. This is pessimistic -- the largest groups are more
+ 			 * likely to be processed in the first pass and never go to disk.
+ 			 */
+ 			double tuples_disk = groups_disk * (input_tuples / numGroups);
+ 
+ 			int tuple_size = sizeof(uint32) /* stored hash value */
+ 				+ MAXALIGN(sizeof(MinimalTupleData))
+ 				+ MAXALIGN(input_width);
+ 			double pages_to_disk = (tuples_disk * tuple_size) / BLCKSZ;
+ 
+ 			/*
+ 			 * Write and then read back the data that's not processed in the
+ 			 * first pass. Data could be read and written more times than that
+ 			 * if not enough partitions are created, but the depth will be a
+ 			 * very small number even for a very large amount of data, so
+ 			 * ignore it here.
+ 			 */
+ 			total_cost += seq_page_cost * 2 * pages_to_disk;
+ 		}
  		output_tuples = numGroups;
  	}
  
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
***************
*** 4369,4374 **** make_agg(PlannerInfo *root, List *tlist, List *qual,
--- 4369,4377 ----
  	node->grpColIdx = grpColIdx;
  	node->grpOperators = grpOperators;
  	node->numGroups = numGroups;
+ 	if (aggcosts != NULL)
+ 		node->transitionSpace = aggcosts->transitionSpace;
+ 	node->plan_width = lefttree->plan_width;
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
  	cost_agg(&agg_path, root,
***************
*** 4376,4381 **** make_agg(PlannerInfo *root, List *tlist, List *qual,
--- 4379,4385 ----
  			 numGroupCols, numGroups,
  			 lefttree->startup_cost,
  			 lefttree->total_cost,
+ 			 lefttree->plan_width,
  			 lefttree->plan_rows);
  	plan->startup_cost = agg_path.startup_cost;
  	plan->total_cost = agg_path.total_cost;
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 234,240 **** optimize_minmax_aggregates(PlannerInfo *root, List *tlist,
  	cost_agg(&agg_p, root, AGG_PLAIN, aggcosts,
  			 0, 0,
  			 best_path->startup_cost, best_path->total_cost,
! 			 best_path->parent->rows);
  
  	if (total_cost > agg_p.total_cost)
  		return NULL;			/* too expensive */
--- 234,240 ----
  	cost_agg(&agg_p, root, AGG_PLAIN, aggcosts,
  			 0, 0,
  			 best_path->startup_cost, best_path->total_cost,
! 			 best_path->parent->width, best_path->parent->rows);
  
  	if (total_cost > agg_p.total_cost)
  		return NULL;			/* too expensive */
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
***************
*** 2744,2750 **** choose_hashed_grouping(PlannerInfo *root,
  	/* plus the per-hash-entry overhead */
  	hashentrysize += hash_agg_entry_size(agg_costs->numAggs);
  
! 	if (hashentrysize * dNumGroups > work_mem * 1024L)
  		return false;
  
  	/*
--- 2744,2751 ----
  	/* plus the per-hash-entry overhead */
  	hashentrysize += hash_agg_entry_size(agg_costs->numAggs);
  
! 	if (!enable_hashagg_disk &&
! 		hashentrysize * dNumGroups > work_mem * 1024L)
  		return false;
  
  	/*
***************
*** 2779,2785 **** choose_hashed_grouping(PlannerInfo *root,
  	cost_agg(&hashed_p, root, AGG_HASHED, agg_costs,
  			 numGroupCols, dNumGroups,
  			 cheapest_path->startup_cost, cheapest_path->total_cost,
! 			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
  		cost_sort(&hashed_p, root, target_pathkeys, hashed_p.total_cost,
--- 2780,2786 ----
  	cost_agg(&hashed_p, root, AGG_HASHED, agg_costs,
  			 numGroupCols, dNumGroups,
  			 cheapest_path->startup_cost, cheapest_path->total_cost,
! 			 path_width, path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
  		cost_sort(&hashed_p, root, target_pathkeys, hashed_p.total_cost,
***************
*** 2810,2816 **** choose_hashed_grouping(PlannerInfo *root,
  		cost_agg(&sorted_p, root, AGG_SORTED, agg_costs,
  				 numGroupCols, dNumGroups,
  				 sorted_p.startup_cost, sorted_p.total_cost,
! 				 path_rows);
  	else
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
--- 2811,2817 ----
  		cost_agg(&sorted_p, root, AGG_SORTED, agg_costs,
  				 numGroupCols, dNumGroups,
  				 sorted_p.startup_cost, sorted_p.total_cost,
! 				 path_width, path_rows);
  	else
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
***************
*** 2910,2916 **** choose_hashed_distinct(PlannerInfo *root,
  	/* plus the per-hash-entry overhead */
  	hashentrysize += hash_agg_entry_size(0);
  
! 	if (hashentrysize * dNumDistinctRows > work_mem * 1024L)
  		return false;
  
  	/*
--- 2911,2918 ----
  	/* plus the per-hash-entry overhead */
  	hashentrysize += hash_agg_entry_size(0);
  
! 	if (!enable_hashagg_disk &&
! 		hashentrysize * dNumDistinctRows > work_mem * 1024L)
  		return false;
  
  	/*
***************
*** 2929,2935 **** choose_hashed_distinct(PlannerInfo *root,
  	cost_agg(&hashed_p, root, AGG_HASHED, NULL,
  			 numDistinctCols, dNumDistinctRows,
  			 cheapest_startup_cost, cheapest_total_cost,
! 			 path_rows);
  
  	/*
  	 * Result of hashed agg is always unsorted, so if ORDER BY is present we
--- 2931,2937 ----
  	cost_agg(&hashed_p, root, AGG_HASHED, NULL,
  			 numDistinctCols, dNumDistinctRows,
  			 cheapest_startup_cost, cheapest_total_cost,
! 			 path_width, path_rows);
  
  	/*
  	 * Result of hashed agg is always unsorted, so if ORDER BY is present we
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
***************
*** 851,857 **** choose_hashed_setop(PlannerInfo *root, List *groupClauses,
  	cost_agg(&hashed_p, root, AGG_HASHED, NULL,
  			 numGroupCols, dNumGroups,
  			 input_plan->startup_cost, input_plan->total_cost,
! 			 input_plan->plan_rows);
  
  	/*
  	 * Now for the sorted case.  Note that the input is *always* unsorted,
--- 851,857 ----
  	cost_agg(&hashed_p, root, AGG_HASHED, NULL,
  			 numGroupCols, dNumGroups,
  			 input_plan->startup_cost, input_plan->total_cost,
! 			 input_plan->plan_width, input_plan->plan_rows);
  
  	/*
  	 * Now for the sorted case.  Note that the input is *always* unsorted,
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
***************
*** 1379,1385 **** create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
  					 numCols, pathnode->path.rows,
  					 subpath->startup_cost,
  					 subpath->total_cost,
! 					 rel->rows);
  	}
  
  	if (all_btree && all_hash)
--- 1379,1385 ----
  					 numCols, pathnode->path.rows,
  					 subpath->startup_cost,
  					 subpath->total_cost,
! 					 rel->width, rel->rows);
  	}
  
  	if (all_btree && all_hash)
*** a/src/backend/utils/hash/dynahash.c
--- b/src/backend/utils/hash/dynahash.c
***************
*** 291,301 **** hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
  			CurrentDynaHashCxt = info->hcxt;
  		else
  			CurrentDynaHashCxt = TopMemoryContext;
! 		CurrentDynaHashCxt = AllocSetContextCreate(CurrentDynaHashCxt,
! 												   tabname,
! 												   ALLOCSET_DEFAULT_MINSIZE,
! 												   ALLOCSET_DEFAULT_INITSIZE,
! 												   ALLOCSET_DEFAULT_MAXSIZE);
  	}
  
  	/* Initialize the hash header, plus a copy of the table name */
--- 291,303 ----
  			CurrentDynaHashCxt = info->hcxt;
  		else
  			CurrentDynaHashCxt = TopMemoryContext;
! 
! 		if ((flags & HASH_NOCHILDCXT) == 0)
! 			CurrentDynaHashCxt = AllocSetContextCreate(CurrentDynaHashCxt,
! 													   tabname,
! 													   ALLOCSET_DEFAULT_MINSIZE,
! 													   ALLOCSET_DEFAULT_INITSIZE,
! 													   ALLOCSET_DEFAULT_MAXSIZE);
  	}
  
  	/* Initialize the hash header, plus a copy of the table name */
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 771,776 **** static struct config_bool ConfigureNamesBool[] =
--- 771,785 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"enable_hashagg_disk", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the planner's use of disk-based hashed aggregation plans."),
+ 			NULL
+ 		},
+ 		&enable_hashagg_disk,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
  			gettext_noop("Enables the planner's use of materialization."),
  			NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 270,275 ****
--- 270,276 ----
  
  #enable_bitmapscan = on
  #enable_hashagg = on
+ #enable_hashagg_disk = on
  #enable_hashjoin = on
  #enable_indexscan = on
  #enable_indexonlyscan = on
*** a/src/include/executor/executor.h
--- b/src/include/executor/executor.h
***************
*** 147,152 **** extern TupleHashTable BuildTupleHashTable(int numCols, AttrNumber *keyColIdx,
--- 147,158 ----
  extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
  					 TupleTableSlot *slot,
  					 bool *isnew);
+ extern uint32 TupleHashEntryHash(TupleHashTable hashtable,
+ 					 TupleTableSlot *slot);
+ extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
+ 					 TupleTableSlot *slot,
+ 					 uint32 hashvalue,
+ 					 bool *isnew);
  extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
  				   TupleTableSlot *slot,
  				   FmgrInfo *eqfunctions,
*** a/src/include/executor/nodeAgg.h
--- b/src/include/executor/nodeAgg.h
***************
*** 22,27 **** extern void ExecEndAgg(AggState *node);
--- 22,28 ----
  extern void ExecReScanAgg(AggState *node);
  
  extern Size hash_agg_entry_size(int numAggs);
+ extern Size hash_group_size(int numAggs, int inputWidth, Size transitionSpace);
  
  extern Datum aggregate_dummy(PG_FUNCTION_ARGS);
  
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 1759,1769 **** typedef struct AggState
--- 1759,1776 ----
  	AggStatePerGroup pergroup;	/* per-Aggref-per-group working state */
  	HeapTuple	grp_firstTuple; /* copy of first tuple of current group */
  	/* these fields are used in AGG_HASHED mode: */
+ 	MemoryContext hashcontext;	/* subcontext to use for hash table */
  	TupleHashTable hashtable;	/* hash table with one entry per group */
  	TupleTableSlot *hashslot;	/* slot for loading hash table */
  	List	   *hash_needed;	/* list of columns needed in hash table */
+ 	bool		hash_init_state; /* in initial state before execution? */
  	bool		table_filled;	/* hash table filled yet? */
+ 	int64		hash_disk;		/* bytes of disk space used */
+ 	uint64		hash_mem_min;	/* memory used by empty hash table */
+ 	uint64		hash_mem_peak;	/* memory used at peak of execution */
+ 	int			hash_num_batches; /* total number of batches created */
  	TupleHashIterator hashiter; /* for iterating through hash table */
+ 	List	   *hash_work;		/* remaining work to be done */
  } AggState;
  
  /* ----------------
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
***************
*** 666,671 **** typedef struct Agg
--- 666,673 ----
  	AttrNumber *grpColIdx;		/* their indexes in the target list */
  	Oid		   *grpOperators;	/* equality operators to compare with */
  	long		numGroups;		/* estimated number of groups in input */
+ 	Size		transitionSpace; /* estimated size of by-ref transition val */
+ 	int			plan_width;		/* input plan width */
  } Agg;
  
  /* ----------------
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
***************
*** 57,62 **** extern bool enable_bitmapscan;
--- 57,63 ----
  extern bool enable_tidscan;
  extern bool enable_sort;
  extern bool enable_hashagg;
+ extern bool enable_hashagg_disk;
  extern bool enable_nestloop;
  extern bool enable_material;
  extern bool enable_mergejoin;
***************
*** 102,108 **** extern void cost_agg(Path *path, PlannerInfo *root,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, double numGroups,
  		 Cost input_startup_cost, Cost input_total_cost,
! 		 double input_tuples);
  extern void cost_windowagg(Path *path, PlannerInfo *root,
  			   List *windowFuncs, int numPartCols, int numOrderCols,
  			   Cost input_startup_cost, Cost input_total_cost,
--- 103,109 ----
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, double numGroups,
  		 Cost input_startup_cost, Cost input_total_cost,
! 		 int input_width, double input_tuples);
  extern void cost_windowagg(Path *path, PlannerInfo *root,
  			   List *windowFuncs, int numPartCols, int numOrderCols,
  			   Cost input_startup_cost, Cost input_total_cost,
*** a/src/include/utils/hsearch.h
--- b/src/include/utils/hsearch.h
***************
*** 93,98 **** typedef struct HASHCTL
--- 93,101 ----
  #define HASH_COMPARE	0x400	/* Set user defined comparison function */
  #define HASH_KEYCOPY	0x800	/* Set user defined key-copying function */
  #define HASH_FIXED_SIZE 0x1000	/* Initial size is a hard limit */
+ #define HASH_NOCHILDCXT 0x2000	/* Don't create a child context. Warning:
+ 								 * hash_destroy will delete the memory context
+ 								 * specified by the caller. */
  
  
  /* max_dsize value to indicate expansible directory */
*** a/src/test/regress/expected/rangefuncs.out
--- b/src/test/regress/expected/rangefuncs.out
***************
*** 3,8 **** SELECT name, setting FROM pg_settings WHERE name LIKE 'enable%';
--- 3,9 ----
  ----------------------+---------
   enable_bitmapscan    | on
   enable_hashagg       | on
+  enable_hashagg_disk  | on
   enable_hashjoin      | on
   enable_indexonlyscan | on
   enable_indexscan     | on
***************
*** 12,18 **** SELECT name, setting FROM pg_settings WHERE name LIKE 'enable%';
   enable_seqscan       | on
   enable_sort          | on
   enable_tidscan       | on
! (11 rows)
  
  CREATE TABLE foo2(fooid int, f2 int);
  INSERT INTO foo2 VALUES(1, 11);
--- 13,19 ----
   enable_seqscan       | on
   enable_sort          | on
   enable_tidscan       | on
! (12 rows)
  
  CREATE TABLE foo2(fooid int, f2 int);
  INSERT INTO foo2 VALUES(1, 11);

#38

Tomas Vondra

tv@fuzzy.cz

about 11 years ago

In reply to: Jeff Davis (#37)

Re: 9.5: Memory-bounded HashAgg

On 11.12.2014 11:46, Jeff Davis wrote:

New patch attached. All open items are complete, though the patch may
have a few rough edges.

Summary of changes:

* rebased on top of latest memory accounting patch
/messages/by-id/1417497257.5584.5.camel@jeff-desktop
* added a flag to hash_create to prevent it from creating an extra
level of memory context
- without this, the memory accounting would have a measurable impact
on performance
* cost model for the disk usage
* intelligently choose the number of partitions for each pass of the
data
* explain support
* in build_hash_table(), be more intelligent about the value of
nbuckets to pass to BuildTupleHashTable()
- BuildTupleHashTable tries to choose a value to keep the table in
work_mem, but it isn't very accurate.
* some very rudimentary testing (sanity checks, really) shows good
results

I plan to look into this over the holidays, hopefully.

Summary of previous discussion (my summary; I may have missed some
points):

Tom Lane requested that the patch also handle the case where transition
values grow (e.g. array_agg) beyond work_mem. I feel this patch provides
a lot of benefit as it is, and trying to handle that case would be a lot
more work (we need a way to write the transition values out to disk at a
minimum, and perhaps combine them with other transition values). I also
don't think my patch would interfere with a fix there in the future.

Tomas Vondra suggested an alternative design that more closely resembles
HashJoin: instead of filling up the hash table and then spilling any new
groups, the idea would be to split the current data into two partitions,
keep one in the hash table, and spill the other (see
ExecHashIncreaseNumBatches()). This has the advantage that it's very
fast to identify whether the tuple is part of the in-memory batch or
not; and we can avoid even looking in the memory hashtable if not.

The batch-splitting approach has a major downside, however: you are
likely to evict a skew value from the in-memory batch, which will result
in all subsequent tuples with that skew value going to disk. My approach
never evicts from the in-memory table until we actually finalize the
groups, so the skew values are likely to be completely processed in the
first pass.

I don't think that's the main issue - there are probably ways to work
around that (e.g. by keeping a "skew hash table" for those frequent
values, similarly to what hash join does).

The main problem IMHO is that it requires writing the transition values
to disk, which we don't know in many cases (esp. in the interesting
ones, where the transtion values grow).

So, the attached patch implements my original approach, which I still
feel is the best solution.

I think this is a reasonable approach - it's true it does no handle the
case with growing aggregate state (e.g. array_agg), so it really fixes
"just" the case when we underestimate the number of groups.

But I believe we need this approach anyway, becauce we'll never know how
to write all the various transition values (e.g. think of custom
aggregates), and this is an improvement.

We can build on this and add the more elaborate hashjoin-like approach
in the future.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Jeff Davis

pgsql@j-davis.com

about 11 years ago

In reply to: Jeff Davis (#37)

Re: 9.5: Memory-bounded HashAgg

On Thu, 2014-12-11 at 02:46 -0800, Jeff Davis wrote:

On Sun, 2014-08-10 at 14:26 -0700, Jeff Davis wrote:

This patch is requires the Memory Accounting patch, or something similar
to track memory usage.

The attached patch enables hashagg to spill to disk, which means that
hashagg will contain itself to work_mem even if the planner makes a
bad misestimate of the cardinality.

New patch attached. All open items are complete, though the patch may
have a few rough edges.

This thread got moved over here:

/messages/by-id/1419326161.24895.13.camel@jeff-desktop

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers