Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

Started by Lawrence, Ramonabout 17 years ago71 messages
#1Lawrence, Ramon
ramon.lawrence@ubc.ca
1 attachment(s)

We propose a patch that improves hybrid hash join's performance for
large multi-batch joins where the probe relation has skew.

Project name: Histojoin

Patch file: histojoin_v1.patch

This patch implements the Histojoin join algorithm as an optional
feature added to the standard Hybrid Hash Join (HHJ). A flag is used to
enable or disable the Histojoin features. When Histojoin is disabled,
HHJ acts as normal. The Histojoin features allow HHJ to use
PostgreSQL's statistics to do skew aware partitioning. The basic idea
is to keep build relation tuples in a small in-memory hash table that
have join values that are frequently occurring in the probe relation.
This improves performance of HHJ when multiple batches are used by 10%
to 50% for skewed data sets. The performance improvements of this patch
can be seen in the paper (pages 25-30) at:

http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf

All generators and materials needed to verify these results can be
provided.

This is a patch against the HEAD of the repository.

This patch does not contain platform specific code. It compiles and has
been tested on our machines in both Windows (MSVC++) and Linux (GCC).

Currently the Histojoin feature is enabled by default and is used
whenever HHJ is used and there are Most Common Value (MCV) statistics
available on the probe side base relation of the join. To disable this
feature simply set the enable_hashjoin_usestatmcvs flag to off in the
database configuration file or at run time with the 'set' command.

One potential improvement not included in the patch is that Most Common
Value (MCV) statistics are only determined when the probe relation is
produced by a scan operator. There is a benefit to using MCVs even when
the probe relation is not a base scan, but we were unable to determine
how to find statistics from a base relation after other operators are
performed.

This patch was created by Bryce Cutt as part of his work on his M.Sc.
thesis.

--

Dr. Ramon Lawrence

Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan

E-mail: ramon.lawrence@ubc.ca <mailto:ramon.lawrence@ubc.ca>

Attachments:

histojoin_v1.patchapplication/octet-stream; name=histojoin_v1.patchDownload
Index: src/backend/executor/nodeHash.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHash.c,v
retrieving revision 1.116
diff -c -r1.116 nodeHash.c
*** src/backend/executor/nodeHash.c	1 Jan 2008 19:45:49 -0000	1.116
--- src/backend/executor/nodeHash.c	17 Oct 2008 23:47:20 -0000
***************
*** 54,59 ****
--- 54,86 ----
  }
  
  /* ----------------------------------------------------------------
+ *		isAMostCommonValue
+ *
+ *		is the value one of the most common key values?
+ *  ----------------------------------------------------------------
+ */
+ bool isAMostCommonValue(HashJoinTable hashtable, uint32 hashvalue, int *partitionNumber)
+ {
+ 	int bucket = hashvalue & (hashtable->nMostCommonTuplePartitionHashBuckets - 1);
+ 
+ 	while (hashtable->mostCommonTuplePartition[bucket].hashvalue != 0
+ 		&& hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ 	{
+ 		bucket = (bucket + 1) & (hashtable->nMostCommonTuplePartitionHashBuckets - 1);
+ 	}
+ 
+ 	if (hashtable->mostCommonTuplePartition[bucket].hashvalue == hashvalue)
+ 	{
+ 		*partitionNumber = bucket;
+ 		return true;
+ 	}
+ 
+ 	//must have run into an empty slot which means this is not an MCV
+ 	*partitionNumber = MCV_INVALID_PARTITION;
+ 	return false;
+ }
+ 
+ /* ----------------------------------------------------------------
   *		MultiExecHash
   *
   *		build hash table for hashjoin, doing partitioning if more
***************
*** 69,74 ****
--- 96,103 ----
  	TupleTableSlot *slot;
  	ExprContext *econtext;
  	uint32		hashvalue;
+ 	MinimalTuple mintuple;
+ 	int partitionNumber;
  
  	/* must provide our own instrumentation support */
  	if (node->ps.instrument)
***************
*** 99,106 ****
  		if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
  								 &hashvalue))
  		{
! 			ExecHashTableInsert(hashtable, slot, hashvalue);
! 			hashtable->totalTuples += 1;
  		}
  	}
  
--- 128,163 ----
  		if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
  								 &hashvalue))
  		{
! 			partitionNumber = MCV_INVALID_PARTITION;
! 
! 			if (hashtable->usingMostCommonValues && isAMostCommonValue(hashtable, hashvalue, &partitionNumber))
! 			{
! 				HashJoinTuple hashTuple;
! 				int			hashTupleSize;
! 				
! 				mintuple = ExecFetchSlotMinimalTuple(slot);
! 				hashTupleSize = HJTUPLE_OVERHEAD + mintuple->t_len;
! 				hashTuple = (HashJoinTuple) palloc(hashTupleSize);
! 				hashTuple->hashvalue = hashvalue;
! 				memcpy(HJTUPLE_MINTUPLE(hashTuple), mintuple, mintuple->t_len);
! 
! 				hashTuple->next = hashtable->mostCommonTuplePartition[partitionNumber].tuples;
! 				hashtable->mostCommonTuplePartition[partitionNumber].tuples = hashTuple;
! 				
! 				hashtable->spaceUsed += hashTupleSize;
! 				
! 				if (hashtable->spaceUsed > hashtable->spaceAllowed) {
! 					ExecHashIncreaseNumBatches(hashtable);
! 				}
! 				
! 				hashtable->mostCommonTuplesStored++;
! 			}
! 
! 			if (partitionNumber == MCV_INVALID_PARTITION)
! 			{
! 				ExecHashTableInsert(hashtable, slot, hashvalue);
! 				hashtable->totalTuples += 1;
! 			}
  		}
  	}
  
***************
*** 798,803 ****
--- 855,921 ----
  }
  
  /*
+  * ExecScanHashMostCommonTuples
+  *		scan a hash bucket for matches to the current outer tuple
+  *
+  * The current outer tuple must be stored in econtext->ecxt_outertuple.
+  */
+ HashJoinTuple
+ ExecScanHashMostCommonTuples(HashJoinState *hjstate,
+ 				   ExprContext *econtext)
+ {
+ 	List	   *hjclauses = hjstate->hashclauses;
+ 	HashJoinTable hashtable = hjstate->hj_HashTable;
+ 	HashJoinTuple hashTuple = hjstate->hj_CurTuple;
+ 	uint32		hashvalue = hjstate->hj_CurHashValue;
+ 
+ 	/*
+ 	 * hj_CurTuple is NULL to start scanning a new bucket, or the address of
+ 	 * the last tuple returned from the current bucket.
+ 	 */
+ 	if (hashTuple == NULL)
+ 	{
+ 		//painstakingly make sure this is a valid partition index
+ 		Assert(hjstate->hj_OuterTupleMostCommonValuePartition > MCV_INVALID_PARTITION);
+ 		Assert(hjstate->hj_OuterTupleMostCommonValuePartition < hashtable->nMostCommonTuplePartitions);
+ 
+ 		hashTuple = hashtable->mostCommonTuplePartition[hjstate->hj_OuterTupleMostCommonValuePartition].tuples;
+ 	}
+ 	else
+ 		hashTuple = hashTuple->next;
+ 
+ 	while (hashTuple != NULL)
+ 	{
+ 		if (hashTuple->hashvalue == hashvalue)
+ 		{
+ 			TupleTableSlot *inntuple;
+ 
+ 			/* insert hashtable's tuple into exec slot so ExecQual sees it */
+ 			inntuple = ExecStoreMinimalTuple(HJTUPLE_MINTUPLE(hashTuple),
+ 											 hjstate->hj_HashTupleSlot,
+ 											 false);	/* do not pfree */
+ 			econtext->ecxt_innertuple = inntuple;
+ 
+ 			/* reset temp memory each time to avoid leaks from qual expr */
+ 			ResetExprContext(econtext);
+ 
+ 			if (ExecQual(hjclauses, econtext, false))
+ 			{
+ 				hjstate->hj_CurTuple = hashTuple;
+ 				return hashTuple;
+ 			}
+ 		}
+ 
+ 		hashTuple = hashTuple->next;
+ 	}
+ 
+ 	/*
+ 	 * no match
+ 	 */
+ 	return NULL;
+ }
+ 
+ /*
   * ExecScanHashBucket
   *		scan a hash bucket for matches to the current outer tuple
   *
Index: src/backend/executor/nodeHashjoin.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v
retrieving revision 1.95
diff -c -r1.95 nodeHashjoin.c
*** src/backend/executor/nodeHashjoin.c	15 Aug 2008 19:20:42 -0000	1.95
--- src/backend/executor/nodeHashjoin.c	18 Oct 2008 01:47:57 -0000
***************
*** 20,25 ****
--- 20,30 ----
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
  #include "utils/memutils.h"
+ #include "optimizer/cost.h"
+ #include "utils/syscache.h"
+ #include "utils/lsyscache.h"
+ #include "parser/parsetree.h"
+ #include "catalog/pg_statistic.h"
  
  
  /* Returns true for JOIN_LEFT and JOIN_ANTI jointypes */
***************
*** 34,39 ****
--- 39,146 ----
  						  TupleTableSlot *tupleSlot);
  static int	ExecHashJoinNewBatch(HashJoinState *hjstate);
  
+ /*
+ *          getMostCommonValues
+ *
+ *          
+ */
+ void getMostCommonValues(EState *estate, HashJoinState *hjstate)
+ {
+ 	HeapTupleData *statsTuple;
+ 	FuncExprState *clause;
+ 	ExprState *argstate;
+ 	Var *variable;
+ 
+ 	Datum	   *values;
+ 	int			nvalues;
+ 	float4	   *numbers;
+ 	int			nnumbers;
+ 
+ 	Oid relid;
+ 	AttrNumber relattnum;
+ 	Oid atttype;
+ 	int32 atttypmod;
+ 
+ 	int i;
+ 
+ 	//is it a join on more than one key?
+ 	if (hjstate->hashclauses->length != 1)
+ 		return; //histojoin is not defined for more than one join key so run away
+ 
+ 	//make sure the outer node is a seq scan on a base relation otherwise we cant get MCVs at the moment and should not bother trying
+ 	if (outerPlanState(hjstate)->type != T_SeqScanState)
+ 		return;
+ 	
+ 	//grab the relation object id of the outer relation
+ 	relid = getrelid(((SeqScan *) ((SeqScanState *) outerPlanState(hjstate))->ps.plan)->scanrelid, estate->es_range_table);
+ 	clause = (FuncExprState *) lfirst(list_head(hjstate->hashclauses));
+ 	argstate = (ExprState *) lfirst(list_head(clause->args));
+ 	variable = (Var *) argstate->expr;
+ 
+ 	//grab the necessary properties of the join variable
+ 	relattnum = variable->varattno;
+ 	atttype = variable->vartype;
+ 	atttypmod = variable->vartypmod;
+ 
+ 	statsTuple = SearchSysCache(STATRELATT,
+ 		ObjectIdGetDatum(relid),
+ 		Int16GetDatum(relattnum),
+ 		0, 0);
+ 
+ 	if (HeapTupleIsValid(statsTuple))
+ 	{
+ 		if (get_attstatsslot(statsTuple,
+ 			atttype, atttypmod,
+ 			STATISTIC_KIND_MCV, InvalidOid,
+ 			&values, &nvalues,
+ 			&numbers, &nnumbers))
+ 		{
+ 			HashJoinTable hashtable;
+ 			FmgrInfo   *hashfunctions;
+ 			//MCV Partitions is an open addressing hashtable with a power of 2 size greater than the number of MCV values
+ 			int nbuckets = 2;
+ 			uint32 collisionsWhileHashing = 0;
+ 			while (nbuckets <= nvalues)
+ 			{
+ 				nbuckets <<= 1;
+ 			}
+ 			//use two more bit just to help avoid collisions
+ 			nbuckets <<= 2;
+ 
+ 			hashtable = hjstate->hj_HashTable;
+ 			hashtable->usingMostCommonValues = true;
+ 			hashtable->nMostCommonTuplePartitionHashBuckets = nbuckets;
+ 			hashtable->mostCommonTuplePartition = palloc0(nbuckets * sizeof(HashJoinMostCommonValueTuplePartition));
+ 			hashfunctions = hashtable->outer_hashfunctions;
+ 
+ 			//create the partitions
+ 			for (i = 0; i < nvalues; i++)
+ 			{
+ 				uint32 hashvalue = DatumGetUInt32(FunctionCall1(&hashfunctions[0], values[i]));
+ 				int bucket = hashvalue & (nbuckets - 1);
+ 
+ 				while (hashtable->mostCommonTuplePartition[bucket].hashvalue != 0
+ 					&& hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ 				{
+ 					bucket = (bucket + 1) & (nbuckets - 1);
+ 					collisionsWhileHashing++;
+ 				}
+ 
+ 				//leave partition alone if it has the same hashvalue as current MCV.  we only want one partition per hashvalue
+ 				if (hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ 				{
+ 					hashtable->mostCommonTuplePartition[bucket].tuples = NULL;
+ 					hashtable->mostCommonTuplePartition[bucket].hashvalue = hashvalue;
+ 					hashtable->nMostCommonTuplePartitions++;
+ 				}
+ 			}
+ 
+ 			free_attstatsslot(atttype, values, nvalues, numbers, nnumbers);
+ 		}
+ 
+ 		ReleaseSysCache(statsTuple);
+ 	}
+ }
  
  /* ----------------------------------------------------------------
   *		ExecHashJoin
***************
*** 146,151 ****
--- 253,267 ----
  		hashtable = ExecHashTableCreate((Hash *) hashNode->ps.plan,
  										node->hj_HashOperators);
  		node->hj_HashTable = hashtable;
+ 		
+ 		hashtable->usingMostCommonValues = false;
+ 		hashtable->nMostCommonTuplePartitions = 0;
+ 		hashtable->nMostCommonTuplePartitionHashBuckets = 0;
+ 		hashtable->mostCommonTuplesStored = 0;
+ 		hashtable->mostCommonTuplePartition = NULL;
+ 
+ 		if (enable_hashjoin_usestatmcvs)
+ 			getMostCommonValues(estate, node);
  
  		/*
  		 * execute the Hash node, to build the hash table
***************
*** 157,163 ****
  		 * If the inner relation is completely empty, and we're not doing an
  		 * outer join, we can quit without scanning the outer relation.
  		 */
! 		if (hashtable->totalTuples == 0 && !HASHJOIN_IS_OUTER(node))
  			return NULL;
  
  		/*
--- 273,279 ----
  		 * If the inner relation is completely empty, and we're not doing an
  		 * outer join, we can quit without scanning the outer relation.
  		 */
! 		if (hashtable->totalTuples == 0 && hashtable->mostCommonTuplesStored == 0 && !HASHJOIN_IS_OUTER(node))
  			return NULL;
  
  		/*
***************
*** 206,228 ****
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
! 
! 			/*
! 			 * Now we've got an outer tuple and the corresponding hash bucket,
! 			 * but this tuple may not belong to the current batch.
! 			 */
! 			if (batchno != hashtable->curbatch)
  			{
  				/*
! 				 * Need to postpone this outer tuple to a later batch. Save it
! 				 * in the corresponding outer-batch file.
  				 */
! 				Assert(batchno > hashtable->curbatch);
! 				ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(outerTupleSlot),
! 									  hashvalue,
! 									  &hashtable->outerBatchFile[batchno]);
! 				node->hj_NeedNewOuter = true;
! 				continue;		/* loop around for a new outer tuple */
  			}
  		}
  
--- 322,350 ----
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
! 			
! 			node->hj_OuterTupleMostCommonValuePartition = MCV_INVALID_PARTITION;
! 			
! 			
! 			if (!(hashtable->usingMostCommonValues && isAMostCommonValue(hashtable, hashvalue, &node->hj_OuterTupleMostCommonValuePartition)))
  			{
  				/*
! 				 * Now we've got an outer tuple and the corresponding hash bucket,
! 				 * but this tuple may not belong to the current batch.
  				 */
! 				if (batchno != hashtable->curbatch)
! 				{
! 					/*
! 					 * Need to postpone this outer tuple to a later batch. Save it
! 					 * in the corresponding outer-batch file.
! 					 */
! 					Assert(batchno > hashtable->curbatch);
! 					ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(outerTupleSlot),
! 										  hashvalue,
! 										  &hashtable->outerBatchFile[batchno]);
! 					node->hj_NeedNewOuter = true;
! 					continue;		/* loop around for a new outer tuple */
! 				}
  			}
  		}
  
***************
*** 231,237 ****
  		 */
  		for (;;)
  		{
! 			curtuple = ExecScanHashBucket(node, econtext);
  			if (curtuple == NULL)
  				break;			/* out of matches */
  
--- 353,366 ----
  		 */
  		for (;;)
  		{
! 			if (node->hj_OuterTupleMostCommonValuePartition != MCV_INVALID_PARTITION)
! 			{
! 				curtuple = ExecScanHashMostCommonTuples(node, econtext);
! 			}
! 			else
! 			{
! 				curtuple = ExecScanHashBucket(node, econtext);
! 			}
  			if (curtuple == NULL)
  				break;			/* out of matches */
  
Index: src/backend/optimizer/path/costsize.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.199
diff -c -r1.199 costsize.c
*** src/backend/optimizer/path/costsize.c	17 Oct 2008 20:27:24 -0000	1.199
--- src/backend/optimizer/path/costsize.c	17 Oct 2008 23:07:05 -0000
***************
*** 108,113 ****
--- 108,115 ----
  bool		enable_mergejoin = true;
  bool		enable_hashjoin = true;
  
+ bool		enable_hashjoin_usestatmcvs = true;
+ 
  typedef struct
  {
  	PlannerInfo *root;
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.475
diff -c -r1.475 guc.c
*** src/backend/utils/misc/guc.c	6 Oct 2008 13:05:36 -0000	1.475
--- src/backend/utils/misc/guc.c	9 Oct 2008 19:56:17 -0000
***************
*** 625,630 ****
--- 625,638 ----
  		true, NULL, NULL
  	},
  	{
+ 		{"enable_hashjoin_usestatmcvs", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the hash join's use of the MCVs stored in pg_statistic."),
+ 			NULL
+ 		},
+ 		&enable_hashjoin_usestatmcvs,
+ 		true, NULL, NULL
+ 	},
+ 	{
  		{"constraint_exclusion", PGC_USERSET, QUERY_TUNING_OTHER,
  			gettext_noop("Enables the planner to use constraints to optimize queries."),
  			gettext_noop("Child table scans will be skipped if their "
Index: src/include/executor/hashjoin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/hashjoin.h,v
retrieving revision 1.48
diff -c -r1.48 hashjoin.h
*** src/include/executor/hashjoin.h	1 Jan 2008 19:45:57 -0000	1.48
--- src/include/executor/hashjoin.h	17 Oct 2008 23:48:46 -0000
***************
*** 72,77 ****
--- 72,84 ----
  #define HJTUPLE_MINTUPLE(hjtup)  \
  	((MinimalTuple) ((char *) (hjtup) + HJTUPLE_OVERHEAD))
  
+ typedef struct HashJoinMostCommonValueTuplePartition
+ {
+ 	uint32 hashvalue;
+ 	HashJoinTuple tuples;
+ } HashJoinMostCommonValueTuplePartition;
+ 
+ #define MCV_INVALID_PARTITION -1
  
  typedef struct HashJoinTableData
  {
***************
*** 116,121 ****
--- 123,134 ----
  
  	MemoryContext hashCxt;		/* context for whole-hash-join storage */
  	MemoryContext batchCxt;		/* context for this-batch-only storage */
+ 	
+ 	bool usingMostCommonValues;
+ 	HashJoinMostCommonValueTuplePartition *mostCommonTuplePartition;
+ 	int nMostCommonTuplePartitionHashBuckets;
+ 	int nMostCommonTuplePartitions;
+ 	uint32 mostCommonTuplesStored;
  } HashJoinTableData;
  
  #endif   /* HASHJOIN_H */
Index: src/include/executor/nodeHash.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/nodeHash.h,v
retrieving revision 1.45
diff -c -r1.45 nodeHash.h
*** src/include/executor/nodeHash.h	1 Jan 2008 19:45:57 -0000	1.45
--- src/include/executor/nodeHash.h	30 Sep 2008 20:31:35 -0000
***************
*** 45,48 ****
--- 45,51 ----
  						int *numbuckets,
  						int *numbatches);
  
+ extern HashJoinTuple ExecScanHashMostCommonTuples(HashJoinState *hjstate, ExprContext *econtext);
+ extern bool isAMostCommonValue(HashJoinTable hashtable, uint32 hashvalue, int *partitionNumber);
+ 
  #endif   /* NODEHASH_H */
Index: src/include/executor/nodeHashjoin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/nodeHashjoin.h,v
retrieving revision 1.37
diff -c -r1.37 nodeHashjoin.h
*** src/include/executor/nodeHashjoin.h	1 Jan 2008 19:45:57 -0000	1.37
--- src/include/executor/nodeHashjoin.h	30 Sep 2008 20:32:05 -0000
***************
*** 26,29 ****
--- 26,31 ----
  extern void ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
  					  BufFile **fileptr);
  
+ extern void getMostCommonValues(EState *estate, HashJoinState *hjstate);
+ 
  #endif   /* NODEHASHJOIN_H */
Index: src/include/nodes/execnodes.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/nodes/execnodes.h,v
retrieving revision 1.190
diff -c -r1.190 execnodes.h
*** src/include/nodes/execnodes.h	7 Oct 2008 19:27:04 -0000	1.190
--- src/include/nodes/execnodes.h	17 Oct 2008 23:07:14 -0000
***************
*** 1365,1370 ****
--- 1365,1371 ----
  	bool		hj_NeedNewOuter;
  	bool		hj_MatchedOuter;
  	bool		hj_OuterNotEmpty;
+ 	int		hj_OuterTupleMostCommonValuePartition;
  } HashJoinState;
  
  
Index: src/include/optimizer/cost.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/optimizer/cost.h,v
retrieving revision 1.93
diff -c -r1.93 cost.h
*** src/include/optimizer/cost.h	4 Oct 2008 21:56:55 -0000	1.93
--- src/include/optimizer/cost.h	7 Oct 2008 18:31:42 -0000
***************
*** 52,57 ****
--- 52,58 ----
  extern bool enable_nestloop;
  extern bool enable_mergejoin;
  extern bool enable_hashjoin;
+ extern bool enable_hashjoin_usestatmcvs;
  extern bool constraint_exclusion;
  
  extern double clamp_row_est(double nrows);
#2Joshua Tolley
eggyknap@gmail.com
In reply to: Lawrence, Ramon (#1)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

On Mon, Oct 20, 2008 at 4:42 PM, Lawrence, Ramon <ramon.lawrence@ubc.ca> wrote:

We propose a patch that improves hybrid hash join's performance for large
multi-batch joins where the probe relation has skew.

Project name: Histojoin
Patch file: histojoin_v1.patch

This patch implements the Histojoin join algorithm as an optional feature
added to the standard Hybrid Hash Join (HHJ). A flag is used to enable or
disable the Histojoin features. When Histojoin is disabled, HHJ acts as
normal. The Histojoin features allow HHJ to use PostgreSQL's statistics to
do skew aware partitioning. The basic idea is to keep build relation tuples
in a small in-memory hash table that have join values that are frequently
occurring in the probe relation. This improves performance of HHJ when
multiple batches are used by 10% to 50% for skewed data sets. The
performance improvements of this patch can be seen in the paper (pages
25-30) at:

http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf

All generators and materials needed to verify these results can be provided.

This is a patch against the HEAD of the repository.

This patch does not contain platform specific code. It compiles and has
been tested on our machines in both Windows (MSVC++) and Linux (GCC).

Currently the Histojoin feature is enabled by default and is used whenever
HHJ is used and there are Most Common Value (MCV) statistics available on
the probe side base relation of the join. To disable this feature simply
set the enable_hashjoin_usestatmcvs flag to off in the database
configuration file or at run time with the 'set' command.

One potential improvement not included in the patch is that Most Common
Value (MCV) statistics are only determined when the probe relation is
produced by a scan operator. There is a benefit to using MCVs even when the
probe relation is not a base scan, but we were unable to determine how to
find statistics from a base relation after other operators are performed.

This patch was created by Bryce Cutt as part of his work on his M.Sc.
thesis.

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of British
Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca

I'm interested in trying to review this patch. Having not done patch
review before, I can't exactly promise grand results, but if you could
provide me with the data to check your results? In the meantime I'll
go read the paper.

- Josh / eggyknap

#3Lawrence, Ramon
ramon.lawrence@ubc.ca
In reply to: Joshua Tolley (#2)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

Joshua,

Thank you for offering to review the patch.

The easiest way to test would be to generate your own TPC-H data and
load it into a database for testing. I have posted the TPC-H generator
at:

http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip

The generator can produce skewed data sets. It was produced by
Microsoft Research.

After unzipping, on a Windows machine, you can just run the command:

dbgen -s 1 -z 1

This will produce a TPC-H database of scale 1 GB with a Zipfian skew of
z=1. More information on the generator is in the document README-S.DOC.
Source is provided for the generator, so you should be able to run it on
other operating systems as well.

The schema DDL is at:

http://people.ok.ubc.ca/rlawrenc/tpch_pg_ddl.txt

Note that the load time for 1G data is 1-2 hours and for 10G data is
about 24 hours. I recommend you do not add the foreign keys until after
the data is loaded.

The other alternative is to do a pgdump on our data sets. However, the
download size would be quite large, and it will take a couple of days
for us to get you the data in that form.

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca

-----Original Message-----
From: Joshua Tolley [mailto:eggyknap@gmail.com]
Sent: November 1, 2008 3:42 PM
To: Lawrence, Ramon
Cc: pgsql-hackers@postgresql.org; Bryce Cutt
Subject: Re: [HACKERS] Proposed Patch to Improve Performance of Multi-
Batch Hash Join for Skewed Data Sets

On Mon, Oct 20, 2008 at 4:42 PM, Lawrence, Ramon

<ramon.lawrence@ubc.ca>

wrote:

We propose a patch that improves hybrid hash join's performance for

large

multi-batch joins where the probe relation has skew.

Project name: Histojoin
Patch file: histojoin_v1.patch

This patch implements the Histojoin join algorithm as an optional

feature

added to the standard Hybrid Hash Join (HHJ). A flag is used to

enable

or

disable the Histojoin features. When Histojoin is disabled, HHJ

acts as

normal. The Histojoin features allow HHJ to use PostgreSQL's

statistics

to

do skew aware partitioning. The basic idea is to keep build

relation

tuples

in a small in-memory hash table that have join values that are

frequently

occurring in the probe relation. This improves performance of HHJ

when

multiple batches are used by 10% to 50% for skewed data sets. The
performance improvements of this patch can be seen in the paper

(pages

25-30) at:

http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf

All generators and materials needed to verify these results can be

provided.

This is a patch against the HEAD of the repository.

This patch does not contain platform specific code. It compiles and

has

been tested on our machines in both Windows (MSVC++) and Linux

(GCC).

Currently the Histojoin feature is enabled by default and is used

whenever

HHJ is used and there are Most Common Value (MCV) statistics

available

on

the probe side base relation of the join. To disable this feature

simply

set the enable_hashjoin_usestatmcvs flag to off in the database
configuration file or at run time with the 'set' command.

One potential improvement not included in the patch is that Most

Common

Value (MCV) statistics are only determined when the probe relation

is

produced by a scan operator. There is a benefit to using MCVs even

when

the

probe relation is not a base scan, but we were unable to determine

how

to

find statistics from a base relation after other operators are

performed.

This patch was created by Bryce Cutt as part of his work on his

M.Sc.

Show quoted text

thesis.

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of

British

Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca

I'm interested in trying to review this patch. Having not done patch
review before, I can't exactly promise grand results, but if you could
provide me with the data to check your results? In the meantime I'll
go read the paper.

- Josh / eggyknap

#4Joshua Tolley
eggyknap@gmail.com
In reply to: Lawrence, Ramon (#3)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

On Sun, Nov 2, 2008 at 4:48 PM, Lawrence, Ramon <ramon.lawrence@ubc.ca> wrote:

Joshua,

Thank you for offering to review the patch.

The easiest way to test would be to generate your own TPC-H data and
load it into a database for testing. I have posted the TPC-H generator
at:

http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip

The generator can produce skewed data sets. It was produced by
Microsoft Research.

After unzipping, on a Windows machine, you can just run the command:

dbgen -s 1 -z 1

This will produce a TPC-H database of scale 1 GB with a Zipfian skew of
z=1. More information on the generator is in the document README-S.DOC.
Source is provided for the generator, so you should be able to run it on
other operating systems as well.

The schema DDL is at:

http://people.ok.ubc.ca/rlawrenc/tpch_pg_ddl.txt

Note that the load time for 1G data is 1-2 hours and for 10G data is
about 24 hours. I recommend you do not add the foreign keys until after
the data is loaded.

The other alternative is to do a pgdump on our data sets. However, the
download size would be quite large, and it will take a couple of days
for us to get you the data in that form.

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca

I'll try out the TPC-H generator first :) Thanks.

- Josh

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Lawrence, Ramon (#3)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

"Lawrence, Ramon" <ramon.lawrence@ubc.ca> writes:

The easiest way to test would be to generate your own TPC-H data and
load it into a database for testing. I have posted the TPC-H generator
at:
http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip
The generator can produce skewed data sets. It was produced by
Microsoft Research.

What alternatives are there for people who do not run Windows?

regards, tom lane

#6Lawrence, Ramon
ramon.lawrence@ubc.ca
In reply to: Tom Lane (#5)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
What alternatives are there for people who do not run Windows?

regards, tom lane

The TPC-H generator is a standard code base provided at
http://www.tpc.org/tpch/. We have been able to compile this code on
Linux.

However, we were unable to get the Microsoft modifications to this code
to compile on Linux (although they are supposed to be portable). So, we
just used the Windows version with wine on our test Debian machine.

I have also posted the text files for the TPC-H 1G 1Z data set at:

http://people.ok.ubc.ca/rlawrenc/tpch1g1z.zip

Note that you need to trim the extra characters at the end of the lines
for PostgreSQL to read them properly.

Since the data takes a while to generate and load, we can also provide a
compressed version of the PostgreSQL data directory of the databases
with the data already loaded.

--
Ramon Lawrence

#7Joshua Tolley
eggyknap@gmail.com
In reply to: Lawrence, Ramon (#1)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote:

We propose a patch that improves hybrid hash join's performance for large
multi-batch joins where the probe relation has skew.

I'm running into problems with this patch. It applies cleanly, and the
technique you provided for generating sample data works just fine
(though I admit I haven't verified that the expected skew exists in the
data). But the server crashes when I try to load the data. The backtrace
is below, labeled "Backtrace 1"; since it happens in
ExecScanHashMostCommonTuples, I figure it's because of the patch and not
something else odd (unless perhaps my hardware is flakey -- I'll try it
on other hardware as soon as I can, to verify). Note that I'm running
this on Ubuntu 8.10, 32-bit x86, running a kernel Ubuntu labels as
"2.6.27-7-generic #1 SMP". The statement in execution at the time was
"ALTER TABLE SUPPLIER ADD CONSTRAINT SUPPLIER_FK1 FOREIGN KEY
(S_NATIONKEY) references NATION (N_NATIONKEY);"

Further, when I go back into the database in psql, simply issuing a "\d"
command crashes the backend with a similar backtrace, labeled Backtrace
2, below. The query underlying \d and its EXPLAIN output are also
included, just for kicks.

- Josh

*****************************************
BACKTRACE 1
****************************************
Core was generated by `postgres: jtolley jtolley [local] ALTE'.
Program terminated with signal 6, Aborted.
[New process 20407]
#0 0xb80b0430 in __kernel_vsyscall ()
(gdb) bt
#0 0xb80b0430 in __kernel_vsyscall ()
#1 0xb7f22880 in raise () from /lib/tls/i686/cmov/libc.so.6
#2 0xb7f24248 in abort () from /lib/tls/i686/cmov/libc.so.6
#3 0x0831540e in ExceptionalCondition (
conditionName=0x8433274
"!(hjstate->hj_OuterTupleMostCommonValuePartition <
hashtable->nMostCommonTuplePartitions)",
errorType=0x834b66d "FailedAssertion", fileName=0x84331d9
"nodeHash.c", lineNumber=880) at assert.c:57
#4 0x081b457b in ExecScanHashMostCommonTuples (hjstate=0x8720a6c,
econtext=0x8720af8) at nodeHash.c:880
#5 0x081b60de in ExecHashJoin (node=0x8720a6c) at nodeHashjoin.c:357
#6 0x081a4748 in ExecProcNode (node=0x8720a6c) at execProcnode.c:406
#7 0x081a242b in standard_ExecutorRun (queryDesc=0x870957c,
direction=ForwardScanDirection, count=1) at execMain.c:1343
#8 0x081c2036 in _SPI_execute_plan (plan=0x87181bc, paramLI=0x0,
snapshot=0x8485300, crosscheck_snapshot=0x0, read_only=1 '\001',
fire_triggers=0 '\0', tcount=1) at spi.c:1976
#9 0x081c2350 in SPI_execute_snapshot (plan=0x87181bc, Values=0x0,
Nulls=0x0, snapshot=0x8485300, crosscheck_snapshot=0x0,
read_only=<value optimized out>, fire_triggers=<value optimized
out>, tcount=1) at spi.c:408
#10 0x082e1921 in RI_Initial_Check (trigger=0xbfeb0afc,
fk_rel=0xb5a21938, pk_rel=0xb5a20754) at ri_triggers.c:2763
#11 0x08178613 in ATRewriteTables (wqueue=0xbfeb0d88) at
tablecmds.c:5026
#12 0x0817ef36 in ATController (rel=0xb5a21938, cmds=<value optimized
out>, recurse=<value optimized out>) at tablecmds.c:2294
#13 0x08261dd5 in ProcessUtility (parsetree=0x86ca17c,
queryString=0x86c96ec "ALTER TABLE SUPPLIER\nADD CONSTRAINT
SUPPLIER_FK1 FOREIGN KEY (S_NATIONKEY) references NATION
(N_NATIONKEY);",
params=0x0, isTopLevel=1 '\001', dest=0x86ca2b4,
completionTag=0xbfeb0fc8 "") at utility.c:569
#14 0x0825e2ae in PortalRunUtility (portal=0x86fadfc,
utilityStmt=0x86ca17c, isTopLevel=<value optimized out>, dest=0x86ca2b4,
completionTag=0xbfeb0fc8 "") at pquery.c:1176
#15 0x0825f2c0 in PortalRunMulti (portal=0x86fadfc, isTopLevel=<value
optimized out>, dest=0x86ca2b4, altdest=0x86ca2b4,
completionTag=0xbfeb0fc8 "") at pquery.c:1281
#16 0x0825fb54 in PortalRun (portal=0x86fadfc, count=2147483647,
isTopLevel=6 '\006', dest=0x86ca2b4, altdest=0x86ca2b4,
completionTag=0xbfeb0fc8 "") at pquery.c:812
#17 0x0825a757 in exec_simple_query (
query_string=0x86c96ec "ALTER TABLE SUPPLIER\nADD CONSTRAINT
SUPPLIER_FK1 FOREIGN KEY (S_NATIONKEY) references NATION
(N_NATIONKEY);")
at postgres.c:992
#18 0x0825bfff in PostgresMain (argc=4, argv=0x8667b08,
username=0x8667ae0 "jtolley") at postgres.c:3569
#19 0x082261cf in ServerLoop () at postmaster.c:3258
#20 0x08227190 in PostmasterMain (argc=1, argv=0x8664250) at
postmaster.c:1031
#21 0x081cc126 in main (argc=1, argv=0x8664250) at main.c:188
(gdb)

*****************************************
BACKTRACE 2
****************************************
Core was generated by `postgres: jtolley jtolley [local] SELE'.
Program terminated with signal 6, Aborted.
[New process 20967]
#0 0xb80b0430 in __kernel_vsyscall ()
(gdb) bt
#0 0xb80b0430 in __kernel_vsyscall ()
#1 0xb7f22880 in raise () from /lib/tls/i686/cmov/libc.so.6
#2 0xb7f24248 in abort () from /lib/tls/i686/cmov/libc.so.6
#3 0x0831540e in ExceptionalCondition (
conditionName=0x8433274
"!(hjstate->hj_OuterTupleMostCommonValuePartition <
hashtable->nMostCommonTuplePartitions)",
errorType=0x834b66d "FailedAssertion", fileName=0x84331d9
"nodeHash.c", lineNumber=880) at assert.c:57
#4 0x081b457b in ExecScanHashMostCommonTuples (hjstate=0x86fb320,
econtext=0x86fb3ac) at nodeHash.c:880
#5 0x081b60de in ExecHashJoin (node=0x86fb320) at nodeHashjoin.c:357
#6 0x081a4748 in ExecProcNode (node=0x86fb320) at execProcnode.c:406
#7 0x081bb2a1 in ExecSort (node=0x86fb294) at nodeSort.c:102
#8 0x081a4718 in ExecProcNode (node=0x86fb294) at execProcnode.c:417
#9 0x081a242b in standard_ExecutorRun (queryDesc=0x8706e1c,
direction=ForwardScanDirection, count=0) at execMain.c:1343
#10 0x0825e64c in PortalRunSelect (portal=0x8700e0c, forward=1 '\001',
count=0, dest=0x871db14) at pquery.c:942
#11 0x0825f9ae in PortalRun (portal=0x8700e0c, count=2147483647,
isTopLevel=1 '\001', dest=0x871db14, altdest=0x871db14,
completionTag=0xbfeb0fc8 "") at pquery.c:796
#12 0x0825a757 in exec_simple_query (
query_string=0x86cb6f4 "SELECT n.nspname as \"Schema\",\n c.relname
as \"Name\",\n CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN
'view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN
'special' END as \"Type\",\n "...) at postgres.c:992
#13 0x0825bfff in PostgresMain (argc=4, argv=0x8667f58,
username=0x8667f30 "jtolley") at postgres.c:3569
#14 0x082261cf in ServerLoop () at postmaster.c:3258
#15 0x08227190 in PostmasterMain (argc=1, argv=0x8664250) at
postmaster.c:1031
#16 0x081cc126 in main (argc=1, argv=0x8664250) at main.c:188

*****************************************
\d EXPLAIN output
****************************************
jtolley=# explain SELECT n.nspname as "Schema",
jtolley-# c.relname as "Name",
jtolley-# CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view'
WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN 'special'
END as "Type",
jtolley-# pg_catalog.pg_get_userbyid(c.relowner) as "Owner"
jtolley-# FROM pg_catalog.pg_class c
jtolley-# LEFT JOIN pg_catalog.pg_namespace n ON n.oid =
c.relnamespace
jtolley-# WHERE c.relkind IN ('r','v','S','')
jtolley-# AND n.nspname <> 'pg_catalog'
jtolley-# AND n.nspname !~ '^pg_toast'
jtolley-# AND pg_catalog.pg_table_is_visible(c.oid)
jtolley-# ORDER BY 1,2;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Sort (cost=13.02..13.10 rows=35 width=133)
Sort Key: n.nspname, c.relname
-> Hash Join (cost=1.14..12.12 rows=35 width=133)
Hash Cond: (c.relnamespace = n.oid)
-> Seq Scan on pg_class c (cost=0.00..9.97 rows=35 width=73)
Filter: (pg_table_is_visible(oid) AND (relkind = ANY
('{r,v,S,""}'::"char"[])))
-> Hash (cost=1.09..1.09 rows=4 width=68)
-> Seq Scan on pg_namespace n (cost=0.00..1.09 rows=4
width=68)
Filter: ((nspname <> 'pg_catalog'::name) AND
(nspname !~ '^pg_toast'::text))
(9 rows)

#8Joshua Tolley
eggyknap@gmail.com
In reply to: Lawrence, Ramon (#1)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote:

We propose a patch that improves hybrid hash join's performance for large
multi-batch joins where the probe relation has skew.

I also recommend modifying docs/src/sgml/config.sgml to include the
enable_hashjoin_usestatmcvs option.

- Josh / eggyknap

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joshua Tolley (#8)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

Joshua Tolley <eggyknap@gmail.com> writes:

On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote:

We propose a patch that improves hybrid hash join's performance for large
multi-batch joins where the probe relation has skew.

I also recommend modifying docs/src/sgml/config.sgml to include the
enable_hashjoin_usestatmcvs option.

If the patch is actually a win, why would we bother with such a GUC
at all?

regards, tom lane

#10Joshua Tolley
eggyknap@gmail.com
In reply to: Tom Lane (#9)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Nov 5, 2008 at 8:20 AM, Tom Lane wrote:

Joshua Tolley writes:

On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote:

We propose a patch that improves hybrid hash join's performance for large
multi-batch joins where the probe relation has skew.

I also recommend modifying docs/src/sgml/config.sgml to include the
enable_hashjoin_usestatmcvs option.

If the patch is actually a win, why would we bother with such a GUC
at all?

regards, tom lane

Good point. Leaving it in place for patch review purposes is useful,
but we can probably lose it in the end.

- - Josh / eggyknap
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: http://getfiregpg.org

iEYEARECAAYFAkkRujsACgkQRiRfCGf1UMNSTACfbpDSQn0HGSVr3jI30GJApcRD
YbQAn2VZdI/aIalGBrbn1hlRWPEvbgV5
=LKZ3
-----END PGP SIGNATURE-----

#11Bryce Cutt
pandasuit@gmail.com
In reply to: Joshua Tolley (#10)
1 attachment(s)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

The error is causes by me Asserting against the wrong variable. I
never noticed this as I apparently did not have assertions turned on
on my development machine. That is fixed now and with the new patch
version I have attached all assertions are passing with your query and
my test queries. I added another assertion to that section of the
code so that it is a bit more vigorous in confirming the hash table
partition is correct. It does not change the operation of the code.

There are two partition counts. One holds the maximum number of
buckets in the hash table and the other counts the number of actual
buckets created for hash values. I was incorrectly testing against
the second one because that was valid before I started using a hash
table to store the buckets.

The enable_hashjoin_usestatmcvs flag was valuable for my own research
and tests and likely useful for your review but Tom is correct that it
can be removed in the final version.

- Bryce Cutt

Show quoted text

On Wed, Nov 5, 2008 at 7:22 AM, Joshua Tolley <eggyknap@gmail.com> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Nov 5, 2008 at 8:20 AM, Tom Lane wrote:

Joshua Tolley writes:

On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote:

We propose a patch that improves hybrid hash join's performance for large
multi-batch joins where the probe relation has skew.

I also recommend modifying docs/src/sgml/config.sgml to include the
enable_hashjoin_usestatmcvs option.

If the patch is actually a win, why would we bother with such a GUC
at all?

regards, tom lane

Good point. Leaving it in place for patch review purposes is useful,
but we can probably lose it in the end.

- - Josh / eggyknap
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: http://getfiregpg.org

iEYEARECAAYFAkkRujsACgkQRiRfCGf1UMNSTACfbpDSQn0HGSVr3jI30GJApcRD
YbQAn2VZdI/aIalGBrbn1hlRWPEvbgV5
=LKZ3
-----END PGP SIGNATURE-----

Attachments:

histojoin_v2.patchapplication/octet-stream; name=histojoin_v2.patchDownload
Index: src/backend/executor/nodeHash.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHash.c,v
retrieving revision 1.116
diff -c -r1.116 nodeHash.c
*** src/backend/executor/nodeHash.c	1 Jan 2008 19:45:49 -0000	1.116
--- src/backend/executor/nodeHash.c	5 Nov 2008 22:26:53 -0000
***************
*** 54,59 ****
--- 54,86 ----
  }
  
  /* ----------------------------------------------------------------
+ *		isAMostCommonValue
+ *
+ *		is the value one of the most common key values?
+ *  ----------------------------------------------------------------
+ */
+ bool isAMostCommonValue(HashJoinTable hashtable, uint32 hashvalue, int *partitionNumber)
+ {
+ 	int bucket = hashvalue & (hashtable->nMostCommonTuplePartitionHashBuckets - 1);
+ 
+ 	while (hashtable->mostCommonTuplePartition[bucket].hashvalue != 0
+ 		&& hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ 	{
+ 		bucket = (bucket + 1) & (hashtable->nMostCommonTuplePartitionHashBuckets - 1);
+ 	}
+ 
+ 	if (hashtable->mostCommonTuplePartition[bucket].hashvalue == hashvalue)
+ 	{
+ 		*partitionNumber = bucket;
+ 		return true;
+ 	}
+ 
+ 	//must have run into an empty slot which means this is not an MCV
+ 	*partitionNumber = MCV_INVALID_PARTITION;
+ 	return false;
+ }
+ 
+ /* ----------------------------------------------------------------
   *		MultiExecHash
   *
   *		build hash table for hashjoin, doing partitioning if more
***************
*** 69,74 ****
--- 96,103 ----
  	TupleTableSlot *slot;
  	ExprContext *econtext;
  	uint32		hashvalue;
+ 	MinimalTuple mintuple;
+ 	int partitionNumber;
  
  	/* must provide our own instrumentation support */
  	if (node->ps.instrument)
***************
*** 99,106 ****
  		if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
  								 &hashvalue))
  		{
! 			ExecHashTableInsert(hashtable, slot, hashvalue);
! 			hashtable->totalTuples += 1;
  		}
  	}
  
--- 128,163 ----
  		if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
  								 &hashvalue))
  		{
! 			partitionNumber = MCV_INVALID_PARTITION;
! 
! 			if (hashtable->usingMostCommonValues && isAMostCommonValue(hashtable, hashvalue, &partitionNumber))
! 			{
! 				HashJoinTuple hashTuple;
! 				int			hashTupleSize;
! 				
! 				mintuple = ExecFetchSlotMinimalTuple(slot);
! 				hashTupleSize = HJTUPLE_OVERHEAD + mintuple->t_len;
! 				hashTuple = (HashJoinTuple) palloc(hashTupleSize);
! 				hashTuple->hashvalue = hashvalue;
! 				memcpy(HJTUPLE_MINTUPLE(hashTuple), mintuple, mintuple->t_len);
! 
! 				hashTuple->next = hashtable->mostCommonTuplePartition[partitionNumber].tuples;
! 				hashtable->mostCommonTuplePartition[partitionNumber].tuples = hashTuple;
! 				
! 				hashtable->spaceUsed += hashTupleSize;
! 				
! 				if (hashtable->spaceUsed > hashtable->spaceAllowed) {
! 					ExecHashIncreaseNumBatches(hashtable);
! 				}
! 				
! 				hashtable->mostCommonTuplesStored++;
! 			}
! 
! 			if (partitionNumber == MCV_INVALID_PARTITION)
! 			{
! 				ExecHashTableInsert(hashtable, slot, hashvalue);
! 				hashtable->totalTuples += 1;
! 			}
  		}
  	}
  
***************
*** 798,803 ****
--- 855,922 ----
  }
  
  /*
+  * ExecScanHashMostCommonTuples
+  *		scan a hash bucket for matches to the current outer tuple
+  *
+  * The current outer tuple must be stored in econtext->ecxt_outertuple.
+  */
+ HashJoinTuple
+ ExecScanHashMostCommonTuples(HashJoinState *hjstate,
+ 				   ExprContext *econtext)
+ {
+ 	List	   *hjclauses = hjstate->hashclauses;
+ 	HashJoinTable hashtable = hjstate->hj_HashTable;
+ 	HashJoinTuple hashTuple = hjstate->hj_CurTuple;
+ 	uint32		hashvalue = hjstate->hj_CurHashValue;
+ 
+ 	/*
+ 	 * hj_CurTuple is NULL to start scanning a new bucket, or the address of
+ 	 * the last tuple returned from the current bucket.
+ 	 */
+ 	if (hashTuple == NULL)
+ 	{
+ 		//painstakingly make sure this is a valid partition index
+ 		Assert(hjstate->hj_OuterTupleMostCommonValuePartition > MCV_INVALID_PARTITION);
+ 		Assert(hjstate->hj_OuterTupleMostCommonValuePartition < hashtable->nMostCommonTuplePartitionHashBuckets);
+ 		Assert(hashtable->mostCommonTuplePartition[hjstate->hj_OuterTupleMostCommonValuePartition].hashvalue != 0);
+ 
+ 		hashTuple = hashtable->mostCommonTuplePartition[hjstate->hj_OuterTupleMostCommonValuePartition].tuples;
+ 	}
+ 	else
+ 		hashTuple = hashTuple->next;
+ 
+ 	while (hashTuple != NULL)
+ 	{
+ 		if (hashTuple->hashvalue == hashvalue)
+ 		{
+ 			TupleTableSlot *inntuple;
+ 
+ 			/* insert hashtable's tuple into exec slot so ExecQual sees it */
+ 			inntuple = ExecStoreMinimalTuple(HJTUPLE_MINTUPLE(hashTuple),
+ 											 hjstate->hj_HashTupleSlot,
+ 											 false);	/* do not pfree */
+ 			econtext->ecxt_innertuple = inntuple;
+ 
+ 			/* reset temp memory each time to avoid leaks from qual expr */
+ 			ResetExprContext(econtext);
+ 
+ 			if (ExecQual(hjclauses, econtext, false))
+ 			{
+ 				hjstate->hj_CurTuple = hashTuple;
+ 				return hashTuple;
+ 			}
+ 		}
+ 
+ 		hashTuple = hashTuple->next;
+ 	}
+ 
+ 	/*
+ 	 * no match
+ 	 */
+ 	return NULL;
+ }
+ 
+ /*
   * ExecScanHashBucket
   *		scan a hash bucket for matches to the current outer tuple
   *
Index: src/backend/executor/nodeHashjoin.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v
retrieving revision 1.96
diff -c -r1.96 nodeHashjoin.c
*** src/backend/executor/nodeHashjoin.c	23 Oct 2008 14:34:34 -0000	1.96
--- src/backend/executor/nodeHashjoin.c	5 Nov 2008 22:56:59 -0000
***************
*** 20,25 ****
--- 20,30 ----
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
  #include "utils/memutils.h"
+ #include "optimizer/cost.h"
+ #include "utils/syscache.h"
+ #include "utils/lsyscache.h"
+ #include "parser/parsetree.h"
+ #include "catalog/pg_statistic.h"
  
  
  /* Returns true for JOIN_LEFT and JOIN_ANTI jointypes */
***************
*** 34,39 ****
--- 39,146 ----
  						  TupleTableSlot *tupleSlot);
  static int	ExecHashJoinNewBatch(HashJoinState *hjstate);
  
+ /*
+ *          getMostCommonValues
+ *
+ *          
+ */
+ void getMostCommonValues(EState *estate, HashJoinState *hjstate)
+ {
+ 	HeapTupleData *statsTuple;
+ 	FuncExprState *clause;
+ 	ExprState *argstate;
+ 	Var *variable;
+ 
+ 	Datum	   *values;
+ 	int			nvalues;
+ 	float4	   *numbers;
+ 	int			nnumbers;
+ 
+ 	Oid relid;
+ 	AttrNumber relattnum;
+ 	Oid atttype;
+ 	int32 atttypmod;
+ 
+ 	int i;
+ 
+ 	//is it a join on more than one key?
+ 	if (hjstate->hashclauses->length != 1)
+ 		return; //histojoin is not defined for more than one join key so run away
+ 
+ 	//make sure the outer node is a seq scan on a base relation otherwise we cant get MCVs at the moment and should not bother trying
+ 	if (outerPlanState(hjstate)->type != T_SeqScanState)
+ 		return;
+ 	
+ 	//grab the relation object id of the outer relation
+ 	relid = getrelid(((SeqScan *) ((SeqScanState *) outerPlanState(hjstate))->ps.plan)->scanrelid, estate->es_range_table);
+ 	clause = (FuncExprState *) lfirst(list_head(hjstate->hashclauses));
+ 	argstate = (ExprState *) lfirst(list_head(clause->args));
+ 	variable = (Var *) argstate->expr;
+ 
+ 	//grab the necessary properties of the join variable
+ 	relattnum = variable->varattno;
+ 	atttype = variable->vartype;
+ 	atttypmod = variable->vartypmod;
+ 
+ 	statsTuple = SearchSysCache(STATRELATT,
+ 		ObjectIdGetDatum(relid),
+ 		Int16GetDatum(relattnum),
+ 		0, 0);
+ 
+ 	if (HeapTupleIsValid(statsTuple))
+ 	{
+ 		if (get_attstatsslot(statsTuple,
+ 			atttype, atttypmod,
+ 			STATISTIC_KIND_MCV, InvalidOid,
+ 			&values, &nvalues,
+ 			&numbers, &nnumbers))
+ 		{
+ 			HashJoinTable hashtable;
+ 			FmgrInfo   *hashfunctions;
+ 			//MCV Partitions is an open addressing hashtable with a power of 2 size greater than the number of MCV values
+ 			int nbuckets = 2;
+ 			uint32 collisionsWhileHashing = 0;
+ 			while (nbuckets <= nvalues)
+ 			{
+ 				nbuckets <<= 1;
+ 			}
+ 			//use two more bit just to help avoid collisions
+ 			nbuckets <<= 2;
+ 
+ 			hashtable = hjstate->hj_HashTable;
+ 			hashtable->usingMostCommonValues = true;
+ 			hashtable->nMostCommonTuplePartitionHashBuckets = nbuckets;
+ 			hashtable->mostCommonTuplePartition = palloc0(nbuckets * sizeof(HashJoinMostCommonValueTuplePartition));
+ 			hashfunctions = hashtable->outer_hashfunctions;
+ 
+ 			//create the partitions
+ 			for (i = 0; i < nvalues; i++)
+ 			{
+ 				uint32 hashvalue = DatumGetUInt32(FunctionCall1(&hashfunctions[0], values[i]));
+ 				int bucket = hashvalue & (nbuckets - 1);
+ 
+ 				while (hashtable->mostCommonTuplePartition[bucket].hashvalue != 0
+ 					&& hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ 				{
+ 					bucket = (bucket + 1) & (nbuckets - 1);
+ 					collisionsWhileHashing++;
+ 				}
+ 
+ 				//leave partition alone if it has the same hashvalue as current MCV.  we only want one partition per hashvalue
+ 				if (hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ 				{
+ 					hashtable->mostCommonTuplePartition[bucket].tuples = NULL;
+ 					hashtable->mostCommonTuplePartition[bucket].hashvalue = hashvalue;
+ 					hashtable->nMostCommonTuplePartitions++;
+ 				}
+ 			}
+ 
+ 			free_attstatsslot(atttype, values, nvalues, numbers, nnumbers);
+ 		}
+ 
+ 		ReleaseSysCache(statsTuple);
+ 	}
+ }
  
  /* ----------------------------------------------------------------
   *		ExecHashJoin
***************
*** 146,151 ****
--- 253,267 ----
  		hashtable = ExecHashTableCreate((Hash *) hashNode->ps.plan,
  										node->hj_HashOperators);
  		node->hj_HashTable = hashtable;
+ 		
+ 		hashtable->usingMostCommonValues = false;
+ 		hashtable->nMostCommonTuplePartitions = 0;
+ 		hashtable->nMostCommonTuplePartitionHashBuckets = 0;
+ 		hashtable->mostCommonTuplesStored = 0;
+ 		hashtable->mostCommonTuplePartition = NULL;
+ 
+ 		if (enable_hashjoin_usestatmcvs)
+ 			getMostCommonValues(estate, node);
  
  		/*
  		 * execute the Hash node, to build the hash table
***************
*** 157,163 ****
  		 * If the inner relation is completely empty, and we're not doing an
  		 * outer join, we can quit without scanning the outer relation.
  		 */
! 		if (hashtable->totalTuples == 0 && !HASHJOIN_IS_OUTER(node))
  			return NULL;
  
  		/*
--- 273,279 ----
  		 * If the inner relation is completely empty, and we're not doing an
  		 * outer join, we can quit without scanning the outer relation.
  		 */
! 		if (hashtable->totalTuples == 0 && hashtable->mostCommonTuplesStored == 0 && !HASHJOIN_IS_OUTER(node))
  			return NULL;
  
  		/*
***************
*** 205,227 ****
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
! 
! 			/*
! 			 * Now we've got an outer tuple and the corresponding hash bucket,
! 			 * but this tuple may not belong to the current batch.
! 			 */
! 			if (batchno != hashtable->curbatch)
  			{
  				/*
! 				 * Need to postpone this outer tuple to a later batch. Save it
! 				 * in the corresponding outer-batch file.
  				 */
! 				Assert(batchno > hashtable->curbatch);
! 				ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(outerTupleSlot),
! 									  hashvalue,
! 									  &hashtable->outerBatchFile[batchno]);
! 				node->hj_NeedNewOuter = true;
! 				continue;		/* loop around for a new outer tuple */
  			}
  		}
  
--- 321,349 ----
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
! 			
! 			node->hj_OuterTupleMostCommonValuePartition = MCV_INVALID_PARTITION;
! 			
! 			
! 			if (!(hashtable->usingMostCommonValues && isAMostCommonValue(hashtable, hashvalue, &node->hj_OuterTupleMostCommonValuePartition)))
  			{
  				/*
! 				 * Now we've got an outer tuple and the corresponding hash bucket,
! 				 * but this tuple may not belong to the current batch.
  				 */
! 				if (batchno != hashtable->curbatch)
! 				{
! 					/*
! 					 * Need to postpone this outer tuple to a later batch. Save it
! 					 * in the corresponding outer-batch file.
! 					 */
! 					Assert(batchno > hashtable->curbatch);
! 					ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(outerTupleSlot),
! 										  hashvalue,
! 										  &hashtable->outerBatchFile[batchno]);
! 					node->hj_NeedNewOuter = true;
! 					continue;		/* loop around for a new outer tuple */
! 				}
  			}
  		}
  
***************
*** 230,236 ****
  		 */
  		for (;;)
  		{
! 			curtuple = ExecScanHashBucket(node, econtext);
  			if (curtuple == NULL)
  				break;			/* out of matches */
  
--- 352,365 ----
  		 */
  		for (;;)
  		{
! 			if (node->hj_OuterTupleMostCommonValuePartition != MCV_INVALID_PARTITION)
! 			{
! 				curtuple = ExecScanHashMostCommonTuples(node, econtext);
! 			}
! 			else
! 			{
! 				curtuple = ExecScanHashBucket(node, econtext);
! 			}
  			if (curtuple == NULL)
  				break;			/* out of matches */
  
Index: src/backend/optimizer/path/costsize.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.200
diff -c -r1.200 costsize.c
*** src/backend/optimizer/path/costsize.c	21 Oct 2008 20:42:52 -0000	1.200
--- src/backend/optimizer/path/costsize.c	5 Nov 2008 22:57:01 -0000
***************
*** 109,114 ****
--- 109,116 ----
  bool		enable_mergejoin = true;
  bool		enable_hashjoin = true;
  
+ bool		enable_hashjoin_usestatmcvs = true;
+ 
  typedef struct
  {
  	PlannerInfo *root;
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.475
diff -c -r1.475 guc.c
*** src/backend/utils/misc/guc.c	6 Oct 2008 13:05:36 -0000	1.475
--- src/backend/utils/misc/guc.c	9 Oct 2008 19:56:17 -0000
***************
*** 625,630 ****
--- 625,638 ----
  		true, NULL, NULL
  	},
  	{
+ 		{"enable_hashjoin_usestatmcvs", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the hash join's use of the MCVs stored in pg_statistic."),
+ 			NULL
+ 		},
+ 		&enable_hashjoin_usestatmcvs,
+ 		true, NULL, NULL
+ 	},
+ 	{
  		{"constraint_exclusion", PGC_USERSET, QUERY_TUNING_OTHER,
  			gettext_noop("Enables the planner to use constraints to optimize queries."),
  			gettext_noop("Child table scans will be skipped if their "
Index: src/include/executor/hashjoin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/hashjoin.h,v
retrieving revision 1.48
diff -c -r1.48 hashjoin.h
*** src/include/executor/hashjoin.h	1 Jan 2008 19:45:57 -0000	1.48
--- src/include/executor/hashjoin.h	17 Oct 2008 23:48:46 -0000
***************
*** 72,77 ****
--- 72,84 ----
  #define HJTUPLE_MINTUPLE(hjtup)  \
  	((MinimalTuple) ((char *) (hjtup) + HJTUPLE_OVERHEAD))
  
+ typedef struct HashJoinMostCommonValueTuplePartition
+ {
+ 	uint32 hashvalue;
+ 	HashJoinTuple tuples;
+ } HashJoinMostCommonValueTuplePartition;
+ 
+ #define MCV_INVALID_PARTITION -1
  
  typedef struct HashJoinTableData
  {
***************
*** 116,121 ****
--- 123,134 ----
  
  	MemoryContext hashCxt;		/* context for whole-hash-join storage */
  	MemoryContext batchCxt;		/* context for this-batch-only storage */
+ 	
+ 	bool usingMostCommonValues;
+ 	HashJoinMostCommonValueTuplePartition *mostCommonTuplePartition;
+ 	int nMostCommonTuplePartitionHashBuckets;
+ 	int nMostCommonTuplePartitions;
+ 	uint32 mostCommonTuplesStored;
  } HashJoinTableData;
  
  #endif   /* HASHJOIN_H */
Index: src/include/executor/nodeHash.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/nodeHash.h,v
retrieving revision 1.45
diff -c -r1.45 nodeHash.h
*** src/include/executor/nodeHash.h	1 Jan 2008 19:45:57 -0000	1.45
--- src/include/executor/nodeHash.h	30 Sep 2008 20:31:35 -0000
***************
*** 45,48 ****
--- 45,51 ----
  						int *numbuckets,
  						int *numbatches);
  
+ extern HashJoinTuple ExecScanHashMostCommonTuples(HashJoinState *hjstate, ExprContext *econtext);
+ extern bool isAMostCommonValue(HashJoinTable hashtable, uint32 hashvalue, int *partitionNumber);
+ 
  #endif   /* NODEHASH_H */
Index: src/include/executor/nodeHashjoin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/nodeHashjoin.h,v
retrieving revision 1.37
diff -c -r1.37 nodeHashjoin.h
*** src/include/executor/nodeHashjoin.h	1 Jan 2008 19:45:57 -0000	1.37
--- src/include/executor/nodeHashjoin.h	30 Sep 2008 20:32:05 -0000
***************
*** 26,29 ****
--- 26,31 ----
  extern void ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
  					  BufFile **fileptr);
  
+ extern void getMostCommonValues(EState *estate, HashJoinState *hjstate);
+ 
  #endif   /* NODEHASHJOIN_H */
Index: src/include/nodes/execnodes.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/nodes/execnodes.h,v
retrieving revision 1.194
diff -c -r1.194 execnodes.h
*** src/include/nodes/execnodes.h	31 Oct 2008 19:37:56 -0000	1.194
--- src/include/nodes/execnodes.h	5 Nov 2008 22:57:08 -0000
***************
*** 1386,1391 ****
--- 1386,1392 ----
  	bool		hj_NeedNewOuter;
  	bool		hj_MatchedOuter;
  	bool		hj_OuterNotEmpty;
+ 	int		hj_OuterTupleMostCommonValuePartition;
  } HashJoinState;
  
  
Index: src/include/optimizer/cost.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/optimizer/cost.h,v
retrieving revision 1.93
diff -c -r1.93 cost.h
*** src/include/optimizer/cost.h	4 Oct 2008 21:56:55 -0000	1.93
--- src/include/optimizer/cost.h	7 Oct 2008 18:31:42 -0000
***************
*** 52,57 ****
--- 52,58 ----
  extern bool enable_nestloop;
  extern bool enable_mergejoin;
  extern bool enable_hashjoin;
+ extern bool enable_hashjoin_usestatmcvs;
  extern bool constraint_exclusion;
  
  extern double clamp_row_est(double nrows);
#12Joshua Tolley
eggyknap@gmail.com
In reply to: Bryce Cutt (#11)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

On Wed, Nov 05, 2008 at 04:06:11PM -0800, Bryce Cutt wrote:

The error is causes by me Asserting against the wrong variable. I
never noticed this as I apparently did not have assertions turned on
on my development machine. That is fixed now and with the new patch
version I have attached all assertions are passing with your query and
my test queries. I added another assertion to that section of the
code so that it is a bit more vigorous in confirming the hash table
partition is correct. It does not change the operation of the code.

There are two partition counts. One holds the maximum number of
buckets in the hash table and the other counts the number of actual
buckets created for hash values. I was incorrectly testing against
the second one because that was valid before I started using a hash
table to store the buckets.

The enable_hashjoin_usestatmcvs flag was valuable for my own research
and tests and likely useful for your review but Tom is correct that it
can be removed in the final version.

- Bryce Cutt

Thanks for the new patch; I'll take a look as soon as I can (prolly
tomorrow).

- Josh

#13Joshua Tolley
eggyknap@gmail.com
In reply to: Bryce Cutt (#11)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

On Wed, Nov 5, 2008 at 5:06 PM, Bryce Cutt <pandasuit@gmail.com> wrote:

The error is causes by me Asserting against the wrong variable. I
never noticed this as I apparently did not have assertions turned on
on my development machine. That is fixed now and with the new patch
version I have attached all assertions are passing with your query and
my test queries. I added another assertion to that section of the
code so that it is a bit more vigorous in confirming the hash table
partition is correct. It does not change the operation of the code.

There are two partition counts. One holds the maximum number of
buckets in the hash table and the other counts the number of actual
buckets created for hash values. I was incorrectly testing against
the second one because that was valid before I started using a hash
table to store the buckets.

The enable_hashjoin_usestatmcvs flag was valuable for my own research
and tests and likely useful for your review but Tom is correct that it
can be removed in the final version.

- Bryce Cutt

Well, that builds nicely, lets me import the data, and I've seen a
performance improvement with enable_hashjoin_usestatmcvs on vs. off. I
plan to test that more formally (though probably not fully to the
extent you did in your paper; just enough to feel comfortable that I'm
getting similar results). Then I'll spend some time poking in the
code, for the relatively little good I feel I can do in that capacity,
and I'll also investigate scenarios with particularly inaccurate
statistics. Stay tuned.

- Josh

#14Simon Riggs
simon@2ndQuadrant.com
In reply to: Joshua Tolley (#13)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

On Thu, 2008-11-06 at 15:33 -0700, Joshua Tolley wrote:

Stay tuned.

Minor question on this patch. AFAICS there is another patch that seems
to be aiming at exactly the same use case. Jonah's Bloom filter patch.

Shouldn't we have a dust off to see which one is best? Or at least a
discussion to test whether they overlap? Perhaps you already did that
and I missed it because I'm not very tuned in on this thread.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#15Joshua Tolley
eggyknap@gmail.com
In reply to: Simon Riggs (#14)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

On Thu, Nov 6, 2008 at 3:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, 2008-11-06 at 15:33 -0700, Joshua Tolley wrote:

Stay tuned.

Minor question on this patch. AFAICS there is another patch that seems
to be aiming at exactly the same use case. Jonah's Bloom filter patch.

Shouldn't we have a dust off to see which one is best? Or at least a
discussion to test whether they overlap? Perhaps you already did that
and I missed it because I'm not very tuned in on this thread.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

We haven't had that discussion AFAIK, and definitely should. First
glance suggests they could coexist peacefully, with proper coaxing. If
I understand things properly, Jonah's patch filters tuples early in
the join process, and this patch tries to ensure that hash join
batches are kept in RAM when they're most likely to be used. So
they're orthogonal in purpose, and the patches actually apply *almost*
cleanly together. Jonah, any comments? If I continue to have some time
to devote, and get through all I think I can do to review this patch,
I'll gladly look at Jonah's too, FWIW.

- Josh

#16Lawrence, Ramon
ramon.lawrence@ubc.ca
In reply to: Joshua Tolley (#15)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

-----Original Message-----

Minor question on this patch. AFAICS there is another patch that

seems

to be aiming at exactly the same use case. Jonah's Bloom filter

patch.

Shouldn't we have a dust off to see which one is best? Or at least a
discussion to test whether they overlap? Perhaps you already did

that

and I missed it because I'm not very tuned in on this thread.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

We haven't had that discussion AFAIK, and definitely should. First
glance suggests they could coexist peacefully, with proper coaxing. If
I understand things properly, Jonah's patch filters tuples early in
the join process, and this patch tries to ensure that hash join
batches are kept in RAM when they're most likely to be used. So
they're orthogonal in purpose, and the patches actually apply *almost*
cleanly together. Jonah, any comments? If I continue to have some time
to devote, and get through all I think I can do to review this patch,
I'll gladly look at Jonah's too, FWIW.

- Josh

The skew patch and bloom filter patch are orthogonal and can both be
applied. The bloom filter patch is a great idea, and it is used in many
other database systems. You can use the TPC-H data set to demonstrate
that the bloom filter patch will significantly improve performance of
multi-batch joins (with or without data skew).

Any query that filters a build table before joining on the probe table
will show improvements with a bloom filter. For example,

select * from customer, orders where customer.c_nationkey = 10 and
customer.c_custkey = orders.o_custkey

The bloom filter on customer would allow us to avoid probing with orders
tuples that cannot possibly find a match due to the selection criteria.
This is especially beneficial for multi-batch joins where an orders
tuple must be written to disk if its corresponding customer batch is not
the in-memory batch.

I have no experience reviewing patches, but I would be happy to help
contribute/review the bloom filter patch as best I can.

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca

#17Joshua Tolley
eggyknap@gmail.com
In reply to: Lawrence, Ramon (#16)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

On Thu, Nov 6, 2008 at 5:31 PM, Lawrence, Ramon <ramon.lawrence@ubc.ca> wrote:

-----Original Message-----

Minor question on this patch. AFAICS there is another patch that

seems

to be aiming at exactly the same use case. Jonah's Bloom filter

patch.

Shouldn't we have a dust off to see which one is best? Or at least a
discussion to test whether they overlap? Perhaps you already did

that

and I missed it because I'm not very tuned in on this thread.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

We haven't had that discussion AFAIK, and definitely should. First
glance suggests they could coexist peacefully, with proper coaxing. If
I understand things properly, Jonah's patch filters tuples early in
the join process, and this patch tries to ensure that hash join
batches are kept in RAM when they're most likely to be used. So
they're orthogonal in purpose, and the patches actually apply *almost*
cleanly together. Jonah, any comments? If I continue to have some time
to devote, and get through all I think I can do to review this patch,
I'll gladly look at Jonah's too, FWIW.

- Josh

The skew patch and bloom filter patch are orthogonal and can both be
applied. The bloom filter patch is a great idea, and it is used in many
other database systems. You can use the TPC-H data set to demonstrate
that the bloom filter patch will significantly improve performance of
multi-batch joins (with or without data skew).

Any query that filters a build table before joining on the probe table
will show improvements with a bloom filter. For example,

select * from customer, orders where customer.c_nationkey = 10 and
customer.c_custkey = orders.o_custkey

The bloom filter on customer would allow us to avoid probing with orders
tuples that cannot possibly find a match due to the selection criteria.
This is especially beneficial for multi-batch joins where an orders
tuple must be written to disk if its corresponding customer batch is not
the in-memory batch.

I have no experience reviewing patches, but I would be happy to help
contribute/review the bloom filter patch as best I can.

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca

I've no patch review experience, either -- this is my first one. See
http://wiki.postgresql.org/wiki/Reviewing_a_Patch for details on what
a reviewer ought to do in general; various patch review discussions on
the -hackers list have also proven helpful. As regards this patch
specifically, it seems we could merge the two patches into one and
consider them together. However, the bloom filter patch is listed as a
"Work in Progress" on
http://wiki.postgresql.org/wiki/CommitFest_2008-11. Perhaps it needs
more work before being considered seriously? Jonah, what do you think
would be most helpful?

- Josh / eggyknap

#18Joshua Tolley
eggyknap@gmail.com
In reply to: Bryce Cutt (#11)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

On Wed, Nov 05, 2008 at 04:06:11PM -0800, Bryce Cutt wrote:

The error is causes by me Asserting against the wrong variable. I
never noticed this as I apparently did not have assertions turned on
on my development machine. That is fixed now and with the new patch
version I have attached all assertions are passing with your query and
my test queries. I added another assertion to that section of the
code so that it is a bit more vigorous in confirming the hash table
partition is correct. It does not change the operation of the code.

There are two partition counts. One holds the maximum number of
buckets in the hash table and the other counts the number of actual
buckets created for hash values. I was incorrectly testing against
the second one because that was valid before I started using a hash
table to store the buckets.

The enable_hashjoin_usestatmcvs flag was valuable for my own research
and tests and likely useful for your review but Tom is correct that it
can be removed in the final version.

- Bryce Cutt

Well, this version seems to work as advertised. Skewed data sets tend to
hash join more quickly with this turned on, and data sets with
deliberately bad statistics don't perform much differently than with the
feature turned off. The patch applies cleanly to CVS HEAD.

I don't consider myself qualified to do a decent code review. However I
noticed that the comments are all done with // instead of /* ... */.
That should probably be changed.

To those familiar with code review: is there more I should do to review
this?

- Josh / eggyknap

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Lawrence, Ramon (#1)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

"Lawrence, Ramon" <ramon.lawrence@ubc.ca> writes:

We propose a patch that improves hybrid hash join's performance for
large multi-batch joins where the probe relation has skew.
...
The basic idea
is to keep build relation tuples in a small in-memory hash table that
have join values that are frequently occurring in the probe relation.

I looked at this patch a little.

I'm a tad worried about what happens when the values that are frequently
occurring in the outer relation are also frequently occurring in the
inner (which hardly seems an improbable case). Don't you stand a severe
risk of blowing out the in-memory hash table? It doesn't appear to me
that the code has any way to back off once it's decided that a certain
set of join key values are to be treated in-memory. Splitting the main
join into more batches certainly doesn't help with that.

Also, AFAICS the benefit of this patch comes entirely from avoiding dump
and reload of tuples bearing the most common values, which means it's a
significant waste of cycles when there's only one batch. It'd be better
to avoid doing any of the extra work in the single-batch case.

One thought that might address that point as well as the difficulty of
getting stats in nontrivial cases is to wait until we've overrun memory
and are forced to start batching, and at that point determine on-the-fly
which are the most common hash values from inspection of the hash table
as we dump it out. This would amount to optimizing on the basis of
frequency in the *inner* relation not the outer, but offhand I don't see
any strong theoretical basis why that wouldn't be just as good. It
could lose if the first work_mem worth of inner tuples isn't
representative of what follows; but this hardly seems more dangerous
than depending on MCV stats that are for the whole outer relation rather
than the portion of it being selected.

regards, tom lane

#20Lawrence, Ramon
ramon.lawrence@ubc.ca
In reply to: Tom Lane (#19)
1 attachment(s)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
I'm a tad worried about what happens when the values that are

frequently

occurring in the outer relation are also frequently occurring in the
inner (which hardly seems an improbable case). Don't you stand a

severe

risk of blowing out the in-memory hash table? It doesn't appear to me
that the code has any way to back off once it's decided that a certain
set of join key values are to be treated in-memory. Splitting the

main

join into more batches certainly doesn't help with that.

Also, AFAICS the benefit of this patch comes entirely from avoiding

dump

and reload of tuples bearing the most common values, which means it's

a

significant waste of cycles when there's only one batch. It'd be

better

to avoid doing any of the extra work in the single-batch case.

One thought that might address that point as well as the difficulty of
getting stats in nontrivial cases is to wait until we've overrun

memory

and are forced to start batching, and at that point determine

on-the-fly

which are the most common hash values from inspection of the hash

table

as we dump it out. This would amount to optimizing on the basis of
frequency in the *inner* relation not the outer, but offhand I don't

see

any strong theoretical basis why that wouldn't be just as good. It
could lose if the first work_mem worth of inner tuples isn't
representative of what follows; but this hardly seems more dangerous
than depending on MCV stats that are for the whole outer relation

rather

than the portion of it being selected.

regards, tom lane

You are correct with both observations. The patch only has a benefit
when there is more than one batch. Also, there is a potential issue
with MCV hash table overflows if the number of tuples that match the
MCVs in the build relation is very large.

Bryce has created a patch (attached) that disables the code for one
batch joins. This patch also checks for MCV hash table overflows and
handles them by "flushing" from the MCV hash table back to the main hash
table. The main hash table will then resolve overflows as usual. Note
that this will cause the worse case of a build table with all the same
values to be handled the same as the current hash code, i.e., it will
attempt to re-partition until it eventually gives up and then allocates
the entire partition in memory. There may be a better way to handle
this case, but the new patch will remain consistent with the current
hash join implementation.

The issue with determining and using the MCV stats is more challenging
than it appears. First, knowing the MCVs of the build table will not
help us. What we need are the MCVs of the probe table because by
knowing those values we will keep the tuples with those values in the
build relation in memory. For example, consider a join between tables
Part and LineItem. Assume 1 popular part accounts for 10% of all
LineItems. If Part is the build relation and LineItem is the probe
relation, then by keeping that 1 part record in memory, we will
guarantee that we do not need to write out 10% of LineItem. If a
selection occurs on LineItem before the join, it may change the
distribution of LineItem (the MCVs) but it is probable that they are
still a good estimate of the MCVs in the derived LineItem relation. (We
did experiments on trying to sample the first few thousand tuples of the
probe relation to dynamically determine the MCVs but generally found
this was inaccurate due to non-random samples.) In essence, the goal is
to smartly pick the tuples that remain in the in-memory batch before
probing begins. Since the number of MCVs is small, incorrectly
selecting build tuples to remain in memory has negligible cost.

If we assume that LineItem has been filtered so much that it is now
smaller than Part and is the build relation then the MCV approach does
not apply. There is no skew in Part on partkey (since it is the PK) and
knowing the MCV partkeys in LineItem does not help us because they each
only join with a single tuple in Part. In this case, the MCV approach
should not be used because no benefit is possible, and it will not be
used because there will be no MCVs for Part.partkey.

The bad case with MCV hash table overflow requires a many-to-many join
between the two relations which would not occur on the more typical
PK-FK joins.

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca

Attachments:

histojoin_v3.patchapplication/octet-stream; name=histojoin_v3.patchDownload
Index: src/backend/executor/nodeHash.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHash.c,v
retrieving revision 1.116
diff -c -r1.116 nodeHash.c
*** src/backend/executor/nodeHash.c	1 Jan 2008 19:45:49 -0000	1.116
--- src/backend/executor/nodeHash.c	24 Nov 2008 12:32:13 -0000
***************
*** 54,59 ****
--- 54,165 ----
  }
  
  /* ----------------------------------------------------------------
+ *		isAMostCommonValue
+ *
+ *		is the value one of the most common key values?
+ *  ----------------------------------------------------------------
+ */
+ bool isAMostCommonValue(HashJoinTable hashtable, uint32 hashvalue, int *partitionNumber)
+ {
+ 	int bucket = hashvalue & (hashtable->nMostCommonTuplePartitionHashBuckets - 1);
+ 
+ 	while (hashtable->mostCommonTuplePartition[bucket].hashvalue != 0
+ 		&& hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ 	{
+ 		bucket = (bucket + 1) & (hashtable->nMostCommonTuplePartitionHashBuckets - 1);
+ 	}
+ 
+ 	if (!hashtable->mostCommonTuplePartition[bucket].frozen && hashtable->mostCommonTuplePartition[bucket].hashvalue == hashvalue)
+ 	{
+ 		*partitionNumber = bucket;
+ 		return true;
+ 	}
+ 
+ 	/* must have run into an empty slot which means this is not an MCV*/
+ 	*partitionNumber = MCV_INVALID_PARTITION;
+ 	return false;
+ }
+ 
+ /*
+ *	freezeNextMCVPartiton
+ *
+ *	flush the tuples of the next MCV partition by pushing them into the main hashtable
+ */
+ bool freezeNextMCVPartiton(HashJoinTable hashtable) {
+ 	int partitionToFlush = hashtable->nMostCommonTuplePartitions - 1 - hashtable->nMostCommonTuplePartitionsFlushed;
+ 	if (partitionToFlush < 0)
+ 		return false;
+ 	else
+ 	{
+ 
+ 		int		bucketno;
+ 		int		batchno;
+ 		uint32		hashvalue;
+ 		HashJoinTuple hashTuple;
+ 		HashJoinTuple nextHashTuple;
+ 		HashJoinMostCommonValueTuplePartition *partition;
+ 		MinimalTuple mintuple;
+ 
+ 		partition = hashtable->flushOrderedMostCommonTuplePartition[partitionToFlush];
+ 		hashvalue = partition->hashvalue;
+ 
+ 		Assert(hashvalue != 0);
+ 
+ 		hashTuple = partition->tuples;
+ 
+ 		ExecHashGetBucketAndBatch(hashtable, hashvalue,
+ 								  &bucketno, &batchno);
+ 
+ 		while (hashTuple != NULL)
+ 		{
+ 			/* decide whether to put the tuples in the hash table or a temp file */
+ 			if (batchno == hashtable->curbatch)
+ 			{
+ 				/* put the tuples in hash table */
+ 				nextHashTuple = hashTuple->next;
+ 
+ 				hashTuple->next = hashtable->buckets[bucketno];
+ 				hashtable->buckets[bucketno] = hashTuple;
+ 				
+ 				hashTuple = nextHashTuple;
+ 				hashtable->totalTuples++;
+ 				hashtable->mostCommonTuplesStored--;
+ 				
+ 				if (hashtable->spaceUsed > hashtable->spaceAllowed)
+ 				{
+ 					ExecHashIncreaseNumBatches(hashtable);
+ 					/* likely changed due to increase in batches */
+ 					ExecHashGetBucketAndBatch(hashtable, hashvalue,
+ 						&bucketno, &batchno);
+ 				}
+ 			}
+ 			else
+ 			{
+ 				/* put the tuples into a temp file for later batches */
+ 				Assert(batchno > hashtable->curbatch);
+ 				mintuple = HJTUPLE_MINTUPLE(hashTuple);
+ 				ExecHashJoinSaveTuple(mintuple,
+ 									  hashvalue,
+ 									  &hashtable->innerBatchFile[batchno]);
+ 				hashtable->spaceUsed -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ 				nextHashTuple = hashTuple->next;
+ 				pfree(hashTuple);
+ 				hashTuple = nextHashTuple;
+ 				hashtable->inTupIOs++;
+ 				hashtable->totalTuples++;
+ 				hashtable->mostCommonTuplesStored--;
+ 			}
+ 		}
+ 
+ 		partition->frozen = true;
+ 		partition->tuples = NULL;
+ 		hashtable->nMostCommonTuplePartitionsFlushed++;
+ 
+ 		return true;
+ 	}
+ }
+ 
+ /* ----------------------------------------------------------------
   *		MultiExecHash
   *
   *		build hash table for hashjoin, doing partitioning if more
***************
*** 69,74 ****
--- 175,182 ----
  	TupleTableSlot *slot;
  	ExprContext *econtext;
  	uint32		hashvalue;
+ 	MinimalTuple mintuple;
+ 	int partitionNumber;
  
  	/* must provide our own instrumentation support */
  	if (node->ps.instrument)
***************
*** 99,106 ****
  		if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
  								 &hashvalue))
  		{
! 			ExecHashTableInsert(hashtable, slot, hashvalue);
! 			hashtable->totalTuples += 1;
  		}
  	}
  
--- 207,240 ----
  		if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
  								 &hashvalue))
  		{
! 			partitionNumber = MCV_INVALID_PARTITION;
! 
! 			if (hashtable->usingMostCommonValues && isAMostCommonValue(hashtable, hashvalue, &partitionNumber))
! 			{
! 				HashJoinTuple hashTuple;
! 				int			hashTupleSize;
! 				
! 				mintuple = ExecFetchSlotMinimalTuple(slot);
! 				hashTupleSize = HJTUPLE_OVERHEAD + mintuple->t_len;
! 				hashTuple = (HashJoinTuple) palloc(hashTupleSize);
! 				hashTuple->hashvalue = hashvalue;
! 				memcpy(HJTUPLE_MINTUPLE(hashTuple), mintuple, mintuple->t_len);
! 
! 				hashTuple->next = hashtable->mostCommonTuplePartition[partitionNumber].tuples;
! 				hashtable->mostCommonTuplePartition[partitionNumber].tuples = hashTuple;
! 				
! 				hashtable->spaceUsed += hashTupleSize;
! 				
! 				hashtable->mostCommonTuplesStored++;
! 				
! 				while (hashtable->spaceUsed > hashtable->spaceAllowed && freezeNextMCVPartiton(hashtable)) {}
! 			}
! 
! 			if (partitionNumber == MCV_INVALID_PARTITION)
! 			{
! 				ExecHashTableInsert(hashtable, slot, hashvalue);
! 				hashtable->totalTuples += 1;
! 			}
  		}
  	}
  
***************
*** 461,466 ****
--- 595,606 ----
  			BufFileClose(hashtable->outerBatchFile[i]);
  	}
  
+ 	if (hashtable->usingMostCommonValues)
+ 	{
+ 		pfree(hashtable->mostCommonTuplePartition);
+ 		pfree(hashtable->flushOrderedMostCommonTuplePartition);
+ 	}
+ 
  	/* Release working memory (batchCxt is a child, so it goes away too) */
  	MemoryContextDelete(hashtable->hashCxt);
  
***************
*** 798,803 ****
--- 938,1005 ----
  }
  
  /*
+  * ExecScanHashMostCommonTuples
+  *		scan a hash bucket for matches to the current outer tuple
+  *
+  * The current outer tuple must be stored in econtext->ecxt_outertuple.
+  */
+ HashJoinTuple
+ ExecScanHashMostCommonTuples(HashJoinState *hjstate,
+ 				   ExprContext *econtext)
+ {
+ 	List	   *hjclauses = hjstate->hashclauses;
+ 	HashJoinTable hashtable = hjstate->hj_HashTable;
+ 	HashJoinTuple hashTuple = hjstate->hj_CurTuple;
+ 	uint32		hashvalue = hjstate->hj_CurHashValue;
+ 
+ 	/*
+ 	 * hj_CurTuple is NULL to start scanning a new bucket, or the address of
+ 	 * the last tuple returned from the current bucket.
+ 	 */
+ 	if (hashTuple == NULL)
+ 	{
+ 		//painstakingly make sure this is a valid partition index
+ 		Assert(hjstate->hj_OuterTupleMostCommonValuePartition > MCV_INVALID_PARTITION);
+ 		Assert(hjstate->hj_OuterTupleMostCommonValuePartition < hashtable->nMostCommonTuplePartitionHashBuckets);
+ 		Assert(hashtable->mostCommonTuplePartition[hjstate->hj_OuterTupleMostCommonValuePartition].hashvalue != 0);
+ 
+ 		hashTuple = hashtable->mostCommonTuplePartition[hjstate->hj_OuterTupleMostCommonValuePartition].tuples;
+ 	}
+ 	else
+ 		hashTuple = hashTuple->next;
+ 
+ 	while (hashTuple != NULL)
+ 	{
+ 		if (hashTuple->hashvalue == hashvalue)
+ 		{
+ 			TupleTableSlot *inntuple;
+ 
+ 			/* insert hashtable's tuple into exec slot so ExecQual sees it */
+ 			inntuple = ExecStoreMinimalTuple(HJTUPLE_MINTUPLE(hashTuple),
+ 											 hjstate->hj_HashTupleSlot,
+ 											 false);	/* do not pfree */
+ 			econtext->ecxt_innertuple = inntuple;
+ 
+ 			/* reset temp memory each time to avoid leaks from qual expr */
+ 			ResetExprContext(econtext);
+ 
+ 			if (ExecQual(hjclauses, econtext, false))
+ 			{
+ 				hjstate->hj_CurTuple = hashTuple;
+ 				return hashTuple;
+ 			}
+ 		}
+ 
+ 		hashTuple = hashTuple->next;
+ 	}
+ 
+ 	/*
+ 	 * no match
+ 	 */
+ 	return NULL;
+ }
+ 
+ /*
   * ExecScanHashBucket
   *		scan a hash bucket for matches to the current outer tuple
   *
Index: src/backend/executor/nodeHashjoin.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v
retrieving revision 1.96
diff -c -r1.96 nodeHashjoin.c
*** src/backend/executor/nodeHashjoin.c	23 Oct 2008 14:34:34 -0000	1.96
--- src/backend/executor/nodeHashjoin.c	24 Nov 2008 12:35:29 -0000
***************
*** 20,25 ****
--- 20,30 ----
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
  #include "utils/memutils.h"
+ #include "optimizer/cost.h"
+ #include "utils/syscache.h"
+ #include "utils/lsyscache.h"
+ #include "parser/parsetree.h"
+ #include "catalog/pg_statistic.h"
  
  
  /* Returns true for JOIN_LEFT and JOIN_ANTI jointypes */
***************
*** 34,39 ****
--- 39,149 ----
  						  TupleTableSlot *tupleSlot);
  static int	ExecHashJoinNewBatch(HashJoinState *hjstate);
  
+ /*
+ *          getMostCommonValues
+ *
+ *          
+ */
+ void getMostCommonValues(EState *estate, HashJoinState *hjstate)
+ {
+ 	HeapTupleData *statsTuple;
+ 	FuncExprState *clause;
+ 	ExprState *argstate;
+ 	Var *variable;
+ 
+ 	Datum	   *values;
+ 	int			nvalues;
+ 	float4	   *numbers;
+ 	int			nnumbers;
+ 
+ 	Oid relid;
+ 	AttrNumber relattnum;
+ 	Oid atttype;
+ 	int32 atttypmod;
+ 
+ 	int i;
+ 
+ 	//is it a join on more than one key?
+ 	if (hjstate->hashclauses->length != 1)
+ 		return; //histojoin is not defined for more than one join key so run away
+ 
+ 	//make sure the outer node is a seq scan on a base relation otherwise we cant get MCVs at the moment and should not bother trying
+ 	if (outerPlanState(hjstate)->type != T_SeqScanState)
+ 		return;
+ 	
+ 	//grab the relation object id of the outer relation
+ 	relid = getrelid(((SeqScan *) ((SeqScanState *) outerPlanState(hjstate))->ps.plan)->scanrelid, estate->es_range_table);
+ 	clause = (FuncExprState *) lfirst(list_head(hjstate->hashclauses));
+ 	argstate = (ExprState *) lfirst(list_head(clause->args));
+ 	variable = (Var *) argstate->expr;
+ 
+ 	//grab the necessary properties of the join variable
+ 	relattnum = variable->varattno;
+ 	atttype = variable->vartype;
+ 	atttypmod = variable->vartypmod;
+ 
+ 	statsTuple = SearchSysCache(STATRELATT,
+ 		ObjectIdGetDatum(relid),
+ 		Int16GetDatum(relattnum),
+ 		0, 0);
+ 
+ 	if (HeapTupleIsValid(statsTuple))
+ 	{
+ 		if (get_attstatsslot(statsTuple,
+ 			atttype, atttypmod,
+ 			STATISTIC_KIND_MCV, InvalidOid,
+ 			&values, &nvalues,
+ 			&numbers, &nnumbers))
+ 		{
+ 			HashJoinTable hashtable;
+ 			FmgrInfo   *hashfunctions;
+ 			//MCV Partitions is an open addressing hashtable with a power of 2 size greater than the number of MCV values
+ 			int nbuckets = 2;
+ 			uint32 collisionsWhileHashing = 0;
+ 			while (nbuckets <= nvalues)
+ 			{
+ 				nbuckets <<= 1;
+ 			}
+ 			//use two more bit just to help avoid collisions
+ 			nbuckets <<= 2;
+ 
+ 			hashtable = hjstate->hj_HashTable;
+ 			hashtable->usingMostCommonValues = true;
+ 			hashtable->nMostCommonTuplePartitionHashBuckets = nbuckets;
+ 			hashtable->mostCommonTuplePartition = palloc0(nbuckets * sizeof(HashJoinMostCommonValueTuplePartition));
+ 			hashtable->flushOrderedMostCommonTuplePartition = palloc0(nvalues * sizeof(HashJoinMostCommonValueTuplePartition*));
+ 			hashfunctions = hashtable->outer_hashfunctions;
+ 
+ 			//create the partitions
+ 			for (i = 0; i < nvalues; i++)
+ 			{
+ 				uint32 hashvalue = DatumGetUInt32(FunctionCall1(&hashfunctions[0], values[i]));
+ 				int bucket = hashvalue & (nbuckets - 1);
+ 
+ 				while (hashtable->mostCommonTuplePartition[bucket].hashvalue != 0
+ 					&& hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ 				{
+ 					bucket = (bucket + 1) & (nbuckets - 1);
+ 					collisionsWhileHashing++;
+ 				}
+ 
+ 				//leave partition alone if it has the same hashvalue as current MCV.  we only want one partition per hashvalue
+ 				if (hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ 				{
+ 					hashtable->mostCommonTuplePartition[bucket].tuples = NULL;
+ 					hashtable->mostCommonTuplePartition[bucket].hashvalue = hashvalue;
+ 					hashtable->mostCommonTuplePartition[bucket].frozen = false;
+ +					hashtable->flushOrderedMostCommonTuplePartition[hashtable->nMostCommonTuplePartitions] = &hashtable->mostCommonTuplePartition[bucket];
+ 					hashtable->nMostCommonTuplePartitions++;
+ 				}
+ 			}
+ 
+ 			free_attstatsslot(atttype, values, nvalues, numbers, nnumbers);
+ 		}
+ 
+ 		ReleaseSysCache(statsTuple);
+ 	}
+ }
  
  /* ----------------------------------------------------------------
   *		ExecHashJoin
***************
*** 146,151 ****
--- 256,271 ----
  		hashtable = ExecHashTableCreate((Hash *) hashNode->ps.plan,
  										node->hj_HashOperators);
  		node->hj_HashTable = hashtable;
+ 		
+ 		hashtable->usingMostCommonValues = false;
+ 		hashtable->nMostCommonTuplePartitions = 0;
+ 		hashtable->nMostCommonTuplePartitionHashBuckets = 0;
+ 		hashtable->mostCommonTuplesStored = 0;
+ 		hashtable->mostCommonTuplePartition = NULL;
+ 		hashtable->nMostCommonTuplePartitionsFlushed = 0;
+ 
+ 		if (hashtable->nbatch > 1 && enable_hashjoin_usestatmcvs)
+ 			getMostCommonValues(estate, node);
  
  		/*
  		 * execute the Hash node, to build the hash table
***************
*** 157,163 ****
  		 * If the inner relation is completely empty, and we're not doing an
  		 * outer join, we can quit without scanning the outer relation.
  		 */
! 		if (hashtable->totalTuples == 0 && !HASHJOIN_IS_OUTER(node))
  			return NULL;
  
  		/*
--- 277,283 ----
  		 * If the inner relation is completely empty, and we're not doing an
  		 * outer join, we can quit without scanning the outer relation.
  		 */
! 		if (hashtable->totalTuples == 0 && hashtable->mostCommonTuplesStored == 0 && !HASHJOIN_IS_OUTER(node))
  			return NULL;
  
  		/*
***************
*** 205,227 ****
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
! 
! 			/*
! 			 * Now we've got an outer tuple and the corresponding hash bucket,
! 			 * but this tuple may not belong to the current batch.
! 			 */
! 			if (batchno != hashtable->curbatch)
  			{
  				/*
! 				 * Need to postpone this outer tuple to a later batch. Save it
! 				 * in the corresponding outer-batch file.
  				 */
! 				Assert(batchno > hashtable->curbatch);
! 				ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(outerTupleSlot),
! 									  hashvalue,
! 									  &hashtable->outerBatchFile[batchno]);
! 				node->hj_NeedNewOuter = true;
! 				continue;		/* loop around for a new outer tuple */
  			}
  		}
  
--- 325,353 ----
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
! 			
! 			node->hj_OuterTupleMostCommonValuePartition = MCV_INVALID_PARTITION;
! 			
! 			
! 			if (!(hashtable->usingMostCommonValues && isAMostCommonValue(hashtable, hashvalue, &node->hj_OuterTupleMostCommonValuePartition)))
  			{
  				/*
! 				 * Now we've got an outer tuple and the corresponding hash bucket,
! 				 * but this tuple may not belong to the current batch.
  				 */
! 				if (batchno != hashtable->curbatch)
! 				{
! 					/*
! 					 * Need to postpone this outer tuple to a later batch. Save it
! 					 * in the corresponding outer-batch file.
! 					 */
! 					Assert(batchno > hashtable->curbatch);
! 					ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(outerTupleSlot),
! 										  hashvalue,
! 										  &hashtable->outerBatchFile[batchno]);
! 					node->hj_NeedNewOuter = true;
! 					continue;		/* loop around for a new outer tuple */
! 				}
  			}
  		}
  
***************
*** 230,236 ****
  		 */
  		for (;;)
  		{
! 			curtuple = ExecScanHashBucket(node, econtext);
  			if (curtuple == NULL)
  				break;			/* out of matches */
  
--- 356,369 ----
  		 */
  		for (;;)
  		{
! 			if (node->hj_OuterTupleMostCommonValuePartition != MCV_INVALID_PARTITION)
! 			{
! 				curtuple = ExecScanHashMostCommonTuples(node, econtext);
! 			}
! 			else
! 			{
! 				curtuple = ExecScanHashBucket(node, econtext);
! 			}
  			if (curtuple == NULL)
  				break;			/* out of matches */
  
Index: src/backend/optimizer/path/costsize.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.201
diff -c -r1.201 costsize.c
*** src/backend/optimizer/path/costsize.c	22 Nov 2008 22:47:05 -0000	1.201
--- src/backend/optimizer/path/costsize.c	24 Nov 2008 12:15:00 -0000
***************
*** 109,114 ****
--- 109,116 ----
  bool		enable_mergejoin = true;
  bool		enable_hashjoin = true;
  
+ bool		enable_hashjoin_usestatmcvs = true;
+ 
  typedef struct
  {
  	PlannerInfo *root;
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.481
diff -c -r1.481 guc.c
*** src/backend/utils/misc/guc.c	21 Nov 2008 20:14:27 -0000	1.481
--- src/backend/utils/misc/guc.c	24 Nov 2008 12:15:05 -0000
***************
*** 636,641 ****
--- 636,649 ----
  		true, NULL, NULL
  	},
  	{
+ 		{"enable_hashjoin_usestatmcvs", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the hash join's use of the MCVs stored in pg_statistic."),
+ 			NULL
+ 		},
+ 		&enable_hashjoin_usestatmcvs,
+ 		true, NULL, NULL
+ 	},
+ 	{
  		{"constraint_exclusion", PGC_USERSET, QUERY_TUNING_OTHER,
  			gettext_noop("Enables the planner to use constraints to optimize queries."),
  			gettext_noop("Child table scans will be skipped if their "
Index: src/include/executor/hashjoin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/hashjoin.h,v
retrieving revision 1.48
diff -c -r1.48 hashjoin.h
*** src/include/executor/hashjoin.h	1 Jan 2008 19:45:57 -0000	1.48
--- src/include/executor/hashjoin.h	24 Nov 2008 12:40:18 -0000
***************
*** 72,77 ****
--- 72,85 ----
  #define HJTUPLE_MINTUPLE(hjtup)  \
  	((MinimalTuple) ((char *) (hjtup) + HJTUPLE_OVERHEAD))
  
+ typedef struct HashJoinMostCommonValueTuplePartition
+ {
+ 	uint32 hashvalue;
+ 	bool frozen;
+ 	HashJoinTuple tuples;
+ } HashJoinMostCommonValueTuplePartition;
+ 
+ #define MCV_INVALID_PARTITION -1
  
  typedef struct HashJoinTableData
  {
***************
*** 116,121 ****
--- 124,137 ----
  
  	MemoryContext hashCxt;		/* context for whole-hash-join storage */
  	MemoryContext batchCxt;		/* context for this-batch-only storage */
+ 	
+ 	bool usingMostCommonValues;
+ 	HashJoinMostCommonValueTuplePartition *mostCommonTuplePartition;
+ 	HashJoinMostCommonValueTuplePartition **flushOrderedMostCommonTuplePartition;
+ 	int nMostCommonTuplePartitionHashBuckets;
+ 	int nMostCommonTuplePartitions;
+ 	int nMostCommonTuplePartitionsFlushed;
+ 	uint32 mostCommonTuplesStored;
  } HashJoinTableData;
  
  #endif   /* HASHJOIN_H */
Index: src/include/executor/nodeHash.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/nodeHash.h,v
retrieving revision 1.45
diff -c -r1.45 nodeHash.h
*** src/include/executor/nodeHash.h	1 Jan 2008 19:45:57 -0000	1.45
--- src/include/executor/nodeHash.h	30 Sep 2008 20:31:35 -0000
***************
*** 45,48 ****
--- 45,51 ----
  						int *numbuckets,
  						int *numbatches);
  
+ extern HashJoinTuple ExecScanHashMostCommonTuples(HashJoinState *hjstate, ExprContext *econtext);
+ extern bool isAMostCommonValue(HashJoinTable hashtable, uint32 hashvalue, int *partitionNumber);
+ 
  #endif   /* NODEHASH_H */
Index: src/include/executor/nodeHashjoin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/nodeHashjoin.h,v
retrieving revision 1.37
diff -c -r1.37 nodeHashjoin.h
*** src/include/executor/nodeHashjoin.h	1 Jan 2008 19:45:57 -0000	1.37
--- src/include/executor/nodeHashjoin.h	30 Sep 2008 20:32:05 -0000
***************
*** 26,29 ****
--- 26,31 ----
  extern void ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
  					  BufFile **fileptr);
  
+ extern void getMostCommonValues(EState *estate, HashJoinState *hjstate);
+ 
  #endif   /* NODEHASHJOIN_H */
Index: src/include/nodes/execnodes.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/nodes/execnodes.h,v
retrieving revision 1.196
diff -c -r1.196 execnodes.h
*** src/include/nodes/execnodes.h	16 Nov 2008 17:34:28 -0000	1.196
--- src/include/nodes/execnodes.h	17 Nov 2008 20:05:27 -0000
***************
*** 1392,1397 ****
--- 1392,1398 ----
  	bool		hj_NeedNewOuter;
  	bool		hj_MatchedOuter;
  	bool		hj_OuterNotEmpty;
+ 	int		hj_OuterTupleMostCommonValuePartition;
  } HashJoinState;
  
  
Index: src/include/optimizer/cost.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/optimizer/cost.h,v
retrieving revision 1.93
diff -c -r1.93 cost.h
*** src/include/optimizer/cost.h	4 Oct 2008 21:56:55 -0000	1.93
--- src/include/optimizer/cost.h	7 Oct 2008 18:31:42 -0000
***************
*** 52,57 ****
--- 52,58 ----
  extern bool enable_nestloop;
  extern bool enable_mergejoin;
  extern bool enable_hashjoin;
+ extern bool enable_hashjoin_usestatmcvs;
  extern bool constraint_exclusion;
  
  extern double clamp_row_est(double nrows);
#21Robert Haas
robertmhaas@gmail.com
In reply to: Lawrence, Ramon (#20)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

I have to admit that I haven't fully grokked what this patch is about
just yet, so what follows is mostly a coding style review at this
point. It would help a lot if you could add some comments to the new
functions that are being added to explain the purpose of each at a
very high level. There's clearly been a lot of thought put into some
parts of this logic, so it would be worth explaining the reasoning
behind that logic.

This patch applies clearly against CVS HEAD, but does not compile
(please fix the warning, too).

nodeHash.c:88: warning: no previous prototype for 'freezeNextMCVPartiton'
nodeHash.c: In function 'freezeNextMCVPartiton':
nodeHash.c:148: error: 'struct HashJoinTableData' has no member named 'inTupIOs'

I commented out the offending line. It errored out again here:

nodeHashjoin.c: In function 'getMostCommonValues':
nodeHashjoin.c:136: error: wrong type argument to unary plus

After removing the stray + sign, it compiled, but failed the
"rangefuncs" regression test. If you're going to keep the
enable_hashjoin_usestatmvcs() GUC around, you need to patch
rangefuncs.out so that the regression tests pass. I think, however,
that there was some discussion of removing that before the patch is
committed; if so, please do that instead. Keeping the GUC would also
require patching the documentation, which the current patch does not
do.

getMostCommonValues() isn't a good name for a non-static function
because there's nothing to tip the reader off to the fact that it has
something to do with hash joins; compare with the other function names
defined in the same header file. On the flip side, that function has
only one call site, so it should probably be made static and not
declared in the header file at all. Some of the other new functions
need similar treatment. I am also a little suspicious of this bit of
code:

relid = getrelid(((SeqScan *) ((SeqScanState *)
outerPlanState(hjstate))->ps.plan)->scanrelid,
estate->es_range_table);
clause = (FuncExprState *) lfirst(list_head(hjstate->hashclauses));
argstate = (ExprState *) lfirst(list_head(clause->args));
variable = (Var *) argstate->expr;

I'm not very familiar with the hash join code, but it seems like there
are a lot of assumptions being made there about what things are
pointing to what other things. Is this this actually safe? And if it
is, perhaps a comment explaining why?

getMostCommonValues() also appears to be creating and maintaining a
counter called collisionsWhileHashing, but nothing is ever done with
the counter. On a similar note, the variables relattnum, atttype, and
atttypmod don't appear to be necessary; 2 out of 3 of them are only
used once, so maybe inlining the reference and dropping the variable
would make more sense. Also, the if (HeapTupleIsValid(statsTuple))
block encompasses the whole rest of the function, maybe if
(!HeapTupleIsValid(statsTuple)) return?

I don't understand why
hashtable->mostCommonTuplePartition[bucket].tuples and .frozen need to
be initialized to 0. It looks to me like those are in a zero-filled
array that was just allocated, so it shouldn't be necessary to re-zero
them, unless I'm missing something.

freezeNextMCVPartiton is mis-spelled consistently throughout (the last
three letters should be "ion"). I also don't think it makes sense to
enclose everything but the first two lines of that function in an
else-block.

There is some initialization code in ExecHashJoin() that looks like it
belongs in ExecHashTableCreate.

It appears to me that the interface to isAMostCommonValue() could be
simplified by just making it return the partition number. It could
perhaps be renamed something like ExecHashGetMCVPartition().

Does ExecHashTableDestroy() need to explicitly pfree
hashtable->mostCommonTuplePartition and
hashtable->flushOrderedMostCommonTuplePartition? I would think those
would be allocated out of hashCxt - if they aren't, they probably
should be.

Department of minor nitpicks: (1) C++-style comments are not
permitted, (2) function names need to be capitalized like_this() or
LikeThis() but not likeThis(), (3) when defining a function, the
return type should be placed on the line preceding the actual function
name, so that the function name is at the beginning of the line, (4)
curly braces should be avoided around a block containing only one
statement, (5) excessive blank lines should be avoided (for example,
the one in costsize.c is clearly unnecessary, and there's at least one
place where you add two consecutive blank lines), and (6) I believe
the accepted way to write an empty loop is an indented semi-colon on
the next line, rather than {} on the same line as the while.

I will try to do some more substantive testing of this as well.

...Robert

#22Robert Haas
robertmhaas@gmail.com
In reply to: Lawrence, Ramon (#20)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

Dr. Lawrence:

I'm still working on reviewing this patch. I've managed to load the
sample TPCH data from tpch1g1z.zip after changing the line endings to
UNIX-style and chopping off the trailing vertical bars. (If anyone is
interested, I have the results of pg_dump | bzip2 -9 on the resulting
database, which I would be happy to upload if someone has server
space. It is about 250MB.)

But, I'm not sure quite what to do in terms of generating queries.
TPCHSkew contains QGEN.EXE, but that seems to require that you provide
template queries as input, and I'm not sure where to get the
templates.

Any suggestions?

Thanks,

...Robert

#23Lawrence, Ramon
ramon.lawrence@ubc.ca
In reply to: Lawrence, Ramon (#1)
2 attachment(s)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

Robert,

You do not need to use qgen.exe to generate queries as you are not
running the TPC-H benchmark test. Attached is an example of the 22
sample TPC-H queries according to the benchmark.

We have not tested using the TPC-H queries for this particular patch and
only use the TPC-H database as a large, skewed data set. The simpler
queries we test involve joins of Part-Lineitem or Supplier-Lineitem such
as:

Select * from part, lineitem where p_partkey = l_partkey

OR

Select count(*) from part, lineitem where p_partkey = l_partkey

The count(*) version is usually more useful for comparisons as the
generation of output tuples on the client side (say with pgadmin)
dominates the actual time to complete the query.

To isolate query costs, we also test using a simple server-side
function. The setup description I have also attached.

I would be happy to help in any way I can.

Bryce is currently working on an updated patch according to your
suggestions.

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca

Show quoted text

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Robert Haas
Sent: December 17, 2008 7:54 PM
To: Lawrence, Ramon
Cc: Tom Lane; pgsql-hackers@postgresql.org; Bryce Cutt
Subject: Re: [HACKERS] Proposed Patch to Improve Performance of Multi-
Batch Hash Join for Skewed Data Sets

Dr. Lawrence:

I'm still working on reviewing this patch. I've managed to load the
sample TPCH data from tpch1g1z.zip after changing the line endings to
UNIX-style and chopping off the trailing vertical bars. (If anyone is
interested, I have the results of pg_dump | bzip2 -9 on the resulting
database, which I would be happy to upload if someone has server
space. It is about 250MB.)

But, I'm not sure quite what to do in terms of generating queries.
TPCHSkew contains QGEN.EXE, but that seems to require that you provide
template queries as input, and I'm not sure where to get the
templates.

Any suggestions?

Thanks,

...Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachments:

test_queries.txttext/plain; name=test_queries.txtDownload
setup.txttext/plain; name=setup.txtDownload
#24Bryce Cutt
pandasuit@gmail.com
In reply to: Robert Haas (#21)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

Robert,

I thoroughly appreciate the constructive criticism.

The compile errors are due to my development process being convoluted.
I will endeavor to not waste your time in the future with errors
caused by my development process.

I have updated the code to follow the conventions and suggestions
given. I am now working on adding the requested documentation. I
will not submit the next patch until that is done. The functionality
has not changed so you can performance test with the patch you have.

As for that particularly ugly piece of code. I figured that out while
digging through the selfuncs code. Basically I needed a way to get
the stats tuple for the outer relation join column of the join but to
do that I needed to figure out how to get the actual relation id and
attribute number that was being joined.

I have not yet figured out a better way to do this but I am sure there
is someone on the mailing list with far more knowledge of this than I
have.

I could possibly be more vigorous in testing to make sure the things I
am casting are exactly what I expect. My tests have always been
consistent so far.

I am essentially doing what is done in selfuncs. I believe I could
use the examine_variable() function in selfuncs.c except I would first
need a PlannerInfo and I don't think I can get that from inside the
join initialization code.

- Bryce Cutt

Show quoted text

On Mon, Dec 15, 2008 at 8:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I have to admit that I haven't fully grokked what this patch is about
just yet, so what follows is mostly a coding style review at this
point. It would help a lot if you could add some comments to the new
functions that are being added to explain the purpose of each at a
very high level. There's clearly been a lot of thought put into some
parts of this logic, so it would be worth explaining the reasoning
behind that logic.

This patch applies clearly against CVS HEAD, but does not compile
(please fix the warning, too).

nodeHash.c:88: warning: no previous prototype for 'freezeNextMCVPartiton'
nodeHash.c: In function 'freezeNextMCVPartiton':
nodeHash.c:148: error: 'struct HashJoinTableData' has no member named 'inTupIOs'

I commented out the offending line. It errored out again here:

nodeHashjoin.c: In function 'getMostCommonValues':
nodeHashjoin.c:136: error: wrong type argument to unary plus

After removing the stray + sign, it compiled, but failed the
"rangefuncs" regression test. If you're going to keep the
enable_hashjoin_usestatmvcs() GUC around, you need to patch
rangefuncs.out so that the regression tests pass. I think, however,
that there was some discussion of removing that before the patch is
committed; if so, please do that instead. Keeping the GUC would also
require patching the documentation, which the current patch does not
do.

getMostCommonValues() isn't a good name for a non-static function
because there's nothing to tip the reader off to the fact that it has
something to do with hash joins; compare with the other function names
defined in the same header file. On the flip side, that function has
only one call site, so it should probably be made static and not
declared in the header file at all. Some of the other new functions
need similar treatment. I am also a little suspicious of this bit of
code:

relid = getrelid(((SeqScan *) ((SeqScanState *)
outerPlanState(hjstate))->ps.plan)->scanrelid,
estate->es_range_table);
clause = (FuncExprState *) lfirst(list_head(hjstate->hashclauses));
argstate = (ExprState *) lfirst(list_head(clause->args));
variable = (Var *) argstate->expr;

I'm not very familiar with the hash join code, but it seems like there
are a lot of assumptions being made there about what things are
pointing to what other things. Is this this actually safe? And if it
is, perhaps a comment explaining why?

getMostCommonValues() also appears to be creating and maintaining a
counter called collisionsWhileHashing, but nothing is ever done with
the counter. On a similar note, the variables relattnum, atttype, and
atttypmod don't appear to be necessary; 2 out of 3 of them are only
used once, so maybe inlining the reference and dropping the variable
would make more sense. Also, the if (HeapTupleIsValid(statsTuple))
block encompasses the whole rest of the function, maybe if
(!HeapTupleIsValid(statsTuple)) return?

I don't understand why
hashtable->mostCommonTuplePartition[bucket].tuples and .frozen need to
be initialized to 0. It looks to me like those are in a zero-filled
array that was just allocated, so it shouldn't be necessary to re-zero
them, unless I'm missing something.

freezeNextMCVPartiton is mis-spelled consistently throughout (the last
three letters should be "ion"). I also don't think it makes sense to
enclose everything but the first two lines of that function in an
else-block.

There is some initialization code in ExecHashJoin() that looks like it
belongs in ExecHashTableCreate.

It appears to me that the interface to isAMostCommonValue() could be
simplified by just making it return the partition number. It could
perhaps be renamed something like ExecHashGetMCVPartition().

Does ExecHashTableDestroy() need to explicitly pfree
hashtable->mostCommonTuplePartition and
hashtable->flushOrderedMostCommonTuplePartition? I would think those
would be allocated out of hashCxt - if they aren't, they probably
should be.

Department of minor nitpicks: (1) C++-style comments are not
permitted, (2) function names need to be capitalized like_this() or
LikeThis() but not likeThis(), (3) when defining a function, the
return type should be placed on the line preceding the actual function
name, so that the function name is at the beginning of the line, (4)
curly braces should be avoided around a block containing only one
statement, (5) excessive blank lines should be avoided (for example,
the one in costsize.c is clearly unnecessary, and there's at least one
place where you add two consecutive blank lines), and (6) I believe
the accepted way to write an empty loop is an indented semi-colon on
the next line, rather than {} on the same line as the while.

I will try to do some more substantive testing of this as well.

...Robert

#25Robert Haas
robertmhaas@gmail.com
In reply to: Bryce Cutt (#24)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

[Some performance testing.]

I ran this query 10x with this patch applied, and then 10x again with
enable_hashjoin_usestatmvcs set to false to disable the optimization:

select sum(1) from (select * from part, lineitem where p_partkey = l_partkey) x;

With the optimization enabled, the query took between 26.6 and 38.3
seconds with an average of 31.6. With the optimization disabled, the
query took between 48.3 and 69.0 seconds with an average of 60.0
seconds.

It appears that the 100 entries in pg_statistic cover about 32% of l_partkey:

tpch=# WITH x AS (
SELECT stanumbers1, array_length(stanumbers1, 1) AS len
FROM pg_statistic WHERE starelid='lineitem'::regclass
AND staattnum = (SELECT attnum FROM pg_attribute
WHERE attrelid='lineitem'::regclass AND
attname='l_partkey')
)
SELECT sum(x.stanumbers1[y.g]) FROM x,
(select generate_series(1, x.len) g from x) y;
sum
--------
0.3276
(1 row)

(there's probably a better way to write that query...)

stadistinct for l_partkey is 23,050; the actual number of distinct
values is 199,919. IOW, 0.0005% of the distinct values account for
32.76% of the table. That's a lot of skew, but not unrealistic - I've
seen tables where more than half of the rows were covered by a single
value.

...Robert

#26Joshua Tolley
eggyknap@gmail.com
In reply to: Robert Haas (#25)
6 attachment(s)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

On Sun, Dec 21, 2008 at 10:25:59PM -0500, Robert Haas wrote:

[Some performance testing.]

I (finally!) have a chance to post my performance testing results... my
apologies for the really long delay. <Excuses omitted>

Unfortunately I'm not seeing wonderful speedups with the particular
queries I did in this case. I generated three 1GB datasets, with skews
set at 1, 2, and 3. The test script I wrote turns on enable_usestatmcvs
and runs EXPLAIN ANALYZE on the same query five times. Then it turns
enable_usestatmcvs off, and runs the same query five more times. It does
this with each of the three datasets in turn, and then starts over at
the beginning until I tell it to quit. My results showed a statistically
significant improvement in speed only on the skew == 3 dataset.

I did the same tests twice, once with default_statistics_target set to
10, and once with it set to 100. I've attached boxplots of the total
query times as reported by EXPLAIN ANALYZE ("dst10" in the filename
indicates default_statistics_target was 10, and so on), my results
parsed out of the EXPLAIN ANALYZE output (test.filtered.10 and
test.filtered.100), the results of one-tailed Student's T tests of the
result set (ttests), and the R code to run the tests if anyone's really
interested (t.test.R).

The results data includes six columns: the skew value, whether
enable_usestatmcvs was on or not (represented by a 1 or 0), total times
for each of the three joins that made up the query, and total time for
the query itself. The results above pay attention only to the total
query time.

Finally, the query involved:

SELECT * FROM lineitem l LEFT JOIN part p ON (p.p_partkey = l.l_partkey)
LEFT JOIN orders o ON (o.o_orderkey = l.l_orderky) LEFT JOIN customer c
ON (c.c_custkey = o.o_custkey);

- Josh / eggyknap

Attachments:

boxplot-dst10.pngimage/pngDownload
�PNG


IHDR��J
N��PLTE			





   !!!"""###$$$%%%&&&'''((()))***+++,,,///000111222333444555666777888999:::;;;<<<>>>???@@@AAABBBCCCDDDEEEFFFGGGHHHJJJKKKLLLMMMNNNOOOPPPQQQRRRSSSTTTUUUVVVWWWXXXZZZ[[[\\\]]]^^^___```bbbdddeeefffggghhhiiikkklllmmmnnnooopppqqqrrrsssuuuvvvwwwxxxyyyzzz{{{|||}}}~~~������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������"t�IDATx����c������b�V���*/���K,�5�7Z��R������5���RK�-��b�R�WE,U@����!!��%3��9�����9�y���a�|K�}���sJ`M������`�@0s �9�f3�����`�@0s �9�f3�����`�@0s �9�f3�����`�@0s �9�f3�����`�@0s �9�f3�����`�@0s �9�f3�����`�@0s �9�f3��� x�2`����t���x��lLp��\����f3�����`�@0s ������[?���)�s)��Q�-���K-.�u���~-�Z\������;����"��r�q������"����	;s)���`�@0s �9�f��
3�����;b�l�����u��	�a�
�������0[����lhm�4\l��@�s��!��
���`+@�f3ms�&1ms�&1ms�����A���I�A���I�A��)�����p��#Z1 x��n���="�sm�5S����6	���\��V`�M�`+0�&A��{������A������4�16�V�Ap�y��
Ms/�
C�h��� h��
C������A����0[��c81X3lal��@G����z�fy|�'��I���&d����
� x���|u�����
Cp�V�TH���;w��oy���0���������A���0���7s*�A����{�=-���
C�h���#�^:�.~"���bx���
Ap1��@pT�;Cv|�Si������A�1�)����	#"w��0�k��WL�u���~dV�#������5�E?*�����_�,)��J���
��^����/W�PDi��<X>�%x��Bb����V��^>{��[
���Z�+��[�� d1�)�9H��A�fR4���v�h� E3)�9H�@Bs �����R��b������.�h,F�3y�,,F�3y�b� E3��-��p)E�M����6)g�Yh�rm�����&I���R�N���6)g�Yh�rm�����j���2>r�&�y\J�^�I�l9)r��R��U��R���T�FG�H�u�����
_��J��vqkt��_[��h�,����$.�|��S��m����4��[F}�$�&��%d���>�IQ�\t�����,)��`|�,'��h��m��`Bp�)��6���m��K�F����m���_R�����!E{��/)���}I����K��vp_R��;�#E���g����P�sb������_��D��;�A�<B�������xR\�YR���h������,��u��x/�@�<�S���������k���v���N�K�is7_�0�f�`�1�Z4���*!�` �9�S4R4�`�A�fBJ�O�+/}��`�.�R���`��D �vN���/#���_�C��!+��Q��#�D�^�����N$�_��?����. X��I����A�������R�2|\�2J��.�����Q�����F�q������L�]p��(�����I��O��#Q{���2XFI%E�,�u_���J�����e�D~;zbx��=�j����<���H��Nx�����PA����6IE0Rt/6�I�f�U�J��>;�A�v�M�bs��Fp�m�m�`m�����:E�"X�j�H���(X@��M����>�<�Z�%I�����z�nL?������{��M�~�f�U���^lh�LF����6��`��3���&�lu�>����T'���{8���6��`�S���k*c��8kz�];�6��lY�S��y�+c��-+��ta�_��[�k���W���)Z�K�	m�/��(��\?K�~����7�H�s����u��Fc���_>�|�-�	m��#����gx -��E��v&���z�-��j�hk��]��U�.�'����o���-��j��O�]�t��\�������,�$�3w����R��i@pl�����l�������aV�����W74��86�i���K��?����0���=0M���o�����K�\[�
����d�'>��]
Yc81X3lal��`�c�����4{E�NG���:���YR�/8i��4���2����k'�_���,�������E0�6I����b�`�m��1*X#��6�����1h}��v�M:���}����)��-�]h�;�~��U.�K�`�`�h��/)Z���,�
�c�k�T 8��'�]�����a%!����� z�p��'d(��%XD����@�.�6����v��)���^F���E��?x+�L���w���&����]���=�'	��QgV>�o�(
BV9�=+�,.��`�)�w�5�rL����N��`�S���6�v���	C�<�S�<lu������wc2�	���9d%��[�,x�`i����I;���	6��E��T@���E�Y6>�.dE�)Z��v?J��6�u�}I�:����vX��.,���(�H�N�ck��2l
m�`_R�������RtH�]X"8K��|X��d#G	���~���%E�h�T��T��):�)�j�Dl�X~.Rt�S�h���h Eg8��6I��O�Y����C"O��_���hmR�-��
�]�k�����%Ep"E+�K�.�MRI�J���u�IE�h%|I�:�$�)Z��hmR4������&!E��@�n��i��Ks��	}]������"�A�&�w���&9����\�}x�6�{�J�!KG��B�)zcd��#����\l���	���u�IN�h'��Mr2EE�)���E���.�L��������&�P�k�E�b��M ���MB�6�%mR4����(�MB�N)�=\���
H�i����hy�h��"��M���"��	���);��@�&�w���.)�@�)���	�����MB�&�O�K#Ep�����u������X):���.������#Q��b�R):
����� ���E�������4��BRt#E��A�o!��R4s����[�B�N��`��48�-d��`�@�fR4s|YH�i0�����,�Q�N��{����%�(.n��Rt�,��RtH��qT���!�n���LWqq�B*o/
�{K��&�Em���7�|k������)M�^�����������rh�������h�����l�zf����2P�u�m;�Y|��\�K�.\ZOx�-����q���&�k� X'[n
�uE���
��Il��$�sm[��7 �
 �9��.�`+0�w![���!�
��]�V`n�B�s{B��j�V��@��\~ [����z�
C�h���?�'�Zt��n�<�
�-E�sp��3 �R4��U��
A���$�@0s����/�_z�B�����U�c���/������K16~���_v(�bl�j��/���/�� ���'���A�c�����?���|%����9 �9�f3�����`�@0s�	�M�&;�,��A�s�_��~
s���[5�����)�$j6�R�Vg���s��
s��$3�u�*�)@pv �g���*&���`}�J��f�������2U~n0Qa��(���xi�B�y	��0wO{A�U��*\���
c\��f3�����`�@0s �9�f3�������_K�����������T��/����a�?<�iZ�����$��q}��3�o-��g����7T��y��r�%���m��K�������e&'b\���
C�o�z�Z,�A�^y]� WXbn_���{VI�����q��Z�����J	�$
���D�0un�ns���nXXfr��L��-h-FpX��'��B�0yn���d8�n�}r��1dS��(�b
��
w�9�}��s����1Z�I��,>�9f,\����!7���s�M���+�zK����8��KN&�����d��L������[GR������'-W�����������&����JM�������`�@0s �9�f3�����`�@0s �9�f3�����`�@0s �9�f3�����`�@0s �9�f3�����`�@0s �9�f3�����`�@0s �9��l9i�hIEND�B`�
boxplot-dst100.pngimage/pngDownload
�PNG


IHDR��J
N��PLTE			





!!!"""###$$$&&&'''((()))***+++---...///000111222333555666777888999:::;;;<<<===???@@@AAABBBCCCDDDEEEFFFGGGHHHJJJKKKLLLMMMNNNPPPQQQRRRTTTUUUVVVWWWXXXZZZ[[[\\\]]]^^^___aaabbbccceeefffggghhhiiikkklllmmmooopppqqqssstttuuuvvvwwwxxxyyy{{{|||}}}~~~�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������OIDATx����c���M��h���"����H��*�� *)-j��@�����x�
�B���Z�V,(�P"��+k��_�M����������>������}��M>$�}f���4��7����`�@0q �8L&�����`�@0q �8L&�����`�@0q �8L&�����`�@0q �8L&�����`�@0q �8L&�����`�@0q �8L&�����`���lm���^���f�yDOi�(tom�j}3�<"�k{�����`�Mm��m�M�7�#tn���Z;�o�)�8����L��`F�����A����-��&�Z)0�f��`F��N`@0�M�`'0 ��&A������c�O4�M�`0 m���M"�$��M"���u�,wK�������!�����bQ��8�=���e����+"w[,j�M�y���2����T�^��<n�J���I�c��*��&A��k� ��{�T���
������y�$m�>�h�I
����&���t�)E�MR�����y@0q|J�;nx���_��G�a��S�������~uQl�����a=�����a�a��S��}��������c�����u��1'�N�_�y�$���)E3�`>�h��>�h�!�8L�h� E3@p: E!�8L�h� E3@p: E!�8L�h� E3@p: E!�8L�h� E3@p: E!�8L�h� E3x*���!�gG��m��I���Y��J�L��%�  ��uU*d�h� E3���U*d�hq�x@p:@�8H�<��h`���y�������A��A@�� E3����`�,�,R4����%���1~�v��`�\��:)f�>��t������a<P�t��`} ��N�h&��A��A@0B����������|���b�YE(���Q���>56�E(����2*6�]������������0BV
�;[�������0� �Rt&�h�8P�v����@����#�6iIsH�����_��<i��:(��H�<I�:�����Z���6��&������6��i�I�?����,������CA��6��c���3Y�� 8�|H�te(6�lR:)����y���+CAp�m�:�����N]
���<����%������
����f	���n���Yj;�F#E?q��7�1��r(N�M��J��I���*@A��6	��K�(8�6�i(6�&�\t	3��\�u X�,��=}����_D�.BAp�k�������5����\s(
u�a��"$�:oK�>-6�]����#k�������P���l|�/6�]���d���Pl�M�f�>xY�"��c"��P}L(N�M�a��h�o��J�,���6I���y����i�IY<i���K8)8gSt���>W)�K���C�i��{tD0Rt	���!�M���\aAG�����b�	�\��Q0Rt	�o��CbAG#E�@�6���5���R�L�7x��L5ExRN���'�������)��d�Z���Y�t��n�;R,~���>
F�.�]R4�u��Te�hq���1X�S4 s!�G�H�%���A��A@�^�,�hq���1X�hg.d�()�Rt9UM��g����ne<)z�K�:^���-��D^}qq�v��H���������vY��h��K�c�86k������l@$ECp	�)��`�s�+�=@Z��YH����1X��<��!�_�.Od~S���?$,��`��Q��f�K��g���2aAG#E��?�+��&��v:dQ�]��`2��!���!��`��T#E��]�}?I�qa��M���YpAp#����S��Bp\,�3�2��`�h�Q�l��dY��I3�qa���E���IK�ks�&��1>V�BV9�	�y����F!�R����!%X��@�"E�.��`���}�q���9O�s!���?����w��<o
�|�H�:@0c�'6�1�Py��������qr�]�?��� ��1���6�������������` ���dl�)Ew�6�kF/��`>��)m����Sc���'����_F��!��O)��������)6���"���>�oh��
C�.�f�<|J� ��O!��6i�$�����ij�}h�����on�����h��I�~��+}d=dy�&�&���D�MS�6)R�8d���|x�j9�Ry}���&��U:�_Wy}�R�m���5C����_Jc��`�6����1��p2�c���h��K���F8�uE����Xr{8�9�>�h�6��a��F�18����5a,pN�O!���c��Se��=8�K��NO�dw�'��/bA����qzs����^�zrG����N���
��>~��������d�M��G�!n���t�^c��B���rW�<�����H��b�
e=E��lYO�><��E�S��&�8���BY���B'��Wt0�`]\�C��)Z`]��$�h�u�n�D�z���MB�X��6	)Z�D�%k�/{S"3f%�):0"�����C��=6.xG�����������C�X7�5m^��E�b������YY���CAP��
WM��x��@(E��Iy.c<w���^	�u�� �}Zl�j�u�����E�����'���ll\������Z�m����o��`��}��D�����>��`�,��Y�\�Vg'�(x�����<~�@�
Y:����(>�h�/��z�f�#����@J�{�u��hWh�����p:E�?B��-�'�������M����������#��:;��Yl�[���t�f!�F����v�f��`�h	>yQ���N�S��mG#EK�-R�2Vy���F���[G�hqdS����-<�h�N��H���@���]���W�Z��)Z}^F�V���u@�G�\�H���@���]���Wv�������;+�v��J.�-N5��-�2�?En��\)Z�h	|�V��)�8H��������`/��H��x))Z��	F�v�;g�|-r{��:H���7�1M�#�%_*�
��`/��i��H��A�&R4q�����c�H��M�h� E)����;���H��A�&R4q�����c�H���������@x��^��� �z���x[�Nw��)
����8��=������\������YK��LN�-����J) 8 �8H��L�;[�5���
C0�R����B�����a��S��8w<*6�<|����]����`>�����|��5~Av�h�@p:����&)�S�B���O��&)�S�F��2h����>�h�����6I��MR���6�e�&mR:���?]���3�v�
|
Yq�Z��57��(�$��&�Z��)�<y���������k� 8Na?�E��I���$v{m;��' �	 �8��=}���b�������<�hVl������C�m��b��&�:oK�>-6�N`@��������Kc��&Rt��M����b������'G�5������>�����h�=�N}n]��Ov�^3�5�":'�r���s���h���5* �mhn�:,A�0�i��������������w��,�_��`�zu X�4�r]���GV���t���`���;��>��v��1�%����/zR*�3�;~^H��J���Z�AJ@0q �8L&�����`�@0q �8��m:������se'�3�����������s���o��F�nXXer��.�]����f�b���a�����=t�Za���u#�^��������D���8�h���~�l�p������[����j���=�R���3����U&'b]p�;(�(�%�R(������e�F�v�r�J�
�L��g�|!�NGpX���Q-,=7�
�<y�d���,����,Q�-�6Z������}m��S�������8�����&���G�]�]7��b�3���0�`W��+�J���=0���''H��_a����l���QS>+�Q�6w��=*��/�T�~���9l��u�Y����] �8L&�����`�@0q �8L&�����`�@0q �8L&�����`�@0q �8L&�����`�@0q �8L&�����`�@0q �8L&���"����J�IEND�B`�
test.filtered.10text/plain; charset=us-asciiDownload
test.filtered.100text/plain; charset=us-asciiDownload
tteststext/plain; charset=us-asciiDownload
t.test.Rtext/plain; charset=us-asciiDownload
#27Bryce Cutt
pandasuit@gmail.com
In reply to: Joshua Tolley (#26)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

Because there is no nice way in PostgreSQL (that I know of) to derive
a histogram after a join (on an intermediate result) currently
usingMostCommonValues is only enabled on a join when the outer (probe)
side is a table scan (seq scan only actually). See
getMostCommonValues (soon to be called
ExecHashJoinGetMostCommonValues) for the logic that determines this.

Here is the result of explain (on a 100MB version of PostgreSQL):
"Hash Left Join (cost=16232.00..91035.00 rows=600000 width=526)"
" Hash Cond: (l.l_partkey = p.p_partkey)"
" -> Hash Left Join (cost=15368.00..75171.00 rows=600000 width=395)"
" Hash Cond: (l.l_orderkey = o.o_orderkey)"
" -> Seq Scan on lineitem l (cost=0.00..17867.00 rows=600000
width=125)"
" -> Hash (cost=8073.00..8073.00 rows=150000 width=270)"
" -> Hash Left Join (cost=700.50..8073.00 rows=150000 width=270)"
" Hash Cond: (o.o_custkey = c.c_custkey)"
" -> Seq Scan on orders o (cost=0.00..4185.00
rows=150000 width=109)"
" -> Hash (cost=513.00..513.00 rows=15000 width=161)"
" -> Seq Scan on customer c
(cost=0.00..513.00 rows=15000 width=161)"
" -> Hash (cost=614.00..614.00 rows=20000 width=131)"
" -> Seq Scan on part p (cost=0.00..614.00 rows=20000 width=131)"

If you take a look at the explain for that join you will see that the
first of the relations joined are orders and customer on custkey.
There is almost no skew in the o_custkey attribute of orders even in
the Z2 dataset so the difference between hashjoin with and without
usingMostCommonValues enabled is quite small.

The second join performed is to join the result of the first join with
lineitem on orderkey. There is no skew at all in the l_orderkey
attribute of lineitem so usingMostCommonValues can not help at all.

The third join performed is to join the result of the second join with
part on partkey. There is a lot of skew in the l_partkey attribute of
lineitem but because the probe side of the third join is an
intermediate from the second join and not a seq scan the algorithm
cannot figure out the MCVs of the probe side.

So on the query presented almost no skew can be exploited on the first
join and no other joins can have their skew exploited at all because
of the order PostgreSQL does the joins in. Basically yes, you would
not see any real benefit from using the most common values on this
query.

We experimented with sampling (mentioned in the paper) to make an
educated guess of MCVs on intermediary results but found that because
a random sample could not be obtained the results were always very
inaccurate. I basically just read a percentage of tuples from the
probe relation before partitioning the build relation, derived the
MCVs in a single pass, wrote the tuples back out to a temp file
(because reading back from here is less expensive than resetting the
probe side tree), then did the join as usual while remembering to read
back from my temp file before reading the rest of the probe side
tuples. Our tests indicate that sampling is not likely a good
solution for deriving MCVs from intermediary results.

In the Java implementation of histojoin we experimented with
exploiting skew in multiple joins of a star join with some success
(detailed in paper). I am not sure how this would be accomplished
nicely in PostgreSQL.

If the cost operators knew how to order the joins to make the best use
of skew in the relations PostgreSQL could use the benefits of
histojoin more often if perhaps doing a join with skew first would
have speed benefits over doing the smaller join first. This change
could be a future addition to PostgreSQL if this patch is accepted.
It relies on getting the stats tuple for the join during the planning
phase (in the cost function) and estimating the benefit that would
have on the join cost.

- Bryce Cutt

Show quoted text

On Mon, Dec 22, 2008 at 6:15 AM, Joshua Tolley <eggyknap@gmail.com> wrote:

On Sun, Dec 21, 2008 at 10:25:59PM -0500, Robert Haas wrote:

[Some performance testing.]

I (finally!) have a chance to post my performance testing results... my
apologies for the really long delay. <Excuses omitted>

Unfortunately I'm not seeing wonderful speedups with the particular
queries I did in this case. I generated three 1GB datasets, with skews
set at 1, 2, and 3. The test script I wrote turns on enable_usestatmcvs
and runs EXPLAIN ANALYZE on the same query five times. Then it turns
enable_usestatmcvs off, and runs the same query five more times. It does
this with each of the three datasets in turn, and then starts over at
the beginning until I tell it to quit. My results showed a statistically
significant improvement in speed only on the skew == 3 dataset.

I did the same tests twice, once with default_statistics_target set to
10, and once with it set to 100. I've attached boxplots of the total
query times as reported by EXPLAIN ANALYZE ("dst10" in the filename
indicates default_statistics_target was 10, and so on), my results
parsed out of the EXPLAIN ANALYZE output (test.filtered.10 and
test.filtered.100), the results of one-tailed Student's T tests of the
result set (ttests), and the R code to run the tests if anyone's really
interested (t.test.R).

The results data includes six columns: the skew value, whether
enable_usestatmcvs was on or not (represented by a 1 or 0), total times
for each of the three joins that made up the query, and total time for
the query itself. The results above pay attention only to the total
query time.

Finally, the query involved:

SELECT * FROM lineitem l LEFT JOIN part p ON (p.p_partkey = l.l_partkey)
LEFT JOIN orders o ON (o.o_orderkey = l.l_orderky) LEFT JOIN customer c
ON (c.c_custkey = o.o_custkey);

- Josh / eggyknap

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAklPoRYACgkQRiRfCGf1UMNUJgCcCxCRNXJS65nXqMsY2h6PENKF
YkQAoJlSlaaHd2L5dkFUAc8GPKfKezS5
=KWfi
-----END PGP SIGNATURE-----

#28Robert Haas
robertmhaas@gmail.com
In reply to: Bryce Cutt (#27)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

On Tue, Dec 23, 2008 at 2:21 AM, Bryce Cutt <pandasuit@gmail.com> wrote:

Because there is no nice way in PostgreSQL (that I know of) to derive
a histogram after a join (on an intermediate result) currently
usingMostCommonValues is only enabled on a join when the outer (probe)
side is a table scan (seq scan only actually). See
getMostCommonValues (soon to be called
ExecHashJoinGetMostCommonValues) for the logic that determines this.

It's starting to seem to me that the case where this patch provides a
benefit is so narrow that I'm not sure it's worth the extra code.
Admittedly, when it works, it is pretty dramatic, as in the numbers
that I posted previously. I'm OK with the fact that it is restricted
to hash joins on a single variable where the probe relation is a
sequential scan, because that actually happens pretty frequently, at
least in my queries. But, if there's no way to consistently get any
benefit out of this when joining more than two tables, then I'm not
sure it's worth it.

Is it realistic to think that the MCVs of the base relation might
still be applicable to the joinrel? It's certainly easy to think of
counterexamples, but it might be a good approximation more often than
not.

...Robert

#29Joshua Tolley
eggyknap@gmail.com
In reply to: Robert Haas (#28)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

On Tue, Dec 23, 2008 at 09:22:27AM -0500, Robert Haas wrote:

On Tue, Dec 23, 2008 at 2:21 AM, Bryce Cutt <pandasuit@gmail.com> wrote:

Because there is no nice way in PostgreSQL (that I know of) to derive
a histogram after a join (on an intermediate result) currently
usingMostCommonValues is only enabled on a join when the outer (probe)
side is a table scan (seq scan only actually). See
getMostCommonValues (soon to be called
ExecHashJoinGetMostCommonValues) for the logic that determines this.

So my test case of "do a whole bunch of hash joins in a test query"
isn't really valid. Makes sense. I did another, more haphazard test on a
query with fewer joins, and saw noticeable speedups.

It's starting to seem to me that the case where this patch provides a
benefit is so narrow that I'm not sure it's worth the extra code.

Not that anyone asked, but I don't consider myself qualified to render
judgement on that point. Code size is, I guess, a maintainability issue,
and I'm not terribly experienced maintaining PostgreSQL :)

Is it realistic to think that the MCVs of the base relation might
still be applicable to the joinrel? It's certainly easy to think of
counterexamples, but it might be a good approximation more often than
not.

It's equivalent to our assumption that distributions of values in
columns in the same table are independent. Making that assumption in
this case would probably result in occasional dramatic speed
improvements similar to the ones we've seen in less complex joins,
offset by just-as-occasional dramatic slowdowns of similar magnitude. In
other words, it will increase the variance of our results.

- Josh

#30Robert Haas
robertmhaas@gmail.com
In reply to: Joshua Tolley (#29)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

It's equivalent to our assumption that distributions of values in
columns in the same table are independent. Making that assumption in
this case would probably result in occasional dramatic speed
improvements similar to the ones we've seen in less complex joins,
offset by just-as-occasional dramatic slowdowns of similar magnitude. In
other words, it will increase the variance of our results.

Under what circumstances do you think that it would produce a dramatic
slowdown? I'm confused. I thought the penalty for picking a bad set
of values for the in-memory hash table was pretty small.

...Robert

#31Lawrence, Ramon
ramon.lawrence@ubc.ca
In reply to: Lawrence, Ramon (#1)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

Because there is no nice way in PostgreSQL (that I know of) to

derive

a histogram after a join (on an intermediate result) currently
usingMostCommonValues is only enabled on a join when the outer

(probe)

side is a table scan (seq scan only actually). See
getMostCommonValues (soon to be called
ExecHashJoinGetMostCommonValues) for the logic that determines

this.

So my test case of "do a whole bunch of hash joins in a test query"
isn't really valid. Makes sense. I did another, more haphazard test on

a

query with fewer joins, and saw noticeable speedups.

It's starting to seem to me that the case where this patch provides

a

benefit is so narrow that I'm not sure it's worth the extra code.

Not that anyone asked, but I don't consider myself qualified to render
judgement on that point. Code size is, I guess, a maintainability

issue,

and I'm not terribly experienced maintaining PostgreSQL :)

Is it realistic to think that the MCVs of the base relation might
still be applicable to the joinrel? It's certainly easy to think of
counterexamples, but it might be a good approximation more often

than

not.

It's equivalent to our assumption that distributions of values in
columns in the same table are independent. Making that assumption in
this case would probably result in occasional dramatic speed
improvements similar to the ones we've seen in less complex joins,
offset by just-as-occasional dramatic slowdowns of similar magnitude.

In

other words, it will increase the variance of our results.

- Josh

There is almost zero penalty for selecting incorrect MCV tuples to
buffer in memory. Since the number of MCVs is approximately 100, the
"overhead" is keeping these 100 tuples in memory where they *might* not
be MCVs. The cost is the little extra memory and the checking of the
MCVs which is very fast.

On the other hand, the benefit is potentially tremendous if the MCV is
very common in the probe relation. Every probe tuple that matches the
MCV tuple in memory does not have to be written to disk. The potential
speedup is directly proportional to the skew. The more skew the more
benefit.

An analogy is with a page buffering system where one goal is to keep
frequently used pages in the buffer. Essentially the goal of this patch
is to "pin in memory" the tuples that the join believes will match with
the most tuples on the probe side. This reduces I/Os by making more
probe relation tuples match during the first read of the probe relation.
Regular hash join has no way to guarantee frequently matched build
tuples remain memory-resident.

The particular join with Customer, Orders, LineItem, and Part is a
reasonable test case. There may be two explanations for the results.
(I am running tests for this query currently.) First, the time to
generate the tuples (select *) may be dominating the query time.
Second, as mentioned by Bryce, I expect the issue is that only the join
with Customer and Orders exploited the patch. Customer has some skew
(but not dramatic) so there would be some speedup.

However, the join with Part and LineItem *should* show a benefit but may
not because of a limitation of the patch implementation (not the idea).
The MCV optimization is only enabled currently when the probe side is a
sequential scan. This limitation is due to our current inability to
determine a stats tuple of the join attribute on the probe side for
other operators. (This should be possible - help please?).

Even if this stats tuple is on the base relation and may not exactly
reflect the distribution of the intermediate relation on the probe side,
it still could be very good. Even if it is not, once again the cost is
negligible.

In summary, the patch will improve performance of any multi-batch hash
join with skew. It is useful right now when the probe relation has skew
and is accessed using a sequential scan. It would be useful in even
more situations if the code was modified to determine the stats for the
join attribute of the probe relation in all cases (even when the probe
relation is produced by another operator).

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca

#32Joshua Tolley
eggyknap@gmail.com
In reply to: Robert Haas (#30)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

On Tue, Dec 23, 2008 at 10:14:29AM -0500, Robert Haas wrote:

It's equivalent to our assumption that distributions of values in
columns in the same table are independent. Making that assumption in
this case would probably result in occasional dramatic speed
improvements similar to the ones we've seen in less complex joins,
offset by just-as-occasional dramatic slowdowns of similar magnitude. In
other words, it will increase the variance of our results.

Under what circumstances do you think that it would produce a dramatic
slowdown? I'm confused. I thought the penalty for picking a bad set
of values for the in-memory hash table was pretty small.

...Robert

I take that back :) I agree with what others have already said, that it
shouldn't cause dramatic slowdowns when we get it wrong.

- Josh

#33Robert Haas
robertmhaas@gmail.com
In reply to: Lawrence, Ramon (#31)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

There is almost zero penalty for selecting incorrect MCV tuples to
buffer in memory. Since the number of MCVs is approximately 100, the
"overhead" is keeping these 100 tuples in memory where they *might* not
be MCVs. The cost is the little extra memory and the checking of the
MCVs which is very fast.

I looked at this some more. I'm a little concerned about the way
we're maintaining the in-memory hash table. Since the highest legal
statistics target is now 10,000, it's possible that we could have two
orders of magnitude more MCVs than what you're expecting. As I read
the code, that could lead to construction of an in-memory hash table
with 64K slots. On a 32-bit machine, I believe that works out to 16
bytes per partition (12 and 4), which is a 1MB hash table. That's not
necessarily problematic, except that I don't think you're considering
the size of the hash table itself when evaluating whether you are
blowing out work_mem, and the default size of work_mem is 1MB.

I also don't really understand why we're trying to control the size of
the hash table by flushing tuples after the fact. Right now, when the
in-memory table fills up, we just keep adding tuples to it, which in
turn forces us to flush out other tuples to keep the size down. This
seems quite inefficient - not only are we doing a lot of unnecessary
allocating and freeing, but those flushed slots in the hash table
degrade performance (because they don't stop the scan for an empty
slot). It seems like we could simplify things considerably by adding
tuples to the in-memory hash table only to the point where the next
tuple would blow it out. Once we get to that point, we can skip the
isAMostCommonValue() test and send any future tuples straight to temp
files. (This would also reduce the memory consumption of the
in-memory table by a factor of two.)

We could potentially improve on this even further if we can estimate
in advance how many MCVs we can fit into the in-memory hash table
before it gets blown out. If, for example, we have only 1MB of
work_mem but there 10,000 MCVs, getMostCommonValues() might decide to
only hash the first 1,000 MCVs. Even if we still blow out the
in-memory hash table, the earlier MCVs are more frequent than the
later MCVs, so the ones that actually make it into the table are
likely to be more beneficial. I'm not sure exactly how to do this
tuning though, since we'd need to approximate the size of the
tuples... I guess the query planner makes some effort to estimate that
but I'm not sure how to get at it.

However, the join with Part and LineItem *should* show a benefit but may
not because of a limitation of the patch implementation (not the idea).
The MCV optimization is only enabled currently when the probe side is a
sequential scan. This limitation is due to our current inability to
determine a stats tuple of the join attribute on the probe side for
other operators. (This should be possible - help please?).

Not sure how to get at this either, but I'll take a look and see if I
can figure it out.

Merry Christmas,

...Robert

#34Lawrence, Ramon
ramon.lawrence@ubc.ca
In reply to: Robert Haas (#33)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

-----Original Message-----
From: Robert Haas [mailto:robertmhaas@gmail.com]
I looked at this some more. I'm a little concerned about the way
we're maintaining the in-memory hash table. Since the highest legal
statistics target is now 10,000, it's possible that we could have two
orders of magnitude more MCVs than what you're expecting. As I read
the code, that could lead to construction of an in-memory hash table
with 64K slots. On a 32-bit machine, I believe that works out to 16
bytes per partition (12 and 4), which is a 1MB hash table. That's not
necessarily problematic, except that I don't think you're considering
the size of the hash table itself when evaluating whether you are
blowing out work_mem, and the default size of work_mem is 1MB.

I totally agree that 10,000 MCVs changes things. Ideally, these 10,000
MCVs should be kept in memory because they will join with the most
tuples. However, the size of the MCV hash table (as you point out) can
be bigger than work_mem *by itself* not even considering the tuples in
the table or in the in-memory batch. Supporting that many MCVs would
require more modifications to the hash join algorithm.

100 MCVs should be able to fit in memory though. Since the number of
batches is rounded to a power of 2, there is often some hash_table_bytes
that are not used by the in-memory batch that can be "used" to store the
MCV table. The absolute size of the memory used should also be
reasonable (depending on the tuple size in bytes).

So, basically, we have a decision to make whether to try support a
larger number of MCVs or cap it at a reasonable number like a 100. You
can come up with situations where using all 10,000 MCVs is good (for
instance if all MCVs have frequency 1/10000), but I expect 100 MCVs will
capture the majority of the cases as usually the top 100 MCVs are
significantly more frequent than later MCVs.

I now also see that the code should be changed to keep track of the MCV
bytes separately from hashtable->spaceUsed as this is used to determine
when to dynamically increase the number of batches.

I also don't really understand why we're trying to control the size of
the hash table by flushing tuples after the fact. Right now, when the
in-memory table fills up, we just keep adding tuples to it, which in
turn forces us to flush out other tuples to keep the size down. This
seems quite inefficient - not only are we doing a lot of unnecessary
allocating and freeing, but those flushed slots in the hash table
degrade performance (because they don't stop the scan for an empty
slot). It seems like we could simplify things considerably by adding
tuples to the in-memory hash table only to the point where the next
tuple would blow it out. Once we get to that point, we can skip the
isAMostCommonValue() test and send any future tuples straight to temp
files. (This would also reduce the memory consumption of the
in-memory table by a factor of two.)

In the ideal case, we select a number of MCVs to support that we know
will always fit in memory. The flushing is used to deal with the case
where we are doing a many-to-many join and there may be multiple tuples
with the given MCV value in the build relation.

The issue with building the MCV table is that the hash operator will not
be receiving tuples in MCV frequency order. It is possible that the MCV
table is filled up with tuples of less frequent MCVs when a more
frequent MCV tuple arrives. In that case, we would like to keep the
more frequent MCV and bump one of the less frequent MCVs.

We could potentially improve on this even further if we can estimate
in advance how many MCVs we can fit into the in-memory hash table
before it gets blown out. If, for example, we have only 1MB of
work_mem but there 10,000 MCVs, getMostCommonValues() might decide to
only hash the first 1,000 MCVs. Even if we still blow out the
in-memory hash table, the earlier MCVs are more frequent than the
later MCVs, so the ones that actually make it into the table are
likely to be more beneficial. I'm not sure exactly how to do this
tuning though, since we'd need to approximate the size of the
tuples... I guess the query planner makes some effort to estimate that
but I'm not sure how to get at it.

The number of batches (nbatch), inner_rel_bytes, and hash_table_bytes
are calculated in ExecChooseHashTableSize in nodeHash.c.

The number of bytes "free" not allocated to the in-memory batch is then:

hash_table_bytes - inner_rel_bytes/nbatch

Depending on the power of 2 rounding of nbatch, this may be almost 0 or
quite large. You could change the calculation of nbatch or try to
resize the in-memory batch, but that opens up a can of worms. It may be
best to assume a small number of MCVs 10 or 100.

However, the join with Part and LineItem *should* show a benefit but

may

not because of a limitation of the patch implementation (not the

idea).

The MCV optimization is only enabled currently when the probe side

is a

sequential scan. This limitation is due to our current inability to
determine a stats tuple of the join attribute on the probe side for
other operators. (This should be possible - help please?).

Not sure how to get at this either, but I'll take a look and see if I
can figure it out.

After more digging, we can extract the original relation id and
attribute id of the join attribute using the instance variables varnoold
and varoattno of Var. It is documented that these variables are just
kept around for debugging, but they are definitely useful here.

New code would be:
relid = getrelid(variable->varnoold, estate->es_range_table);
relattnum = variable->varoattno;

Thanks for working with us on the patch.

Happy Holidays Everyone,

Ramon Lawrence

#35Robert Haas
robertmhaas@gmail.com
In reply to: Lawrence, Ramon (#34)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

I totally agree that 10,000 MCVs changes things. Ideally, these 10,000
MCVs should be kept in memory because they will join with the most
tuples. However, the size of the MCV hash table (as you point out) can
be bigger than work_mem *by itself* not even considering the tuples in
the table or in the in-memory batch.

So, basically, we have a decision to make whether to try support a
larger number of MCVs or cap it at a reasonable number like a 100. You
can come up with situations where using all 10,000 MCVs is good (for
instance if all MCVs have frequency 1/10000), but I expect 100 MCVs will
capture the majority of the cases as usually the top 100 MCVs are
significantly more frequent than later MCVs.

I thought about this, but upon due reflection I think it's the wrong
approach. Raising work_mem is a pretty common tuning step - it's 4MB
even on my small OLTP systems, and in a data-warehousing environment
where this optimization will bring the most benefit, it could easily
be higher. Furthermore, if someone DOES change the statistics target
for that column to 10,000, there's a pretty good chance that they had
a reason for doing so (or at the very least it's not for us to assume
that they were doing something stupid). I think we need some kind of
code to try to tune this based on the actual situation.

We might try to size the in-memory hash table to be the largest value
that won't increase the total number of batches, but if the number of
batches is large then this won't be the right decision. Maybe we
should insist on setting aside some minimum percentage of work_mem for
the in-memory hash table, and fill it with however many MCVs we think
will fit.

The issue with building the MCV table is that the hash operator will not
be receiving tuples in MCV frequency order. It is possible that the MCV
table is filled up with tuples of less frequent MCVs when a more
frequent MCV tuple arrives. In that case, we would like to keep the
more frequent MCV and bump one of the less frequent MCVs.

I agree. However, there's no reason at all to assume that the tuples
we flush out of the table are any better or worse than the new ones we
add back in later. In fact, although it's far from a guarantee, if
the order of the tuples in the table is random, then we're more likely
to encounter the most common values first. We might as well just keep
the ones we had rather than dumping them out and adding in different
ones. Err, except, maybe we can't guarantee correctness that way, in
the case of a many-to-many join?

I don't think there's any way to get around the possibility of a
hash-table overflow completely. Besides many-to-many joins, there's
also the possibility of hash collisions. The code assumes that
anything that hashes to the same 32-bit value as an MCV is in fact an
MCV, which is obviously false, but doesn't seem worth worrying about
since the chances of a collision are very small and the equality test
might be expensive. But clearly we want to minimize overflows as much
as we can.

...Robert

#36Lawrence, Ramon
ramon.lawrence@ubc.ca
In reply to: Lawrence, Ramon (#1)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

I thought about this, but upon due reflection I think it's the wrong
approach. Raising work_mem is a pretty common tuning step - it's 4MB
even on my small OLTP systems, and in a data-warehousing environment
where this optimization will bring the most benefit, it could easily
be higher. Furthermore, if someone DOES change the statistics target
for that column to 10,000, there's a pretty good chance that they had
a reason for doing so (or at the very least it's not for us to assume
that they were doing something stupid). I think we need some kind of
code to try to tune this based on the actual situation.

We might try to size the in-memory hash table to be the largest value
that won't increase the total number of batches, but if the number of
batches is large then this won't be the right decision. Maybe we
should insist on setting aside some minimum percentage of work_mem for
the in-memory hash table, and fill it with however many MCVs we think
will fit.

I think that setting aside a minimum percentage of work_mem may be a
reasonable approach. For instance, setting aside 1% at even 1 MB
work_mem would be 10 KB which is enough to store about 40 MCV tuples of
the TPC-H database. Such a small percentage would be very unlikely (but
still possible) to change the number of batches used. Then, given the
memory allocation and the known tuple size + overhead, only that number
of MCVs are selected for the MCV table regardless how many there are.
The MCV table size would then increase as work_mem is changed up to a
maximum given by the number of MCVs.

I agree. However, there's no reason at all to assume that the tuples
we flush out of the table are any better or worse than the new ones we
add back in later. In fact, although it's far from a guarantee, if
the order of the tuples in the table is random, then we're more likely
to encounter the most common values first. We might as well just keep
the ones we had rather than dumping them out and adding in different
ones. Err, except, maybe we can't guarantee correctness that way, in
the case of a many-to-many join?

The code when building the MCV hash table keeps track of the order of
insertion of the best MCVs. It then flushes the MCV partitions in
decreasing order of frequency of MCVs. Thus, by the end of the build
partitioning phase the MCV hash table should only store the most
frequent MCV tuples. Even with many-to-many joins as long as we keep
all build tuples that have a given MCV in memory, then everything is
fine. You would get into problems if you only flushed some of the
tuples of a certain MCV but that will not happen.

--
Ramon Lawrence

#37Robert Haas
robertmhaas@gmail.com
In reply to: Lawrence, Ramon (#36)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

I think that setting aside a minimum percentage of work_mem may be a
reasonable approach. For instance, setting aside 1% at even 1 MB
work_mem would be 10 KB which is enough to store about 40 MCV tuples of
the TPC-H database. Such a small percentage would be very unlikely (but
still possible) to change the number of batches used. Then, given the
memory allocation and the known tuple size + overhead, only that number
of MCVs are selected for the MCV table regardless how many there are.
The MCV table size would then increase as work_mem is changed up to a
maximum given by the number of MCVs.

Sounds fine. Maybe 2-3% would be better.

The code when building the MCV hash table keeps track of the order of
insertion of the best MCVs. It then flushes the MCV partitions in
decreasing order of frequency of MCVs. Thus, by the end of the build
partitioning phase the MCV hash table should only store the most
frequent MCV tuples. Even with many-to-many joins as long as we keep
all build tuples that have a given MCV in memory, then everything is
fine. You would get into problems if you only flushed some of the
tuples of a certain MCV but that will not happen.

OK, I'll read it again - I must not have understood.

It would be good to post an updated patch soon, even if not everything
has been addressed.

...Robert

#38Bryce Cutt
pandasuit@gmail.com
In reply to: Robert Haas (#37)
1 attachment(s)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

Here is the next patch version.

The naming and style concerns have been addressed. The patch now only
touches 5 files. 4 of those files are hashjoin specific and 1 is to
add a couple lines to a hashjoin specific struct in another file.

The code can now find the the MCVs in more cases. Even if the probe
side is an operator other than a seq scan (such as another hashjoin)
the code can now find the stats tuple for the underlying relation.

The new idea of limiting the number of MCVs to a percentage of memory
has not been added yet.

- Bryce Cutt

Show quoted text

On Mon, Dec 29, 2008 at 8:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I think that setting aside a minimum percentage of work_mem may be a
reasonable approach. For instance, setting aside 1% at even 1 MB
work_mem would be 10 KB which is enough to store about 40 MCV tuples of
the TPC-H database. Such a small percentage would be very unlikely (but
still possible) to change the number of batches used. Then, given the
memory allocation and the known tuple size + overhead, only that number
of MCVs are selected for the MCV table regardless how many there are.
The MCV table size would then increase as work_mem is changed up to a
maximum given by the number of MCVs.

Sounds fine. Maybe 2-3% would be better.

The code when building the MCV hash table keeps track of the order of
insertion of the best MCVs. It then flushes the MCV partitions in
decreasing order of frequency of MCVs. Thus, by the end of the build
partitioning phase the MCV hash table should only store the most
frequent MCV tuples. Even with many-to-many joins as long as we keep
all build tuples that have a given MCV in memory, then everything is
fine. You would get into problems if you only flushed some of the
tuples of a certain MCV but that will not happen.

OK, I'll read it again - I must not have understood.

It would be good to post an updated patch soon, even if not everything
has been addressed.

...Robert

Attachments:

histojoin_v4.patchapplication/octet-stream; name=histojoin_v4.patchDownload
Index: src/backend/executor/nodeHash.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHash.c,v
retrieving revision 1.116
diff -c -r1.116 nodeHash.c
*** src/backend/executor/nodeHash.c	1 Jan 2008 19:45:49 -0000	1.116
--- src/backend/executor/nodeHash.c	29 Dec 2008 07:38:52 -0000
***************
*** 53,58 ****
--- 53,201 ----
  	return NULL;
  }
  
+ /*
+ *	ExecHashGetMCVPartition
+ *
+ *	returns MCV_INVALID_PARTITION if the hashvalue does not correspond
+ *   to any MCV partition or it corresponds to a partition that has been frozen
+ *	or MCVs are not being used
+ *
+ *   otherwise it returns the index of the MCV partition for this hashvalue
+ *
+ *	it is possible for a non-MCV tuple to hash to an MCV partition due to
+ *	the limited number of hash values but it is unlikely and everything
+ *	continues to work even if it does happen. we would accidentally prioritize
+ *	some less optimal tuples in memory but the result would be accurate
+ *
+ *	hashtable->mostCommonTuplePartition is an open addressing hashtable of
+ *	MCV partitions (HashJoinMostCommonValueTuplePartition)
+ */
+ int 
+ ExecHashGetMCVPartition(HashJoinTable hashtable, uint32 hashvalue)
+ {
+ 	int bucket;
+ 
+ 	if (!hashtable->usingMostCommonValues)
+ 		return MCV_INVALID_PARTITION;
+ 	
+ 	/* modulo the hashvalue (using bitmask) to find the MCV partition */
+ 	bucket = hashvalue & (hashtable->nMostCommonTuplePartitionHashBuckets - 1);
+ 
+ 	/*
+ 	 * while we have not hit a hole in the hashtable and have not hit the actual partition
+ 	 * we have collided in the hashtable so try the next partition slot
+ 	 */
+ 	while (hashtable->mostCommonTuplePartition[bucket].hashvalue != 0
+ 		&& hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ 		bucket = (bucket + 1) & (hashtable->nMostCommonTuplePartitionHashBuckets - 1);
+ 
+ 	/* if the partition is not frozen and has been correctly determined return the partition index */
+ 	if (!hashtable->mostCommonTuplePartition[bucket].frozen && hashtable->mostCommonTuplePartition[bucket].hashvalue == hashvalue)
+ 		return bucket;
+ 
+ 	/* must have run into an empty slot which means this is not an MCV*/
+ 	return MCV_INVALID_PARTITION;
+ }
+ 
+ /*
+ *	ExecHashFreezeNextMCVPartition
+ *
+ *	flush the tuples of the next MCV partition by pushing them into the main hashtable
+ */
+ static bool 
+ ExecHashFreezeNextMCVPartition(HashJoinTable hashtable) {
+ 	/*
+ 	 * calculate the flushOrderedMostCommonTuplePartition index of
+ 	 * the partition to flush. not to be confused with the index of
+ 	 * the partition in the MCV partitions hashtable
+ 	 */
+ 	int partitionToFlush = hashtable->nMostCommonTuplePartitions
+ 		- 1 - hashtable->nMostCommonTuplePartitionsFlushed;
+ 	int			bucketno;
+ 	int			batchno;
+ 	uint32		hashvalue;
+ 	HashJoinTuple hashTuple;
+ 	HashJoinTuple nextHashTuple;
+ 	HashJoinMostCommonValueTuplePartition *partition;
+ 	MinimalTuple mintuple;
+ 
+ 	/* if all MCV partitions have already been flushed */
+ 	if (partitionToFlush < 0)
+ 		return false;
+ 
+ 	/* grab a pointer to the actual MCV partition */
+ 	partition = hashtable->flushOrderedMostCommonTuplePartition[partitionToFlush];
+ 	hashvalue = partition->hashvalue;
+ 
+ 	Assert(hashvalue != 0);
+ 
+ 	/* grab a pointer to the first tuple in the soon to be frozen MCV partition */
+ 	hashTuple = partition->tuples;
+ 
+ 	/*
+ 	 * calculate which bucket and batch the tuples belong to in the main
+ 	 * non-MCV hashtable
+ 	 */
+ 	ExecHashGetBucketAndBatch(hashtable, hashvalue,
+ 							  &bucketno, &batchno);
+ 
+ 	/* until we have read all tuples from this partition */
+ 	while (hashTuple != NULL)
+ 	{
+ 		/* decide whether to put the tuple in the hash table or a temp file */
+ 		if (batchno == hashtable->curbatch)
+ 		{
+ 			/* put the tuple in hash table */
+ 			nextHashTuple = hashTuple->next;
+ 			hashTuple->next = hashtable->buckets[bucketno];
+ 			hashtable->buckets[bucketno] = hashTuple;
+ 			
+ 			hashTuple = nextHashTuple;
+ 			hashtable->totalTuples++;
+ 			hashtable->nMostCommonTuplesStored--;
+ 			
+ 			/*
+ 			 * since the tuple is not being removed from memory we may still
+ 			 * be using too much memory.  if this is the first tuple flushed
+ 			 * then we definately must try to free some memory but we must test
+ 			 * again for subsequent tuples in case more memory must be freed.
+ 			 */
+ 			if (hashtable->spaceUsed > hashtable->spaceAllowed)
+ 			{
+ 				ExecHashIncreaseNumBatches(hashtable);
+ 				/* batchno may have changed due to increase in batches */
+ 				ExecHashGetBucketAndBatch(hashtable, hashvalue,
+ 					&bucketno, &batchno);
+ 			}
+ 		}
+ 		else
+ 		{
+ 			/* put the tuples into a temp file for later batches */
+ 			Assert(batchno > hashtable->curbatch);
+ 			mintuple = HJTUPLE_MINTUPLE(hashTuple);
+ 			ExecHashJoinSaveTuple(mintuple,
+ 								  hashvalue,
+ 								  &hashtable->innerBatchFile[batchno]);
+ 			/*
+ 			 * some memory has been freed up. this must be done before we
+ 			 * pfree the hashTuple of we lose access to the tuple size
+ 			 */
+ 			hashtable->spaceUsed -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ 			nextHashTuple = hashTuple->next;
+ 			pfree(hashTuple);
+ 			hashTuple = nextHashTuple;
+ 			hashtable->totalTuples++;
+ 			hashtable->nMostCommonTuplesStored--;
+ 		}
+ 	}
+ 
+ 	partition->frozen = true;
+ 	partition->tuples = NULL;
+ 	hashtable->nMostCommonTuplePartitionsFlushed++;
+ 
+ 	return true;
+ }
+ 
  /* ----------------------------------------------------------------
   *		MultiExecHash
   *
***************
*** 69,74 ****
--- 212,219 ----
  	TupleTableSlot *slot;
  	ExprContext *econtext;
  	uint32		hashvalue;
+ 	MinimalTuple mintuple;
+ 	int partitionNumber;
  
  	/* must provide our own instrumentation support */
  	if (node->ps.instrument)
***************
*** 99,106 ****
  		if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
  								 &hashvalue))
  		{
! 			ExecHashTableInsert(hashtable, slot, hashvalue);
! 			hashtable->totalTuples += 1;
  		}
  	}
  
--- 244,279 ----
  		if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
  								 &hashvalue))
  		{
! 			partitionNumber = ExecHashGetMCVPartition(hashtable, hashvalue);
! 
! 			/* if this tuple belongs in an MCV partition */
! 			if (partitionNumber != MCV_INVALID_PARTITION)
! 			{
! 				HashJoinTuple hashTuple;
! 				int			hashTupleSize;
! 				
! 				/* get the HashJoinTuple */
! 				mintuple = ExecFetchSlotMinimalTuple(slot);
! 				hashTupleSize = HJTUPLE_OVERHEAD + mintuple->t_len;
! 				hashTuple = (HashJoinTuple) palloc(hashTupleSize);
! 				hashTuple->hashvalue = hashvalue;
! 				memcpy(HJTUPLE_MINTUPLE(hashTuple), mintuple, mintuple->t_len);
! 
! 				/* push the HashJoinTuple onto the front of the MCV partition tuple list */
! 				hashTuple->next = hashtable->mostCommonTuplePartition[partitionNumber].tuples;
! 				hashtable->mostCommonTuplePartition[partitionNumber].tuples = hashTuple;
! 				
! 				/* move memory is now in use so make sure we are not over memory */
! 				hashtable->spaceUsed += hashTupleSize;
! 				hashtable->nMostCommonTuplesStored++;
! 				while (hashtable->spaceUsed > hashtable->spaceAllowed && ExecHashFreezeNextMCVPartition(hashtable))
! 					;
! 			}
! 			else
! 			{
! 				ExecHashTableInsert(hashtable, slot, hashvalue);
! 				hashtable->totalTuples += 1;
! 			}
  		}
  	}
  
***************
*** 269,274 ****
--- 442,454 ----
  	hashtable->outerBatchFile = NULL;
  	hashtable->spaceUsed = 0;
  	hashtable->spaceAllowed = work_mem * 1024L;
+ 	/* initialize MCV related hashtable variables */
+ 	hashtable->usingMostCommonValues = false;
+ 	hashtable->nMostCommonTuplePartitions = 0;
+ 	hashtable->nMostCommonTuplePartitionHashBuckets = 0;
+ 	hashtable->nMostCommonTuplesStored = 0;
+ 	hashtable->mostCommonTuplePartition = NULL;
+ 	hashtable->nMostCommonTuplePartitionsFlushed = 0;
  
  	/*
  	 * Get info about the hash functions to be used for each hash key. Also
***************
*** 566,571 ****
--- 746,752 ----
  				ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(tuple),
  									  tuple->hashvalue,
  									  &hashtable->innerBatchFile[batchno]);
+ 
  				/* and remove from hash table */
  				if (prevtuple)
  					prevtuple->next = nexttuple;
***************
*** 798,803 ****
--- 979,1046 ----
  }
  
  /*
+  * ExecScanHashMostCommonTuples
+  *		scan a MCV partition for matches to the current outer tuple
+  *
+  * The current outer tuple must be stored in econtext->ecxt_outertuple.
+  */
+ HashJoinTuple
+ ExecScanHashMostCommonTuples(HashJoinState *hjstate,
+ 				   ExprContext *econtext)
+ {
+ 	List	   *hjclauses = hjstate->hashclauses;
+ 	HashJoinTable hashtable = hjstate->hj_HashTable;
+ 	HashJoinTuple hashTuple = hjstate->hj_CurTuple;
+ 	uint32		hashvalue = hjstate->hj_CurHashValue;
+ 
+ 	/*
+ 	 * hj_CurTuple is NULL to start scanning a new partition, or the address of
+ 	 * the last tuple returned from the current partition.
+ 	 */
+ 	if (hashTuple == NULL)
+ 	{
+ 		/* painstakingly make sure this is a valid partition index */
+ 		Assert(hjstate->hj_OuterTupleMostCommonValuePartition > MCV_INVALID_PARTITION);
+ 		Assert(hjstate->hj_OuterTupleMostCommonValuePartition < hashtable->nMostCommonTuplePartitionHashBuckets);
+ 		Assert(hashtable->mostCommonTuplePartition[hjstate->hj_OuterTupleMostCommonValuePartition].hashvalue != 0);
+ 
+ 		hashTuple = hashtable->mostCommonTuplePartition[hjstate->hj_OuterTupleMostCommonValuePartition].tuples;
+ 	}
+ 	else
+ 		hashTuple = hashTuple->next;
+ 
+ 	while (hashTuple != NULL)
+ 	{
+ 		if (hashTuple->hashvalue == hashvalue)
+ 		{
+ 			TupleTableSlot *inntuple;
+ 
+ 			/* insert hashtable's tuple into exec slot so ExecQual sees it */
+ 			inntuple = ExecStoreMinimalTuple(HJTUPLE_MINTUPLE(hashTuple),
+ 											 hjstate->hj_HashTupleSlot,
+ 											 false);	/* do not pfree */
+ 			econtext->ecxt_innertuple = inntuple;
+ 
+ 			/* reset temp memory each time to avoid leaks from qual expr */
+ 			ResetExprContext(econtext);
+ 
+ 			if (ExecQual(hjclauses, econtext, false))
+ 			{
+ 				hjstate->hj_CurTuple = hashTuple;
+ 				return hashTuple;
+ 			}
+ 		}
+ 
+ 		hashTuple = hashTuple->next;
+ 	}
+ 
+ 	/*
+ 	 * no match
+ 	 */
+ 	return NULL;
+ }
+ 
+ /*
   * ExecScanHashBucket
   *		scan a hash bucket for matches to the current outer tuple
   *
Index: src/backend/executor/nodeHashjoin.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v
retrieving revision 1.96
diff -c -r1.96 nodeHashjoin.c
*** src/backend/executor/nodeHashjoin.c	23 Oct 2008 14:34:34 -0000	1.96
--- src/backend/executor/nodeHashjoin.c	29 Dec 2008 04:34:46 -0000
***************
*** 20,25 ****
--- 20,30 ----
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
  #include "utils/memutils.h"
+ #include "optimizer/cost.h"
+ #include "utils/syscache.h"
+ #include "utils/lsyscache.h"
+ #include "parser/parsetree.h"
+ #include "catalog/pg_statistic.h"
  
  
  /* Returns true for JOIN_LEFT and JOIN_ANTI jointypes */
***************
*** 34,39 ****
--- 39,159 ----
  						  TupleTableSlot *tupleSlot);
  static int	ExecHashJoinNewBatch(HashJoinState *hjstate);
  
+ /*
+  *	ExecHashJoinGetMostCommonValues
+  *
+  *	if the MCV statistics can be found for the join attribute of this
+  *	hashjoin 
+  *
+  */
+ static void 
+ ExecHashJoinGetMostCommonValues(EState *estate, HashJoinState *hjstate)
+ {
+ 	HeapTupleData *statsTuple;
+ 	FuncExprState *clause;
+ 	ExprState *argstate;
+ 	Var *variable;
+ 
+ 	Datum	   *values;
+ 	int			nvalues;
+ 	float4	   *numbers;
+ 	int			nnumbers;
+ 
+ 	Oid relid;
+ 	Oid atttype;
+ 
+ 	int i;
+ 
+ 	/* Only use statistics if there is a single join attribute. */
+ 	if (hjstate->hashclauses->length != 1)
+ 		return; /* histojoin is not defined for more than one join key so run away */
+ 
+ 	/* Determine the relation id and attribute id of the single join attribute of the probe relation. */
+ 	
+ 	clause = (FuncExprState *) lfirst(list_head(hjstate->hashclauses));
+ 	argstate = (ExprState *) lfirst(list_head(clause->args));
+ 
+ 	/* Do not try to exploit stats if the join attribute is an expression instead of just a simple attribute. */		
+ 	if (argstate->expr->type != T_Var)
+ 		return;
+ 
+ 	variable = (Var *) argstate->expr;
+ 	relid = getrelid(variable->varnoold, estate->es_range_table);
+ 
+ 	/* grab the necessary properties of the join variable */
+ 	atttype = variable->vartype;
+ 
+ 	statsTuple = SearchSysCache(STATRELATT, ObjectIdGetDatum(relid),
+ 		Int16GetDatum(variable->varoattno), 0, 0);
+ 
+ 	if (!HeapTupleIsValid(statsTuple))
+ 		return;
+ 
+ 	/* if there are MCV statistics for the attribute */
+ 	if (get_attstatsslot(statsTuple,
+ 		atttype, variable->vartypmod,
+ 		STATISTIC_KIND_MCV, InvalidOid,
+ 		&values, &nvalues,
+ 		&numbers, &nnumbers))
+ 	{
+ 		MemoryContext oldcxt;
+ 		HashJoinTable hashtable;
+ 		FmgrInfo   *hashfunctions;
+ 		/* MCV Partitions is an open addressing hashtable with a 
+ 		power of 2 size greater than the number of MCV values. */
+ 		int nbuckets = 2;
+ 
+ 		while (nbuckets <= nvalues)
+ 			nbuckets <<= 1;
+ 		/* use two more bits just to help avoid collisions */
+ 		nbuckets <<= 2;
+ 
+ 		hashtable = hjstate->hj_HashTable;
+ 		hashtable->usingMostCommonValues = true;
+ 		hashtable->nMostCommonTuplePartitionHashBuckets = nbuckets;
+ 
+ 		/* allocate the partition memory in the hashtable's memory context */
+ 		oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
+ 		hashtable->mostCommonTuplePartition = palloc0(nbuckets * sizeof(HashJoinMostCommonValueTuplePartition));
+ 		hashtable->flushOrderedMostCommonTuplePartition = palloc0(nvalues * sizeof(HashJoinMostCommonValueTuplePartition*));
+ 		MemoryContextSwitchTo(oldcxt);
+ 
+ 		/* grab the hash functions as we will be generating the hashvalues here */
+ 		hashfunctions = hashtable->outer_hashfunctions;
+ 
+ 		/* create the partitions */
+ 		for (i = 0; i < nvalues; i++)
+ 		{
+ 			uint32 hashvalue = DatumGetUInt32(FunctionCall1(&hashfunctions[0], values[i]));
+ 			int bucket = hashvalue & (nbuckets - 1);
+ 
+ 			/*
+ 			 * while we have not hit a hole in the hashtable and have not hit a partition
+ 			 * with the same hashvalue we have collided in the hashtable so try the next
+ 			 * partition slot (remember it is an open addressing hashtable)
+ 			 */
+ 			while (hashtable->mostCommonTuplePartition[bucket].hashvalue != 0
+ 				&& hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ 				bucket = (bucket + 1) & (nbuckets - 1);
+ 
+ 			/*
+ 			 * leave partition alone if it has the same hashvalue as current MCV.  
+ 			 * we only want one partition per hashvalue. even if two MCV values
+ 			 * hash to the same partition we are fine
+ 			 */
+ 			if (hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ 			{
+ 				hashtable->mostCommonTuplePartition[bucket].hashvalue = hashvalue;
+ 				hashtable->flushOrderedMostCommonTuplePartition[hashtable->nMostCommonTuplePartitions] = &hashtable->mostCommonTuplePartition[bucket];
+ 				hashtable->nMostCommonTuplePartitions++;
+ 			}
+ 		}
+ 
+ 		free_attstatsslot(atttype, values, nvalues, numbers, nnumbers);
+ 	}
+ 
+ 	ReleaseSysCache(statsTuple);
+ }
  
  /* ----------------------------------------------------------------
   *		ExecHashJoin
***************
*** 147,152 ****
--- 267,276 ----
  										node->hj_HashOperators);
  		node->hj_HashTable = hashtable;
  
+ 		/* we dont bother exploiting MCVs if we can do the entire join in memory */
+ 		if (hashtable->nbatch > 1)
+ 			ExecHashJoinGetMostCommonValues(estate, node);
+ 
  		/*
  		 * execute the Hash node, to build the hash table
  		 */
***************
*** 157,163 ****
  		 * If the inner relation is completely empty, and we're not doing an
  		 * outer join, we can quit without scanning the outer relation.
  		 */
! 		if (hashtable->totalTuples == 0 && !HASHJOIN_IS_OUTER(node))
  			return NULL;
  
  		/*
--- 281,287 ----
  		 * If the inner relation is completely empty, and we're not doing an
  		 * outer join, we can quit without scanning the outer relation.
  		 */
! 		if (hashtable->totalTuples == 0 && hashtable->nMostCommonTuplesStored == 0 && !HASHJOIN_IS_OUTER(node))
  			return NULL;
  
  		/*
***************
*** 205,227 ****
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
  
! 			/*
! 			 * Now we've got an outer tuple and the corresponding hash bucket,
! 			 * but this tuple may not belong to the current batch.
! 			 */
! 			if (batchno != hashtable->curbatch)
  			{
  				/*
! 				 * Need to postpone this outer tuple to a later batch. Save it
! 				 * in the corresponding outer-batch file.
  				 */
! 				Assert(batchno > hashtable->curbatch);
! 				ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(outerTupleSlot),
! 									  hashvalue,
! 									  &hashtable->outerBatchFile[batchno]);
! 				node->hj_NeedNewOuter = true;
! 				continue;		/* loop around for a new outer tuple */
  			}
  		}
  
--- 329,356 ----
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
+ 			
+ 			node->hj_OuterTupleMostCommonValuePartition = ExecHashGetMCVPartition(hashtable, hashvalue);
  
! 			if (node->hj_OuterTupleMostCommonValuePartition == MCV_INVALID_PARTITION)
  			{
  				/*
! 				 * Now we've got an outer tuple and the corresponding hash bucket,
! 				 * but this tuple may not belong to the current batch.
  				 */
! 				if (batchno != hashtable->curbatch)
! 				{
! 					/*
! 					 * Need to postpone this outer tuple to a later batch. Save it
! 					 * in the corresponding outer-batch file.
! 					 */
! 					Assert(batchno > hashtable->curbatch);
! 					ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(outerTupleSlot),
! 										  hashvalue,
! 										  &hashtable->outerBatchFile[batchno]);
! 					node->hj_NeedNewOuter = true;
! 					continue;		/* loop around for a new outer tuple */
! 				}
  			}
  		}
  
***************
*** 230,236 ****
  		 */
  		for (;;)
  		{
! 			curtuple = ExecScanHashBucket(node, econtext);
  			if (curtuple == NULL)
  				break;			/* out of matches */
  
--- 359,373 ----
  		 */
  		for (;;)
  		{
! 			/* if the tuple hashed to an MCV partition then scan the MCV tuples */
! 			if (node->hj_OuterTupleMostCommonValuePartition != MCV_INVALID_PARTITION)
! 			{
! 				curtuple = ExecScanHashMostCommonTuples(node, econtext);
! 			}
! 			else /* otherwise scan the standard hashtable buckets */
! 			{
! 				curtuple = ExecScanHashBucket(node, econtext);
! 			}
  			if (curtuple == NULL)
  				break;			/* out of matches */
  
Index: src/include/executor/hashjoin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/hashjoin.h,v
retrieving revision 1.48
diff -c -r1.48 hashjoin.h
*** src/include/executor/hashjoin.h	1 Jan 2008 19:45:57 -0000	1.48
--- src/include/executor/hashjoin.h	29 Dec 2008 01:39:50 -0000
***************
*** 72,77 ****
--- 72,85 ----
  #define HJTUPLE_MINTUPLE(hjtup)  \
  	((MinimalTuple) ((char *) (hjtup) + HJTUPLE_OVERHEAD))
  
+ typedef struct HashJoinMostCommonValueTuplePartition
+ {
+ 	uint32 hashvalue;
+ 	bool frozen;
+ 	HashJoinTuple tuples;
+ } HashJoinMostCommonValueTuplePartition;
+ 
+ #define MCV_INVALID_PARTITION -1
  
  typedef struct HashJoinTableData
  {
***************
*** 116,121 ****
--- 124,141 ----
  
  	MemoryContext hashCxt;		/* context for whole-hash-join storage */
  	MemoryContext batchCxt;		/* context for this-batch-only storage */
+ 	
+ 	bool usingMostCommonValues;	/* will the join use MCV partitions */
+ 	HashJoinMostCommonValueTuplePartition *mostCommonTuplePartition; /* hashtable of MCV partitions */
+ 	/*
+ 	 * array of pointers to the MCV partitions hashtable buckets in the opposite order that
+ 	 * they would be flushed to disk
+ 	 */
+ 	HashJoinMostCommonValueTuplePartition **flushOrderedMostCommonTuplePartition;
+ 	int nMostCommonTuplePartitionHashBuckets; /* # of buckets in the MCV partitions hashtable */
+ 	int nMostCommonTuplePartitions; /* # of actual partitions in the MCV partitions hashtable */
+ 	int nMostCommonTuplePartitionsFlushed; /* # of MCV partitions that have already been flushed to disk */
+ 	uint32 nMostCommonTuplesStored; /* total # of build tuples currently stored in the MCV partitions */
  } HashJoinTableData;
  
  #endif   /* HASHJOIN_H */
Index: src/include/executor/nodeHash.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/nodeHash.h,v
retrieving revision 1.45
diff -c -r1.45 nodeHash.h
*** src/include/executor/nodeHash.h	1 Jan 2008 19:45:57 -0000	1.45
--- src/include/executor/nodeHash.h	23 Dec 2008 07:54:03 -0000
***************
*** 44,48 ****
--- 44,51 ----
  extern void ExecChooseHashTableSize(double ntuples, int tupwidth,
  						int *numbuckets,
  						int *numbatches);
+ 						
+ extern HashJoinTuple ExecScanHashMostCommonTuples(HashJoinState *hjstate, ExprContext *econtext);
+ extern int ExecHashGetMCVPartition(HashJoinTable hashtable, uint32 hashvalue);
  
  #endif   /* NODEHASH_H */
Index: src/include/nodes/execnodes.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/nodes/execnodes.h,v
retrieving revision 1.197
diff -c -r1.197 execnodes.h
*** src/include/nodes/execnodes.h	28 Dec 2008 18:54:00 -0000	1.197
--- src/include/nodes/execnodes.h	29 Dec 2008 04:40:02 -0000
***************
*** 1381,1386 ****
--- 1381,1387 ----
   *		hj_NeedNewOuter			true if need new outer tuple on next call
   *		hj_MatchedOuter			true if found a join match for current outer
   *		hj_OuterNotEmpty		true if outer relation known not empty
+  *		hj_OuterTupleMostCommonValuePartition	partition# for the current outer tuple
   * ----------------
   */
  
***************
*** 1406,1411 ****
--- 1407,1413 ----
  	bool		hj_NeedNewOuter;
  	bool		hj_MatchedOuter;
  	bool		hj_OuterNotEmpty;
+ 	int		hj_OuterTupleMostCommonValuePartition;
  } HashJoinState;
  
  
#39Robert Haas
robertmhaas@gmail.com
In reply to: Bryce Cutt (#38)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

On Tue, Dec 30, 2008 at 12:29 AM, Bryce Cutt <pandasuit@gmail.com> wrote:

Here is the next patch version.

Thanks for posting this update. This is definitely getting better,
but I still see some style issues. We can work on fixing those once
the rest of the details have been finalized.

However, one question in this area - isn't
ExecHashFreezeNextMCVPartition actually a most common TUPLE partition,
rather than a most common VALUE partition (and similarly for
ExecHashGetMCVPartition)? I'm not quite sure what to do about this as
the names are already quite long - is there some better name for the
functions and structure members than MostCommonTuplePartition? Maybe
we could call it the in-memory partition and abbreviate it IMPartition
throughout. I think that might make things more clear.

The code can now find the the MCVs in more cases. Even if the probe
side is an operator other than a seq scan (such as another hashjoin)
the code can now find the stats tuple for the underlying relation.

You're using varnoold in a way that directly contradicts the comment
in primnodes.h (essentially, that it's not used for anything other
than debugging). I don't think this is a bad thing, but you have to
patch the comment.

Have you done any performance testing on the impact of this change?

The new idea of limiting the number of MCVs to a percentage of memory
has not been added yet.

That's a pretty important change, I think, though it would be nice to
have one of the committers chime in here. For those who may not have
been following the thread closely, the current implementation's memory
usage can go quite a bit higher than work_mem - the in-memory open
hash table can be up to 1MB or so (if statistics_target = 10K) plus it
can contain up to work_mem of tuples plus each batch can contain
another work_mem of tuples. The proposal is to carve out 1-3% of
work_mem for the in-memory hash table and leave the rest for the
batches, thus hopefully not affecting the # of batches very much. If
it doesn't look like the whole MCV list will fit, we'll take a shot at
guessing what length prefix of it will.

...Robert

#40Bryce Cutt
pandasuit@gmail.com
In reply to: Lawrence, Ramon (#1)
1 attachment(s)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

The latest version of the patch is attached. The revision considerably
cleans up the code, especially variable naming consistency. We have
adopted the use of IM (in-memory) in variable names for the hash table
structures as suggested.

Two other implementations changes:

1) The overhead of the hash table has been reduced by allocating an
array of pointers instead of an array of structs and only allocating the
structs as they are needed to store MCVs. IM buckets are now frozen
by first removing all tuples then deleting the struct from memory. This
allows more memory to be freed as well as the removal of the frozen
field in the IM bucket struct which now makes that struct only 8 bytes
on a 32bit machine. If for some reason all IM buckets are frozen all
IM struct overhead is removed from memory to further reduce the memory
footprint.

2) This patch supports using a set percentage of work_mem (currently 2%)
to store the build tuples that join frequently with probe relation
tuples. The code only allocates MCVs up to the maximum amount and will
flush from the in-memory hash table if the memory is ever exceeded. The
code also ensures that the overall join memory used (the MCV hash table
and batch 0 in memory) does not exceed spaceAllocated as usual. If this 2%
of memory is not used by the MCV hash table then it can be used by batch 0.

These changes are mostly relate to style, although some of the cleanup
has made the code slightly faster.

We would really appreciate help on finalizing this patch, especially in
regard to style issues. Thank you for all the help.

- Dr. Ramon Lawrence and Bryce Cutt

Show quoted text

On Sun, Jan 4, 2009 at 6:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:

1) Isn't ExecHashFreezeNextMCVPartition actually a most common TUPLE
partition, rather than a most common VALUE partition (and similarly for
ExecHashGetMCVPartition)?

A partition stores all tuples that correspond to that MCV value. It is
usually one for foreign key joins but may be more than one. (Plus, it
may store other tuples that have the same hash value for the join
attribute as the MCV value.)

I guess my point is - check that your variable/function/structure
member naming is consistent between different parts of the code. The
ExecHashGetMCVPartition function accesses structure members called
nMostCommonTuplePartitionHashBuckets, nMostCommonTuplePartition, and
mostCommonTuplePartition. It seems inconsistent that the function
name uses MCVPartition and
the structure members use mostCommonTuplePartition - aren't we talking
about the same thing in both cases?

And, more to the point, the terminology just seems wrong to me, the
more I think about it. I mean, ExecHashGetMCVParitition is not
finding a partition of the MCVs. It's finding a partition of an
in-memory hash table which we plan to populate with MCVs. That's why
I'm wondering if we should make it ExecHashGetIMPartition,
nIMPartitionHashBuckets, etc.

2) Have you done any performance testing on the impact of this change?

Yes, the ability to use MCVs for more than sequential scans
significantly improves performance in multi-join cases. The allocation
of a percentage of memory of only 1% will not affect any performance
results as all our testing was done with the MCV value of 10 or 100
which is significantly below a 1% allocation of work_mem. If anything,
performance would be improved when using more MCVs.

That is a very good thing.

Finally, any help you can provide on style concerns to make this easier
to commit would be appreciated. We will put all the effort required
over the next few days to get this into 8.4.

If I have time, I might be willing to make a style run over the next
version of the patch after you post it to the list, and just correct
anything I see and repost. This might be faster than sending comments
back and forth, if you are OK with it. I have a day job so this would
probably need to be Tuesday or Wednesday night. My main advice is
"read the diff before you post it". Sometimes things will just pop
out at you that are less obvious when you are head-down in the code.

Random stuff I notice in v4 patch: make sure all lines fit in 80
columns (except for long error messages if any), missing space before
closing comment delimiter in ExecHashGetMCVPartition, extraneous blank
line added to nodeHash.c just before the comment that says "and remove
from hash table", comment in ExecHashJoinGetMostCommonValues just
after the get_attstatsslot call is formatted strangely, still extra
curly braces around the calls to
ExecScanHashMostCommonValuePartition/ExecScanHashBucket.

...Robert

Attachments:

histojoin_v5.patchapplication/octet-stream; name=histojoin_v5.patchDownload
Index: src/backend/executor/nodeHash.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHash.c,v
retrieving revision 1.117
diff -c -r1.117 nodeHash.c
*** src/backend/executor/nodeHash.c	1 Jan 2009 17:23:41 -0000	1.117
--- src/backend/executor/nodeHash.c	6 Jan 2009 23:18:16 -0000
***************
*** 53,58 ****
--- 53,220 ----
  	return NULL;
  }
  
+ /*
+ *	ExecHashGetIMBucket
+ *
+ *	Returns IM_INVALID_BUCKET if the hashvalue does not correspond
+ *	to any IM bucket or it corresponds to a bucket that has been frozen
+ *	or skew optimization is not being used.
+ *
+ *	Otherwise it returns the index of the IM bucket for this hashvalue.
+ *
+ *	It is possible for a tuple whose join attribute value is not a MCV to
+ *	hash to an IM bucket due to the limited number of hash values but it is
+ *	unlikely and everything continues to work even if it does happen. We would
+ *	accidentally prioritize some less optimal tuples in memory but the join
+ *	result would still be accurate.
+ *
+ *	hashtable->imBucket is an open addressing hashtable of
+ *	IM buckets (HashJoinIMBucket).
+ */
+ int 
+ ExecHashGetIMBucket(HashJoinTable hashtable, uint32 hashvalue)
+ {
+ 	int bucket;
+ 
+ 	if (!hashtable->enableSkewOptimization)
+ 		return IM_INVALID_BUCKET;
+ 	
+ 	/* Modulo the hashvalue (using bitmask) to find the IM bucket. */
+ 	bucket = hashvalue & (hashtable->nIMBuckets - 1);
+ 
+ 	/*
+ 	 * While we have not hit a hole in the hashtable and have not hit the 
+ 	 * actual bucket we have collided in the hashtable so try the next
+ 	 * bucket location.
+ 	 */
+ 	while (hashtable->imBucket[bucket] != NULL
+ 		&& hashtable->imBucket[bucket]->hashvalue != hashvalue)
+ 		bucket = (bucket + 1) & (hashtable->nIMBuckets - 1);
+ 
+ 	/*
+ 	 * If the bucket exists and has been correctly determined return
+ 	 * the bucket index.
+ 	 */
+ 	if (hashtable->imBucket[bucket] != NULL
+ 		&& hashtable->imBucket[bucket]->hashvalue == hashvalue)
+ 		return bucket;
+ 
+ 	/*
+ 	 * Must have run into an empty location or a frozen bucket which means the
+ 	 * tuple with this hashvalue is not to be handled as if it matches with an
+ 	 * IM bucket.
+ 	 */
+ 	return IM_INVALID_BUCKET;
+ }
+ 
+ /*
+ *	ExecHashFreezeNextIMBucket
+ *
+ *	Freeze the tuples of the next IM bucket by pushing them into the main
+ *	hashtable.  Buckets are frozen in order so that the best tuples are kept
+ *	in memory the longest.
+ */
+ static bool 
+ ExecHashFreezeNextIMBucket(HashJoinTable hashtable) {
+ 	int						bucketToFreeze;
+ 	int						bucketno;
+ 	int						batchno;
+ 	uint32					hashvalue;
+ 	HashJoinTuple			hashTuple;
+ 	HashJoinTuple			nextHashTuple;
+ 	HashJoinIMBucket		*bucket;
+ 	MinimalTuple			mintuple;
+ 
+ 	/* Calculate the imBucket index of the bucket to freeze. */
+ 	bucketToFreeze = hashtable->imBucketFreezeOrder
+ 		[hashtable->nUsedIMBuckets - 1 - hashtable->nIMBucketsFrozen];
+ 
+ 	/* Grab a pointer to the actual IM bucket. */
+ 	bucket = hashtable->imBucket[bucketToFreeze];
+ 	hashvalue = bucket->hashvalue;
+ 
+ 	/*
+ 	 * Grab a pointer to the first tuple in the soon to be frozen IM bucket.
+ 	 */
+ 	hashTuple = bucket->tuples;
+ 
+ 	/*
+ 	 * Calculate which bucket and batch the tuples belong to in the main
+ 	 * non-IM hashtable.
+ 	 */
+ 	ExecHashGetBucketAndBatch(hashtable, hashvalue,
+ 							  &bucketno, &batchno);
+ 
+ 	/* until we have read all tuples from this bucket */
+ 	while (hashTuple != NULL)
+ 	{
+ 		/*
+ 		 * Some of this code is very similar to that of ExecHashTableInsert.
+ 		 * We do not call ExecHashTableInsert directly as
+ 		 * ExecHashTableInsert expects a TupleTableSlot and we already have
+ 		 * HashJoinTuples.
+ 		 */
+ 		
+ 		mintuple = HJTUPLE_MINTUPLE(hashTuple);
+ 
+ 		/* Decide whether to put the tuple in the hash table or a temp file. */
+ 		if (batchno == hashtable->curbatch)
+ 		{
+ 			/* Put the tuple in hash table. */
+ 			nextHashTuple = hashTuple->next;
+ 			hashTuple->next = hashtable->buckets[bucketno];
+ 			hashtable->buckets[bucketno] = hashTuple;
+ 			
+ 			hashTuple = nextHashTuple;
+ 			hashtable->spaceUsedIM -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ 		}
+ 		else
+ 		{
+ 			/* Put the tuples into a temp file for later batches. */
+ 			Assert(batchno > hashtable->curbatch);
+ 			ExecHashJoinSaveTuple(mintuple, hashvalue,
+ 								  &hashtable->innerBatchFile[batchno]);
+ 			/*
+ 			 * Some memory has been freed up. This must be done before we
+ 			 * pfree the hashTuple of we lose access to the tuple size.
+ 			 */
+ 			hashtable->spaceUsed -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ 			hashtable->spaceUsedIM -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ 			nextHashTuple = hashTuple->next;
+ 			pfree(hashTuple);
+ 			hashTuple = nextHashTuple;
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Free the memory the bucket struct was using as it is not necessary
+ 	 * any more.  All code treats a frozen IM bucket the same as one that
+ 	 * did not exist so by setting the pointer to null the rest of the code
+ 	 * will function as if we had not created this IM bucket at all.
+ 	 */
+ 	pfree(bucket);
+ 	hashtable->imBucket[bucketToFreeze] = NULL;
+ 	hashtable->spaceUsed -= IM_BUCKET_OVERHEAD;
+ 	hashtable->spaceUsedIM -= IM_BUCKET_OVERHEAD;
+ 	hashtable->nIMBucketsFrozen++;
+ 
+ 	/*
+ 	 * All IM buckets have been frozen and deleted from memory so turn off
+ 	 * skew aware partitioning and remove the structs from memory as they are
+ 	 * just wasting space from now on.
+ 	 */
+ 	if (hashtable->nUsedIMBuckets == hashtable->nIMBucketsFrozen)
+ 	{
+ 		hashtable->enableSkewOptimization = false;
+ 		pfree(hashtable->imBucket);
+ 		pfree(hashtable->imBucketFreezeOrder);
+ 		hashtable->spaceUsed -= hashtable->spaceUsedIM;
+ 		hashtable->spaceUsedIM = 0;
+ 	}
+ 
+ 	return true;
+ }
+ 
  /* ----------------------------------------------------------------
   *		MultiExecHash
   *
***************
*** 69,74 ****
--- 231,238 ----
  	TupleTableSlot *slot;
  	ExprContext *econtext;
  	uint32		hashvalue;
+ 	MinimalTuple mintuple;
+ 	int bucketNumber;
  
  	/* must provide our own instrumentation support */
  	if (node->ps.instrument)
***************
*** 99,105 ****
  		if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
  								 &hashvalue))
  		{
! 			ExecHashTableInsert(hashtable, slot, hashvalue);
  			hashtable->totalTuples += 1;
  		}
  	}
--- 263,307 ----
  		if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
  								 &hashvalue))
  		{
! 			bucketNumber = ExecHashGetIMBucket(hashtable, hashvalue);
! 
! 			/* if this tuple does not belong in an IM bucket */
! 			if (bucketNumber == IM_INVALID_BUCKET)
! 				ExecHashTableInsert(hashtable, slot, hashvalue);
! 			else
! 			{
! 				HashJoinTuple hashTuple;
! 				int			hashTupleSize;
! 				
! 				/* get the HashJoinTuple */
! 				mintuple = ExecFetchSlotMinimalTuple(slot);
! 				hashTupleSize = HJTUPLE_OVERHEAD + mintuple->t_len;
! 				hashTuple
! 					= (HashJoinTuple) MemoryContextAlloc(hashtable->batchCxt,
! 														hashTupleSize);
! 				hashTuple->hashvalue = hashvalue;
! 				memcpy(HJTUPLE_MINTUPLE(hashTuple), mintuple
! 					, mintuple->t_len);
! 
! 				/* Push the HashJoinTuple onto the front of the IM bucket. */
! 				hashTuple->next 
! 					= hashtable->imBucket[bucketNumber]->tuples;
! 				hashtable->imBucket[bucketNumber]->tuples
! 					= hashTuple;
! 				
! 				/*
! 				 * More memory is now in use so make sure we are not over
! 				 * spaceAllowedIM.
! 				 */
! 				hashtable->spaceUsed += hashTupleSize;
! 				hashtable->spaceUsedIM += hashTupleSize;
! 				while (hashtable->spaceUsedIM > hashtable->spaceAllowedIM
! 					&& ExecHashFreezeNextIMBucket(hashtable))
! 					;
! 				/* Guarantee we are not over the spaceAllowed. */
! 				if (hashtable->spaceUsed > hashtable->spaceAllowed)
! 					ExecHashIncreaseNumBatches(hashtable);
! 			}
  			hashtable->totalTuples += 1;
  		}
  	}
***************
*** 269,274 ****
--- 471,485 ----
  	hashtable->outerBatchFile = NULL;
  	hashtable->spaceUsed = 0;
  	hashtable->spaceAllowed = work_mem * 1024L;
+ 	/* Initialize skew optimization related hashtable variables. */
+ 	hashtable->spaceUsedIM = 0;
+ 	hashtable->spaceAllowedIM
+ 		= hashtable->spaceAllowed * IM_WORK_MEM_PERCENT / 100;
+ 	hashtable->enableSkewOptimization = false;
+ 	hashtable->nUsedIMBuckets = 0;
+ 	hashtable->nIMBuckets = 0;
+ 	hashtable->imBucket = NULL;
+ 	hashtable->nIMBucketsFrozen = 0;
  
  	/*
  	 * Get info about the hash functions to be used for each hash key. Also
***************
*** 815,823 ****
  	/*
  	 * hj_CurTuple is NULL to start scanning a new bucket, or the address of
  	 * the last tuple returned from the current bucket.
  	 */
  	if (hashTuple == NULL)
! 		hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];
  	else
  		hashTuple = hashTuple->next;
  
--- 1026,1041 ----
  	/*
  	 * hj_CurTuple is NULL to start scanning a new bucket, or the address of
  	 * the last tuple returned from the current bucket.
+ 	 *
+ 	 * If the tuple hashed to an IM bucket then scan the IM bucket
+ 	 * otherwise scan the standard hashtable bucket.
  	 */
  	if (hashTuple == NULL)
! 		if (hjstate->hj_OuterTupleIMBucketNo != IM_INVALID_BUCKET)
! 			hashTuple = hashtable->imBucket[hjstate->hj_OuterTupleIMBucketNo]
! 											->tuples;
! 		else
! 			hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];
  	else
  		hashTuple = hashTuple->next;
  
Index: src/backend/executor/nodeHashjoin.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v
retrieving revision 1.97
diff -c -r1.97 nodeHashjoin.c
*** src/backend/executor/nodeHashjoin.c	1 Jan 2009 17:23:41 -0000	1.97
--- src/backend/executor/nodeHashjoin.c	6 Jan 2009 23:51:01 -0000
***************
*** 20,25 ****
--- 20,29 ----
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
  #include "utils/memutils.h"
+ #include "utils/syscache.h"
+ #include "utils/lsyscache.h"
+ #include "parser/parsetree.h"
+ #include "catalog/pg_statistic.h"
  
  
  /* Returns true for JOIN_LEFT and JOIN_ANTI jointypes */
***************
*** 34,39 ****
--- 38,234 ----
  						  TupleTableSlot *tupleSlot);
  static int	ExecHashJoinNewBatch(HashJoinState *hjstate);
  
+ /*
+  *	ExecHashJoinDetectSkew
+  *
+  *	If MCV statistics can be found for the join attribute of this hashjoin
+  *	then create a hash table of buckets. Each bucket will correspond to
+  *	a MCV hashvalue and will be filled with inner relation tuples whose join
+  *	attribute hashes to the same value as that MCV.  If a join attribute
+  *	value is a MCV for the join attribute in the outer (probe) relation,
+  *	tuples with this value in the inner (build) relation are more likely to
+  *	join with outer relation tuples and a benefit can be gained by keeping
+  *	them in memory while joining the first batch of tuples.
+  */
+ static void 
+ ExecHashJoinDetectSkew(EState *estate, HashJoinState *hjstate, int tupwidth)
+ {
+ 	HeapTupleData	*statsTuple;
+ 	FuncExprState	*clause;
+ 	ExprState		*argstate;
+ 	Var				*variable;
+ 	HashJoinTable	hashtable;
+ 	Datum			*values;
+ 	int				nvalues;
+ 	float4			*numbers;
+ 	int				nnumbers;
+ 	Oid				relid;
+ 	Oid				atttype;
+ 	int				i;
+ 	int				mcvsToUse;
+ 
+ 	/* Only use statistics if there is a single join attribute. */
+ 	if (hjstate->hashclauses->length != 1)
+ 		return; /* Histojoin is not defined for more than one join key */
+ 	
+ 	hashtable = hjstate->hj_HashTable;
+ 	
+ 	/*
+ 	 * Estimate the number of IM buckets that will fit in
+ 	 * the memory allowed for IM buckets.
+ 	 *
+ 	 * hashtable->imBucket will have up to 8 times as many HashJoinIMBucket
+ 	 * pointers as the number of MCV hashvalues. A uint16 index in
+ 	 * hashtable->imBucketFreezeOrder will be created for each IM bucket. One
+ 	 * actual HashJoinIMBucket struct will be created for each
+ 	 * unique MCV hashvalue so up to one struct per MCV.
+ 	 *
+ 	 * It is also estimated that each IM bucket will have a single build
+ 	 * tuple stored in it after partitioning the build relation input.  This
+ 	 * estimate could be high if tuples are filtered out before this join but
+ 	 * in that case the extra memory is used by the regular hashjoin batch.
+ 	 * This estimate could be low if it is a many to many join but in that
+ 	 * case IM buckets will be frozen to free up memory as needed
+ 	 * during the inner relation partitioning phase.
+ 	 */
+ 	mcvsToUse = hashtable->spaceAllowedIM / (
+ 		/* size of a hash tuple */
+ 		HJTUPLE_OVERHEAD + MAXALIGN(sizeof(MinimalTupleData))
+ 			+ MAXALIGN(tupwidth)
+ 		/* max size of hashtable pointers per MCV */
+ 		+ (8 * sizeof(HashJoinIMBucket*))
+ 		+ sizeof(uint16) /* size of imBucketFreezeOrder entry */
+ 		+ IM_BUCKET_OVERHEAD /* size of IM bucket struct */
+ 		);
+ 
+ 	/*
+ 	 * If we cannot fit any MCV tuples in memory then it is not necessary to
+ 	 * even look at the MCVs.
+ 	 */
+ 	if (mcvsToUse == 0)
+ 		return;
+ 
+ 	/*
+ 	 * Determine the relation id and attribute id of the single join
+ 	 * attribute of the probe relation.
+ 	 */
+ 
+ 	clause = (FuncExprState *) lfirst(list_head(hjstate->hashclauses));
+ 	argstate = (ExprState *) lfirst(list_head(clause->args));
+ 
+ 	/*
+ 	 * Do not try to exploit stats if the join attribute is an expression
+ 	 * instead of just a simple attribute.
+ 	 */		
+ 	if (argstate->expr->type != T_Var)
+ 		return;
+ 
+ 	variable = (Var *) argstate->expr;
+ 	relid = getrelid(variable->varnoold, estate->es_range_table);
+ 	atttype = variable->vartype;
+ 
+ 	statsTuple = SearchSysCache(STATRELATT, ObjectIdGetDatum(relid),
+ 		Int16GetDatum(variable->varoattno), 0, 0);
+ 
+ 	if (!HeapTupleIsValid(statsTuple))
+ 		return;
+ 
+ 	/* if there are MCV statistics for the attribute */
+ 	if (get_attstatsslot(statsTuple,
+ 		atttype, variable->vartypmod,
+ 		STATISTIC_KIND_MCV, InvalidOid,
+ 		&values, &nvalues,
+ 		&numbers, &nnumbers))
+ 	{
+ 		FmgrInfo   *hashfunctions;
+ 		/*
+ 		 * IM buckets (imBucket) is an open addressing hashtable with a 
+ 		 * power of 2 size that is greater than the number of MCV values.
+ 		 */
+ 		int nbuckets = 2;
+ 
+ 		if (mcvsToUse > nvalues)
+ 			mcvsToUse = nvalues;
+ 
+ 		while (nbuckets <= mcvsToUse)
+ 			nbuckets <<= 1;
+ 		/* use two more bits just to help avoid collisions */
+ 		nbuckets <<= 2;
+ 
+ 		hashtable->enableSkewOptimization = true;
+ 		hashtable->nIMBuckets = nbuckets;
+ 
+ 		/*
+ 		 * Allocate the bucket memory in the hashtable's batch context
+ 		 * because it is only relevant and necessary during the first batch
+ 		 * and will be nicely removed once the first batch is done.
+ 		 */
+ 		hashtable->imBucket = 
+ 			MemoryContextAllocZero(hashtable->batchCxt,
+ 				nbuckets * sizeof(HashJoinIMBucket*));
+ 		hashtable->imBucketFreezeOrder = 
+ 			MemoryContextAllocZero(hashtable->batchCxt,
+ 				mcvsToUse * sizeof(uint16));
+ 		/* Count the overhead of the IM pointers immediately. */
+ 		hashtable->spaceUsed += nbuckets * sizeof(HashJoinIMBucket*)
+ 			+ mcvsToUse * sizeof(uint16);
+ 		hashtable->spaceUsedIM +=  nbuckets * sizeof(HashJoinIMBucket*)
+ 			+ mcvsToUse * sizeof(uint16);
+ 
+ 		/*
+ 		 * Grab the hash functions as we will be generating the hashvalues
+ 		 * in this section.
+ 		 */
+ 		hashfunctions = hashtable->outer_hashfunctions;
+ 
+ 		/* Create the buckets */
+ 		for (i = 0; i < mcvsToUse; i++)
+ 		{
+ 			uint32 hashvalue = DatumGetUInt32(
+ 				FunctionCall1(&hashfunctions[0], values[i]));
+ 			int bucket = hashvalue & (nbuckets - 1);
+ 
+ 			/*
+ 			 * While we have not hit a hole in the hashtable and have not hit
+ 			 * a bucket with the same hashvalue we have collided in the
+ 			 * hashtable so try the next bucket location (remember it is an
+ 			 * open addressing hashtable).
+ 			 */
+ 			while (hashtable->imBucket[bucket] != NULL
+ 				&& hashtable->imBucket[bucket]->hashvalue != hashvalue)
+ 				bucket = (bucket + 1) & (nbuckets - 1);
+ 
+ 			/*
+ 			 * Leave bucket alone if it has the same hashvalue as current
+ 			 * MCV. We only want one bucket per hashvalue. Even if two MCV
+ 			 * values hash to the same bucket we are fine.
+ 			 */
+ 			if (hashtable->imBucket[bucket] == NULL)
+ 			{
+ 				/*
+ 				 * Allocate the actual bucket structure in the hashtable's batch
+ 				 * context because it is only relevant and necessary during
+ 				 * the first batch and will be nicely removed once the first
+ 				 * batch is done.
+ 				 */
+ 				hashtable->imBucket[bucket]
+ 					= MemoryContextAllocZero(hashtable->batchCxt,
+ 						sizeof(HashJoinIMBucket));
+ 				hashtable->imBucket[bucket]->hashvalue = hashvalue;
+ 				hashtable->imBucketFreezeOrder[hashtable->nUsedIMBuckets]
+ 					= bucket;
+ 				hashtable->nUsedIMBuckets++;
+ 				/* Count the overhead of the IM bucket struct */
+ 				hashtable->spaceUsed += IM_BUCKET_OVERHEAD;
+ 				hashtable->spaceUsedIM += IM_BUCKET_OVERHEAD;
+ 			}
+ 		}
+ 
+ 		free_attstatsslot(atttype, values, nvalues, numbers, nnumbers);
+ 	}
+ 
+ 	ReleaseSysCache(statsTuple);
+ }
  
  /* ----------------------------------------------------------------
   *		ExecHashJoin
***************
*** 147,152 ****
--- 342,352 ----
  										node->hj_HashOperators);
  		node->hj_HashTable = hashtable;
  
+ 		/* Use skew optimization only when there is more than one batch. */
+ 		if (hashtable->nbatch > 1)
+ 			ExecHashJoinDetectSkew(estate, node,
+ 				(outerPlan((Hash *) hashNode->ps.plan))->plan_width );
+ 
  		/*
  		 * execute the Hash node, to build the hash table
  		 */
***************
*** 205,216 ****
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
  
  			/*
! 			 * Now we've got an outer tuple and the corresponding hash bucket,
! 			 * but this tuple may not belong to the current batch.
  			 */
! 			if (batchno != hashtable->curbatch)
  			{
  				/*
  				 * Need to postpone this outer tuple to a later batch. Save it
--- 405,423 ----
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
+ 			
+ 			/* Does the outer tuple match an IM bucket? */
+ 			node->hj_OuterTupleIMBucketNo = 
+ 				ExecHashGetIMBucket(hashtable, hashvalue);
  
  			/*
! 			 * Now we've got an outer tuple and the corresponding hash bucket.
! 			 *
! 			 * If the outer tuple does not match an IM bucket and it does not
! 			 * belong to the current batch.
  			 */
! 			if (node->hj_OuterTupleIMBucketNo == IM_INVALID_BUCKET
! 				&& batchno != hashtable->curbatch)
  			{
  				/*
  				 * Need to postpone this outer tuple to a later batch. Save it
***************
*** 641,647 ****
  	nbatch = hashtable->nbatch;
  	curbatch = hashtable->curbatch;
  
! 	if (curbatch > 0)
  	{
  		/*
  		 * We no longer need the previous outer batch file; close it right
--- 848,874 ----
  	nbatch = hashtable->nbatch;
  	curbatch = hashtable->curbatch;
  
! 	/* if we just finished the first batch */
! 	if (curbatch == 0)
! 	{
! 		/*
! 		 * Reset some of the skew optimization state variables. IM buckets are
! 		 * no longer being used as of this point because they are only
! 		 * necessary while joining the first batch (before the cleanup phase).
! 		 *
! 		 * Especially need to make sure ExecHashGetIMBucket returns
! 		 * IM_INVALID_BUCKET quickly for all subsequent calls.
! 		 *
! 		 * IM buckets are only taking up memory if this is a multi-batch join
! 		 * and in that case ExecHashTableReset is about to be called which
! 		 * will free all memory currently used by IM buckets and tuples when
! 		 * it deletes hashtable->batchCxt.  If this is a single batch join
! 		 * then imBucket and imBucketFreezeOrder are already NULL and empty.
! 		 */
! 		hashtable->enableSkewOptimization = false;
! 		hashtable->spaceUsedIM = 0;
! 	}
! 	else if (curbatch > 0)
  	{
  		/*
  		 * We no longer need the previous outer batch file; close it right
Index: src/include/executor/hashjoin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/hashjoin.h,v
retrieving revision 1.49
diff -c -r1.49 hashjoin.h
*** src/include/executor/hashjoin.h	1 Jan 2009 17:23:59 -0000	1.49
--- src/include/executor/hashjoin.h	6 Jan 2009 23:13:19 -0000
***************
*** 72,77 ****
--- 72,96 ----
  #define HJTUPLE_MINTUPLE(hjtup)  \
  	((MinimalTuple) ((char *) (hjtup) + HJTUPLE_OVERHEAD))
  
+ /*
+  * Stores a hashvalue and linked list of tuples that share that hashvalue.
+  *
+  * When processing MCVs to detect skew in the probe relation of a hash join
+  * the hashvalue is generated and stored before any tuples have been read 
+  * (see ExecHashJoinDetectSkew).
+  *
+  * Build tuples that hash to the same hashvalue are placed in the bucket while
+  * reading the build relation.
+  */
+ typedef struct HashJoinIMBucket
+ {
+ 	uint32 hashvalue;
+ 	HashJoinTuple tuples;
+ } HashJoinIMBucket;
+ 
+ #define IM_INVALID_BUCKET -1
+ #define IM_WORK_MEM_PERCENT 2
+ #define IM_BUCKET_OVERHEAD MAXALIGN(sizeof(HashJoinIMBucket))
  
  typedef struct HashJoinTableData
  {
***************
*** 113,121 ****
--- 132,161 ----
  
  	Size		spaceUsed;		/* memory space currently used by tuples */
  	Size		spaceAllowed;	/* upper limit for space used */
+ 	/* memory space currently used by IM buckets and tuples */
+ 	Size		spaceUsedIM;
+ 	/* upper limit for space used by IM buckets and tuples */
+ 	Size		spaceAllowedIM;
  
  	MemoryContext hashCxt;		/* context for whole-hash-join storage */
  	MemoryContext batchCxt;		/* context for this-batch-only storage */
+ 	
+ 	/* will the join optimize memory usage when probe relation is skewed */
+ 	bool enableSkewOptimization;
+ 	HashJoinIMBucket **imBucket; /* hashtable of IM buckets */
+ 	/*
+ 	 * array of imBucket indexes to the created IM buckets sorted
+ 	 * in the opposite order that they would be frozen to disk
+ 	 */
+ 	uint16 *imBucketFreezeOrder;
+ 	int nIMBuckets; /* # of buckets in the IM buckets hashtable */
+ 	/*
+ 	 * # of used buckets in the IM buckets hashtable and length of
+ 	 * imBucketFreezeOrder array
+ 	 */
+ 	int nUsedIMBuckets;
+ 	/* # of IM buckets that have already been frozen to disk */
+ 	int nIMBucketsFrozen;
  } HashJoinTableData;
  
  #endif   /* HASHJOIN_H */
Index: src/include/executor/nodeHash.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/nodeHash.h,v
retrieving revision 1.46
diff -c -r1.46 nodeHash.h
*** src/include/executor/nodeHash.h	1 Jan 2009 17:23:59 -0000	1.46
--- src/include/executor/nodeHash.h	6 Jan 2009 23:29:18 -0000
***************
*** 45,48 ****
--- 45,50 ----
  						int *numbuckets,
  						int *numbatches);
  
+ extern int ExecHashGetIMBucket(HashJoinTable hashtable, uint32 hashvalue);
+ 
  #endif   /* NODEHASH_H */
Index: src/include/nodes/execnodes.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/nodes/execnodes.h,v
retrieving revision 1.199
diff -c -r1.199 execnodes.h
*** src/include/nodes/execnodes.h	1 Jan 2009 17:23:59 -0000	1.199
--- src/include/nodes/execnodes.h	6 Jan 2009 23:11:33 -0000
***************
*** 1381,1386 ****
--- 1381,1387 ----
   *		hj_NeedNewOuter			true if need new outer tuple on next call
   *		hj_MatchedOuter			true if found a join match for current outer
   *		hj_OuterNotEmpty		true if outer relation known not empty
+  *		hj_OuterTupleIMBucketNo	IM bucket# for the current outer tuple
   * ----------------
   */
  
***************
*** 1406,1411 ****
--- 1407,1413 ----
  	bool		hj_NeedNewOuter;
  	bool		hj_MatchedOuter;
  	bool		hj_OuterNotEmpty;
+ 	int			hj_OuterTupleIMBucketNo;
  } HashJoinState;
  
  
Index: src/include/nodes/primnodes.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/nodes/primnodes.h,v
retrieving revision 1.145
diff -c -r1.145 primnodes.h
*** src/include/nodes/primnodes.h	1 Jan 2009 17:24:00 -0000	1.145
--- src/include/nodes/primnodes.h	5 Jan 2009 12:57:25 -0000
***************
*** 142,148 ****
  	Index		varlevelsup;	/* for subquery variables referencing outer
  								 * relations; 0 in a normal var, >0 means N
  								 * levels up */
! 	Index		varnoold;		/* original value of varno, for debugging */
  	AttrNumber	varoattno;		/* original value of varattno */
  	int			location;		/* token location, or -1 if unknown */
  } Var;
--- 142,148 ----
  	Index		varlevelsup;	/* for subquery variables referencing outer
  								 * relations; 0 in a normal var, >0 means N
  								 * levels up */
! 	Index		varnoold;		/* original value of varno */
  	AttrNumber	varoattno;		/* original value of varattno */
  	int			location;		/* token location, or -1 if unknown */
  } Var;
#41Robert Haas
robertmhaas@gmail.com
In reply to: Bryce Cutt (#40)
1 attachment(s)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

We would really appreciate help on finalizing this patch, especially in
regard to style issues. Thank you for all the help.

Here is a cleaned-up version. I fixed a number of whitespace issues,
improved a few comments, and rearranged one set of nested if-else
statements (hopefully without breaking anything in the process).

Josh / eggyknap -

Can you rerun your performance tests with this version of the patch?

...Robert

Attachments:

histojoin_v5_rh1.patchtext/x-diff; name=histojoin_v5_rh1.patchDownload
*** a/src/backend/executor/nodeHash.c
--- b/src/backend/executor/nodeHash.c
***************
*** 53,58 **** ExecHash(HashState *node)
--- 53,222 ----
  	return NULL;
  }
  
+ /*
+  * ----------------------------------------------------------------
+  *		ExecHashGetIMBucket
+  *
+  *  	Returns the index of the in-memory bucket for this
+  *		hashvalue, or IM_INVALID_BUCKET if the hashvalue is not
+  *		associated with any unfrozen bucket (or if skew
+  *		optimization is not being used).
+  *  
+  *		It is possible for a tuple whose join attribute value is
+  *		not a MCV to hash to an in-memory bucket due to the limited
+  * 		number of hash values but it is unlikely and everything
+  *		continues to work even if it does happen. We would
+  *		accidentally cache some less optimal tuples in memory
+  *		but the join result would still be accurate.
+  *
+  *		hashtable->imBucket is an open addressing hashtable of
+  *		in-memory buckets (HashJoinIMBucket).
+  * ----------------------------------------------------------------
+  */
+ int 
+ ExecHashGetIMBucket(HashJoinTable hashtable, uint32 hashvalue)
+ {
+ 	int bucket;
+ 
+ 	if (!hashtable->enableSkewOptimization)
+ 		return IM_INVALID_BUCKET;
+ 	
+ 	/* Modulo the hashvalue (using bitmask) to find the IM bucket. */
+ 	bucket = hashvalue & (hashtable->nIMBuckets - 1);
+ 
+ 	/*
+ 	 * While we have not hit a hole in the hashtable and have not hit the 
+ 	 * actual bucket we have collided in the hashtable so try the next
+ 	 * bucket location.
+ 	 */
+ 	while (hashtable->imBucket[bucket] != NULL
+ 		&& hashtable->imBucket[bucket]->hashvalue != hashvalue)
+ 		bucket = (bucket + 1) & (hashtable->nIMBuckets - 1);
+ 
+ 	/*
+ 	 * If the bucket exists and has been correctly determined return
+ 	 * the bucket index.
+ 	 */
+ 	if (hashtable->imBucket[bucket] != NULL
+ 		&& hashtable->imBucket[bucket]->hashvalue == hashvalue)
+ 		return bucket;
+ 
+ 	/*
+ 	 * Must have run into an empty location or a frozen bucket which means the
+ 	 * tuple with this hashvalue is not to be handled as if it matches with an
+ 	 * in-memory bucket.
+ 	 */
+ 	return IM_INVALID_BUCKET;
+ }
+ 
+ /*
+  * ----------------------------------------------------------------
+  *		ExecHashFreezeNextIMBucket
+  *
+  *		Freeze the tuples of the next in-memory bucket by pushing
+  *		them into the main hashtable.  Buckets are frozen in order
+  *		so that the best tuples are kept in memory the longest.
+  * ----------------------------------------------------------------
+  */
+ static bool 
+ ExecHashFreezeNextIMBucket(HashJoinTable hashtable)
+ {
+ 	int bucketToFreeze;
+ 	int bucketno;
+ 	int batchno;
+ 	uint32 hashvalue;
+ 	HashJoinTuple hashTuple;
+ 	HashJoinTuple nextHashTuple;
+ 	HashJoinIMBucket *bucket;
+ 	MinimalTuple mintuple;
+ 
+ 	/* Calculate the imBucket index of the bucket to freeze. */
+ 	bucketToFreeze = hashtable->imBucketFreezeOrder
+ 		[hashtable->nUsedIMBuckets - 1 - hashtable->nIMBucketsFrozen];
+ 
+ 	/* Grab a pointer to the actual IM bucket. */
+ 	bucket = hashtable->imBucket[bucketToFreeze];
+ 	hashvalue = bucket->hashvalue;
+ 
+ 	/*
+ 	 * Grab a pointer to the first tuple in the soon to be frozen IM bucket.
+ 	 */
+ 	hashTuple = bucket->tuples;
+ 
+ 	/*
+ 	 * Calculate which bucket and batch the tuples belong to in the main
+ 	 * non-IM hashtable.
+ 	 */
+ 	ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno, &batchno);
+ 
+ 	/* until we have read all tuples from this bucket */
+ 	while (hashTuple != NULL)
+ 	{
+ 		/*
+ 		 * Some of this code is very similar to that of ExecHashTableInsert.
+ 		 * We do not call ExecHashTableInsert directly as
+ 		 * ExecHashTableInsert expects a TupleTableSlot and we already have
+ 		 * HashJoinTuples.
+ 		 */
+ 		mintuple = HJTUPLE_MINTUPLE(hashTuple);
+ 
+ 		/* Decide whether to put the tuple in the hash table or a temp file. */
+ 		if (batchno == hashtable->curbatch)
+ 		{
+ 			/* Put the tuple in hash table. */
+ 			nextHashTuple = hashTuple->next;
+ 			hashTuple->next = hashtable->buckets[bucketno];
+ 			hashtable->buckets[bucketno] = hashTuple;
+ 			hashTuple = nextHashTuple;
+ 			hashtable->spaceUsedIM -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ 		}
+ 		else
+ 		{
+ 			/* Put the tuples into a temp file for later batches. */
+ 			Assert(batchno > hashtable->curbatch);
+ 			ExecHashJoinSaveTuple(mintuple, hashvalue,
+ 								  &hashtable->innerBatchFile[batchno]);
+ 			/*
+ 			 * Some memory has been freed up. This must be done before we
+ 			 * pfree the hashTuple of we lose access to the tuple size.
+ 			 */
+ 			hashtable->spaceUsed -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ 			hashtable->spaceUsedIM -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ 			nextHashTuple = hashTuple->next;
+ 			pfree(hashTuple);
+ 			hashTuple = nextHashTuple;
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Free the memory the bucket struct was using as it is not necessary
+ 	 * any more.  All code treats a frozen in-memory bucket the same as one
+ 	 * that did not exist; by setting the pointer to null the rest of the code
+ 	 * will function as if we had not created this bucket at all.
+ 	 */
+ 	pfree(bucket);
+ 	hashtable->imBucket[bucketToFreeze] = NULL;
+ 	hashtable->spaceUsed -= IM_BUCKET_OVERHEAD;
+ 	hashtable->spaceUsedIM -= IM_BUCKET_OVERHEAD;
+ 	hashtable->nIMBucketsFrozen++;
+ 
+ 	/*
+ 	 * All buckets have been frozen and deleted from memory so turn off
+ 	 * skew aware partitioning and remove the structs from memory as they are
+ 	 * just wasting space from now on.
+ 	 */
+ 	if (hashtable->nUsedIMBuckets == hashtable->nIMBucketsFrozen)
+ 	{
+ 		hashtable->enableSkewOptimization = false;
+ 		pfree(hashtable->imBucket);
+ 		pfree(hashtable->imBucketFreezeOrder);
+ 		hashtable->spaceUsed -= hashtable->spaceUsedIM;
+ 		hashtable->spaceUsedIM = 0;
+ 	}
+ 
+ 	return true;
+ }
+ 
  /* ----------------------------------------------------------------
   *		MultiExecHash
   *
***************
*** 69,74 **** MultiExecHash(HashState *node)
--- 233,240 ----
  	TupleTableSlot *slot;
  	ExprContext *econtext;
  	uint32		hashvalue;
+ 	MinimalTuple mintuple;
+ 	int bucketNumber;
  
  	/* must provide our own instrumentation support */
  	if (node->ps.instrument)
***************
*** 99,106 **** MultiExecHash(HashState *node)
  		if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
  								 &hashvalue))
  		{
! 			ExecHashTableInsert(hashtable, slot, hashvalue);
! 			hashtable->totalTuples += 1;
  		}
  	}
  
--- 265,306 ----
  		if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
  								 &hashvalue))
  		{
! 			bucketNumber = ExecHashGetIMBucket(hashtable, hashvalue);
! 
! 			/* handle tuples not destined for an in-memory bucket normally */
! 			if (bucketNumber == IM_INVALID_BUCKET)
! 				ExecHashTableInsert(hashtable, slot, hashvalue);
! 			else
! 			{
! 				HashJoinTuple hashTuple;
! 				int			hashTupleSize;
! 				
! 				/* get the HashJoinTuple */
! 				mintuple = ExecFetchSlotMinimalTuple(slot);
! 				hashTupleSize = HJTUPLE_OVERHEAD + mintuple->t_len;
! 				hashTuple = (HashJoinTuple)
! 					MemoryContextAlloc(hashtable->batchCxt, hashTupleSize);
! 				hashTuple->hashvalue = hashvalue;
! 				memcpy(HJTUPLE_MINTUPLE(hashTuple), mintuple, mintuple->t_len);
! 
! 				/* Push the HashJoinTuple onto the front of the IM bucket. */
! 				hashTuple->next = hashtable->imBucket[bucketNumber]->tuples;
! 				hashtable->imBucket[bucketNumber]->tuples = hashTuple;
! 
! 				/*
! 				 * More memory is now in use so make sure we are not over
! 				 * spaceAllowedIM.
! 				 */
! 				hashtable->spaceUsed += hashTupleSize;
! 				hashtable->spaceUsedIM += hashTupleSize;
! 				while (hashtable->spaceUsedIM > hashtable->spaceAllowedIM
! 					&& ExecHashFreezeNextIMBucket(hashtable))
! 					;
! 				/* Guarantee we are not over the spaceAllowed. */
! 				if (hashtable->spaceUsed > hashtable->spaceAllowed)
! 					ExecHashIncreaseNumBatches(hashtable);
! 			}
! 			hashtable->totalTuples++;
  		}
  	}
  
***************
*** 269,274 **** ExecHashTableCreate(Hash *node, List *hashOperators)
--- 469,483 ----
  	hashtable->outerBatchFile = NULL;
  	hashtable->spaceUsed = 0;
  	hashtable->spaceAllowed = work_mem * 1024L;
+ 	/* Initialize skew optimization related hashtable variables. */
+ 	hashtable->spaceUsedIM = 0;
+ 	hashtable->spaceAllowedIM
+ 		= hashtable->spaceAllowed * IM_WORK_MEM_PERCENT / 100;
+ 	hashtable->enableSkewOptimization = false;
+ 	hashtable->nUsedIMBuckets = 0;
+ 	hashtable->nIMBuckets = 0;
+ 	hashtable->imBucket = NULL;
+ 	hashtable->nIMBucketsFrozen = 0;
  
  	/*
  	 * Get info about the hash functions to be used for each hash key. Also
***************
*** 815,825 **** ExecScanHashBucket(HashJoinState *hjstate,
  	/*
  	 * hj_CurTuple is NULL to start scanning a new bucket, or the address of
  	 * the last tuple returned from the current bucket.
  	 */
! 	if (hashTuple == NULL)
! 		hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];
! 	else
  		hashTuple = hashTuple->next;
  
  	while (hashTuple != NULL)
  	{
--- 1024,1040 ----
  	/*
  	 * hj_CurTuple is NULL to start scanning a new bucket, or the address of
  	 * the last tuple returned from the current bucket.
+ 	 *
+ 	 * If the tuple hashed to an IM bucket then scan the IM bucket
+ 	 * otherwise scan the standard hashtable bucket.
  	 */
! 	if (hashTuple != NULL)
  		hashTuple = hashTuple->next;
+ 	else if (hjstate->hj_OuterTupleIMBucketNo != IM_INVALID_BUCKET)
+ 		hashTuple = hashtable->imBucket[hjstate->hj_OuterTupleIMBucketNo]
+ 			->tuples;
+ 	else
+ 		hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];
  
  	while (hashTuple != NULL)
  	{
*** a/src/backend/executor/nodeHashjoin.c
--- b/src/backend/executor/nodeHashjoin.c
***************
*** 20,25 ****
--- 20,29 ----
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
  #include "utils/memutils.h"
+ #include "utils/syscache.h"
+ #include "utils/lsyscache.h"
+ #include "parser/parsetree.h"
+ #include "catalog/pg_statistic.h"
  
  
  /* Returns true for JOIN_LEFT and JOIN_ANTI jointypes */
***************
*** 34,39 **** static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
--- 38,227 ----
  						  TupleTableSlot *tupleSlot);
  static int	ExecHashJoinNewBatch(HashJoinState *hjstate);
  
+ /*
+  * ----------------------------------------------------------------
+  *		ExecHashJoinDetectSkew
+  *
+  *		If MCV statistics can be found for the join attribute of
+  *		this hashjoin then create a hash table of buckets. Each
+  *		bucket will correspond to an MCV hashvalue and will be
+  *		filled with inner relation tuples whose join attribute
+  *		hashes to the same value as that MCV.  If a join
+  *		attribute value is a MCV for the join attribute in the
+  *		outer (probe) relation, tuples with this value in the
+  *		inner (build) relation are more likely to join with outer
+  *		relation tuples and a benefit can be gained by keeping
+  *		them in memory while joining the first batch of tuples.
+  * ----------------------------------------------------------------
+  */
+ static void 
+ ExecHashJoinDetectSkew(EState *estate, HashJoinState *hjstate, int tupwidth)
+ {
+ 	HeapTupleData	*statsTuple;
+ 	FuncExprState	*clause;
+ 	ExprState		*argstate;
+ 	Var				*variable;
+ 	HashJoinTable	hashtable;
+ 	Datum			*values;
+ 	int				nvalues;
+ 	float4			*numbers;
+ 	int				nnumbers;
+ 	Oid				relid;
+ 	Oid				atttype;
+ 	int				i;
+ 	int				mcvsToUse;
+ 
+ 	/* Only use statistics if there is a single join attribute. */
+ 	if (hjstate->hashclauses->length != 1)
+ 		return; /* Histojoin is not defined for more than one join key */
+ 	
+ 	hashtable = hjstate->hj_HashTable;
+ 	
+ 	/*
+ 	 * Estimate the number of IM buckets that will fit in
+ 	 * the memory allowed for IM buckets.
+ 	 *
+ 	 * hashtable->imBucket will have up to 8 times as many HashJoinIMBucket
+ 	 * pointers as the number of MCV hashvalues. A uint16 index in
+ 	 * hashtable->imBucketFreezeOrder will be created for each IM bucket. One
+ 	 * actual HashJoinIMBucket struct will be created for each
+ 	 * unique MCV hashvalue so up to one struct per MCV.
+ 	 *
+ 	 * It is also estimated that each IM bucket will have a single build
+ 	 * tuple stored in it after partitioning the build relation input.  This
+ 	 * estimate could be high if tuples are filtered out before this join but
+ 	 * in that case the extra memory is used by the regular hashjoin batch.
+ 	 * This estimate could be low if it is a many to many join but in that
+ 	 * case IM buckets will be frozen to free up memory as needed
+ 	 * during the inner relation partitioning phase.
+ 	 */
+ 	mcvsToUse = hashtable->spaceAllowedIM / (
+ 		/* size of a hash tuple */
+ 		HJTUPLE_OVERHEAD + MAXALIGN(sizeof(MinimalTupleData))
+ 			+ MAXALIGN(tupwidth)
+ 		/* max size of hashtable pointers per MCV */
+ 		+ (8 * sizeof(HashJoinIMBucket*))
+ 		+ sizeof(uint16) /* size of imBucketFreezeOrder entry */
+ 		+ IM_BUCKET_OVERHEAD /* size of IM bucket struct */
+ 		);
+ 	if (mcvsToUse == 0)
+ 		return;	/* No point in considering this any further. */
+ 
+ 	/*
+ 	 * Determine the relation id and attribute id of the single join
+ 	 * attribute of the probe relation.
+ 	 */
+ 	clause = (FuncExprState *) lfirst(list_head(hjstate->hashclauses));
+ 	argstate = (ExprState *) lfirst(list_head(clause->args));
+ 
+ 	/*
+ 	 * Do not try to exploit stats if the join attribute is an expression
+ 	 * instead of just a simple attribute.
+ 	 */		
+ 	if (argstate->expr->type != T_Var)
+ 		return;
+ 
+ 	variable = (Var *) argstate->expr;
+ 	relid = getrelid(variable->varnoold, estate->es_range_table);
+ 	atttype = variable->vartype;
+ 
+ 	statsTuple = SearchSysCache(STATRELATT, ObjectIdGetDatum(relid),
+ 		Int16GetDatum(variable->varoattno), 0, 0);
+ 	if (!HeapTupleIsValid(statsTuple))
+ 		return;
+ 
+ 	/* Look for MCV statistics for the attribute. */
+ 	if (get_attstatsslot(statsTuple, atttype, variable->vartypmod,
+ 		STATISTIC_KIND_MCV, InvalidOid, &values, &nvalues,
+ 		&numbers, &nnumbers))
+ 	{
+ 		FmgrInfo   *hashfunctions;
+ 		int nbuckets = 2;
+ 
+ 		/*
+ 		 * IM buckets (imBucket) is an open addressing hashtable with a 
+ 		 * power of 2 size that is greater than the number of MCV values.
+ 		 */
+ 		if (mcvsToUse > nvalues)
+ 			mcvsToUse = nvalues;
+ 		while (nbuckets <= mcvsToUse)
+ 			nbuckets <<= 1;
+ 		/* use two more bits just to help avoid collisions */
+ 		nbuckets <<= 2;
+ 		hashtable->nIMBuckets = nbuckets;
+ 		hashtable->enableSkewOptimization = true;
+ 
+ 		/*
+ 		 * Allocate the bucket memory in the hashtable's batch context
+ 		 * because it is only relevant and necessary during the first batch
+ 		 * and will be nicely removed once the first batch is done.
+ 		 */
+ 		hashtable->imBucket = 
+ 			MemoryContextAllocZero(hashtable->batchCxt,
+ 				nbuckets * sizeof(HashJoinIMBucket*));
+ 		hashtable->imBucketFreezeOrder = 
+ 			MemoryContextAllocZero(hashtable->batchCxt,
+ 				mcvsToUse * sizeof(uint16));
+ 		/* Count the overhead of the IM pointers immediately. */
+ 		hashtable->spaceUsed += nbuckets * sizeof(HashJoinIMBucket*)
+ 			+ mcvsToUse * sizeof(uint16);
+ 		hashtable->spaceUsedIM +=  nbuckets * sizeof(HashJoinIMBucket*)
+ 			+ mcvsToUse * sizeof(uint16);
+ 
+ 		/*
+ 		 * Grab the hash functions as we will be generating the hashvalues
+ 		 * in this section.
+ 		 */
+ 		hashfunctions = hashtable->outer_hashfunctions;
+ 
+ 		/* Create the buckets */
+ 		for (i = 0; i < mcvsToUse; i++)
+ 		{
+ 			uint32 hashvalue = DatumGetUInt32(
+ 				FunctionCall1(&hashfunctions[0], values[i]));
+ 			int bucket = hashvalue & (nbuckets - 1);
+ 
+ 			/*
+ 			 * While we have not hit a hole in the hashtable and have not hit
+ 			 * a bucket with the same hashvalue we have collided in the
+ 			 * hashtable so try the next bucket location (remember it is an
+ 			 * open addressing hashtable).
+ 			 */
+ 			while (hashtable->imBucket[bucket] != NULL
+ 				&& hashtable->imBucket[bucket]->hashvalue != hashvalue)
+ 				bucket = (bucket + 1) & (nbuckets - 1);
+ 
+ 			/*
+ 			 * Leave bucket alone if it has the same hashvalue as current
+ 			 * MCV. We only want one bucket per hashvalue. Even if two MCV
+ 			 * values hash to the same bucket we are fine.
+ 			 */
+ 			if (hashtable->imBucket[bucket] == NULL)
+ 			{
+ 				/*
+ 				 * Allocate the actual bucket structure in the hashtable's batch
+ 				 * context because it is only relevant and necessary during
+ 				 * the first batch and will be nicely removed once the first
+ 				 * batch is done.
+ 				 */
+ 				hashtable->imBucket[bucket]
+ 					= MemoryContextAllocZero(hashtable->batchCxt,
+ 						sizeof(HashJoinIMBucket));
+ 				hashtable->imBucket[bucket]->hashvalue = hashvalue;
+ 				hashtable->imBucketFreezeOrder[hashtable->nUsedIMBuckets]
+ 					= bucket;
+ 				hashtable->nUsedIMBuckets++;
+ 				/* Count the overhead of the IM bucket struct */
+ 				hashtable->spaceUsed += IM_BUCKET_OVERHEAD;
+ 				hashtable->spaceUsedIM += IM_BUCKET_OVERHEAD;
+ 			}
+ 		}
+ 
+ 		free_attstatsslot(atttype, values, nvalues, numbers, nnumbers);
+ 	}
+ 
+ 	ReleaseSysCache(statsTuple);
+ }
  
  /* ----------------------------------------------------------------
   *		ExecHashJoin
***************
*** 147,152 **** ExecHashJoin(HashJoinState *node)
--- 335,345 ----
  										node->hj_HashOperators);
  		node->hj_HashTable = hashtable;
  
+ 		/* Use skew optimization only when there is more than one batch. */
+ 		if (hashtable->nbatch > 1)
+ 			ExecHashJoinDetectSkew(estate, node,
+ 				(outerPlan((Hash *) hashNode->ps.plan))->plan_width);
+ 
  		/*
  		 * execute the Hash node, to build the hash table
  		 */
***************
*** 205,216 **** ExecHashJoin(HashJoinState *node)
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
  
  			/*
  			 * Now we've got an outer tuple and the corresponding hash bucket,
! 			 * but this tuple may not belong to the current batch.
  			 */
! 			if (batchno != hashtable->curbatch)
  			{
  				/*
  				 * Need to postpone this outer tuple to a later batch. Save it
--- 398,415 ----
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
+ 			
+ 			/* Does the outer tuple match an IM bucket? */
+ 			node->hj_OuterTupleIMBucketNo = 
+ 				ExecHashGetIMBucket(hashtable, hashvalue);
  
  			/*
  			 * Now we've got an outer tuple and the corresponding hash bucket,
! 			 * but in might not belong to the current batch, or it might need
! 			 * to go into an in-memory bucket.
  			 */
! 			if (node->hj_OuterTupleIMBucketNo == IM_INVALID_BUCKET
! 				&& batchno != hashtable->curbatch)
  			{
  				/*
  				 * Need to postpone this outer tuple to a later batch. Save it
***************
*** 641,647 **** start_over:
  	nbatch = hashtable->nbatch;
  	curbatch = hashtable->curbatch;
  
! 	if (curbatch > 0)
  	{
  		/*
  		 * We no longer need the previous outer batch file; close it right
--- 840,866 ----
  	nbatch = hashtable->nbatch;
  	curbatch = hashtable->curbatch;
  
! 	/* if we just finished the first batch */
! 	if (curbatch == 0)
! 	{
! 		/*
! 		 * Reset some of the skew optimization state variables. IM buckets are
! 		 * no longer being used as of this point because they are only
! 		 * necessary while joining the first batch (before the cleanup phase).
! 		 *
! 		 * Especially need to make sure ExecHashGetIMBucket returns
! 		 * IM_INVALID_BUCKET quickly for all subsequent calls.
! 		 *
! 		 * IM buckets are only taking up memory if this is a multi-batch join
! 		 * and in that case ExecHashTableReset is about to be called which
! 		 * will free all memory currently used by IM buckets and tuples when
! 		 * it deletes hashtable->batchCxt.  If this is a single batch join
! 		 * then imBucket and imBucketFreezeOrder are already NULL and empty.
! 		 */
! 		hashtable->enableSkewOptimization = false;
! 		hashtable->spaceUsedIM = 0;
! 	}
! 	else if (curbatch > 0)
  	{
  		/*
  		 * We no longer need the previous outer batch file; close it right
*** a/src/include/executor/hashjoin.h
--- b/src/include/executor/hashjoin.h
***************
*** 72,77 **** typedef struct HashJoinTupleData
--- 72,96 ----
  #define HJTUPLE_MINTUPLE(hjtup)  \
  	((MinimalTuple) ((char *) (hjtup) + HJTUPLE_OVERHEAD))
  
+ /*
+  * Stores a hashvalue and linked list of tuples that share that hashvalue.
+  *
+  * When processing MCVs to detect skew in the probe relation of a hash join
+  * the hashvalue is generated and stored before any tuples have been read 
+  * (see ExecHashJoinDetectSkew).
+  *
+  * Build tuples that hash to the same hashvalue are placed in the bucket while
+  * reading the build relation.
+  */
+ typedef struct HashJoinIMBucket
+ {
+ 	uint32 hashvalue;
+ 	HashJoinTuple tuples;
+ } HashJoinIMBucket;
+ 
+ #define IM_INVALID_BUCKET -1
+ #define IM_WORK_MEM_PERCENT 2
+ #define IM_BUCKET_OVERHEAD MAXALIGN(sizeof(HashJoinIMBucket))
  
  typedef struct HashJoinTableData
  {
***************
*** 113,121 **** typedef struct HashJoinTableData
--- 132,161 ----
  
  	Size		spaceUsed;		/* memory space currently used by tuples */
  	Size		spaceAllowed;	/* upper limit for space used */
+ 	/* memory space currently used by IM buckets and tuples */
+ 	Size		spaceUsedIM;
+ 	/* upper limit for space used by IM buckets and tuples */
+ 	Size		spaceAllowedIM;
  
  	MemoryContext hashCxt;		/* context for whole-hash-join storage */
  	MemoryContext batchCxt;		/* context for this-batch-only storage */
+ 	
+ 	/* will the join optimize memory usage when probe relation is skewed */
+ 	bool enableSkewOptimization;
+ 	HashJoinIMBucket **imBucket; /* hashtable of IM buckets */
+ 	/*
+ 	 * array of imBucket indexes to the created IM buckets sorted
+ 	 * in the opposite order that they would be frozen to disk
+ 	 */
+ 	uint16 *imBucketFreezeOrder;
+ 	int nIMBuckets; /* # of buckets in the IM buckets hashtable */
+ 	/*
+ 	 * # of used buckets in the IM buckets hashtable and length of
+ 	 * imBucketFreezeOrder array
+ 	 */
+ 	int nUsedIMBuckets;
+ 	/* # of IM buckets that have already been frozen to disk */
+ 	int nIMBucketsFrozen;
  } HashJoinTableData;
  
  #endif   /* HASHJOIN_H */
*** a/src/include/executor/nodeHash.h
--- b/src/include/executor/nodeHash.h
***************
*** 45,48 **** extern void ExecChooseHashTableSize(double ntuples, int tupwidth,
--- 45,50 ----
  						int *numbuckets,
  						int *numbatches);
  
+ extern int ExecHashGetIMBucket(HashJoinTable hashtable, uint32 hashvalue);
+ 
  #endif   /* NODEHASH_H */
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 1381,1386 **** typedef struct MergeJoinState
--- 1381,1387 ----
   *		hj_NeedNewOuter			true if need new outer tuple on next call
   *		hj_MatchedOuter			true if found a join match for current outer
   *		hj_OuterNotEmpty		true if outer relation known not empty
+  *		hj_OuterTupleIMBucketNo	IM bucket# for the current outer tuple
   * ----------------
   */
  
***************
*** 1406,1411 **** typedef struct HashJoinState
--- 1407,1413 ----
  	bool		hj_NeedNewOuter;
  	bool		hj_MatchedOuter;
  	bool		hj_OuterNotEmpty;
+ 	int			hj_OuterTupleIMBucketNo;
  } HashJoinState;
  
  
*** a/src/include/nodes/primnodes.h
--- b/src/include/nodes/primnodes.h
***************
*** 121,128 **** typedef struct Expr
   * subplans; for example, in a join node varno becomes INNER or OUTER and
   * varattno becomes the index of the proper element of that subplan's target
   * list.  But varnoold/varoattno continue to hold the original values.
!  * The code doesn't really need varnoold/varoattno, but they are very useful
!  * for debugging and interpreting completed plans, so we keep them around.
   */
  #define    INNER		65000
  #define    OUTER		65001
--- 121,132 ----
   * subplans; for example, in a join node varno becomes INNER or OUTER and
   * varattno becomes the index of the proper element of that subplan's target
   * list.  But varnoold/varoattno continue to hold the original values.
!  *
!  * For the most part, the code doesn't really need varnoold/varoattno, but
!  * they are very useful for debugging and interpreting completed plans, so we
!  * keep them around.  As of PostgreSQL 8.4, these values are also used by
!  * ExecHashJoinDetectSkew to fetch MCV statistics when performing multi-batch
!  * hash joins.
   */
  #define    INNER		65000
  #define    OUTER		65001
***************
*** 142,148 **** typedef struct Var
  	Index		varlevelsup;	/* for subquery variables referencing outer
  								 * relations; 0 in a normal var, >0 means N
  								 * levels up */
! 	Index		varnoold;		/* original value of varno, for debugging */
  	AttrNumber	varoattno;		/* original value of varattno */
  	int			location;		/* token location, or -1 if unknown */
  } Var;
--- 146,152 ----
  	Index		varlevelsup;	/* for subquery variables referencing outer
  								 * relations; 0 in a normal var, >0 means N
  								 * levels up */
! 	Index		varnoold;		/* original value of varno */
  	AttrNumber	varoattno;		/* original value of varattno */
  	int			location;		/* token location, or -1 if unknown */
  } Var;
#42Joshua Tolley
eggyknap@gmail.com
In reply to: Robert Haas (#41)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

On Tue, Jan 06, 2009 at 11:49:57PM -0500, Robert Haas wrote:

Josh / eggyknap -

Can you rerun your performance tests with this version of the patch?

...Robert

Will do, as soon as I can.

#43Lawrence, Ramon
ramon.lawrence@ubc.ca
In reply to: Bryce Cutt (#40)
3 attachment(s)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

Here is a cleaned-up version. I fixed a number of whitespace issues,
improved a few comments, and rearranged one set of nested if-else
statements (hopefully without breaking anything in the process).

Josh / eggyknap -

Can you rerun your performance tests with this version of the patch?

To help with testing, we have constructed a patch specifically for
testing. The patch is the same as Robert's version except that it
tracks and prints out statistics during the join on how many tuples are
affected and has the enable_hashjoin_usestatmcvs variable defined so
that it is easy to turn on/off skew handling. This is useful as
although the patch reduces the number of I/Os performed, this
improvement may not be seen in some queries which are dominated by other
cost factors (non-skew joins, CPU time, time to scan input relations,
etc.).

The sample output looks like this:

LI-P
Values: 100 Skew: 0.27 Est. tuples: 59986052.00 Batches: 512 Est.
Save: 16114709.99
Total Inner Tuples: 2000000
IM Inner Tuples: 83
Batch Zero Inner Tuples: 3941
Batch Zero Potential Inner Tuples: 3941
Total Outer Tuples: 59986052
IM Outer Tuples: 16074146
Batch Zero Outer Tuples: 98778
Batch Zero Potential Outer Tuples: 98778
Total Output Tuples: 59986052
IM Output Tuples: 16074146
Batch Zero Output Tuples: 98778
Batch Zero Potential Output Tuples: 98778
Percentage less tuple IOs than HHJ: 25.98

The other change is that the system calculates the skew and will not use
the in-memory skew partition if the skew is less than 1%.

Finally, we have attached some performance results for the TPCH 10G data
set (skew factors z=1 and z=2). For the Customer-Orders-Lineitem-Part
query that Josh was testing, we see no overall time difference that is
significant compared to experimental error (although there is I/O
benefit for the Lineitem-Part join). This query cost is dominated by
the non-skew joins of Customer-Orders and Orders-Lineitem and output
tuple construction.

The joins with skew, Lineitem-Supplier and Lineitem-Part, show
significantly improved performance. Note how the statistics show that
the percentage I/O savings is directly proportional to the skew.
However, the overall query time savings is always less than this as
there are other costs such as reading the relations, performing the hash
comparisons, building the output tuples, etc. that are unaffected by the
optimization.

At this point, we await further feedback on what is necessary to get
this patch accepted. We would also like to thank Josh and Robert again
for their review time.

Sincerely,

Ramon Lawrence and Bryce Cutt

Attachments:

histojoin_testing.patchapplication/octet-stream; name=histojoin_testing.patchDownload
Index: src/backend/executor/nodeHash.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHash.c,v
retrieving revision 1.117
diff -c -r1.117 nodeHash.c
*** src/backend/executor/nodeHash.c	1 Jan 2009 17:23:41 -0000	1.117
--- src/backend/executor/nodeHash.c	14 Jan 2009 06:36:51 -0000
***************
*** 53,58 ****
--- 53,224 ----
  	return NULL;
  }
  
+ /*
+  * ----------------------------------------------------------------
+  *		ExecHashGetIMBucket
+  *
+  *  	Returns the index of the in-memory bucket for this
+  *		hashvalue, or IM_INVALID_BUCKET if the hashvalue is not
+  *		associated with any unfrozen bucket (or if skew
+  *		optimization is not being used).
+  *  
+  *		It is possible for a tuple whose join attribute value is
+  *		not a MCV to hash to an in-memory bucket due to the limited
+  * 		number of hash values but it is unlikely and everything
+  *		continues to work even if it does happen. We would
+  *		accidentally cache some less optimal tuples in memory
+  *		but the join result would still be accurate.
+  *
+  *		hashtable->imBucket is an open addressing hashtable of
+  *		in-memory buckets (HashJoinIMBucket).
+  * ----------------------------------------------------------------
+  */
+ int 
+ ExecHashGetIMBucket(HashJoinTable hashtable, uint32 hashvalue)
+ {
+ 	int bucket;
+ 
+ 	if (!hashtable->enableSkewOptimization)
+ 		return IM_INVALID_BUCKET;
+ 	
+ 	/* Modulo the hashvalue (using bitmask) to find the IM bucket. */
+ 	bucket = hashvalue & (hashtable->nIMBuckets - 1);
+ 
+ 	/*
+ 	 * While we have not hit a hole in the hashtable and have not hit the 
+ 	 * actual bucket we have collided in the hashtable so try the next
+ 	 * bucket location.
+ 	 */
+ 	while (hashtable->imBucket[bucket] != NULL
+ 		&& hashtable->imBucket[bucket]->hashvalue != hashvalue)
+ 		bucket = (bucket + 1) & (hashtable->nIMBuckets - 1);
+ 
+ 	/*
+ 	 * If the bucket exists and has been correctly determined return
+ 	 * the bucket index.
+ 	 */
+ 	if (hashtable->imBucket[bucket] != NULL
+ 		&& hashtable->imBucket[bucket]->hashvalue == hashvalue)
+ 		return bucket;
+ 
+ 	/*
+ 	 * Must have run into an empty location or a frozen bucket which means the
+ 	 * tuple with this hashvalue is not to be handled as if it matches with an
+ 	 * in-memory bucket.
+ 	 */
+ 	return IM_INVALID_BUCKET;
+ }
+ 
+ /*
+  * ----------------------------------------------------------------
+  *		ExecHashFreezeNextIMBucket
+  *
+  *		Freeze the tuples of the next in-memory bucket by pushing
+  *		them into the main hashtable.  Buckets are frozen in order
+  *		so that the best tuples are kept in memory the longest.
+  * ----------------------------------------------------------------
+  */
+ static bool 
+ ExecHashFreezeNextIMBucket(HashJoinTable hashtable)
+ {
+ 	int bucketToFreeze;
+ 	int bucketno;
+ 	int batchno;
+ 	uint32 hashvalue;
+ 	HashJoinTuple hashTuple;
+ 	HashJoinTuple nextHashTuple;
+ 	HashJoinIMBucket *bucket;
+ 	MinimalTuple mintuple;
+ 
+ 	/* Calculate the imBucket index of the bucket to freeze. */
+ 	bucketToFreeze = hashtable->imBucketFreezeOrder
+ 		[hashtable->nUsedIMBuckets - 1 - hashtable->nIMBucketsFrozen];
+ 
+ 	/* Grab a pointer to the actual IM bucket. */
+ 	bucket = hashtable->imBucket[bucketToFreeze];
+ 	hashvalue = bucket->hashvalue;
+ 
+ 	/*
+ 	 * Grab a pointer to the first tuple in the soon to be frozen IM bucket.
+ 	 */
+ 	hashTuple = bucket->tuples;
+ 
+ 	/*
+ 	 * Calculate which bucket and batch the tuples belong to in the main
+ 	 * non-IM hashtable.
+ 	 */
+ 	ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno, &batchno);
+ 
+ 	/* until we have read all tuples from this bucket */
+ 	while (hashTuple != NULL)
+ 	{
+ 		/*
+ 		 * Some of this code is very similar to that of ExecHashTableInsert.
+ 		 * We do not call ExecHashTableInsert directly as
+ 		 * ExecHashTableInsert expects a TupleTableSlot and we already have
+ 		 * HashJoinTuples.
+ 		 */
+ 		mintuple = HJTUPLE_MINTUPLE(hashTuple);
+ 
+ 		hashtable->nInnerIMTupFrozen++;
+ 
+ 		/* Decide whether to put the tuple in the hash table or a temp file. */
+ 		if (batchno == hashtable->curbatch)
+ 		{
+ 			/* Put the tuple in hash table. */
+ 			nextHashTuple = hashTuple->next;
+ 			hashTuple->next = hashtable->buckets[bucketno];
+ 			hashtable->buckets[bucketno] = hashTuple;
+ 			hashTuple = nextHashTuple;
+ 			hashtable->spaceUsedIM -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ 		}
+ 		else
+ 		{
+ 			/* Put the tuples into a temp file for later batches. */
+ 			Assert(batchno > hashtable->curbatch);
+ 			ExecHashJoinSaveTuple(mintuple, hashvalue,
+ 								  &hashtable->innerBatchFile[batchno]);
+ 			/*
+ 			 * Some memory has been freed up. This must be done before we
+ 			 * pfree the hashTuple of we lose access to the tuple size.
+ 			 */
+ 			hashtable->spaceUsed -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ 			hashtable->spaceUsedIM -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ 			nextHashTuple = hashTuple->next;
+ 			pfree(hashTuple);
+ 			hashTuple = nextHashTuple;
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Free the memory the bucket struct was using as it is not necessary
+ 	 * any more.  All code treats a frozen in-memory bucket the same as one
+ 	 * that did not exist; by setting the pointer to null the rest of the code
+ 	 * will function as if we had not created this bucket at all.
+ 	 */
+ 	pfree(bucket);
+ 	hashtable->imBucket[bucketToFreeze] = NULL;
+ 	hashtable->spaceUsed -= IM_BUCKET_OVERHEAD;
+ 	hashtable->spaceUsedIM -= IM_BUCKET_OVERHEAD;
+ 	hashtable->nIMBucketsFrozen++;
+ 
+ 	/*
+ 	 * All buckets have been frozen and deleted from memory so turn off
+ 	 * skew aware partitioning and remove the structs from memory as they are
+ 	 * just wasting space from now on.
+ 	 */
+ 	if (hashtable->nUsedIMBuckets == hashtable->nIMBucketsFrozen)
+ 	{
+ 		hashtable->enableSkewOptimization = false;
+ 		pfree(hashtable->imBucket);
+ 		pfree(hashtable->imBucketFreezeOrder);
+ 		hashtable->spaceUsed -= hashtable->spaceUsedIM;
+ 		hashtable->spaceUsedIM = 0;
+ 	}
+ 
+ 	return true;
+ }
+ 
  /* ----------------------------------------------------------------
   *		MultiExecHash
   *
***************
*** 69,74 ****
--- 235,242 ----
  	TupleTableSlot *slot;
  	ExprContext *econtext;
  	uint32		hashvalue;
+ 	MinimalTuple mintuple;
+ 	int bucketNumber;
  
  	/* must provide our own instrumentation support */
  	if (node->ps.instrument)
***************
*** 99,106 ****
  		if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
  								 &hashvalue))
  		{
! 			ExecHashTableInsert(hashtable, slot, hashvalue);
! 			hashtable->totalTuples += 1;
  		}
  	}
  
--- 267,322 ----
  		if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
  								 &hashvalue))
  		{
! 			int bucketno;
! 			int batchno;
! 
! 			bucketNumber = ExecHashGetIMBucket(hashtable, hashvalue);
! 
! 			ExecHashGetBucketAndBatch(hashtable, hashvalue,
! 									  &bucketno, &batchno);
! 			if (batchno == 0)
! 			{
! 				hashtable->nInnerPotentialBatchZeroTup++;
! 				if (bucketNumber == IM_INVALID_BUCKET)
! 					hashtable->nInnerBatchZeroTup++;
! 			}
! 
! 			/* handle tuples not destined for an in-memory bucket normally */
! 			if (bucketNumber == IM_INVALID_BUCKET)
! 				ExecHashTableInsert(hashtable, slot, hashvalue);
! 			else
! 			{
! 				HashJoinTuple hashTuple;
! 				int			hashTupleSize;
! 				
! 				/* get the HashJoinTuple */
! 				mintuple = ExecFetchSlotMinimalTuple(slot);
! 				hashTupleSize = HJTUPLE_OVERHEAD + mintuple->t_len;
! 				hashTuple = (HashJoinTuple)
! 					MemoryContextAlloc(hashtable->batchCxt, hashTupleSize);
! 				hashTuple->hashvalue = hashvalue;
! 				memcpy(HJTUPLE_MINTUPLE(hashTuple), mintuple, mintuple->t_len);
! 
! 				/* Push the HashJoinTuple onto the front of the IM bucket. */
! 				hashTuple->next = hashtable->imBucket[bucketNumber]->tuples;
! 				hashtable->imBucket[bucketNumber]->tuples = hashTuple;
! 				
! 				/*
! 				 * More memory is now in use so make sure we are not over
! 				 * spaceAllowedIM.
! 				 */
! 				hashtable->spaceUsed += hashTupleSize;
! 				hashtable->spaceUsedIM += hashTupleSize;
! 				hashtable->nInnerIMTup++;
! 				while (hashtable->spaceUsedIM > hashtable->spaceAllowedIM
! 					&& ExecHashFreezeNextIMBucket(hashtable))
! 					;
! 				/* Guarantee we are not over the spaceAllowed. */
! 				if (hashtable->spaceUsed > hashtable->spaceAllowed)
! 					ExecHashIncreaseNumBatches(hashtable);
! 			}
! 			hashtable->totalTuples++;
! 			hashtable->nInnerTup++;
  		}
  	}
  
***************
*** 269,274 ****
--- 485,512 ----
  	hashtable->outerBatchFile = NULL;
  	hashtable->spaceUsed = 0;
  	hashtable->spaceAllowed = work_mem * 1024L;
+ 	/* Initialize skew optimization related hashtable variables. */
+ 	hashtable->spaceUsedIM = 0;
+ 	hashtable->spaceAllowedIM
+ 		= hashtable->spaceAllowed * IM_WORK_MEM_PERCENT / 100;
+ 	hashtable->enableSkewOptimization = false;
+ 	hashtable->nUsedIMBuckets = 0;
+ 	hashtable->nIMBuckets = 0;
+ 	hashtable->imBucket = NULL;
+ 	hashtable->nIMBucketsFrozen = 0;
+ 	hashtable->nInnerTup = 0;
+ 	hashtable->nOuterTup = 0;
+ 	hashtable->nInnerIMTup = 0;
+ 	hashtable->nOuterIMTup = 0;
+ 	hashtable->nOutputTup = 0;
+ 	hashtable->nOutputIMTup = 0;
+ 	hashtable->nInnerIMTupFrozen = 0;
+ 	hashtable->nOuterBatchZeroTup = 0;
+ 	hashtable->nOuterPotentialBatchZeroTup = 0;
+ 	hashtable->nOutputBatchZeroTup = 0;
+ 	hashtable->nOutputPotentialBatchZeroTup = 0;
+ 	hashtable->nInnerBatchZeroTup = 0;
+ 	hashtable->nInnerPotentialBatchZeroTup = 0;
  
  	/*
  	 * Get info about the hash functions to be used for each hash key. Also
***************
*** 448,453 ****
--- 686,754 ----
  {
  	int			i;
  
+ 	/* the total # of inner tuples received by join */
+ 	ereport(NOTICE, (errmsg("Total Inner Tuples: %d", hashtable->nInnerTup)));
+ 	/* # inner tuples in the IM buckets */
+ 	ereport(NOTICE, (errmsg("IM Inner Tuples: %d", hashtable->nInnerIMTup)));
+ 	/* # inner tuples that fell in batch 0 */
+ 	ereport(NOTICE, (errmsg("Batch Zero Inner Tuples: %d", 
+ 		hashtable->nInnerBatchZeroTup)));
+ 	/*
+ 	 * # inner tuples that would have fallen in batch 0 if IM buckets were
+ 	 * not in use at all
+ 	 */
+ 	ereport(NOTICE, (errmsg("Batch Zero Potential Inner Tuples: %d", 
+ 		hashtable->nInnerPotentialBatchZeroTup)));
+ 	/* total # of outer tuples received by join */
+ 	ereport(NOTICE, (errmsg("Total Outer Tuples: %d", hashtable->nOuterTup)));
+ 	/* # outer tuples that matched with the IM buckets */
+ 	ereport(NOTICE, (errmsg("IM Outer Tuples: %d", hashtable->nOuterIMTup)));
+ 	/* # outer tuples that matched with batch 0 */
+ 	ereport(NOTICE, (errmsg("Batch Zero Outer Tuples: %d", 
+ 		hashtable->nOuterBatchZeroTup)));
+ 	/*
+ 	 * # outer tuples that would have fallen in batch 0 if IM buckets were
+ 	 * not in use at all
+ 	 */
+ 	ereport(NOTICE, (errmsg("Batch Zero Potential Outer Tuples: %d", 
+ 		hashtable->nOuterPotentialBatchZeroTup)));
+ 	/* total # output tuples produced by join */
+ 	ereport(NOTICE, (errmsg("Total Output Tuples: %d", 
+ 		hashtable->nOutputTup)));
+ 	/* # output tuples that came from matches with IM bucket inner tuples */
+ 	ereport(NOTICE, (errmsg("IM Output Tuples: %d", 
+ 		hashtable->nOutputIMTup)));
+ 	/* # output tuples that came from matches with batch 0 inner tuples */
+ 	ereport(NOTICE, (errmsg("Batch Zero Output Tuples: %d", 
+ 		hashtable->nOutputBatchZeroTup)));
+ 	/*
+ 	 * # output tuples that would have came from matches with batch 0 if IM
+ 	 * buckets were not in use at all
+ 	 */
+ 	ereport(NOTICE, (errmsg("Batch Zero Potential Output Tuples: %d", 
+ 		hashtable->nOutputPotentialBatchZeroTup)));
+ 	/*
+ 	 * # of inner IM tuples that were frozen back to the main hashtable when
+ 	 * an IM bucket was frozen
+ 	 */
+ 	ereport(NOTICE, (errmsg("IM Tuples Frozen: %d", 
+ 		hashtable->nInnerIMTupFrozen)));
+ 	/* percentage less tuple IOs compared to HHJ due to skew optimization */
+ 	if (hashtable->nInnerTup - hashtable->nInnerPotentialBatchZeroTup
+ 			+ hashtable->nOuterTup - hashtable->nOuterPotentialBatchZeroTup
+ 			!= 0)
+ 		ereport(NOTICE, (errmsg("Percentage less tuple IOs than HHJ: %4.2f", 
+ 			(1 - (
+ 			(float)(hashtable->nInnerTup - hashtable->nInnerIMTup
+ 				- hashtable->nInnerBatchZeroTup 
+ 				+ hashtable->nOuterTup - hashtable->nOuterIMTup
+ 				- hashtable->nOuterBatchZeroTup)
+ 			/
+ 			(hashtable->nInnerTup - hashtable->nInnerPotentialBatchZeroTup
+ 				+ hashtable->nOuterTup
+ 				- hashtable->nOuterPotentialBatchZeroTup))
+ 			) * 100)));
+ 
  	/*
  	 * Make sure all the temp files are closed.  We skip batch 0, since it
  	 * can't have any temp files (and the arrays might not even exist if
***************
*** 815,825 ****
  	/*
  	 * hj_CurTuple is NULL to start scanning a new bucket, or the address of
  	 * the last tuple returned from the current bucket.
  	 */
! 	if (hashTuple == NULL)
! 		hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];
! 	else
  		hashTuple = hashTuple->next;
  
  	while (hashTuple != NULL)
  	{
--- 1116,1132 ----
  	/*
  	 * hj_CurTuple is NULL to start scanning a new bucket, or the address of
  	 * the last tuple returned from the current bucket.
+ 	 *
+ 	 * If the tuple hashed to an IM bucket then scan the IM bucket
+ 	 * otherwise scan the standard hashtable bucket.
  	 */
! 	if (hashTuple != NULL)
  		hashTuple = hashTuple->next;
+ 	else if (hjstate->hj_OuterTupleIMBucketNo != IM_INVALID_BUCKET)
+ 		hashTuple = hashtable->imBucket[hjstate->hj_OuterTupleIMBucketNo]
+ 			->tuples;
+ 	else
+ 		hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];
  
  	while (hashTuple != NULL)
  	{
Index: src/backend/executor/nodeHashjoin.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v
retrieving revision 1.97
diff -c -r1.97 nodeHashjoin.c
*** src/backend/executor/nodeHashjoin.c	1 Jan 2009 17:23:41 -0000	1.97
--- src/backend/executor/nodeHashjoin.c	13 Jan 2009 21:28:41 -0000
***************
*** 20,25 ****
--- 20,30 ----
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
  #include "utils/memutils.h"
+ #include "utils/syscache.h"
+ #include "utils/lsyscache.h"
+ #include "parser/parsetree.h"
+ #include "catalog/pg_statistic.h"
+ #include "optimizer/cost.h"
  
  
  /* Returns true for JOIN_LEFT and JOIN_ANTI jointypes */
***************
*** 34,39 ****
--- 39,246 ----
  						  TupleTableSlot *tupleSlot);
  static int	ExecHashJoinNewBatch(HashJoinState *hjstate);
  
+ /*
+  * ----------------------------------------------------------------
+  *		ExecHashJoinDetectSkew
+  *
+  *		If MCV statistics can be found for the join attribute of
+  *		this hashjoin then create a hash table of buckets. Each
+  *		bucket will correspond to an MCV hashvalue and will be
+  *		filled with inner relation tuples whose join attribute
+  *		hashes to the same value as that MCV.  If a join
+  *		attribute value is a MCV for the join attribute in the
+  *		outer (probe) relation, tuples with this value in the
+  *		inner (build) relation are more likely to join with outer
+  *		relation tuples and a benefit can be gained by keeping
+  *		them in memory while joining the first batch of tuples.
+  * ----------------------------------------------------------------
+  */
+ static void 
+ ExecHashJoinDetectSkew(EState *estate, HashJoinState *hjstate, int tupwidth)
+ {
+ 	HeapTupleData	*statsTuple;
+ 	FuncExprState	*clause;
+ 	ExprState		*argstate;
+ 	Var				*variable;
+ 	HashJoinTable	hashtable;
+ 	Datum			*values;
+ 	int				nvalues;
+ 	float4			*numbers;
+ 	int				nnumbers;
+ 	Oid				relid;
+ 	Oid				atttype;
+ 	int				i;
+ 	int				mcvsToUse;
+ 
+ 	PlanState  *outerNode = outerPlanState(hjstate);
+ 
+ 	/* Only use statistics if there is a single join attribute. */
+ 	if (hjstate->hashclauses->length != 1)
+ 		return; /* Histojoin is not defined for more than one join key */
+ 	
+ 	hashtable = hjstate->hj_HashTable;
+ 	
+ 	/*
+ 	 * Estimate the number of IM buckets that will fit in
+ 	 * the memory allowed for IM buckets.
+ 	 *
+ 	 * hashtable->imBucket will have up to 8 times as many HashJoinIMBucket
+ 	 * pointers as the number of MCV hashvalues. A uint16 index in
+ 	 * hashtable->imBucketFreezeOrder will be created for each IM bucket. One
+ 	 * actual HashJoinIMBucket struct will be created for each
+ 	 * unique MCV hashvalue so up to one struct per MCV.
+ 	 *
+ 	 * It is also estimated that each IM bucket will have a single build
+ 	 * tuple stored in it after partitioning the build relation input.  This
+ 	 * estimate could be high if tuples are filtered out before this join but
+ 	 * in that case the extra memory is used by the regular hashjoin batch.
+ 	 * This estimate could be low if it is a many to many join but in that
+ 	 * case IM buckets will be frozen to free up memory as needed
+ 	 * during the inner relation partitioning phase.
+ 	 */
+ 	mcvsToUse = hashtable->spaceAllowedIM / (
+ 		/* size of a hash tuple */
+ 		HJTUPLE_OVERHEAD + MAXALIGN(sizeof(MinimalTupleData))
+ 			+ MAXALIGN(tupwidth)
+ 		/* max size of hashtable pointers per MCV */
+ 		+ (8 * sizeof(HashJoinIMBucket*))
+ 		+ sizeof(uint16) /* size of imBucketFreezeOrder entry */
+ 		+ IM_BUCKET_OVERHEAD /* size of IM bucket struct */
+ 		);
+ 	if (mcvsToUse == 0)
+ 		return;	/* No point in considering this any further. */
+ 
+ 	/*
+ 	 * Determine the relation id and attribute id of the single join
+ 	 * attribute of the probe relation.
+ 	 */
+ 	clause = (FuncExprState *) lfirst(list_head(hjstate->hashclauses));
+ 	argstate = (ExprState *) lfirst(list_head(clause->args));
+ 
+ 	/*
+ 	 * Do not try to exploit stats if the join attribute is an expression
+ 	 * instead of just a simple attribute.
+ 	 */		
+ 	if (argstate->expr->type != T_Var)
+ 		return;
+ 
+ 	variable = (Var *) argstate->expr;
+ 	relid = getrelid(variable->varnoold, estate->es_range_table);
+ 	atttype = variable->vartype;
+ 
+ 	statsTuple = SearchSysCache(STATRELATT, ObjectIdGetDatum(relid),
+ 		Int16GetDatum(variable->varoattno), 0, 0);
+ 	if (!HeapTupleIsValid(statsTuple))
+ 		return;
+ 
+ 	/* Look for MCV statistics for the attribute. */
+ 	if (get_attstatsslot(statsTuple, atttype, variable->vartypmod,
+ 		STATISTIC_KIND_MCV, InvalidOid, &values, &nvalues,
+ 		&numbers, &nnumbers))
+ 	{
+ 		FmgrInfo   *hashfunctions;
+ 		int nbuckets = 2;
+ 		double frac = 0;
+ 		
+ 		/*
+ 		 * IM buckets (imBucket) is an open addressing hashtable with a 
+ 		 * power of 2 size that is greater than the number of MCV values.
+ 		 */
+ 		if (mcvsToUse > nvalues)
+ 			mcvsToUse = nvalues;
+ 			
+ 		for (i = 0; i < mcvsToUse; i++)
+ 			frac += numbers[i];
+ 
+ 		ereport(NOTICE, (errmsg("Values: %d Skew: %4.2f  Est. tuples: %4.2f Batches: %d  Est. Save: %4.2f",
+                         nvalues, frac, outerNode->plan->plan_rows, hashtable->nbatch, 
+ 			(frac*(1-1.0/hashtable->nbatch)*outerNode->plan->plan_rows))));
+ 		
+ 		if (frac < IM_MIN_BENEFIT_PERCENT)
+ 		{
+ 			free_attstatsslot(atttype, values, nvalues, numbers, nnumbers);
+ 			ReleaseSysCache(statsTuple);
+ 			return;
+ 		}
+ 		
+ 		while (nbuckets <= mcvsToUse)
+ 			nbuckets <<= 1;
+ 		/* use two more bits just to help avoid collisions */
+ 		nbuckets <<= 2;
+ 		hashtable->nIMBuckets = nbuckets;
+ 		hashtable->enableSkewOptimization = true;
+ 
+ 		/*
+ 		 * Allocate the bucket memory in the hashtable's batch context
+ 		 * because it is only relevant and necessary during the first batch
+ 		 * and will be nicely removed once the first batch is done.
+ 		 */
+ 		hashtable->imBucket = 
+ 			MemoryContextAllocZero(hashtable->batchCxt,
+ 				nbuckets * sizeof(HashJoinIMBucket*));
+ 		hashtable->imBucketFreezeOrder = 
+ 			MemoryContextAllocZero(hashtable->batchCxt,
+ 				mcvsToUse * sizeof(uint16));
+ 		/* Count the overhead of the IM pointers immediately. */
+ 		hashtable->spaceUsed += nbuckets * sizeof(HashJoinIMBucket*)
+ 			+ mcvsToUse * sizeof(uint16);
+ 		hashtable->spaceUsedIM +=  nbuckets * sizeof(HashJoinIMBucket*)
+ 			+ mcvsToUse * sizeof(uint16);
+ 
+ 		/*
+ 		 * Grab the hash functions as we will be generating the hashvalues
+ 		 * in this section.
+ 		 */
+ 		hashfunctions = hashtable->outer_hashfunctions;
+ 
+ 		/* Create the buckets */
+ 		for (i = 0; i < mcvsToUse; i++)
+ 		{
+ 			uint32 hashvalue = DatumGetUInt32(
+ 				FunctionCall1(&hashfunctions[0], values[i]));
+ 			int bucket = hashvalue & (nbuckets - 1);
+ 
+ 			/*
+ 			 * While we have not hit a hole in the hashtable and have not hit
+ 			 * a bucket with the same hashvalue we have collided in the
+ 			 * hashtable so try the next bucket location (remember it is an
+ 			 * open addressing hashtable).
+ 			 */
+ 			while (hashtable->imBucket[bucket] != NULL
+ 				&& hashtable->imBucket[bucket]->hashvalue != hashvalue)
+ 				bucket = (bucket + 1) & (nbuckets - 1);
+ 
+ 			/*
+ 			 * Leave bucket alone if it has the same hashvalue as current
+ 			 * MCV. We only want one bucket per hashvalue. Even if two MCV
+ 			 * values hash to the same bucket we are fine.
+ 			 */
+ 			if (hashtable->imBucket[bucket] == NULL)
+ 			{
+ 				/*
+ 				 * Allocate the actual bucket structure in the hashtable's batch
+ 				 * context because it is only relevant and necessary during
+ 				 * the first batch and will be nicely removed once the first
+ 				 * batch is done.
+ 				 */
+ 				hashtable->imBucket[bucket]
+ 					= MemoryContextAllocZero(hashtable->batchCxt,
+ 						sizeof(HashJoinIMBucket));
+ 				hashtable->imBucket[bucket]->hashvalue = hashvalue;
+ 				hashtable->imBucketFreezeOrder[hashtable->nUsedIMBuckets]
+ 					= bucket;
+ 				hashtable->nUsedIMBuckets++;
+ 				/* Count the overhead of the IM bucket struct */
+ 				hashtable->spaceUsed += IM_BUCKET_OVERHEAD;
+ 				hashtable->spaceUsedIM += IM_BUCKET_OVERHEAD;
+ 			}
+ 		}
+ 
+ 		free_attstatsslot(atttype, values, nvalues, numbers, nnumbers);
+ 	}
+ 
+ 	ReleaseSysCache(statsTuple);
+ }
  
  /* ----------------------------------------------------------------
   *		ExecHashJoin
***************
*** 147,152 ****
--- 354,364 ----
  										node->hj_HashOperators);
  		node->hj_HashTable = hashtable;
  
+ 		/* Use skew optimization only when there is more than one batch. */
+ 		if (hashtable->nbatch > 1 && enable_hashjoin_usestatmcvs)
+ 			ExecHashJoinDetectSkew(estate, node,
+ 				(outerPlan((Hash *) hashNode->ps.plan))->plan_width);
+ 
  		/*
  		 * execute the Hash node, to build the hash table
  		 */
***************
*** 205,216 ****
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
  
  			/*
  			 * Now we've got an outer tuple and the corresponding hash bucket,
! 			 * but this tuple may not belong to the current batch.
  			 */
! 			if (batchno != hashtable->curbatch)
  			{
  				/*
  				 * Need to postpone this outer tuple to a later batch. Save it
--- 417,440 ----
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
+ 			
+ 			/* Does the outer tuple match an IM bucket? */
+ 			node->hj_OuterTupleIMBucketNo = 
+ 				ExecHashGetIMBucket(hashtable, hashvalue);
+ 			if (node->hj_OuterTupleIMBucketNo != IM_INVALID_BUCKET)
+ 				hashtable->nOuterIMTup++;
+ 			else if (batchno == 0)
+ 				hashtable->nOuterBatchZeroTup++;
+ 			if (batchno == 0)
+ 				hashtable->nOuterPotentialBatchZeroTup++;
  
  			/*
  			 * Now we've got an outer tuple and the corresponding hash bucket,
! 			 * but in might not belong to the current batch, or it might need
! 			 * to go into an in-memory bucket.
  			 */
! 			if (node->hj_OuterTupleIMBucketNo == IM_INVALID_BUCKET
! 				&& batchno != hashtable->curbatch)
  			{
  				/*
  				 * Need to postpone this outer tuple to a later batch. Save it
***************
*** 281,286 ****
--- 505,517 ----
  					{
  						node->js.ps.ps_TupFromTlist =
  							(isDone == ExprMultipleResult);
+ 						hashtable->nOutputTup++;
+ 						if (node->hj_OuterTupleIMBucketNo != IM_INVALID_BUCKET)
+ 							hashtable->nOutputIMTup++;
+ 						else if (batchno == 0)
+ 							hashtable->nOutputBatchZeroTup++;
+ 						if (batchno == 0)
+ 							hashtable->nOutputPotentialBatchZeroTup++;
  						return result;
  					}
  				}
***************
*** 582,587 ****
--- 813,819 ----
  			{
  				/* remember outer relation is not empty for possible rescan */
  				hjstate->hj_OuterNotEmpty = true;
+ 				hashtable->nOuterTup++;
  
  				return slot;
  			}
***************
*** 641,647 ****
  	nbatch = hashtable->nbatch;
  	curbatch = hashtable->curbatch;
  
! 	if (curbatch > 0)
  	{
  		/*
  		 * We no longer need the previous outer batch file; close it right
--- 873,899 ----
  	nbatch = hashtable->nbatch;
  	curbatch = hashtable->curbatch;
  
! 	/* if we just finished the first batch */
! 	if (curbatch == 0)
! 	{
! 		/*
! 		 * Reset some of the skew optimization state variables. IM buckets are
! 		 * no longer being used as of this point because they are only
! 		 * necessary while joining the first batch (before the cleanup phase).
! 		 *
! 		 * Especially need to make sure ExecHashGetIMBucket returns
! 		 * IM_INVALID_BUCKET quickly for all subsequent calls.
! 		 *
! 		 * IM buckets are only taking up memory if this is a multi-batch join
! 		 * and in that case ExecHashTableReset is about to be called which
! 		 * will free all memory currently used by IM buckets and tuples when
! 		 * it deletes hashtable->batchCxt.  If this is a single batch join
! 		 * then imBucket and imBucketFreezeOrder are already NULL and empty.
! 		 */
! 		hashtable->enableSkewOptimization = false;
! 		hashtable->spaceUsedIM = 0;
! 	}
! 	else if (curbatch > 0)
  	{
  		/*
  		 * We no longer need the previous outer batch file; close it right
Index: src/backend/optimizer/path/costsize.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.203
diff -c -r1.203 costsize.c
*** src/backend/optimizer/path/costsize.c	1 Jan 2009 17:23:43 -0000	1.203
--- src/backend/optimizer/path/costsize.c	13 Jan 2009 08:02:42 -0000
***************
*** 108,113 ****
--- 108,114 ----
  bool		enable_nestloop = true;
  bool		enable_mergejoin = true;
  bool		enable_hashjoin = true;
+ bool		enable_hashjoin_usestatmcvs = true;
  
  typedef struct
  {
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.493
diff -c -r1.493 guc.c
*** src/backend/utils/misc/guc.c	12 Jan 2009 05:10:44 -0000	1.493
--- src/backend/utils/misc/guc.c	13 Jan 2009 08:03:33 -0000
***************
*** 656,661 ****
--- 656,669 ----
  		true, NULL, NULL
  	},
  	{
+ 		{"enable_hashjoin_usestatmcvs", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the hash join's use of the MCVs stored in pg_statistic."),
+ 			NULL
+ 		},
+ 		&enable_hashjoin_usestatmcvs,
+ 		true, NULL, NULL
+ 	},
+ 	{
  		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
  			gettext_noop("Enables genetic query optimization."),
  			gettext_noop("This algorithm attempts to do planning without "
Index: src/include/executor/hashjoin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/hashjoin.h,v
retrieving revision 1.49
diff -c -r1.49 hashjoin.h
*** src/include/executor/hashjoin.h	1 Jan 2009 17:23:59 -0000	1.49
--- src/include/executor/hashjoin.h	14 Jan 2009 06:42:53 -0000
***************
*** 72,77 ****
--- 72,97 ----
  #define HJTUPLE_MINTUPLE(hjtup)  \
  	((MinimalTuple) ((char *) (hjtup) + HJTUPLE_OVERHEAD))
  
+ /*
+  * Stores a hashvalue and linked list of tuples that share that hashvalue.
+  *
+  * When processing MCVs to detect skew in the probe relation of a hash join
+  * the hashvalue is generated and stored before any tuples have been read 
+  * (see ExecHashJoinDetectSkew).
+  *
+  * Build tuples that hash to the same hashvalue are placed in the bucket while
+  * reading the build relation.
+  */
+ typedef struct HashJoinIMBucket
+ {
+ 	uint32 hashvalue;
+ 	HashJoinTuple tuples;
+ } HashJoinIMBucket;
+ 
+ #define IM_INVALID_BUCKET -1
+ #define IM_WORK_MEM_PERCENT 2
+ #define IM_MIN_BENEFIT_PERCENT .01
+ #define IM_BUCKET_OVERHEAD MAXALIGN(sizeof(HashJoinIMBucket))
  
  typedef struct HashJoinTableData
  {
***************
*** 113,121 ****
--- 133,201 ----
  
  	Size		spaceUsed;		/* memory space currently used by tuples */
  	Size		spaceAllowed;	/* upper limit for space used */
+ 	/* memory space currently used by IM buckets and tuples */
+ 	Size		spaceUsedIM;
+ 	/* upper limit for space used by IM buckets and tuples */
+ 	Size		spaceAllowedIM;
  
  	MemoryContext hashCxt;		/* context for whole-hash-join storage */
  	MemoryContext batchCxt;		/* context for this-batch-only storage */
+ 	
+ 	/* will the join optimize memory usage when probe relation is skewed */
+ 	bool enableSkewOptimization;
+ 	HashJoinIMBucket **imBucket; /* hashtable of IM buckets */
+ 	/*
+ 	 * array of imBucket indexes to the created IM buckets sorted
+ 	 * in the opposite order that they would be frozen to disk
+ 	 */
+ 	uint16 *imBucketFreezeOrder;
+ 	int nIMBuckets; /* # of buckets in the IM buckets hashtable */
+ 	/*
+ 	 * # of used buckets in the IM buckets hashtable and length of
+ 	 * imBucketFreezeOrder array
+ 	 */
+ 	int nUsedIMBuckets;
+ 	/* # of IM buckets that have already been frozen to disk */
+ 	int nIMBucketsFrozen;
+ 
+ 	/* the total # of inner tuples received by join */
+ 	uint32 nInnerTup;
+ 	/* total # of outer tuples received by join */
+ 	uint32 nOuterTup;
+ 	/* # inner tuples in the IM buckets */
+ 	uint32 nInnerIMTup;
+ 	/* # outer tuples that matched with the IM buckets */
+ 	uint32 nOuterIMTup;
+ 	/* total # output tuples produced by join */
+ 	uint32 nOutputTup;
+ 	/* # output tuples that came from matches with IM bucket inner tuples */
+ 	uint32 nOutputIMTup;
+ 	/*
+ 	 * # of inner IM tuples that were frozen back to the main hashtable when
+ 	 * an IM bucket was frozen
+ 	 */
+ 	uint32 nInnerIMTupFrozen;
+ 	/* # outer tuples that matched with batch 0 */
+ 	uint32 nOuterBatchZeroTup;
+ 	/*
+ 	 * # outer tuples that would have fallen in batch 0 if IM buckets were
+ 	 * not in use at all
+ 	 */
+ 	uint32 nOuterPotentialBatchZeroTup;
+ 	/* # output tuples that came from matches with batch 0 inner tuples */
+ 	uint32 nOutputBatchZeroTup;
+ 	/*
+ 	 * # output tuples that would have came from matches with batch 0 if IM
+ 	 * buckets were not in use at all
+ 	 */
+ 	uint32 nOutputPotentialBatchZeroTup;
+ 	/* # inner tuples that fell in batch 0 */
+ 	uint32 nInnerBatchZeroTup;
+ 	/*
+ 	 * # inner tuples that would have fallen in batch 0 if IM buckets were
+ 	 * not in use at all
+ 	 */
+ 	uint32 nInnerPotentialBatchZeroTup;
  } HashJoinTableData;
  
  #endif   /* HASHJOIN_H */
Index: src/include/executor/nodeHash.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/nodeHash.h,v
retrieving revision 1.46
diff -c -r1.46 nodeHash.h
*** src/include/executor/nodeHash.h	1 Jan 2009 17:23:59 -0000	1.46
--- src/include/executor/nodeHash.h	6 Jan 2009 23:29:18 -0000
***************
*** 45,48 ****
--- 45,50 ----
  						int *numbuckets,
  						int *numbatches);
  
+ extern int ExecHashGetIMBucket(HashJoinTable hashtable, uint32 hashvalue);
+ 
  #endif   /* NODEHASH_H */
Index: src/include/nodes/execnodes.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/nodes/execnodes.h,v
retrieving revision 1.201
diff -c -r1.201 execnodes.h
*** src/include/nodes/execnodes.h	12 Jan 2009 05:10:45 -0000	1.201
--- src/include/nodes/execnodes.h	12 Jan 2009 20:29:58 -0000
***************
*** 1389,1394 ****
--- 1389,1395 ----
   *		hj_NeedNewOuter			true if need new outer tuple on next call
   *		hj_MatchedOuter			true if found a join match for current outer
   *		hj_OuterNotEmpty		true if outer relation known not empty
+  *		hj_OuterTupleIMBucketNo	IM bucket# for the current outer tuple
   * ----------------
   */
  
***************
*** 1414,1419 ****
--- 1415,1421 ----
  	bool		hj_NeedNewOuter;
  	bool		hj_MatchedOuter;
  	bool		hj_OuterNotEmpty;
+ 	int			hj_OuterTupleIMBucketNo;
  } HashJoinState;
  
  
Index: src/include/nodes/primnodes.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/nodes/primnodes.h,v
retrieving revision 1.145
diff -c -r1.145 primnodes.h
*** src/include/nodes/primnodes.h	1 Jan 2009 17:24:00 -0000	1.145
--- src/include/nodes/primnodes.h	7 Jan 2009 05:48:16 -0000
***************
*** 121,128 ****
   * subplans; for example, in a join node varno becomes INNER or OUTER and
   * varattno becomes the index of the proper element of that subplan's target
   * list.  But varnoold/varoattno continue to hold the original values.
!  * The code doesn't really need varnoold/varoattno, but they are very useful
!  * for debugging and interpreting completed plans, so we keep them around.
   */
  #define    INNER		65000
  #define    OUTER		65001
--- 121,132 ----
   * subplans; for example, in a join node varno becomes INNER or OUTER and
   * varattno becomes the index of the proper element of that subplan's target
   * list.  But varnoold/varoattno continue to hold the original values.
!  *
!  * For the most part, the code doesn't really need varnoold/varoattno, but
!  * they are very useful for debugging and interpreting completed plans, so we
!  * keep them around.  As of PostgreSQL 8.4, these values are also used by
!  * ExecHashJoinDetectSkew to fetch MCV statistics when performing multi-batch
!  * hash joins.
   */
  #define    INNER		65000
  #define    OUTER		65001
***************
*** 142,148 ****
  	Index		varlevelsup;	/* for subquery variables referencing outer
  								 * relations; 0 in a normal var, >0 means N
  								 * levels up */
! 	Index		varnoold;		/* original value of varno, for debugging */
  	AttrNumber	varoattno;		/* original value of varattno */
  	int			location;		/* token location, or -1 if unknown */
  } Var;
--- 146,152 ----
  	Index		varlevelsup;	/* for subquery variables referencing outer
  								 * relations; 0 in a normal var, >0 means N
  								 * levels up */
! 	Index		varnoold;		/* original value of varno */
  	AttrNumber	varoattno;		/* original value of varattno */
  	int			location;		/* token location, or -1 if unknown */
  } Var;
Index: src/include/optimizer/cost.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/optimizer/cost.h,v
retrieving revision 1.96
diff -c -r1.96 cost.h
*** src/include/optimizer/cost.h	7 Jan 2009 22:40:49 -0000	1.96
--- src/include/optimizer/cost.h	13 Jan 2009 08:02:06 -0000
***************
*** 59,64 ****
--- 59,65 ----
  extern bool enable_nestloop;
  extern bool enable_mergejoin;
  extern bool enable_hashjoin;
+ extern bool enable_hashjoin_usestatmcvs;
  extern int	constraint_exclusion;
  
  extern double clamp_row_est(double nrows);
tpch10g1zResults.txttext/plain; name=tpch10g1zResults.txtDownload
tpch10g2zResults.txttext/plain; name=tpch10g2zResults.txtDownload
#44Robert Haas
robertmhaas@gmail.com
In reply to: Joshua Tolley (#42)
1 attachment(s)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

On Wed, Jan 7, 2009 at 9:14 AM, Joshua Tolley <eggyknap@gmail.com> wrote:

On Tue, Jan 06, 2009 at 11:49:57PM -0500, Robert Haas wrote:

Josh / eggyknap -

Can you rerun your performance tests with this version of the patch?

...Robert

Will do, as soon as I can.

Josh,

Have you been able to do anything further with this?

I'm attaching a rebased version of this patch with a few further
whitespace cleanups.

...Robert

Attachments:

histojoin_v5_rh2.patchtext/x-patch; charset=US-ASCII; name=histojoin_v5_rh2.patchDownload
*** a/src/backend/executor/nodeHash.c
--- b/src/backend/executor/nodeHash.c
***************
*** 53,58 **** ExecHash(HashState *node)
--- 53,222 ----
  	return NULL;
  }
  
+ /*
+  * ----------------------------------------------------------------
+  *		ExecHashGetIMBucket
+  *
+  *  	Returns the index of the in-memory bucket for this
+  *		hashvalue, or IM_INVALID_BUCKET if the hashvalue is not
+  *		associated with any unfrozen bucket (or if skew
+  *		optimization is not being used).
+  *
+  *		It is possible for a tuple whose join attribute value is
+  *		not a MCV to hash to an in-memory bucket due to the limited
+  * 		number of hash values but it is unlikely and everything
+  *		continues to work even if it does happen. We would
+  *		accidentally cache some less optimal tuples in memory
+  *		but the join result would still be accurate.
+  *
+  *		hashtable->imBucket is an open addressing hashtable of
+  *		in-memory buckets (HashJoinIMBucket).
+  * ----------------------------------------------------------------
+  */
+ int
+ ExecHashGetIMBucket(HashJoinTable hashtable, uint32 hashvalue)
+ {
+ 	int bucket;
+ 
+ 	if (!hashtable->enableSkewOptimization)
+ 		return IM_INVALID_BUCKET;
+ 	
+ 	/* Modulo the hashvalue (using bitmask) to find the IM bucket. */
+ 	bucket = hashvalue & (hashtable->nIMBuckets - 1);
+ 
+ 	/*
+ 	 * While we have not hit a hole in the hashtable and have not hit the
+ 	 * actual bucket we have collided in the hashtable so try the next
+ 	 * bucket location.
+ 	 */
+ 	while (hashtable->imBucket[bucket] != NULL
+ 		&& hashtable->imBucket[bucket]->hashvalue != hashvalue)
+ 		bucket = (bucket + 1) & (hashtable->nIMBuckets - 1);
+ 
+ 	/*
+ 	 * If the bucket exists and has been correctly determined return
+ 	 * the bucket index.
+ 	 */
+ 	if (hashtable->imBucket[bucket] != NULL
+ 		&& hashtable->imBucket[bucket]->hashvalue == hashvalue)
+ 		return bucket;
+ 
+ 	/*
+ 	 * Must have run into an empty location or a frozen bucket which means the
+ 	 * tuple with this hashvalue is not to be handled as if it matches with an
+ 	 * in-memory bucket.
+ 	 */
+ 	return IM_INVALID_BUCKET;
+ }
+ 
+ /*
+  * ----------------------------------------------------------------
+  *		ExecHashFreezeNextIMBucket
+  *
+  *		Freeze the tuples of the next in-memory bucket by pushing
+  *		them into the main hashtable.  Buckets are frozen in order
+  *		so that the best tuples are kept in memory the longest.
+  * ----------------------------------------------------------------
+  */
+ static bool
+ ExecHashFreezeNextIMBucket(HashJoinTable hashtable)
+ {
+ 	int bucketToFreeze;
+ 	int bucketno;
+ 	int batchno;
+ 	uint32 hashvalue;
+ 	HashJoinTuple hashTuple;
+ 	HashJoinTuple nextHashTuple;
+ 	HashJoinIMBucket *bucket;
+ 	MinimalTuple mintuple;
+ 
+ 	/* Calculate the imBucket index of the bucket to freeze. */
+ 	bucketToFreeze = hashtable->imBucketFreezeOrder
+ 		[hashtable->nUsedIMBuckets - 1 - hashtable->nIMBucketsFrozen];
+ 
+ 	/* Grab a pointer to the actual IM bucket. */
+ 	bucket = hashtable->imBucket[bucketToFreeze];
+ 	hashvalue = bucket->hashvalue;
+ 
+ 	/*
+ 	 * Grab a pointer to the first tuple in the soon to be frozen IM bucket.
+ 	 */
+ 	hashTuple = bucket->tuples;
+ 
+ 	/*
+ 	 * Calculate which bucket and batch the tuples belong to in the main
+ 	 * non-IM hashtable.
+ 	 */
+ 	ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno, &batchno);
+ 
+ 	/* until we have read all tuples from this bucket */
+ 	while (hashTuple != NULL)
+ 	{
+ 		/*
+ 		 * Some of this code is very similar to that of ExecHashTableInsert.
+ 		 * We do not call ExecHashTableInsert directly as
+ 		 * ExecHashTableInsert expects a TupleTableSlot and we already have
+ 		 * HashJoinTuples.
+ 		 */
+ 		mintuple = HJTUPLE_MINTUPLE(hashTuple);
+ 
+ 		/* Decide whether to put the tuple in the hash table or a temp file. */
+ 		if (batchno == hashtable->curbatch)
+ 		{
+ 			/* Put the tuple in hash table. */
+ 			nextHashTuple = hashTuple->next;
+ 			hashTuple->next = hashtable->buckets[bucketno];
+ 			hashtable->buckets[bucketno] = hashTuple;
+ 			hashTuple = nextHashTuple;
+ 			hashtable->spaceUsedIM -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ 		}
+ 		else
+ 		{
+ 			/* Put the tuples into a temp file for later batches. */
+ 			Assert(batchno > hashtable->curbatch);
+ 			ExecHashJoinSaveTuple(mintuple, hashvalue,
+ 								  &hashtable->innerBatchFile[batchno]);
+ 			/*
+ 			 * Some memory has been freed up. This must be done before we
+ 			 * pfree the hashTuple of we lose access to the tuple size.
+ 			 */
+ 			hashtable->spaceUsed -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ 			hashtable->spaceUsedIM -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ 			nextHashTuple = hashTuple->next;
+ 			pfree(hashTuple);
+ 			hashTuple = nextHashTuple;
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Free the memory the bucket struct was using as it is not necessary
+ 	 * any more.  All code treats a frozen in-memory bucket the same as one
+ 	 * that did not exist; by setting the pointer to null the rest of the code
+ 	 * will function as if we had not created this bucket at all.
+ 	 */
+ 	pfree(bucket);
+ 	hashtable->imBucket[bucketToFreeze] = NULL;
+ 	hashtable->spaceUsed -= IM_BUCKET_OVERHEAD;
+ 	hashtable->spaceUsedIM -= IM_BUCKET_OVERHEAD;
+ 	hashtable->nIMBucketsFrozen++;
+ 
+ 	/*
+ 	 * All buckets have been frozen and deleted from memory so turn off
+ 	 * skew aware partitioning and remove the structs from memory as they are
+ 	 * just wasting space from now on.
+ 	 */
+ 	if (hashtable->nUsedIMBuckets == hashtable->nIMBucketsFrozen)
+ 	{
+ 		hashtable->enableSkewOptimization = false;
+ 		pfree(hashtable->imBucket);
+ 		pfree(hashtable->imBucketFreezeOrder);
+ 		hashtable->spaceUsed -= hashtable->spaceUsedIM;
+ 		hashtable->spaceUsedIM = 0;
+ 	}
+ 
+ 	return true;
+ }
+ 
  /* ----------------------------------------------------------------
   *		MultiExecHash
   *
***************
*** 69,74 **** MultiExecHash(HashState *node)
--- 233,240 ----
  	TupleTableSlot *slot;
  	ExprContext *econtext;
  	uint32		hashvalue;
+ 	MinimalTuple mintuple;
+ 	int bucketNumber;
  
  	/* must provide our own instrumentation support */
  	if (node->ps.instrument)
***************
*** 99,106 **** MultiExecHash(HashState *node)
  		if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
  								 &hashvalue))
  		{
! 			ExecHashTableInsert(hashtable, slot, hashvalue);
! 			hashtable->totalTuples += 1;
  		}
  	}
  
--- 265,306 ----
  		if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
  								 &hashvalue))
  		{
! 			bucketNumber = ExecHashGetIMBucket(hashtable, hashvalue);
! 
! 			/* handle tuples not destined for an in-memory bucket normally */
! 			if (bucketNumber == IM_INVALID_BUCKET)
! 				ExecHashTableInsert(hashtable, slot, hashvalue);
! 			else
! 			{
! 				HashJoinTuple hashTuple;
! 				int			hashTupleSize;
! 				
! 				/* get the HashJoinTuple */
! 				mintuple = ExecFetchSlotMinimalTuple(slot);
! 				hashTupleSize = HJTUPLE_OVERHEAD + mintuple->t_len;
! 				hashTuple = (HashJoinTuple)
! 					MemoryContextAlloc(hashtable->batchCxt, hashTupleSize);
! 				hashTuple->hashvalue = hashvalue;
! 				memcpy(HJTUPLE_MINTUPLE(hashTuple), mintuple, mintuple->t_len);
! 
! 				/* Push the HashJoinTuple onto the front of the IM bucket. */
! 				hashTuple->next = hashtable->imBucket[bucketNumber]->tuples;
! 				hashtable->imBucket[bucketNumber]->tuples = hashTuple;
! 
! 				/*
! 				 * More memory is now in use so make sure we are not over
! 				 * spaceAllowedIM.
! 				 */
! 				hashtable->spaceUsed += hashTupleSize;
! 				hashtable->spaceUsedIM += hashTupleSize;
! 				while (hashtable->spaceUsedIM > hashtable->spaceAllowedIM
! 					&& ExecHashFreezeNextIMBucket(hashtable))
! 					;
! 				/* Guarantee we are not over the spaceAllowed. */
! 				if (hashtable->spaceUsed > hashtable->spaceAllowed)
! 					ExecHashIncreaseNumBatches(hashtable);
! 			}
! 			hashtable->totalTuples++;
  		}
  	}
  
***************
*** 269,274 **** ExecHashTableCreate(Hash *node, List *hashOperators)
--- 469,483 ----
  	hashtable->outerBatchFile = NULL;
  	hashtable->spaceUsed = 0;
  	hashtable->spaceAllowed = work_mem * 1024L;
+ 	/* Initialize skew optimization related hashtable variables. */
+ 	hashtable->spaceUsedIM = 0;
+ 	hashtable->spaceAllowedIM
+ 		= hashtable->spaceAllowed * IM_WORK_MEM_PERCENT / 100;
+ 	hashtable->enableSkewOptimization = false;
+ 	hashtable->nUsedIMBuckets = 0;
+ 	hashtable->nIMBuckets = 0;
+ 	hashtable->imBucket = NULL;
+ 	hashtable->nIMBucketsFrozen = 0;
  
  	/*
  	 * Get info about the hash functions to be used for each hash key. Also
***************
*** 815,825 **** ExecScanHashBucket(HashJoinState *hjstate,
  	/*
  	 * hj_CurTuple is NULL to start scanning a new bucket, or the address of
  	 * the last tuple returned from the current bucket.
  	 */
! 	if (hashTuple == NULL)
! 		hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];
! 	else
  		hashTuple = hashTuple->next;
  
  	while (hashTuple != NULL)
  	{
--- 1024,1040 ----
  	/*
  	 * hj_CurTuple is NULL to start scanning a new bucket, or the address of
  	 * the last tuple returned from the current bucket.
+ 	 *
+ 	 * If the tuple hashed to an IM bucket then scan the IM bucket
+ 	 * otherwise scan the standard hashtable bucket.
  	 */
! 	if (hashTuple != NULL)
  		hashTuple = hashTuple->next;
+ 	else if (hjstate->hj_OuterTupleIMBucketNo != IM_INVALID_BUCKET)
+ 		hashTuple = hashtable->imBucket[hjstate->hj_OuterTupleIMBucketNo]
+ 			->tuples;
+ 	else
+ 		hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];
  
  	while (hashTuple != NULL)
  	{
*** a/src/backend/executor/nodeHashjoin.c
--- b/src/backend/executor/nodeHashjoin.c
***************
*** 20,25 ****
--- 20,29 ----
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
  #include "utils/memutils.h"
+ #include "utils/syscache.h"
+ #include "utils/lsyscache.h"
+ #include "parser/parsetree.h"
+ #include "catalog/pg_statistic.h"
  
  
  /* Returns true for JOIN_LEFT and JOIN_ANTI jointypes */
***************
*** 34,39 **** static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
--- 38,227 ----
  						  TupleTableSlot *tupleSlot);
  static int	ExecHashJoinNewBatch(HashJoinState *hjstate);
  
+ /*
+  * ----------------------------------------------------------------
+  *		ExecHashJoinDetectSkew
+  *
+  *		If MCV statistics can be found for the join attribute of
+  *		this hashjoin then create a hash table of buckets. Each
+  *		bucket will correspond to an MCV hashvalue and will be
+  *		filled with inner relation tuples whose join attribute
+  *		hashes to the same value as that MCV.  If a join
+  *		attribute value is a MCV for the join attribute in the
+  *		outer (probe) relation, tuples with this value in the
+  *		inner (build) relation are more likely to join with outer
+  *		relation tuples and a benefit can be gained by keeping
+  *		them in memory while joining the first batch of tuples.
+  * ----------------------------------------------------------------
+  */
+ static void
+ ExecHashJoinDetectSkew(EState *estate, HashJoinState *hjstate, int tupwidth)
+ {
+ 	HeapTupleData	*statsTuple;
+ 	FuncExprState	*clause;
+ 	ExprState		*argstate;
+ 	Var				*variable;
+ 	HashJoinTable	hashtable;
+ 	Datum			*values;
+ 	int				nvalues;
+ 	float4			*numbers;
+ 	int				nnumbers;
+ 	Oid				relid;
+ 	Oid				atttype;
+ 	int				i;
+ 	int				mcvsToUse;
+ 
+ 	/* Only use statistics if there is a single join attribute. */
+ 	if (hjstate->hashclauses->length != 1)
+ 		return; /* Histojoin is not defined for more than one join key */
+ 	
+ 	hashtable = hjstate->hj_HashTable;
+ 	
+ 	/*
+ 	 * Estimate the number of IM buckets that will fit in
+ 	 * the memory allowed for IM buckets.
+ 	 *
+ 	 * hashtable->imBucket will have up to 8 times as many HashJoinIMBucket
+ 	 * pointers as the number of MCV hashvalues. A uint16 index in
+ 	 * hashtable->imBucketFreezeOrder will be created for each IM bucket. One
+ 	 * actual HashJoinIMBucket struct will be created for each
+ 	 * unique MCV hashvalue so up to one struct per MCV.
+ 	 *
+ 	 * It is also estimated that each IM bucket will have a single build
+ 	 * tuple stored in it after partitioning the build relation input.  This
+ 	 * estimate could be high if tuples are filtered out before this join but
+ 	 * in that case the extra memory is used by the regular hashjoin batch.
+ 	 * This estimate could be low if it is a many to many join but in that
+ 	 * case IM buckets will be frozen to free up memory as needed
+ 	 * during the inner relation partitioning phase.
+ 	 */
+ 	mcvsToUse = hashtable->spaceAllowedIM / (
+ 		/* size of a hash tuple */
+ 		HJTUPLE_OVERHEAD + MAXALIGN(sizeof(MinimalTupleData))
+ 			+ MAXALIGN(tupwidth)
+ 		/* max size of hashtable pointers per MCV */
+ 		+ (8 * sizeof(HashJoinIMBucket*))
+ 		+ sizeof(uint16) /* size of imBucketFreezeOrder entry */
+ 		+ IM_BUCKET_OVERHEAD /* size of IM bucket struct */
+ 		);
+ 	if (mcvsToUse == 0)
+ 		return;	/* No point in considering this any further. */
+ 
+ 	/*
+ 	 * Determine the relation id and attribute id of the single join
+ 	 * attribute of the probe relation.
+ 	 */
+ 	clause = (FuncExprState *) lfirst(list_head(hjstate->hashclauses));
+ 	argstate = (ExprState *) lfirst(list_head(clause->args));
+ 
+ 	/*
+ 	 * Do not try to exploit stats if the join attribute is an expression
+ 	 * instead of just a simple attribute.
+ 	 */		
+ 	if (argstate->expr->type != T_Var)
+ 		return;
+ 
+ 	variable = (Var *) argstate->expr;
+ 	relid = getrelid(variable->varnoold, estate->es_range_table);
+ 	atttype = variable->vartype;
+ 
+ 	statsTuple = SearchSysCache(STATRELATT, ObjectIdGetDatum(relid),
+ 		Int16GetDatum(variable->varoattno), 0, 0);
+ 	if (!HeapTupleIsValid(statsTuple))
+ 		return;
+ 
+ 	/* Look for MCV statistics for the attribute. */
+ 	if (get_attstatsslot(statsTuple, atttype, variable->vartypmod,
+ 		STATISTIC_KIND_MCV, InvalidOid, &values, &nvalues,
+ 		&numbers, &nnumbers))
+ 	{
+ 		FmgrInfo   *hashfunctions;
+ 		int nbuckets = 2;
+ 
+ 		/*
+ 		 * IM buckets (imBucket) is an open addressing hashtable with a
+ 		 * power of 2 size that is greater than the number of MCV values.
+ 		 */
+ 		if (mcvsToUse > nvalues)
+ 			mcvsToUse = nvalues;
+ 		while (nbuckets <= mcvsToUse)
+ 			nbuckets <<= 1;
+ 		/* use two more bits just to help avoid collisions */
+ 		nbuckets <<= 2;
+ 		hashtable->nIMBuckets = nbuckets;
+ 		hashtable->enableSkewOptimization = true;
+ 
+ 		/*
+ 		 * Allocate the bucket memory in the hashtable's batch context
+ 		 * because it is only relevant and necessary during the first batch
+ 		 * and will be nicely removed once the first batch is done.
+ 		 */
+ 		hashtable->imBucket =
+ 			MemoryContextAllocZero(hashtable->batchCxt,
+ 				nbuckets * sizeof(HashJoinIMBucket*));
+ 		hashtable->imBucketFreezeOrder =
+ 			MemoryContextAllocZero(hashtable->batchCxt,
+ 				mcvsToUse * sizeof(uint16));
+ 		/* Count the overhead of the IM pointers immediately. */
+ 		hashtable->spaceUsed += nbuckets * sizeof(HashJoinIMBucket*)
+ 			+ mcvsToUse * sizeof(uint16);
+ 		hashtable->spaceUsedIM +=  nbuckets * sizeof(HashJoinIMBucket*)
+ 			+ mcvsToUse * sizeof(uint16);
+ 
+ 		/*
+ 		 * Grab the hash functions as we will be generating the hashvalues
+ 		 * in this section.
+ 		 */
+ 		hashfunctions = hashtable->outer_hashfunctions;
+ 
+ 		/* Create the buckets */
+ 		for (i = 0; i < mcvsToUse; i++)
+ 		{
+ 			uint32 hashvalue = DatumGetUInt32(
+ 				FunctionCall1(&hashfunctions[0], values[i]));
+ 			int bucket = hashvalue & (nbuckets - 1);
+ 
+ 			/*
+ 			 * While we have not hit a hole in the hashtable and have not hit
+ 			 * a bucket with the same hashvalue we have collided in the
+ 			 * hashtable so try the next bucket location (remember it is an
+ 			 * open addressing hashtable).
+ 			 */
+ 			while (hashtable->imBucket[bucket] != NULL
+ 				&& hashtable->imBucket[bucket]->hashvalue != hashvalue)
+ 				bucket = (bucket + 1) & (nbuckets - 1);
+ 
+ 			/*
+ 			 * Leave bucket alone if it has the same hashvalue as current
+ 			 * MCV. We only want one bucket per hashvalue. Even if two MCV
+ 			 * values hash to the same bucket we are fine.
+ 			 */
+ 			if (hashtable->imBucket[bucket] == NULL)
+ 			{
+ 				/*
+ 				 * Allocate the actual bucket structure in the hashtable's batch
+ 				 * context because it is only relevant and necessary during
+ 				 * the first batch and will be nicely removed once the first
+ 				 * batch is done.
+ 				 */
+ 				hashtable->imBucket[bucket]
+ 					= MemoryContextAllocZero(hashtable->batchCxt,
+ 						sizeof(HashJoinIMBucket));
+ 				hashtable->imBucket[bucket]->hashvalue = hashvalue;
+ 				hashtable->imBucketFreezeOrder[hashtable->nUsedIMBuckets]
+ 					= bucket;
+ 				hashtable->nUsedIMBuckets++;
+ 				/* Count the overhead of the IM bucket struct */
+ 				hashtable->spaceUsed += IM_BUCKET_OVERHEAD;
+ 				hashtable->spaceUsedIM += IM_BUCKET_OVERHEAD;
+ 			}
+ 		}
+ 
+ 		free_attstatsslot(atttype, values, nvalues, numbers, nnumbers);
+ 	}
+ 
+ 	ReleaseSysCache(statsTuple);
+ }
  
  /* ----------------------------------------------------------------
   *		ExecHashJoin
***************
*** 147,152 **** ExecHashJoin(HashJoinState *node)
--- 335,345 ----
  										node->hj_HashOperators);
  		node->hj_HashTable = hashtable;
  
+ 		/* Use skew optimization only when there is more than one batch. */
+ 		if (hashtable->nbatch > 1)
+ 			ExecHashJoinDetectSkew(estate, node,
+ 				(outerPlan((Hash *) hashNode->ps.plan))->plan_width);
+ 
  		/*
  		 * execute the Hash node, to build the hash table
  		 */
***************
*** 205,216 **** ExecHashJoin(HashJoinState *node)
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
  
  			/*
  			 * Now we've got an outer tuple and the corresponding hash bucket,
! 			 * but this tuple may not belong to the current batch.
  			 */
! 			if (batchno != hashtable->curbatch)
  			{
  				/*
  				 * Need to postpone this outer tuple to a later batch. Save it
--- 398,415 ----
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
+ 			
+ 			/* Does the outer tuple match an IM bucket? */
+ 			node->hj_OuterTupleIMBucketNo =
+ 				ExecHashGetIMBucket(hashtable, hashvalue);
  
  			/*
  			 * Now we've got an outer tuple and the corresponding hash bucket,
! 			 * but in might not belong to the current batch, or it might need
! 			 * to go into an in-memory bucket.
  			 */
! 			if (node->hj_OuterTupleIMBucketNo == IM_INVALID_BUCKET
! 				&& batchno != hashtable->curbatch)
  			{
  				/*
  				 * Need to postpone this outer tuple to a later batch. Save it
***************
*** 641,647 **** start_over:
  	nbatch = hashtable->nbatch;
  	curbatch = hashtable->curbatch;
  
! 	if (curbatch > 0)
  	{
  		/*
  		 * We no longer need the previous outer batch file; close it right
--- 840,866 ----
  	nbatch = hashtable->nbatch;
  	curbatch = hashtable->curbatch;
  
! 	/* if we just finished the first batch */
! 	if (curbatch == 0)
! 	{
! 		/*
! 		 * Reset some of the skew optimization state variables. IM buckets are
! 		 * no longer being used as of this point because they are only
! 		 * necessary while joining the first batch (before the cleanup phase).
! 		 *
! 		 * Especially need to make sure ExecHashGetIMBucket returns
! 		 * IM_INVALID_BUCKET quickly for all subsequent calls.
! 		 *
! 		 * IM buckets are only taking up memory if this is a multi-batch join
! 		 * and in that case ExecHashTableReset is about to be called which
! 		 * will free all memory currently used by IM buckets and tuples when
! 		 * it deletes hashtable->batchCxt.  If this is a single batch join
! 		 * then imBucket and imBucketFreezeOrder are already NULL and empty.
! 		 */
! 		hashtable->enableSkewOptimization = false;
! 		hashtable->spaceUsedIM = 0;
! 	}
! 	else if (curbatch > 0)
  	{
  		/*
  		 * We no longer need the previous outer batch file; close it right
*** a/src/include/executor/hashjoin.h
--- b/src/include/executor/hashjoin.h
***************
*** 72,77 **** typedef struct HashJoinTupleData
--- 72,96 ----
  #define HJTUPLE_MINTUPLE(hjtup)  \
  	((MinimalTuple) ((char *) (hjtup) + HJTUPLE_OVERHEAD))
  
+ /*
+  * Stores a hashvalue and linked list of tuples that share that hashvalue.
+  *
+  * When processing MCVs to detect skew in the probe relation of a hash join
+  * the hashvalue is generated and stored before any tuples have been read
+  * (see ExecHashJoinDetectSkew).
+  *
+  * Build tuples that hash to the same hashvalue are placed in the bucket while
+  * reading the build relation.
+  */
+ typedef struct HashJoinIMBucket
+ {
+ 	uint32 hashvalue;
+ 	HashJoinTuple tuples;
+ } HashJoinIMBucket;
+ 
+ #define IM_INVALID_BUCKET -1
+ #define IM_WORK_MEM_PERCENT 2
+ #define IM_BUCKET_OVERHEAD MAXALIGN(sizeof(HashJoinIMBucket))
  
  typedef struct HashJoinTableData
  {
***************
*** 113,121 **** typedef struct HashJoinTableData
--- 132,161 ----
  
  	Size		spaceUsed;		/* memory space currently used by tuples */
  	Size		spaceAllowed;	/* upper limit for space used */
+ 	/* memory space currently used by IM buckets and tuples */
+ 	Size		spaceUsedIM;
+ 	/* upper limit for space used by IM buckets and tuples */
+ 	Size		spaceAllowedIM;
  
  	MemoryContext hashCxt;		/* context for whole-hash-join storage */
  	MemoryContext batchCxt;		/* context for this-batch-only storage */
+ 	
+ 	/* will the join optimize memory usage when probe relation is skewed */
+ 	bool enableSkewOptimization;
+ 	HashJoinIMBucket **imBucket; /* hashtable of IM buckets */
+ 	/*
+ 	 * array of imBucket indexes to the created IM buckets sorted
+ 	 * in the opposite order that they would be frozen to disk
+ 	 */
+ 	uint16 *imBucketFreezeOrder;
+ 	int nIMBuckets; /* # of buckets in the IM buckets hashtable */
+ 	/*
+ 	 * # of used buckets in the IM buckets hashtable and length of
+ 	 * imBucketFreezeOrder array
+ 	 */
+ 	int nUsedIMBuckets;
+ 	/* # of IM buckets that have already been frozen to disk */
+ 	int nIMBucketsFrozen;
  } HashJoinTableData;
  
  #endif   /* HASHJOIN_H */
*** a/src/include/executor/nodeHash.h
--- b/src/include/executor/nodeHash.h
***************
*** 45,48 **** extern void ExecChooseHashTableSize(double ntuples, int tupwidth,
--- 45,50 ----
  						int *numbuckets,
  						int *numbatches);
  
+ extern int ExecHashGetIMBucket(HashJoinTable hashtable, uint32 hashvalue);
+ 
  #endif   /* NODEHASH_H */
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 1389,1394 **** typedef struct MergeJoinState
--- 1389,1395 ----
   *		hj_NeedNewOuter			true if need new outer tuple on next call
   *		hj_MatchedOuter			true if found a join match for current outer
   *		hj_OuterNotEmpty		true if outer relation known not empty
+  *		hj_OuterTupleIMBucketNo	IM bucket# for the current outer tuple
   * ----------------
   */
  
***************
*** 1414,1419 **** typedef struct HashJoinState
--- 1415,1421 ----
  	bool		hj_NeedNewOuter;
  	bool		hj_MatchedOuter;
  	bool		hj_OuterNotEmpty;
+ 	int			hj_OuterTupleIMBucketNo;
  } HashJoinState;
  
  
*** a/src/include/nodes/primnodes.h
--- b/src/include/nodes/primnodes.h
***************
*** 121,128 **** typedef struct Expr
   * subplans; for example, in a join node varno becomes INNER or OUTER and
   * varattno becomes the index of the proper element of that subplan's target
   * list.  But varnoold/varoattno continue to hold the original values.
!  * The code doesn't really need varnoold/varoattno, but they are very useful
!  * for debugging and interpreting completed plans, so we keep them around.
   */
  #define    INNER		65000
  #define    OUTER		65001
--- 121,132 ----
   * subplans; for example, in a join node varno becomes INNER or OUTER and
   * varattno becomes the index of the proper element of that subplan's target
   * list.  But varnoold/varoattno continue to hold the original values.
!  *
!  * For the most part, the code doesn't really need varnoold/varoattno, but
!  * they are very useful for debugging and interpreting completed plans, so we
!  * keep them around.  As of PostgreSQL 8.4, these values are also used by
!  * ExecHashJoinDetectSkew to fetch MCV statistics when performing multi-batch
!  * hash joins.
   */
  #define    INNER		65000
  #define    OUTER		65001
***************
*** 142,148 **** typedef struct Var
  	Index		varlevelsup;	/* for subquery variables referencing outer
  								 * relations; 0 in a normal var, >0 means N
  								 * levels up */
! 	Index		varnoold;		/* original value of varno, for debugging */
  	AttrNumber	varoattno;		/* original value of varattno */
  	int			location;		/* token location, or -1 if unknown */
  } Var;
--- 146,152 ----
  	Index		varlevelsup;	/* for subquery variables referencing outer
  								 * relations; 0 in a normal var, >0 means N
  								 * levels up */
! 	Index		varnoold;		/* original value of varno */
  	AttrNumber	varoattno;		/* original value of varattno */
  	int			location;		/* token location, or -1 if unknown */
  } Var;
#45Robert Haas
robertmhaas@gmail.com
In reply to: Lawrence, Ramon (#43)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

At this point, we await further feedback on what is necessary to get
this patch accepted. We would also like to thank Josh and Robert again
for their review time.

I think what we need here is some very simple testing to demonstrate
that this patch demonstrates a speed-up even when the inner side of
the join is a joinrel rather than a baserel. Can you suggest a single
query against the skewed TPCH dataset that will result in two or more
multi-batch hash joins? If so, it should be a simple matter to run
that query with and without the patch and verify that the former is
faster than the latter.

Thanks,

...Robert

#46Lawrence, Ramon
ramon.lawrence@ubc.ca
In reply to: Bryce Cutt (#40)
1 attachment(s)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

________________________________

From: pgsql-hackers-owner@postgresql.org on behalf of Robert Haas
I think what we need here is some very simple testing to demonstrate
that this patch demonstrates a speed-up even when the inner side of
the join is a joinrel rather than a baserel. Can you suggest a single
query against the skewed TPCH dataset that will result in two or more
multi-batch hash joins? If so, it should be a simple matter to run
that query with and without the patch and verify that the former is
faster than the latter.

This query will have the outer relation be a joinrel rather than a baserel:

select count(*) from supplier, part, lineitem where l_partkey = p_partkey and s_suppkey = l_suppkey;

The approach collects statistics on the outer relation (not the inner relation) so the code had to have the ability to determine a stats tuple on a joinrel in addition to a baserel.

Joshua sent us some preliminary data with this query and others and indicated that we could post it. He wanted time to clean it up and re-run some experiments, but the data is generally good and the algorithm performs as expected. I have attached this data to the post. Note that the last set of data (although labelled as Z7) is actually an almost zero skew database and represents the worst-case for the algorithm (for most queries the optimization is not even used).

--
Ramon Lawrence

Attachments:

JoshuaTolleyData.xlsapplication/vnd.ms-excel; name=JoshuaTolleyData.xlsDownload
#47Joshua Tolley
eggyknap@gmail.com
In reply to: Robert Haas (#44)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

On Wed, Feb 18, 2009 at 11:20:03PM -0500, Robert Haas wrote:

On Wed, Jan 7, 2009 at 9:14 AM, Joshua Tolley <eggyknap@gmail.com> wrote:

On Tue, Jan 06, 2009 at 11:49:57PM -0500, Robert Haas wrote:

Josh / eggyknap -

Can you rerun your performance tests with this version of the patch?

...Robert

Will do, as soon as I can.

Josh,

Have you been able to do anything further with this?

I'm attaching a rebased version of this patch with a few further
whitespace cleanups.

...Robert

I keep trying to do testing, but not getting too far, though I did
return some test results to the original authors for their review. I'll
try to get a more formal response put together (my new daughter will be
24 hours old in a little bit, though, so it might be a while!)

- Josh

#48David Fetter
david@fetter.org
In reply to: Joshua Tolley (#47)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

On Thu, Feb 19, 2009 at 01:50:55PM -0700, Josh Tolley wrote:

(my new daughter will be 24 hours old in a little bit, though, so it
might be a while!)

Pics!

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#49Robert Haas
robertmhaas@gmail.com
In reply to: Lawrence, Ramon (#46)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

Joshua sent us some preliminary data with this query and others and indicated that we could post it.  He wanted time to clean it up
and re-run some experiments, but the data is generally good and the algorithm performs as expected.  I have attached this data to the
post.  Note that the last set of data (although labelled as Z7) is actually an almost zero skew database and represents the worst-case
for the algorithm (for most queries the optimization is not even used).

Sadly, there seem to be a number of cases in the Z7 database where the
optimization makes things significantly worse (specifically, queries
2, 3, and 7, but especially query 3). Have you investigated what is
going on there? I had thought that we had sufficient safeguards in
place to prevent this optimization from kicking in in cases where it
doesn't help, but it seems not. There will certainly be real-world
databases that are more like Z7 than Z1.

...Robert

#50Lawrence, Ramon
ramon.lawrence@ubc.ca
In reply to: Bryce Cutt (#40)
1 attachment(s)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

-----Original Message-----
From: Robert Haas
Sadly, there seem to be a number of cases in the Z7 database where the
optimization makes things significantly worse (specifically, queries
2, 3, and 7, but especially query 3). Have you investigated what is
going on there? I had thought that we had sufficient safeguards in
place to prevent this optimization from kicking in in cases where it
doesn't help, but it seems not. There will certainly be real-world
databases that are more like Z7 than Z1.

I agree that there should be no noticeable performance difference when
the optimization is not used (single batch case or no skew). I think
the patch achieves this. The optimization is not used in those cases,
but we will review to see if it is the code that by-passes the
optimization that is causing a difference.

The query #3 timing difference is primarily due to a flaw in the
experimental setup. For some reason, query #3 got executed before #4
with the optimization on, and executed after #4 with the optimization
off. This skewed the results for all runs (due to buffering issues),
but is especially noticeable for Z7. Note how query #4 is always faster
for the optimization on version even though the optimization is not
actually used for those queries (because they were one batch). I expect
that if you run query #3 on Z7 in isolation then the results should be
basically identical.

I have attached the SQL script that Joshua sent me. The raw data I have
posted at: http://people.ok.ubc.ca/rlawrenc/test.output

--
Ramon Lawrence

Attachments:

test.sqlapplication/octet-stream; name=test.sqlDownload
#51Robert Haas
robertmhaas@gmail.com
In reply to: Lawrence, Ramon (#50)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

On Wed, Feb 25, 2009 at 12:38 AM, Lawrence, Ramon <ramon.lawrence@ubc.ca> wrote:

-----Original Message-----
From: Robert Haas
Sadly, there seem to be a number of cases in the Z7 database where the
optimization makes things significantly worse (specifically, queries
2, 3, and 7, but especially query 3).  Have you investigated what is
going on there?  I had thought that we had sufficient safeguards in
place to prevent this optimization from kicking in in cases where it
doesn't help, but it seems not.  There will certainly be real-world
databases that are more like Z7 than Z1.

I agree that there should be no noticeable performance difference when
the optimization is not used (single batch case or no skew).  I think
the patch achieves this.  The optimization is not used in those cases,
but we will review to see if it is the code that by-passes the
optimization that is causing a difference.

Yeah we need to understand what's going on there.

The query #3 timing difference is primarily due to a flaw in the
experimental setup.  For some reason, query #3 got executed before #4
with the optimization on, and executed after #4 with the optimization
off.  This skewed the results for all runs (due to buffering issues),
but is especially noticeable for Z7.  Note how query #4 is always faster
for the optimization on version even though the optimization is not
actually used for those queries (because they were one batch).  I expect
that if you run query #3 on Z7 in isolation then the results should be
basically identical.

I have attached the SQL script that Joshua sent me.  The raw data I have
posted at: http://people.ok.ubc.ca/rlawrenc/test.output

I don't think we're really doing this the right way. EXPLAIN ANALYZE
has a measurable effect on the results, and we probably ought to stop
the database and drop the VM caches after each query. Are the Z1-Z7
datasets on line someplace? I might be able to rig up a script here.

...Robert

#52Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#51)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

I haven't been following this thread closely, so pardon if this has been
discussed already.

The patch doesn't seem to change the cost estimates in the planner at
all. Without that, I'd imagine that the planner rarely chooses a
multi-batch hash join to begin with.

Joshua, in the tests that you've been running, did you have to rig the
planner with "enable_mergjoin=off" or similar, to get the queries to use
hash joins?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#53Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#52)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

On Thu, Feb 26, 2009 at 4:22 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I haven't been following this thread closely, so pardon if this has been
discussed already.

The patch doesn't seem to change the cost estimates in the planner at all.
Without that, I'd imagine that the planner rarely chooses a multi-batch hash
join to begin with.

AFAICS, a multi-batch hash join happens when you are joining two big,
unsorted paths. The planner essentially compares the cost of sorting
the two paths and then merge-joining them versus the cost of a hash
join. It doesn't seem to be unusual for the hash join to come out the
winner, although admittedly I haven't played with it a ton. You
certainly could try to model it in the costing algorithm, but I'm not
sure how much benefit you'd get out of it: if you're doing this a lot
you're probably better off creating indices.

Joshua, in the tests that you've been running, did you have to rig the
planner with "enable_mergjoin=off" or similar, to get the queries to use
hash joins?

I didn't have to fiddle anything, but Josh's tests were more exhaustive.

...Robert

#54Joshua Tolley
eggyknap@gmail.com
In reply to: Robert Haas (#51)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

On Wed, Feb 25, 2009 at 10:24:21PM -0500, Robert Haas wrote:

I don't think we're really doing this the right way. EXPLAIN ANALYZE
has a measurable effect on the results, and we probably ought to stop
the database and drop the VM caches after each query. Are the Z1-Z7
datasets on line someplace? I might be able to rig up a script here.

...Robert

They're automatically generated by the dbgen utility, a link to which
was originally published somewhere in this thread. That tool creates a
few text files suitable (with some tweaking) for a COPY command. I've
got the original files... the .tbz I just made is 1.8 GB :) Anyone have
someplace they'd like me to drop it?

- Josh

#55Joshua Tolley
eggyknap@gmail.com
In reply to: Robert Haas (#53)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

On Thu, Feb 26, 2009 at 08:22:52AM -0500, Robert Haas wrote:

On Thu, Feb 26, 2009 at 4:22 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Joshua, in the tests that you've been running, did you have to rig the
planner with "enable_mergjoin=off" or similar, to get the queries to use
hash joins?

I didn't have to fiddle anything, but Josh's tests were more exhaustive.

The planner chose hash joins for the queries I was running, regardless
of whether the patch was applied. I didn't have to mess with any
settings to get hash joins.

- Josh

#56Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joshua Tolley (#55)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

Heikki's got a point here: the planner is aware that hashjoin doesn't
like skewed distributions, and it assigns extra cost accordingly if it
can determine that the join key is skewed. (See the "bucketsize" stuff
in cost_hashjoin.) If this patch is accepted we'll want to tweak that
code.

Still, that has little to do with the current gating issue, which is
whether we've convinced ourselves that the patch doesn't cause a
performance decrease for cases in which it's unable to help.

regards, tom lane

#57Lawrence, Ramon
ramon.lawrence@ubc.ca
In reply to: Bryce Cutt (#40)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

From: Tom Lane
Heikki's got a point here: the planner is aware that hashjoin doesn't
like skewed distributions, and it assigns extra cost accordingly if it
can determine that the join key is skewed. (See the "bucketsize"

stuff

in cost_hashjoin.) If this patch is accepted we'll want to tweak that
code.

Those modifications would make the optimizer more likely to select hash
join, even with skewed distributions. For the TPC-H data set that we
are using the optimizer always picks hash join over merge join (single
or multi-batch). Since the current patch does not change the cost
function, there is no change in the planning cost. It may or may not be
useful to modify the cost function depending on the effect on planning
cost.

Still, that has little to do with the current gating issue, which is
whether we've convinced ourselves that the patch doesn't cause a
performance decrease for cases in which it's unable to help.

Although we have not seen an overhead when the optimization is
by-passed, we are looking at some small code changes that would
guarantee that no extra statements are executed for the single batch
case. Currently, an if optimization_on check is performed on each probe
tuple which, although minor, should be able to be avoided.

The patch's author, Bryce Cutt, is defending his Master's thesis Friday
morning (on this work), so we will provide some updated code right after
that. Since these code changes are small, they should not affect people
trying to test the performance of the current patch.

--
Ramon Lawrence

#58Lawrence, Ramon
ramon.lawrence@ubc.ca
In reply to: Robert Haas (#51)
Re: Proposed Patch to Improve Performance ofMulti-BatchHash Join for Skewed Data Sets

They're automatically generated by the dbgen utility, a link to which
was originally published somewhere in this thread. That tool creates a
few text files suitable (with some tweaking) for a COPY command. I've
got the original files... the .tbz I just made is 1.8 GB :) Anyone

have

someplace they'd like me to drop it?

Just a note that the Z7 data set is really a uniform data set Z0. The
generator only accepts skew in the range from Z0 to Z4. The uniform,
Z0, data set is typically used when benchmarking data warehouses.

It turns out the data is not perfectly uniform as the top 100 suppliers
and products represent 2.3% and 1.5% of LineItem. This is just enough
skew that the optimization will sometimes be triggered in the
multi-batch case (currently 1% skew is the cutoff).

I have posted a pg_dump of the TPCH 1G Z0 data set at:

http://people.ok.ubc.ca/rlawrenc/tpch1g0z.zip

(Note that ownership commands are in the dump and make sure to vacuum
analyze after the load.) I can also post the input text files if that
is easier.

--
Ramon Lawrence

#59Robert Haas
robertmhaas@gmail.com
In reply to: Lawrence, Ramon (#58)
Re: Proposed Patch to Improve Performance ofMulti-BatchHash Join for Skewed Data Sets

I have posted a pg_dump of the TPCH 1G Z0 data set at:

http://people.ok.ubc.ca/rlawrenc/tpch1g0z.zip

That seems VERY useful - can you post the other ones (Z1, etc.) so I
can download them all?

Thanks,

...Robert

#60Lawrence, Ramon
ramon.lawrence@ubc.ca
In reply to: Robert Haas (#59)
Re: Proposed Patch to Improve Performance ofMulti-BatchHash Join for Skewed Data Sets

That seems VERY useful - can you post the other ones (Z1, etc.) so I
can download them all?

The Z1 data set is posted at:

http://people.ok.ubc.ca/rlawrenc/tpch1g1z.zip

I have not generated Z2, Z3, Z4 for 1G, but I can generate the Z2 and Z3
data sets, and in a hour or two they will be at:

http://people.ok.ubc.ca/rlawrenc/tpch1g2z.zip
http://people.ok.ubc.ca/rlawrenc/tpch1g3z.zip

Note that Z3 and Z4 are not really useful as the skew is extreme (98% of
the probe relation covered by top 100 values). Using the Z2/Z3 data set
should be enough to show the huge win if you do *really* have a skewed
data set.

BTW, is there any particular form/options of the pg_dump command that I
should use to make the dump?

--
Ramon Lawrence

#61Bryce Cutt
pandasuit@gmail.com
In reply to: Tom Lane (#56)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

The patch originally modified the cost function but I removed that
part before we submitted it to be a bit conservative about our
proposed changes. I didn't like that for large plans the statistics
were retrieved and calculated many times when finding the optimal
query plan.

The overhead of the algorithm when the skew optimization is not used
ends up being roughly a function call and an if statement per tuple.
It would be easy to remove the function call per tuple. Dr. Lawrence
has come up with some changes so that when the optimization is turned
off, the function call does not happen at all and instead of the if
statement happening per tuple it is run just once per join. We have
to test this a bit more but it should further reduce the overhead.

Hopefully we will have the new patch ready to go this weekend.

- Bryce Cutt

Show quoted text

On Thu, Feb 26, 2009 at 7:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Heikki's got a point here: the planner is aware that hashjoin doesn't
like skewed distributions, and it assigns extra cost accordingly if it
can determine that the join key is skewed.  (See the "bucketsize" stuff
in cost_hashjoin.)  If this patch is accepted we'll want to tweak that
code.

Still, that has little to do with the current gating issue, which is
whether we've convinced ourselves that the patch doesn't cause a
performance decrease for cases in which it's unable to help.

                       regards, tom lane

#62Bryce Cutt
pandasuit@gmail.com
In reply to: Bryce Cutt (#61)
1 attachment(s)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

Here is the new patch.

Our experiments show no noticeable performance issue when using the
patch for cases where the optimization is not used because the number
of extra statements executed when the optimization is disabled is
insignificant.

We have updated the patch to remove a couple of if statements, but
this is really minor. The biggest change was to MultiExecHash that
avoids an if check per tuple by duplicating the hashing loop.

To demonstrate the differences, here is an analysis of the code
changes and their impact.

Three cases:
1) One batch hash join - Optimization is disabled. Extra statements
executed are:
- One if (hashtable->nbatch > 1) in ExecHashJoin (line 356 of nodeHashjoin.c)
- One if optimization_on in MultiExecHash (line 259 of nodeHash.c)
- One if optimization_on in MultiExecHash per probe tuple (line 431
of nodeHashjoin.c)
- One if statement in ExecScanHashBucket per probe tuple (line 1071
of nodeHash.c)

2) Multi-batch hash join with limited skew - Optimization is disabled.
Extra statements executed are:
- One if (hashtable->nbatch > 1) in ExecHashJoin (line 356 of nodeHashjoin.c)
- Executes ExecHashJoinDetectSkew method (at line 357 of
nodeHashjoin.c) that reads stats tuple for probe relation attribute
and determines if skew is above cut-off. In this case, skew is not
above cutoff and no extra memory is used.
- One if optimization_on in MultiExecHash (line 259 of nodeHash.c)
- One if optimization_on in MultiExecHash per probe tuple (line 431
of nodeHashjoin.c)
- One if statement in ExecScanHashBucket per probe tuple (line 1071
of nodeHash.c)

3) Multi-batch hash join with skew - Optimization is enabled. Extra
statements executed are:
- One if (hashtable->nbatch > 1) in ExecHashJoin (line 356 of nodeHashjoin.c)
- Executes ExecHashJoinDetectSkew method (at line 357 of
nodeHashjoin.c) that reads stats tuple for probe relation attribute
and determines there is skew. Allocates space for XXX which is 2% of
work_mem.
- One if optimization_on in MultiExecHash (line 259 of nodeHash.c)
- In MultiExecHash after each tuple is hashed determines if its join
attribute value matches one of the MCVs. If it does, it is put in the
MCV structure. Cost is the hash and search for each build tuple.
- If all IM buckets end up frozen in the build phase (MultiExecHash)
because they grow larger than the memory allowed for IM buckets then
skew optimization is turned off and the probe phase reverts to Case 2
- For each probe tuple, determines if its value is a MCV by
performing hash and quick table lookup. If yes, probes MCV bucket
otherwise does regular hash algorithm as usual.
- One if statement in ExecScanHashBucket per probe tuple (line 1071
of nodeHash.c)
- Additional cost is determining if a tuple is a common tuple (both
on build and probe side). This additional cost is dramatically
outweighed by avoiding disk I/Os (even if they never hit the disk due
to caching).

The if statement on line 440 of nodeHashjoin.c (in ExecHashJoin) has
been rearranged so that in the single batch case short circuit
evaluation requires only the first test in the IF to be checked.

The "limited skew" check mentioned in Case 2 above is a simple check
in the ExecHashJoinDetectSkew function.

- Bryce Cutt

Show quoted text

On Thu, Feb 26, 2009 at 12:16 PM, Bryce Cutt <pandasuit@gmail.com> wrote:

The patch originally modified the cost function but I removed that
part before we submitted it to be a bit conservative about our
proposed changes.  I didn't like that for large plans the statistics
were retrieved and calculated many times when finding the optimal
query plan.

The overhead of the algorithm when the skew optimization is not used
ends up being roughly a function call and an if statement per tuple.
It would be easy to remove the function call per tuple.  Dr. Lawrence
has come up with some changes so that when the optimization is turned
off, the function call does not happen at all and instead of the if
statement happening per tuple it is run just once per join.  We have
to test this a bit more but it should further reduce the overhead.

Hopefully we will have the new patch ready to go this weekend.

- Bryce Cutt

On Thu, Feb 26, 2009 at 7:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Heikki's got a point here: the planner is aware that hashjoin doesn't
like skewed distributions, and it assigns extra cost accordingly if it
can determine that the join key is skewed.  (See the "bucketsize" stuff
in cost_hashjoin.)  If this patch is accepted we'll want to tweak that
code.

Still, that has little to do with the current gating issue, which is
whether we've convinced ourselves that the patch doesn't cause a
performance decrease for cases in which it's unable to help.

                       regards, tom lane

Attachments:

histojoin_v6.patchapplication/octet-stream; name=histojoin_v6.patchDownload
Index: src/backend/executor/nodeHash.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHash.c,v
retrieving revision 1.117
diff -c -r1.117 nodeHash.c
*** src/backend/executor/nodeHash.c	1 Jan 2009 17:23:41 -0000	1.117
--- src/backend/executor/nodeHash.c	2 Mar 2009 07:52:25 -0000
***************
*** 53,58 ****
--- 53,219 ----
  	return NULL;
  }
  
+ /*
+  * ----------------------------------------------------------------
+  *		ExecHashGetIMBucket
+  *
+  *  	Returns the index of the in-memory bucket for this
+  *		hashvalue, or IM_INVALID_BUCKET if the hashvalue is not
+  *		associated with any unfrozen bucket (or if skew
+  *		optimization is not being used).
+  *
+  *		It is possible for a tuple whose join attribute value is
+  *		not a MCV to hash to an in-memory bucket due to the limited
+  * 		number of hash values but it is unlikely and everything
+  *		continues to work even if it does happen. We would
+  *		accidentally cache some less optimal tuples in memory
+  *		but the join result would still be accurate.
+  *
+  *		hashtable->imBucket is an open addressing hashtable of
+  *		in-memory buckets (HashJoinIMBucket).
+  * ----------------------------------------------------------------
+  */
+ int
+ ExecHashGetIMBucket(HashJoinTable hashtable, uint32 hashvalue)
+ {
+ 	int bucket;
+ 	
+ 	/* Modulo the hashvalue (using bitmask) to find the IM bucket. */
+ 	bucket = hashvalue & (hashtable->nIMBuckets - 1);
+ 
+ 	/*
+ 	 * While we have not hit a hole in the hashtable and have not hit the
+ 	 * actual bucket we have collided in the hashtable so try the next
+ 	 * bucket location.
+ 	 */
+ 	while (hashtable->imBucket[bucket] != NULL
+ 		&& hashtable->imBucket[bucket]->hashvalue != hashvalue)
+ 		bucket = (bucket + 1) & (hashtable->nIMBuckets - 1);
+ 
+ 	/*
+ 	 * If the bucket exists and has been correctly determined return
+ 	 * the bucket index.
+ 	 */
+ 	if (hashtable->imBucket[bucket] != NULL
+ 		&& hashtable->imBucket[bucket]->hashvalue == hashvalue)
+ 		return bucket;
+ 
+ 	/*
+ 	 * Must have run into an empty location or a frozen bucket which means the
+ 	 * tuple with this hashvalue is not to be handled as if it matches with an
+ 	 * in-memory bucket.
+ 	 */
+ 	return IM_INVALID_BUCKET;
+ }
+ 
+ /*
+  * ----------------------------------------------------------------
+  *		ExecHashFreezeNextIMBucket
+  *
+  *		Freeze the tuples of the next in-memory bucket by pushing
+  *		them into the main hashtable.  Buckets are frozen in order
+  *		so that the best tuples are kept in memory the longest.
+  * ----------------------------------------------------------------
+  */
+ static bool
+ ExecHashFreezeNextIMBucket(HashJoinTable hashtable)
+ {
+ 	int bucketToFreeze;
+ 	int bucketno;
+ 	int batchno;
+ 	uint32 hashvalue;
+ 	HashJoinTuple hashTuple;
+ 	HashJoinTuple nextHashTuple;
+ 	HashJoinIMBucket *bucket;
+ 	MinimalTuple mintuple;
+ 
+ 	/* Calculate the imBucket index of the bucket to freeze. */
+ 	bucketToFreeze = hashtable->imBucketFreezeOrder
+ 		[hashtable->nUsedIMBuckets - 1 - hashtable->nIMBucketsFrozen];
+ 
+ 	/* Grab a pointer to the actual IM bucket. */
+ 	bucket = hashtable->imBucket[bucketToFreeze];
+ 	hashvalue = bucket->hashvalue;
+ 
+ 	/*
+ 	 * Grab a pointer to the first tuple in the soon to be frozen IM bucket.
+ 	 */
+ 	hashTuple = bucket->tuples;
+ 
+ 	/*
+ 	 * Calculate which bucket and batch the tuples belong to in the main
+ 	 * non-IM hashtable.
+ 	 */
+ 	ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno, &batchno);
+ 
+ 	/* until we have read all tuples from this bucket */
+ 	while (hashTuple != NULL)
+ 	{
+ 		/*
+ 		 * Some of this code is very similar to that of ExecHashTableInsert.
+ 		 * We do not call ExecHashTableInsert directly as
+ 		 * ExecHashTableInsert expects a TupleTableSlot and we already have
+ 		 * HashJoinTuples.
+ 		 */
+ 		mintuple = HJTUPLE_MINTUPLE(hashTuple);
+ 
+ 		/* Decide whether to put the tuple in the hash table or a temp file. */
+ 		if (batchno == hashtable->curbatch)
+ 		{
+ 			/* Put the tuple in hash table. */
+ 			nextHashTuple = hashTuple->next;
+ 			hashTuple->next = hashtable->buckets[bucketno];
+ 			hashtable->buckets[bucketno] = hashTuple;
+ 			hashTuple = nextHashTuple;
+ 			hashtable->spaceUsedIM -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ 		}
+ 		else
+ 		{
+ 			/* Put the tuples into a temp file for later batches. */
+ 			Assert(batchno > hashtable->curbatch);
+ 			ExecHashJoinSaveTuple(mintuple, hashvalue,
+ 								  &hashtable->innerBatchFile[batchno]);
+ 			/*
+ 			 * Some memory has been freed up. This must be done before we
+ 			 * pfree the hashTuple of we lose access to the tuple size.
+ 			 */
+ 			hashtable->spaceUsed -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ 			hashtable->spaceUsedIM -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ 			nextHashTuple = hashTuple->next;
+ 			pfree(hashTuple);
+ 			hashTuple = nextHashTuple;
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Free the memory the bucket struct was using as it is not necessary
+ 	 * any more.  All code treats a frozen in-memory bucket the same as one
+ 	 * that did not exist; by setting the pointer to null the rest of the code
+ 	 * will function as if we had not created this bucket at all.
+ 	 */
+ 	pfree(bucket);
+ 	hashtable->imBucket[bucketToFreeze] = NULL;
+ 	hashtable->spaceUsed -= IM_BUCKET_OVERHEAD;
+ 	hashtable->spaceUsedIM -= IM_BUCKET_OVERHEAD;
+ 	hashtable->nIMBucketsFrozen++;
+ 
+ 	/*
+ 	 * All buckets have been frozen and deleted from memory so turn off
+ 	 * skew aware partitioning and remove the structs from memory as they are
+ 	 * just wasting space from now on.
+ 	 */
+ 	if (hashtable->nUsedIMBuckets == hashtable->nIMBucketsFrozen)
+ 	{
+ 		hashtable->enableSkewOptimization = false;
+ 		pfree(hashtable->imBucket);
+ 		pfree(hashtable->imBucketFreezeOrder);
+ 		hashtable->spaceUsed -= hashtable->spaceUsedIM;
+ 		hashtable->spaceUsedIM = 0;
+ 	}
+ 
+ 	return true;
+ }
+ 
  /* ----------------------------------------------------------------
   *		MultiExecHash
   *
***************
*** 69,74 ****
--- 230,237 ----
  	TupleTableSlot *slot;
  	ExprContext *econtext;
  	uint32		hashvalue;
+ 	MinimalTuple mintuple;
+ 	int bucketNumber;
  
  	/* must provide our own instrumentation support */
  	if (node->ps.instrument)
***************
*** 87,108 ****
  	econtext = node->ps.ps_ExprContext;
  
  	/*
! 	 * get all inner tuples and insert into the hash table (or temp files)
  	 */
! 	for (;;)
! 	{
! 		slot = ExecProcNode(outerNode);
! 		if (TupIsNull(slot))
! 			break;
! 		/* We have to compute the hash value */
! 		econtext->ecxt_innertuple = slot;
! 		if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
! 								 &hashvalue))
  		{
! 			ExecHashTableInsert(hashtable, slot, hashvalue);
! 			hashtable->totalTuples += 1;
  		}
- 	}
  
  	/* must provide our own instrumentation support */
  	if (node->ps.instrument)
--- 250,346 ----
  	econtext = node->ps.ps_ExprContext;
  
  	/*
! 	 * Get all inner tuples and insert into the hash table (or temp
! 	 * files).
! 	 * 
! 	 * Only incur the overhead of the skew tests if skew
! 	 * optimization is actually being used.
  	 */
! 	if (hashtable->enableSkewOptimization)
! 		for (;;)
  		{
! 			slot = ExecProcNode(outerNode);
! 			if (TupIsNull(slot))
! 				break;
! 			/* We have to compute the hash value */
! 			econtext->ecxt_innertuple = slot;
! 			if (ExecHashGetHashValue(hashtable, econtext, hashkeys
! 									, false, false, &hashvalue))
! 			{
! 				/*
! 				 * Skew optimization may have been subsequently
! 				 * turned off if all the IM buckets were frozen so
! 				 * make sure we handle
! 				 * that cleanly.
! 				 */
! 				if (hashtable->enableSkewOptimization)
! 					bucketNumber = 
! 						ExecHashGetIMBucket(hashtable, hashvalue);
! 				else
! 					bucketNumber = IM_INVALID_BUCKET;
! 
! 				/*
! 				 * handle tuples not destined for an in-memory
! 				 * bucket normally
! 				 */
! 				if (bucketNumber == IM_INVALID_BUCKET)
! 					ExecHashTableInsert(hashtable, slot, hashvalue);
! 				else
! 				{
! 					HashJoinTuple hashTuple;
! 					int			hashTupleSize;
! 					
! 					/* get the HashJoinTuple */
! 					mintuple = ExecFetchSlotMinimalTuple(slot);
! 					hashTupleSize = 
! 						HJTUPLE_OVERHEAD + mintuple->t_len;
! 					hashTuple = (HashJoinTuple)
! 						MemoryContextAlloc(hashtable->batchCxt
! 								, hashTupleSize);
! 					hashTuple->hashvalue = hashvalue;
! 					memcpy(HJTUPLE_MINTUPLE(hashTuple), mintuple
! 								, mintuple->t_len);
! 
! 					/*
! 					 * Push the HashJoinTuple onto the front of the
! 					 * IM bucket.
! 					 */
! 					hashTuple->next = 
! 						hashtable->imBucket[bucketNumber]->tuples;
! 					hashtable->imBucket[bucketNumber]->tuples = 
! 						hashTuple;
! 					
! 					/*
! 					 * More memory is now in use so make sure we are not
! 					 * over spaceAllowedIM.
! 					 */
! 					hashtable->spaceUsed += hashTupleSize;
! 					hashtable->spaceUsedIM += hashTupleSize;
! 					while (hashtable->spaceUsedIM >hashtable->spaceAllowedIM
! 						&& ExecHashFreezeNextIMBucket(hashtable))
! 						;
! 					/* Guarantee we are not over the spaceAllowed. */
! 					if (hashtable->spaceUsed > hashtable->spaceAllowed)
! 						ExecHashIncreaseNumBatches(hashtable);
! 				}
! 				hashtable->totalTuples++;
! 			}
! 		}
! 	else
! 		for (;;)
! 		{
! 			slot = ExecProcNode(outerNode);
! 			if (TupIsNull(slot))
! 				break;
! 			/* We have to compute the hash value */
! 			econtext->ecxt_innertuple = slot;
! 			if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false,
! 									 false, &hashvalue))
! 			{
! 				ExecHashTableInsert(hashtable, slot, hashvalue);
! 				hashtable->totalTuples++;
! 			}
  		}
  
  	/* must provide our own instrumentation support */
  	if (node->ps.instrument)
***************
*** 269,274 ****
--- 507,521 ----
  	hashtable->outerBatchFile = NULL;
  	hashtable->spaceUsed = 0;
  	hashtable->spaceAllowed = work_mem * 1024L;
+ 	/* Initialize skew optimization related hashtable variables. */
+ 	hashtable->spaceUsedIM = 0;
+ 	hashtable->spaceAllowedIM
+ 		= hashtable->spaceAllowed * IM_WORK_MEM_PERCENT / 100;
+ 	hashtable->enableSkewOptimization = false;
+ 	hashtable->nUsedIMBuckets = 0;
+ 	hashtable->nIMBuckets = 0;
+ 	hashtable->imBucket = NULL;
+ 	hashtable->nIMBucketsFrozen = 0;
  
  	/*
  	 * Get info about the hash functions to be used for each hash key. Also
***************
*** 815,825 ****
  	/*
  	 * hj_CurTuple is NULL to start scanning a new bucket, or the address of
  	 * the last tuple returned from the current bucket.
  	 */
! 	if (hashTuple == NULL)
! 		hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];
! 	else
  		hashTuple = hashTuple->next;
  
  	while (hashTuple != NULL)
  	{
--- 1062,1078 ----
  	/*
  	 * hj_CurTuple is NULL to start scanning a new bucket, or the address of
  	 * the last tuple returned from the current bucket.
+ 	 *
+ 	 * If the tuple hashed to an IM bucket then scan the IM bucket
+ 	 * otherwise scan the standard hashtable bucket.
  	 */
! 	if (hashTuple != NULL)
  		hashTuple = hashTuple->next;
+ 	else if (hjstate->hj_OuterTupleIMBucketNo != IM_INVALID_BUCKET)
+ 		hashTuple = hashtable->imBucket[hjstate->hj_OuterTupleIMBucketNo]
+ 			->tuples;
+ 	else
+ 		hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];
  
  	while (hashTuple != NULL)
  	{
Index: src/backend/executor/nodeHashjoin.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v
retrieving revision 1.97
diff -c -r1.97 nodeHashjoin.c
*** src/backend/executor/nodeHashjoin.c	1 Jan 2009 17:23:41 -0000	1.97
--- src/backend/executor/nodeHashjoin.c	2 Mar 2009 22:26:38 -0000
***************
*** 20,25 ****
--- 20,29 ----
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
  #include "utils/memutils.h"
+ #include "utils/syscache.h"
+ #include "utils/lsyscache.h"
+ #include "parser/parsetree.h"
+ #include "catalog/pg_statistic.h"
  
  
  /* Returns true for JOIN_LEFT and JOIN_ANTI jointypes */
***************
*** 34,39 ****
--- 38,244 ----
  						  TupleTableSlot *tupleSlot);
  static int	ExecHashJoinNewBatch(HashJoinState *hjstate);
  
+ /*
+  * ----------------------------------------------------------------
+  *		ExecHashJoinDetectSkew
+  *
+  *		If MCV statistics can be found for the join attribute of
+  *		this hashjoin then create a hash table of buckets. Each
+  *		bucket will correspond to an MCV hashvalue and will be
+  *		filled with inner relation tuples whose join attribute
+  *		hashes to the same value as that MCV.  If a join
+  *		attribute value is a MCV for the join attribute in the
+  *		outer (probe) relation, tuples with this value in the
+  *		inner (build) relation are more likely to join with outer
+  *		relation tuples and a benefit can be gained by keeping
+  *		them in memory while joining the first batch of tuples.
+  * ----------------------------------------------------------------
+  */
+ static void
+ ExecHashJoinDetectSkew(EState *estate, HashJoinState *hjstate, int tupwidth)
+ {
+ 	HeapTupleData	*statsTuple;
+ 	FuncExprState	*clause;
+ 	ExprState		*argstate;
+ 	Var				*variable;
+ 	HashJoinTable	hashtable;
+ 	Datum			*values;
+ 	int				nvalues;
+ 	float4			*numbers;
+ 	int				nnumbers;
+ 	Oid				relid;
+ 	Oid				atttype;
+ 	int				i;
+ 	int				mcvsToUse;
+ 
+ 	/* Only use statistics if there is a single join attribute. */
+ 	if (hjstate->hashclauses->length != 1)
+ 		return; /* Histojoin is not defined for more than one join key */
+ 	
+ 	hashtable = hjstate->hj_HashTable;
+ 	
+ 	/*
+ 	 * Estimate the number of IM buckets that will fit in
+ 	 * the memory allowed for IM buckets.
+ 	 *
+ 	 * hashtable->imBucket will have up to 8 times as many HashJoinIMBucket
+ 	 * pointers as the number of MCV hashvalues. A uint16 index in
+ 	 * hashtable->imBucketFreezeOrder will be created for each IM bucket. One
+ 	 * actual HashJoinIMBucket struct will be created for each
+ 	 * unique MCV hashvalue so up to one struct per MCV.
+ 	 *
+ 	 * It is also estimated that each IM bucket will have a single build
+ 	 * tuple stored in it after partitioning the build relation input.  This
+ 	 * estimate could be high if tuples are filtered out before this join but
+ 	 * in that case the extra memory is used by the regular hashjoin batch.
+ 	 * This estimate could be low if it is a many to many join but in that
+ 	 * case IM buckets will be frozen to free up memory as needed
+ 	 * during the inner relation partitioning phase.
+ 	 */
+ 	mcvsToUse = hashtable->spaceAllowedIM / (
+ 		/* size of a hash tuple */
+ 		HJTUPLE_OVERHEAD + MAXALIGN(sizeof(MinimalTupleData))
+ 			+ MAXALIGN(tupwidth)
+ 		/* max size of hashtable pointers per MCV */
+ 		+ (8 * sizeof(HashJoinIMBucket*))
+ 		+ sizeof(uint16) /* size of imBucketFreezeOrder entry */
+ 		+ IM_BUCKET_OVERHEAD /* size of IM bucket struct */
+ 		);
+ 	if (mcvsToUse == 0)
+ 		return;	/* No point in considering this any further. */
+ 
+ 	/*
+ 	 * Determine the relation id and attribute id of the single join
+ 	 * attribute of the probe relation.
+ 	 */
+ 	clause = (FuncExprState *) lfirst(list_head(hjstate->hashclauses));
+ 	argstate = (ExprState *) lfirst(list_head(clause->args));
+ 
+ 	/*
+ 	 * Do not try to exploit stats if the join attribute is an expression
+ 	 * instead of just a simple attribute.
+ 	 */		
+ 	if (argstate->expr->type != T_Var)
+ 		return;
+ 
+ 	variable = (Var *) argstate->expr;
+ 	relid = getrelid(variable->varnoold, estate->es_range_table);
+ 	atttype = variable->vartype;
+ 
+ 	statsTuple = SearchSysCache(STATRELATT, ObjectIdGetDatum(relid),
+ 		Int16GetDatum(variable->varoattno), 0, 0);
+ 	if (!HeapTupleIsValid(statsTuple))
+ 		return;
+ 
+ 	/* Look for MCV statistics for the attribute. */
+ 	if (get_attstatsslot(statsTuple, atttype, variable->vartypmod,
+ 		STATISTIC_KIND_MCV, InvalidOid, &values, &nvalues,
+ 		&numbers, &nnumbers))
+ 	{
+ 		FmgrInfo   *hashfunctions;
+ 		int nbuckets = 2;
+ 		double frac = 0;
+ 
+ 		/*
+ 		 * IM buckets (imBucket) is an open addressing hashtable with a
+ 		 * power of 2 size that is greater than the number of MCV values.
+ 		 */
+ 		if (mcvsToUse > nvalues)
+ 			mcvsToUse = nvalues;
+ 		/*
+ 		 * Calculate the expected percent of probe relation to join
+ 		 * with IM buckets
+ 		 */
+ 		for (i = 0; i < mcvsToUse; i++)
+ 			frac += numbers[i];
+ 		/*
+ 		 * Don't enable skew optimization if the benefits are really
+ 		 * small.
+ 		 */
+ 		if (frac < IM_MIN_BENEFIT_PERCENT)
+ 		{
+ 			free_attstatsslot(atttype, values, nvalues, numbers, nnumbers);
+ 			ReleaseSysCache(statsTuple);
+ 			return;
+ 		}
+ 		while (nbuckets <= mcvsToUse)
+ 			nbuckets <<= 1;
+ 		/* use two more bits just to help avoid collisions */
+ 		nbuckets <<= 2;
+ 		hashtable->nIMBuckets = nbuckets;
+ 		hashtable->enableSkewOptimization = true;
+ 
+ 		/*
+ 		 * Allocate the bucket memory in the hashtable's batch context
+ 		 * because it is only relevant and necessary during the first batch
+ 		 * and will be nicely removed once the first batch is done.
+ 		 */
+ 		hashtable->imBucket =
+ 			MemoryContextAllocZero(hashtable->batchCxt,
+ 				nbuckets * sizeof(HashJoinIMBucket*));
+ 		hashtable->imBucketFreezeOrder =
+ 			MemoryContextAllocZero(hashtable->batchCxt,
+ 				mcvsToUse * sizeof(uint16));
+ 		/* Count the overhead of the IM pointers immediately. */
+ 		hashtable->spaceUsed += nbuckets * sizeof(HashJoinIMBucket*)
+ 			+ mcvsToUse * sizeof(uint16);
+ 		hashtable->spaceUsedIM +=  nbuckets * sizeof(HashJoinIMBucket*)
+ 			+ mcvsToUse * sizeof(uint16);
+ 
+ 		/*
+ 		 * Grab the hash functions as we will be generating the hashvalues
+ 		 * in this section.
+ 		 */
+ 		hashfunctions = hashtable->outer_hashfunctions;
+ 
+ 		/* Create the buckets */
+ 		for (i = 0; i < mcvsToUse; i++)
+ 		{
+ 			uint32 hashvalue = DatumGetUInt32(
+ 				FunctionCall1(&hashfunctions[0], values[i]));
+ 			int bucket = hashvalue & (nbuckets - 1);
+ 
+ 			/*
+ 			 * While we have not hit a hole in the hashtable and have not hit
+ 			 * a bucket with the same hashvalue we have collided in the
+ 			 * hashtable so try the next bucket location (remember it is an
+ 			 * open addressing hashtable).
+ 			 */
+ 			while (hashtable->imBucket[bucket] != NULL
+ 				&& hashtable->imBucket[bucket]->hashvalue != hashvalue)
+ 				bucket = (bucket + 1) & (nbuckets - 1);
+ 
+ 			/*
+ 			 * Leave bucket alone if it has the same hashvalue as current
+ 			 * MCV. We only want one bucket per hashvalue. Even if two MCV
+ 			 * values hash to the same bucket we are fine.
+ 			 */
+ 			if (hashtable->imBucket[bucket] == NULL)
+ 			{
+ 				/*
+ 				 * Allocate the actual bucket structure in the hashtable's batch
+ 				 * context because it is only relevant and necessary during
+ 				 * the first batch and will be nicely removed once the first
+ 				 * batch is done.
+ 				 */
+ 				hashtable->imBucket[bucket]
+ 					= MemoryContextAllocZero(hashtable->batchCxt,
+ 						sizeof(HashJoinIMBucket));
+ 				hashtable->imBucket[bucket]->hashvalue = hashvalue;
+ 				hashtable->imBucketFreezeOrder[hashtable->nUsedIMBuckets]
+ 					= bucket;
+ 				hashtable->nUsedIMBuckets++;
+ 				/* Count the overhead of the IM bucket struct */
+ 				hashtable->spaceUsed += IM_BUCKET_OVERHEAD;
+ 				hashtable->spaceUsedIM += IM_BUCKET_OVERHEAD;
+ 			}
+ 		}
+ 
+ 		free_attstatsslot(atttype, values, nvalues, numbers, nnumbers);
+ 	}
+ 
+ 	ReleaseSysCache(statsTuple);
+ }
  
  /* ----------------------------------------------------------------
   *		ExecHashJoin
***************
*** 147,152 ****
--- 352,362 ----
  										node->hj_HashOperators);
  		node->hj_HashTable = hashtable;
  
+ 		/* Use skew optimization only when there is more than one batch. */
+ 		if (hashtable->nbatch > 1)
+ 			ExecHashJoinDetectSkew(estate, node,
+ 				(outerPlan((Hash *) hashNode->ps.plan))->plan_width);
+ 
  		/*
  		 * execute the Hash node, to build the hash table
  		 */
***************
*** 172,177 ****
--- 382,394 ----
  		 * again.)
  		 */
  		node->hj_OuterNotEmpty = false;
+ 
+ 		/*
+ 		 * Initialize OuterTupleIMBucketNo as this is always the value when
+ 		 * skew optimization is turned off and it will be set properly later
+ 		 * if skew optimization is on.
+ 		 */
+ 		node->hj_OuterTupleIMBucketNo = IM_INVALID_BUCKET;
  	}
  
  	/*
***************
*** 205,216 ****
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
  
  			/*
  			 * Now we've got an outer tuple and the corresponding hash bucket,
! 			 * but this tuple may not belong to the current batch.
  			 */
! 			if (batchno != hashtable->curbatch)
  			{
  				/*
  				 * Need to postpone this outer tuple to a later batch. Save it
--- 422,444 ----
  			ExecHashGetBucketAndBatch(hashtable, hashvalue,
  									  &node->hj_CurBucketNo, &batchno);
  			node->hj_CurTuple = NULL;
+ 			
+ 			/*
+ 			 * Does the outer tuple match an IM bucket? Make sure we
+ 			 * don't incur the cost of a function call if skew
+ 			 * optimization is turned off.
+ 			 */
+ 			if (hashtable->enableSkewOptimization)
+ 				node->hj_OuterTupleIMBucketNo = 
+ 					ExecHashGetIMBucket(hashtable, hashvalue);
  
  			/*
  			 * Now we've got an outer tuple and the corresponding hash bucket,
! 			 * but in might not belong to the current batch, or it might need
! 			 * to go into an in-memory bucket.
  			 */
! 			if (batchno != hashtable->curbatch
! 				&& node->hj_OuterTupleIMBucketNo == IM_INVALID_BUCKET)
  			{
  				/*
  				 * Need to postpone this outer tuple to a later batch. Save it
***************
*** 641,647 ****
  	nbatch = hashtable->nbatch;
  	curbatch = hashtable->curbatch;
  
! 	if (curbatch > 0)
  	{
  		/*
  		 * We no longer need the previous outer batch file; close it right
--- 869,895 ----
  	nbatch = hashtable->nbatch;
  	curbatch = hashtable->curbatch;
  
! 	/* if we just finished the first batch */
! 	if (curbatch == 0)
! 	{
! 		/*
! 		 * Reset some of the skew optimization state variables. IM buckets are
! 		 * no longer being used as of this point because they are only
! 		 * necessary while joining the first batch (before the cleanup phase).
! 		 *
! 		 * Especially need to make sure ExecHashGetIMBucket returns
! 		 * IM_INVALID_BUCKET quickly for all subsequent calls.
! 		 *
! 		 * IM buckets are only taking up memory if this is a multi-batch join
! 		 * and in that case ExecHashTableReset is about to be called which
! 		 * will free all memory currently used by IM buckets and tuples when
! 		 * it deletes hashtable->batchCxt.  If this is a single batch join
! 		 * then imBucket and imBucketFreezeOrder are already NULL and empty.
! 		 */
! 		hashtable->enableSkewOptimization = false;
! 		hashtable->spaceUsedIM = 0;
! 	}
! 	else if (curbatch > 0)
  	{
  		/*
  		 * We no longer need the previous outer batch file; close it right
Index: src/include/executor/hashjoin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/hashjoin.h,v
retrieving revision 1.49
diff -c -r1.49 hashjoin.h
*** src/include/executor/hashjoin.h	1 Jan 2009 17:23:59 -0000	1.49
--- src/include/executor/hashjoin.h	2 Mar 2009 22:27:19 -0000
***************
*** 72,77 ****
--- 72,97 ----
  #define HJTUPLE_MINTUPLE(hjtup)  \
  	((MinimalTuple) ((char *) (hjtup) + HJTUPLE_OVERHEAD))
  
+ /*
+  * Stores a hashvalue and linked list of tuples that share that hashvalue.
+  *
+  * When processing MCVs to detect skew in the probe relation of a hash join
+  * the hashvalue is generated and stored before any tuples have been read
+  * (see ExecHashJoinDetectSkew).
+  *
+  * Build tuples that hash to the same hashvalue are placed in the bucket while
+  * reading the build relation.
+  */
+ typedef struct HashJoinIMBucket
+ {
+ 	uint32 hashvalue;
+ 	HashJoinTuple tuples;
+ } HashJoinIMBucket;
+ 
+ #define IM_INVALID_BUCKET -1
+ #define IM_WORK_MEM_PERCENT 2
+ #define IM_MIN_BENEFIT_PERCENT .01
+ #define IM_BUCKET_OVERHEAD MAXALIGN(sizeof(HashJoinIMBucket))
  
  typedef struct HashJoinTableData
  {
***************
*** 113,121 ****
--- 133,162 ----
  
  	Size		spaceUsed;		/* memory space currently used by tuples */
  	Size		spaceAllowed;	/* upper limit for space used */
+ 	/* memory space currently used by IM buckets and tuples */
+ 	Size		spaceUsedIM;
+ 	/* upper limit for space used by IM buckets and tuples */
+ 	Size		spaceAllowedIM;
  
  	MemoryContext hashCxt;		/* context for whole-hash-join storage */
  	MemoryContext batchCxt;		/* context for this-batch-only storage */
+ 	
+ 	/* will the join optimize memory usage when probe relation is skewed */
+ 	bool enableSkewOptimization;
+ 	HashJoinIMBucket **imBucket; /* hashtable of IM buckets */
+ 	/*
+ 	 * array of imBucket indexes to the created IM buckets sorted
+ 	 * in the opposite order that they would be frozen to disk
+ 	 */
+ 	uint16 *imBucketFreezeOrder;
+ 	int nIMBuckets; /* # of buckets in the IM buckets hashtable */
+ 	/*
+ 	 * # of used buckets in the IM buckets hashtable and length of
+ 	 * imBucketFreezeOrder array
+ 	 */
+ 	int nUsedIMBuckets;
+ 	/* # of IM buckets that have already been frozen to disk */
+ 	int nIMBucketsFrozen;
  } HashJoinTableData;
  
  #endif   /* HASHJOIN_H */
Index: src/include/executor/nodeHash.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/nodeHash.h,v
retrieving revision 1.46
diff -c -r1.46 nodeHash.h
*** src/include/executor/nodeHash.h	1 Jan 2009 17:23:59 -0000	1.46
--- src/include/executor/nodeHash.h	1 Mar 2009 03:17:51 -0000
***************
*** 45,48 ****
--- 45,50 ----
  						int *numbuckets,
  						int *numbatches);
  
+ extern int ExecHashGetIMBucket(HashJoinTable hashtable, uint32 hashvalue);
+ 
  #endif   /* NODEHASH_H */
Index: src/include/nodes/execnodes.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/nodes/execnodes.h,v
retrieving revision 1.201
diff -c -r1.201 execnodes.h
*** src/include/nodes/execnodes.h	12 Jan 2009 05:10:45 -0000	1.201
--- src/include/nodes/execnodes.h	1 Mar 2009 03:17:51 -0000
***************
*** 1389,1394 ****
--- 1389,1395 ----
   *		hj_NeedNewOuter			true if need new outer tuple on next call
   *		hj_MatchedOuter			true if found a join match for current outer
   *		hj_OuterNotEmpty		true if outer relation known not empty
+  *		hj_OuterTupleIMBucketNo	IM bucket# for the current outer tuple
   * ----------------
   */
  
***************
*** 1414,1419 ****
--- 1415,1421 ----
  	bool		hj_NeedNewOuter;
  	bool		hj_MatchedOuter;
  	bool		hj_OuterNotEmpty;
+ 	int			hj_OuterTupleIMBucketNo;
  } HashJoinState;
  
  
Index: src/include/nodes/primnodes.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/nodes/primnodes.h,v
retrieving revision 1.146
diff -c -r1.146 primnodes.h
*** src/include/nodes/primnodes.h	25 Feb 2009 03:30:37 -0000	1.146
--- src/include/nodes/primnodes.h	1 Mar 2009 03:17:51 -0000
***************
*** 121,128 ****
   * subplans; for example, in a join node varno becomes INNER or OUTER and
   * varattno becomes the index of the proper element of that subplan's target
   * list.  But varnoold/varoattno continue to hold the original values.
!  * The code doesn't really need varnoold/varoattno, but they are very useful
!  * for debugging and interpreting completed plans, so we keep them around.
   */
  #define    INNER		65000
  #define    OUTER		65001
--- 121,132 ----
   * subplans; for example, in a join node varno becomes INNER or OUTER and
   * varattno becomes the index of the proper element of that subplan's target
   * list.  But varnoold/varoattno continue to hold the original values.
!  *
!  * For the most part, the code doesn't really need varnoold/varoattno, but
!  * they are very useful for debugging and interpreting completed plans, so we
!  * keep them around.  As of PostgreSQL 8.4, these values are also used by
!  * ExecHashJoinDetectSkew to fetch MCV statistics when performing multi-batch
!  * hash joins.
   */
  #define    INNER		65000
  #define    OUTER		65001
***************
*** 142,148 ****
  	Index		varlevelsup;	/* for subquery variables referencing outer
  								 * relations; 0 in a normal var, >0 means N
  								 * levels up */
! 	Index		varnoold;		/* original value of varno, for debugging */
  	AttrNumber	varoattno;		/* original value of varattno */
  	int			location;		/* token location, or -1 if unknown */
  } Var;
--- 146,152 ----
  	Index		varlevelsup;	/* for subquery variables referencing outer
  								 * relations; 0 in a normal var, >0 means N
  								 * levels up */
! 	Index		varnoold;		/* original value of varno */
  	AttrNumber	varoattno;		/* original value of varattno */
  	int			location;		/* token location, or -1 if unknown */
  } Var;
#63Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bryce Cutt (#62)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

Bryce Cutt <pandasuit@gmail.com> writes:

Here is the new patch.
Our experiments show no noticeable performance issue when using the
patch for cases where the optimization is not used because the number
of extra statements executed when the optimization is disabled is
insignificant.

We have updated the patch to remove a couple of if statements, but
this is really minor. The biggest change was to MultiExecHash that
avoids an if check per tuple by duplicating the hashing loop.

I think you missed the point of the performance questions. It wasn't
about avoiding extra simple if-tests in the per-tuple loops; a few of
those are certainly not going to add measurable cost given how complex
the code is already. (I really don't think you should be duplicating
hunks of code to avoid adding such tests.) Rather, the concern was that
if we are dedicating a fraction of available work_mem to this purpose,
that reduces the overall efficiency of the regular non-IM code path,
principally by forcing the creation of more batches than would otherwise
be needed. It's not clear whether the savings for IM tuples always
exceeds this additional cost.

After looking over the code a bit, there are two points that
particularly concern me in this connection:

* The IM hashtable is only needed during the first-batch processing;
once we've completed the first pass over the outer relation there is
no longer any need for it, unless I'm misunderstanding things
completely. Therefore it really only competes for space with the
regular first batch. However the damage to nbatches will already have
been done; in effect, we can expect that each subsequent batch will
probably only use (100 - IM_WORK_MEM_PERCENT)% of work_mem. The patch
seems to try to deal with this by keeping IM_WORK_MEM_PERCENT negligibly
small, but surely that's mostly equivalent to fighting with one hand
tied behind your back. I wonder if it'd be better to dedicate all of
work_mem to the MCV hash values during the first pass, rather than
allowing them to compete with the first regular batch.

* The IM hashtable creates an additional reason why nbatch might
increase during the initial scan of the inner relation; in fact, since
it's an effect not modeled in the initial choice of nbatch, it's
probably going to be a major reason for that to happen. Increasing
nbatch on the fly isn't good because it results in extra I/O for tuples
that were previously assigned to what is now the wrong batch. Again,
the only answer the patch has for this is to try not to use enough
of work_mem for it to make a difference. Seems like instead the initial
nbatch estimate needs to account for that.

regards, tom lane

#64Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#63)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

On Fri, Mar 6, 2009 at 1:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Bryce Cutt <pandasuit@gmail.com> writes:

Here is the new patch.
Our experiments show no noticeable performance issue when using the
patch for cases where the optimization is not used because the number
of extra statements executed when the optimization is disabled is
insignificant.

We have updated the patch to remove a couple of if statements, but
this is really minor.  The biggest change was to MultiExecHash that
avoids an if check per tuple by duplicating the hashing loop.

I think you missed the point of the performance questions.  It wasn't
about avoiding extra simple if-tests in the per-tuple loops; a few of
those are certainly not going to add measurable cost given how complex
the code is already.  (I really don't think you should be duplicating
hunks of code to avoid adding such tests.)  Rather, the concern was that

Well, at one point we were still trying to verify that (1) the patch
actually had a benefit and (2) blowing out the IM hashtable wasn't too
horribly nasty. A great deal of improvement has been made in those
areas since this was first reviewed. But your questions are
completely valid, too. (I don't think anyone ever expressed a concern
about the simple if-tests, either.)

if we are dedicating a fraction of available work_mem to this purpose,
that reduces the overall efficiency of the regular non-IM code path,
principally by forcing the creation of more batches than would otherwise
be needed.  It's not clear whether the savings for IM tuples always
exceeds this additional cost.

After looking over the code a bit, there are two points that
particularly concern me in this connection:

* The IM hashtable is only needed during the first-batch processing;
once we've completed the first pass over the outer relation there is
no longer any need for it, unless I'm misunderstanding things
completely.  Therefore it really only competes for space with the
regular first batch.  However the damage to nbatches will already have
been done; in effect, we can expect that each subsequent batch will
probably only use (100 - IM_WORK_MEM_PERCENT)% of work_mem.  The patch
seems to try to deal with this by keeping IM_WORK_MEM_PERCENT negligibly
small, but surely that's mostly equivalent to fighting with one hand
tied behind your back.   I wonder if it'd be better to dedicate all of
work_mem to the MCV hash values during the first pass, rather than
allowing them to compete with the first regular batch.

The IM hash table doesn't need to be very large in order to produce a
substantial benefit, because there are only going to be ~100 MCVs in
the probe table and each of those may well be unique in the build
table. But no matter what size you choose for it, there's some danger
that it will push us over the edge into more batches, and if the skew
doesn't turn out to be enough to make up for that, you lose. I'm not
sure there's any way to completely eliminate that unpleasant
possibility.

* The IM hashtable creates an additional reason why nbatch might
increase during the initial scan of the inner relation; in fact, since
it's an effect not modeled in the initial choice of nbatch, it's
probably going to be a major reason for that to happen.  Increasing
nbatch on the fly isn't good because it results in extra I/O for tuples
that were previously assigned to what is now the wrong batch.  Again,
the only answer the patch has for this is to try not to use enough
of work_mem for it to make a difference.  Seems like instead the initial
nbatch estimate needs to account for that.

...Robert

#65Lawrence, Ramon
ramon.lawrence@ubc.ca
In reply to: Robert Haas (#64)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

I think you missed the point of the performance questions.  It wasn't
about avoiding extra simple if-tests in the per-tuple loops; a few of
those are certainly not going to add measurable cost given how complex
the code is already.  (I really don't think you should be duplicating
hunks of code to avoid adding such tests.)  Rather, the concern was that
if we are dedicating a fraction of available work_mem to this purpose,
that reduces the overall efficiency of the regular non-IM code path,
principally by forcing the creation of more batches than would otherwise
be needed.  It's not clear whether the savings for IM tuples always
exceeds this additional cost.

I misunderstood the concern. So, there is no issue with the patch when it is disabled (single batch case or multi-batch with no skew)? There is no memory allocated when the optimization is off, so these cases will not affect the number of batches or re-partitioning.

* The IM hashtable is only needed during the first-batch processing;
once we've completed the first pass over the outer relation there is
no longer any need for it, unless I'm misunderstanding things
completely.  Therefore it really only competes for space with the
regular first batch.  However the damage to nbatches will already have
been done; in effect, we can expect that each subsequent batch will
probably only use (100 - IM_WORK_MEM_PERCENT)% of work_mem.  The patch
seems to try to deal with this by keeping IM_WORK_MEM_PERCENT negligibly
small, but surely that's mostly equivalent to fighting with one hand
tied behind your back.   I wonder if it'd be better to dedicate all of
work_mem to the MCV hash values during the first pass, rather than
allowing them to compete with the first regular batch.

The IM hash table doesn't need to be very large in order to produce a
substantial benefit, because there are only going to be ~100 MCVs in
the probe table and each of those may well be unique in the build
table. But no matter what size you choose for it, there's some danger
that it will push us over the edge into more batches, and if the skew
doesn't turn out to be enough to make up for that, you lose. I'm not
sure there's any way to completely eliminate that unpleasant
possibility.

Correct - The IM table only competes with the first-batch during processing and is removed after the first pass. Also, it tends to be VERY small as the default of 100 MCVs usually results in 100 tuples being in the IM table which is normally much less than 2% of work_mem. We get almost all the benefit with 100-10000 MCVs with little downside risk. Making the IM table larger (size of work_mem) is both not possible (not that many MCVs) and has a bigger downside risk if we get it wrong.

* The IM hashtable creates an additional reason why nbatch might
increase during the initial scan of the inner relation; in fact, since
it's an effect not modeled in the initial choice of nbatch, it's
probably going to be a major reason for that to happen.  Increasing
nbatch on the fly isn't good because it results in extra I/O for tuples
that were previously assigned to what is now the wrong batch.  Again,
the only answer the patch has for this is to try not to use enough
of work_mem for it to make a difference.  Seems like instead the initial
nbatch estimate needs to account for that.

The possibility of the 1-2% IM_WORK_MEM_PERCENT causing a re-batch exists but is very small. The number of batches is calculated in ExecChooseHashTableSize (costsize.c) as ceil(inner_rel_bytes/work_mem) rounded up to the next power of 2. Thus, hash join already "wastes" some of its work_mem allocation due to rounding. For instance, if nbatch is calculated as 3 then rounded up to 4, only 75% of work_mem is used for each batch. This leaves 25% of work_mem "unaccounted for" which may be used by the IM table (and also to compensate for build skew). Clearly, if nbatch is exactly 4, then this unaccounted space is not present and if the optimizer is exact in its estimates, the extra 1-2% may force a re-partition.

A solution may be to re-calculate nbatch factoring in the extra 1-2% during ExecHashTableCreate (nodeHashjoin.c) which calls ExecChooseHashTableSize again before execution. The decision is whether to modify ExecChooseHashTableSize itself (which is used during costing) or to make a modified ExecChooseHashTableSize function that is only used once in ExecHashTableCreate.

We have tried to change the original code as little as possible, but it is possible to modify ExecChooseHashTableSize and the hash join cost function to be skew optimization aware.

--
Ramon Lawrence

#66Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bryce Cutt (#62)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

Bryce Cutt <pandasuit@gmail.com> writes:

Here is the new patch.

Applied with revisions. I undid some of the "optimizations" that
cluttered the code in order to save a cycle or two per tuple --- as per
previous discussion, that's not what the performance questions were
about. Also, I did not like the terminology "in-memory"/"IM"; it seemed
confusing since the main hash table is in-memory too. I revised the
code to consistently refer to the additional hash table as a "skew"
hashtable and the optimization in general as skew optimization. Hope
that seems reasonable to you --- we could search-and-replace it to
something else if you'd prefer.

For the moment, I didn't really do anything about teaching the planner
to account for this optimization in its cost estimates. The initial
estimate of the number of MCVs that will be specially treated seems to
me to be too high (it's only accurate if the inner relation is unique),
but getting a more accurate estimate seems pretty hard, and it's not
clear it's worth the trouble. Without that, though, you can't tell
what fraction of outer tuples will get the short-circuit treatment.

regards, tom lane

#67Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#66)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

On Fri, Mar 20, 2009 at 8:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Bryce Cutt <pandasuit@gmail.com> writes:

Here is the new patch.

Applied with revisions.  I undid some of the "optimizations" that
cluttered the code in order to save a cycle or two per tuple --- as per
previous discussion, that's not what the performance questions were
about.  Also, I did not like the terminology "in-memory"/"IM"; it seemed
confusing since the main hash table is in-memory too.  I revised the
code to consistently refer to the additional hash table as a "skew"
hashtable and the optimization in general as skew optimization.  Hope
that seems reasonable to you --- we could search-and-replace it to
something else if you'd prefer.

For the moment, I didn't really do anything about teaching the planner
to account for this optimization in its cost estimates.  The initial
estimate of the number of MCVs that will be specially treated seems to
me to be too high (it's only accurate if the inner relation is unique),
but getting a more accurate estimate seems pretty hard, and it's not
clear it's worth the trouble.  Without that, though, you can't tell
what fraction of outer tuples will get the short-circuit treatment.

If the inner relation isn't fairly close to unique you shouldn't be
using this optimization in the first place.

...Robert

#68Bryce Cutt
pandasuit@gmail.com
In reply to: Robert Haas (#67)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

Not necessarily true. Seeing as (when the statistics are correct) we
know each of these inner tuples will match with the largest amount of
outer tuples it is just as much of a win per inner tuple as when they
are unique. There is just a chance you will have to give up on the
optimization part way through if too many inner tuples fall into the
new "skew buckets" (formerly IM buckets) and dump the tuples back into
the main buckets. The potential win is still pretty high though.

- Bryce Cutt

Show quoted text

On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 20, 2009 at 8:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Bryce Cutt <pandasuit@gmail.com> writes:

Here is the new patch.

Applied with revisions.  I undid some of the "optimizations" that
cluttered the code in order to save a cycle or two per tuple --- as per
previous discussion, that's not what the performance questions were
about.  Also, I did not like the terminology "in-memory"/"IM"; it seemed
confusing since the main hash table is in-memory too.  I revised the
code to consistently refer to the additional hash table as a "skew"
hashtable and the optimization in general as skew optimization.  Hope
that seems reasonable to you --- we could search-and-replace it to
something else if you'd prefer.

For the moment, I didn't really do anything about teaching the planner
to account for this optimization in its cost estimates.  The initial
estimate of the number of MCVs that will be specially treated seems to
me to be too high (it's only accurate if the inner relation is unique),
but getting a more accurate estimate seems pretty hard, and it's not
clear it's worth the trouble.  Without that, though, you can't tell
what fraction of outer tuples will get the short-circuit treatment.

If the inner relation isn't fairly close to unique you shouldn't be
using this optimization in the first place.

...Robert

#69Robert Haas
robertmhaas@gmail.com
In reply to: Bryce Cutt (#68)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

On Fri, Mar 20, 2009 at 8:45 PM, Bryce Cutt <pandasuit@gmail.com> wrote:

On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:

If the inner relation isn't fairly close to unique you shouldn't be
using this optimization in the first place.

Not necessarily true.  Seeing as (when the statistics are correct) we
know each of these inner tuples will match with the largest amount of
outer tuples it is just as much of a win per inner tuple as when they
are unique.  There is just a chance you will have to give up on the
optimization part way through if too many inner tuples fall into the
new "skew buckets" (formerly IM buckets) and dump the tuples back into
the main buckets.  The potential win is still pretty high though.

- Bryce Cutt

Maybe I'm remembering wrong, but I thought the estimating functions
assuemd that the inner relation was unique. So if there turn out to
be 2, 3, 4, or more copies of each value, the chances of blowing out
the skew hash table are almost 100%, I would think... am I wrong?

...Robert

#70Robert Haas
robertmhaas@gmail.com
In reply to: Bryce Cutt (#68)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

On Fri, Mar 20, 2009 at 8:45 PM, Bryce Cutt <pandasuit@gmail.com> wrote:

On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:

If the inner relation isn't fairly close to unique you shouldn't be
using this optimization in the first place.

Not necessarily true.  Seeing as (when the statistics are correct) we
know each of these inner tuples will match with the largest amount of
outer tuples it is just as much of a win per inner tuple as when they
are unique.  There is just a chance you will have to give up on the
optimization part way through if too many inner tuples fall into the
new "skew buckets" (formerly IM buckets) and dump the tuples back into
the main buckets.  The potential win is still pretty high though.

- Bryce Cutt

Maybe I'm remembering wrong, but I thought the estimating functions
assuemd that the inner relation was unique. So if there turn out to
be 2, 3, 4, or more copies of each value, the chances of blowing out
the skew hash table are almost 100%, I would think... am I wrong?

...Robert

#71Bryce Cutt
pandasuit@gmail.com
In reply to: Robert Haas (#69)
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

The estimation functions assume the inner relation join column is
unique. But it freezes (flushes back to the main hash table) one skew
bucket at a time in order of least importance so if 100 inner tuples
can fit in the skew buckets then the skew buckets are only fully blown
out if the best tuple (the single most common value) occurs more than
100 times in the inner relation. And up until that point you still
have the tuples in memory that are the best "per tuple worth of
memory". But yes, after that point (more than 100 tuples of that best
MCV) the entire effort was wasted. The skew buckets are dynamically
flushed just like buckets in a dynamic hash join would be.

- Bryce Cutt

Show quoted text

On Fri, Mar 20, 2009 at 5:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 20, 2009 at 8:45 PM, Bryce Cutt <pandasuit@gmail.com> wrote:

On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:

If the inner relation isn't fairly close to unique you shouldn't be
using this optimization in the first place.

Not necessarily true.  Seeing as (when the statistics are correct) we
know each of these inner tuples will match with the largest amount of
outer tuples it is just as much of a win per inner tuple as when they
are unique.  There is just a chance you will have to give up on the
optimization part way through if too many inner tuples fall into the
new "skew buckets" (formerly IM buckets) and dump the tuples back into
the main buckets.  The potential win is still pretty high though.

- Bryce Cutt

Maybe I'm remembering wrong, but I thought the estimating functions
assuemd that the inner relation was unique.  So if there turn out to
be 2, 3, 4, or more copies of each value, the chances of blowing out
the skew hash table are almost 100%, I would think...  am I wrong?

...Robert