Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
We propose a patch that improves hybrid hash join's performance for
large multi-batch joins where the probe relation has skew.
Project name: Histojoin
Patch file: histojoin_v1.patch
This patch implements the Histojoin join algorithm as an optional
feature added to the standard Hybrid Hash Join (HHJ). A flag is used to
enable or disable the Histojoin features. When Histojoin is disabled,
HHJ acts as normal. The Histojoin features allow HHJ to use
PostgreSQL's statistics to do skew aware partitioning. The basic idea
is to keep build relation tuples in a small in-memory hash table that
have join values that are frequently occurring in the probe relation.
This improves performance of HHJ when multiple batches are used by 10%
to 50% for skewed data sets. The performance improvements of this patch
can be seen in the paper (pages 25-30) at:
http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf
All generators and materials needed to verify these results can be
provided.
This is a patch against the HEAD of the repository.
This patch does not contain platform specific code. It compiles and has
been tested on our machines in both Windows (MSVC++) and Linux (GCC).
Currently the Histojoin feature is enabled by default and is used
whenever HHJ is used and there are Most Common Value (MCV) statistics
available on the probe side base relation of the join. To disable this
feature simply set the enable_hashjoin_usestatmcvs flag to off in the
database configuration file or at run time with the 'set' command.
One potential improvement not included in the patch is that Most Common
Value (MCV) statistics are only determined when the probe relation is
produced by a scan operator. There is a benefit to using MCVs even when
the probe relation is not a base scan, but we were unable to determine
how to find statistics from a base relation after other operators are
performed.
This patch was created by Bryce Cutt as part of his work on his M.Sc.
thesis.
--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca <mailto:ramon.lawrence@ubc.ca>
Attachments:
histojoin_v1.patchapplication/octet-stream; name=histojoin_v1.patchDownload
Index: src/backend/executor/nodeHash.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHash.c,v
retrieving revision 1.116
diff -c -r1.116 nodeHash.c
*** src/backend/executor/nodeHash.c 1 Jan 2008 19:45:49 -0000 1.116
--- src/backend/executor/nodeHash.c 17 Oct 2008 23:47:20 -0000
***************
*** 54,59 ****
--- 54,86 ----
}
/* ----------------------------------------------------------------
+ * isAMostCommonValue
+ *
+ * is the value one of the most common key values?
+ * ----------------------------------------------------------------
+ */
+ bool isAMostCommonValue(HashJoinTable hashtable, uint32 hashvalue, int *partitionNumber)
+ {
+ int bucket = hashvalue & (hashtable->nMostCommonTuplePartitionHashBuckets - 1);
+
+ while (hashtable->mostCommonTuplePartition[bucket].hashvalue != 0
+ && hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ {
+ bucket = (bucket + 1) & (hashtable->nMostCommonTuplePartitionHashBuckets - 1);
+ }
+
+ if (hashtable->mostCommonTuplePartition[bucket].hashvalue == hashvalue)
+ {
+ *partitionNumber = bucket;
+ return true;
+ }
+
+ //must have run into an empty slot which means this is not an MCV
+ *partitionNumber = MCV_INVALID_PARTITION;
+ return false;
+ }
+
+ /* ----------------------------------------------------------------
* MultiExecHash
*
* build hash table for hashjoin, doing partitioning if more
***************
*** 69,74 ****
--- 96,103 ----
TupleTableSlot *slot;
ExprContext *econtext;
uint32 hashvalue;
+ MinimalTuple mintuple;
+ int partitionNumber;
/* must provide our own instrumentation support */
if (node->ps.instrument)
***************
*** 99,106 ****
if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
&hashvalue))
{
! ExecHashTableInsert(hashtable, slot, hashvalue);
! hashtable->totalTuples += 1;
}
}
--- 128,163 ----
if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
&hashvalue))
{
! partitionNumber = MCV_INVALID_PARTITION;
!
! if (hashtable->usingMostCommonValues && isAMostCommonValue(hashtable, hashvalue, &partitionNumber))
! {
! HashJoinTuple hashTuple;
! int hashTupleSize;
!
! mintuple = ExecFetchSlotMinimalTuple(slot);
! hashTupleSize = HJTUPLE_OVERHEAD + mintuple->t_len;
! hashTuple = (HashJoinTuple) palloc(hashTupleSize);
! hashTuple->hashvalue = hashvalue;
! memcpy(HJTUPLE_MINTUPLE(hashTuple), mintuple, mintuple->t_len);
!
! hashTuple->next = hashtable->mostCommonTuplePartition[partitionNumber].tuples;
! hashtable->mostCommonTuplePartition[partitionNumber].tuples = hashTuple;
!
! hashtable->spaceUsed += hashTupleSize;
!
! if (hashtable->spaceUsed > hashtable->spaceAllowed) {
! ExecHashIncreaseNumBatches(hashtable);
! }
!
! hashtable->mostCommonTuplesStored++;
! }
!
! if (partitionNumber == MCV_INVALID_PARTITION)
! {
! ExecHashTableInsert(hashtable, slot, hashvalue);
! hashtable->totalTuples += 1;
! }
}
}
***************
*** 798,803 ****
--- 855,921 ----
}
/*
+ * ExecScanHashMostCommonTuples
+ * scan a hash bucket for matches to the current outer tuple
+ *
+ * The current outer tuple must be stored in econtext->ecxt_outertuple.
+ */
+ HashJoinTuple
+ ExecScanHashMostCommonTuples(HashJoinState *hjstate,
+ ExprContext *econtext)
+ {
+ List *hjclauses = hjstate->hashclauses;
+ HashJoinTable hashtable = hjstate->hj_HashTable;
+ HashJoinTuple hashTuple = hjstate->hj_CurTuple;
+ uint32 hashvalue = hjstate->hj_CurHashValue;
+
+ /*
+ * hj_CurTuple is NULL to start scanning a new bucket, or the address of
+ * the last tuple returned from the current bucket.
+ */
+ if (hashTuple == NULL)
+ {
+ //painstakingly make sure this is a valid partition index
+ Assert(hjstate->hj_OuterTupleMostCommonValuePartition > MCV_INVALID_PARTITION);
+ Assert(hjstate->hj_OuterTupleMostCommonValuePartition < hashtable->nMostCommonTuplePartitions);
+
+ hashTuple = hashtable->mostCommonTuplePartition[hjstate->hj_OuterTupleMostCommonValuePartition].tuples;
+ }
+ else
+ hashTuple = hashTuple->next;
+
+ while (hashTuple != NULL)
+ {
+ if (hashTuple->hashvalue == hashvalue)
+ {
+ TupleTableSlot *inntuple;
+
+ /* insert hashtable's tuple into exec slot so ExecQual sees it */
+ inntuple = ExecStoreMinimalTuple(HJTUPLE_MINTUPLE(hashTuple),
+ hjstate->hj_HashTupleSlot,
+ false); /* do not pfree */
+ econtext->ecxt_innertuple = inntuple;
+
+ /* reset temp memory each time to avoid leaks from qual expr */
+ ResetExprContext(econtext);
+
+ if (ExecQual(hjclauses, econtext, false))
+ {
+ hjstate->hj_CurTuple = hashTuple;
+ return hashTuple;
+ }
+ }
+
+ hashTuple = hashTuple->next;
+ }
+
+ /*
+ * no match
+ */
+ return NULL;
+ }
+
+ /*
* ExecScanHashBucket
* scan a hash bucket for matches to the current outer tuple
*
Index: src/backend/executor/nodeHashjoin.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v
retrieving revision 1.95
diff -c -r1.95 nodeHashjoin.c
*** src/backend/executor/nodeHashjoin.c 15 Aug 2008 19:20:42 -0000 1.95
--- src/backend/executor/nodeHashjoin.c 18 Oct 2008 01:47:57 -0000
***************
*** 20,25 ****
--- 20,30 ----
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
#include "utils/memutils.h"
+ #include "optimizer/cost.h"
+ #include "utils/syscache.h"
+ #include "utils/lsyscache.h"
+ #include "parser/parsetree.h"
+ #include "catalog/pg_statistic.h"
/* Returns true for JOIN_LEFT and JOIN_ANTI jointypes */
***************
*** 34,39 ****
--- 39,146 ----
TupleTableSlot *tupleSlot);
static int ExecHashJoinNewBatch(HashJoinState *hjstate);
+ /*
+ * getMostCommonValues
+ *
+ *
+ */
+ void getMostCommonValues(EState *estate, HashJoinState *hjstate)
+ {
+ HeapTupleData *statsTuple;
+ FuncExprState *clause;
+ ExprState *argstate;
+ Var *variable;
+
+ Datum *values;
+ int nvalues;
+ float4 *numbers;
+ int nnumbers;
+
+ Oid relid;
+ AttrNumber relattnum;
+ Oid atttype;
+ int32 atttypmod;
+
+ int i;
+
+ //is it a join on more than one key?
+ if (hjstate->hashclauses->length != 1)
+ return; //histojoin is not defined for more than one join key so run away
+
+ //make sure the outer node is a seq scan on a base relation otherwise we cant get MCVs at the moment and should not bother trying
+ if (outerPlanState(hjstate)->type != T_SeqScanState)
+ return;
+
+ //grab the relation object id of the outer relation
+ relid = getrelid(((SeqScan *) ((SeqScanState *) outerPlanState(hjstate))->ps.plan)->scanrelid, estate->es_range_table);
+ clause = (FuncExprState *) lfirst(list_head(hjstate->hashclauses));
+ argstate = (ExprState *) lfirst(list_head(clause->args));
+ variable = (Var *) argstate->expr;
+
+ //grab the necessary properties of the join variable
+ relattnum = variable->varattno;
+ atttype = variable->vartype;
+ atttypmod = variable->vartypmod;
+
+ statsTuple = SearchSysCache(STATRELATT,
+ ObjectIdGetDatum(relid),
+ Int16GetDatum(relattnum),
+ 0, 0);
+
+ if (HeapTupleIsValid(statsTuple))
+ {
+ if (get_attstatsslot(statsTuple,
+ atttype, atttypmod,
+ STATISTIC_KIND_MCV, InvalidOid,
+ &values, &nvalues,
+ &numbers, &nnumbers))
+ {
+ HashJoinTable hashtable;
+ FmgrInfo *hashfunctions;
+ //MCV Partitions is an open addressing hashtable with a power of 2 size greater than the number of MCV values
+ int nbuckets = 2;
+ uint32 collisionsWhileHashing = 0;
+ while (nbuckets <= nvalues)
+ {
+ nbuckets <<= 1;
+ }
+ //use two more bit just to help avoid collisions
+ nbuckets <<= 2;
+
+ hashtable = hjstate->hj_HashTable;
+ hashtable->usingMostCommonValues = true;
+ hashtable->nMostCommonTuplePartitionHashBuckets = nbuckets;
+ hashtable->mostCommonTuplePartition = palloc0(nbuckets * sizeof(HashJoinMostCommonValueTuplePartition));
+ hashfunctions = hashtable->outer_hashfunctions;
+
+ //create the partitions
+ for (i = 0; i < nvalues; i++)
+ {
+ uint32 hashvalue = DatumGetUInt32(FunctionCall1(&hashfunctions[0], values[i]));
+ int bucket = hashvalue & (nbuckets - 1);
+
+ while (hashtable->mostCommonTuplePartition[bucket].hashvalue != 0
+ && hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ {
+ bucket = (bucket + 1) & (nbuckets - 1);
+ collisionsWhileHashing++;
+ }
+
+ //leave partition alone if it has the same hashvalue as current MCV. we only want one partition per hashvalue
+ if (hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ {
+ hashtable->mostCommonTuplePartition[bucket].tuples = NULL;
+ hashtable->mostCommonTuplePartition[bucket].hashvalue = hashvalue;
+ hashtable->nMostCommonTuplePartitions++;
+ }
+ }
+
+ free_attstatsslot(atttype, values, nvalues, numbers, nnumbers);
+ }
+
+ ReleaseSysCache(statsTuple);
+ }
+ }
/* ----------------------------------------------------------------
* ExecHashJoin
***************
*** 146,151 ****
--- 253,267 ----
hashtable = ExecHashTableCreate((Hash *) hashNode->ps.plan,
node->hj_HashOperators);
node->hj_HashTable = hashtable;
+
+ hashtable->usingMostCommonValues = false;
+ hashtable->nMostCommonTuplePartitions = 0;
+ hashtable->nMostCommonTuplePartitionHashBuckets = 0;
+ hashtable->mostCommonTuplesStored = 0;
+ hashtable->mostCommonTuplePartition = NULL;
+
+ if (enable_hashjoin_usestatmcvs)
+ getMostCommonValues(estate, node);
/*
* execute the Hash node, to build the hash table
***************
*** 157,163 ****
* If the inner relation is completely empty, and we're not doing an
* outer join, we can quit without scanning the outer relation.
*/
! if (hashtable->totalTuples == 0 && !HASHJOIN_IS_OUTER(node))
return NULL;
/*
--- 273,279 ----
* If the inner relation is completely empty, and we're not doing an
* outer join, we can quit without scanning the outer relation.
*/
! if (hashtable->totalTuples == 0 && hashtable->mostCommonTuplesStored == 0 && !HASHJOIN_IS_OUTER(node))
return NULL;
/*
***************
*** 206,228 ****
ExecHashGetBucketAndBatch(hashtable, hashvalue,
&node->hj_CurBucketNo, &batchno);
node->hj_CurTuple = NULL;
!
! /*
! * Now we've got an outer tuple and the corresponding hash bucket,
! * but this tuple may not belong to the current batch.
! */
! if (batchno != hashtable->curbatch)
{
/*
! * Need to postpone this outer tuple to a later batch. Save it
! * in the corresponding outer-batch file.
*/
! Assert(batchno > hashtable->curbatch);
! ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(outerTupleSlot),
! hashvalue,
! &hashtable->outerBatchFile[batchno]);
! node->hj_NeedNewOuter = true;
! continue; /* loop around for a new outer tuple */
}
}
--- 322,350 ----
ExecHashGetBucketAndBatch(hashtable, hashvalue,
&node->hj_CurBucketNo, &batchno);
node->hj_CurTuple = NULL;
!
! node->hj_OuterTupleMostCommonValuePartition = MCV_INVALID_PARTITION;
!
!
! if (!(hashtable->usingMostCommonValues && isAMostCommonValue(hashtable, hashvalue, &node->hj_OuterTupleMostCommonValuePartition)))
{
/*
! * Now we've got an outer tuple and the corresponding hash bucket,
! * but this tuple may not belong to the current batch.
*/
! if (batchno != hashtable->curbatch)
! {
! /*
! * Need to postpone this outer tuple to a later batch. Save it
! * in the corresponding outer-batch file.
! */
! Assert(batchno > hashtable->curbatch);
! ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(outerTupleSlot),
! hashvalue,
! &hashtable->outerBatchFile[batchno]);
! node->hj_NeedNewOuter = true;
! continue; /* loop around for a new outer tuple */
! }
}
}
***************
*** 231,237 ****
*/
for (;;)
{
! curtuple = ExecScanHashBucket(node, econtext);
if (curtuple == NULL)
break; /* out of matches */
--- 353,366 ----
*/
for (;;)
{
! if (node->hj_OuterTupleMostCommonValuePartition != MCV_INVALID_PARTITION)
! {
! curtuple = ExecScanHashMostCommonTuples(node, econtext);
! }
! else
! {
! curtuple = ExecScanHashBucket(node, econtext);
! }
if (curtuple == NULL)
break; /* out of matches */
Index: src/backend/optimizer/path/costsize.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.199
diff -c -r1.199 costsize.c
*** src/backend/optimizer/path/costsize.c 17 Oct 2008 20:27:24 -0000 1.199
--- src/backend/optimizer/path/costsize.c 17 Oct 2008 23:07:05 -0000
***************
*** 108,113 ****
--- 108,115 ----
bool enable_mergejoin = true;
bool enable_hashjoin = true;
+ bool enable_hashjoin_usestatmcvs = true;
+
typedef struct
{
PlannerInfo *root;
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.475
diff -c -r1.475 guc.c
*** src/backend/utils/misc/guc.c 6 Oct 2008 13:05:36 -0000 1.475
--- src/backend/utils/misc/guc.c 9 Oct 2008 19:56:17 -0000
***************
*** 625,630 ****
--- 625,638 ----
true, NULL, NULL
},
{
+ {"enable_hashjoin_usestatmcvs", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the hash join's use of the MCVs stored in pg_statistic."),
+ NULL
+ },
+ &enable_hashjoin_usestatmcvs,
+ true, NULL, NULL
+ },
+ {
{"constraint_exclusion", PGC_USERSET, QUERY_TUNING_OTHER,
gettext_noop("Enables the planner to use constraints to optimize queries."),
gettext_noop("Child table scans will be skipped if their "
Index: src/include/executor/hashjoin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/hashjoin.h,v
retrieving revision 1.48
diff -c -r1.48 hashjoin.h
*** src/include/executor/hashjoin.h 1 Jan 2008 19:45:57 -0000 1.48
--- src/include/executor/hashjoin.h 17 Oct 2008 23:48:46 -0000
***************
*** 72,77 ****
--- 72,84 ----
#define HJTUPLE_MINTUPLE(hjtup) \
((MinimalTuple) ((char *) (hjtup) + HJTUPLE_OVERHEAD))
+ typedef struct HashJoinMostCommonValueTuplePartition
+ {
+ uint32 hashvalue;
+ HashJoinTuple tuples;
+ } HashJoinMostCommonValueTuplePartition;
+
+ #define MCV_INVALID_PARTITION -1
typedef struct HashJoinTableData
{
***************
*** 116,121 ****
--- 123,134 ----
MemoryContext hashCxt; /* context for whole-hash-join storage */
MemoryContext batchCxt; /* context for this-batch-only storage */
+
+ bool usingMostCommonValues;
+ HashJoinMostCommonValueTuplePartition *mostCommonTuplePartition;
+ int nMostCommonTuplePartitionHashBuckets;
+ int nMostCommonTuplePartitions;
+ uint32 mostCommonTuplesStored;
} HashJoinTableData;
#endif /* HASHJOIN_H */
Index: src/include/executor/nodeHash.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/nodeHash.h,v
retrieving revision 1.45
diff -c -r1.45 nodeHash.h
*** src/include/executor/nodeHash.h 1 Jan 2008 19:45:57 -0000 1.45
--- src/include/executor/nodeHash.h 30 Sep 2008 20:31:35 -0000
***************
*** 45,48 ****
--- 45,51 ----
int *numbuckets,
int *numbatches);
+ extern HashJoinTuple ExecScanHashMostCommonTuples(HashJoinState *hjstate, ExprContext *econtext);
+ extern bool isAMostCommonValue(HashJoinTable hashtable, uint32 hashvalue, int *partitionNumber);
+
#endif /* NODEHASH_H */
Index: src/include/executor/nodeHashjoin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/nodeHashjoin.h,v
retrieving revision 1.37
diff -c -r1.37 nodeHashjoin.h
*** src/include/executor/nodeHashjoin.h 1 Jan 2008 19:45:57 -0000 1.37
--- src/include/executor/nodeHashjoin.h 30 Sep 2008 20:32:05 -0000
***************
*** 26,29 ****
--- 26,31 ----
extern void ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
BufFile **fileptr);
+ extern void getMostCommonValues(EState *estate, HashJoinState *hjstate);
+
#endif /* NODEHASHJOIN_H */
Index: src/include/nodes/execnodes.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/nodes/execnodes.h,v
retrieving revision 1.190
diff -c -r1.190 execnodes.h
*** src/include/nodes/execnodes.h 7 Oct 2008 19:27:04 -0000 1.190
--- src/include/nodes/execnodes.h 17 Oct 2008 23:07:14 -0000
***************
*** 1365,1370 ****
--- 1365,1371 ----
bool hj_NeedNewOuter;
bool hj_MatchedOuter;
bool hj_OuterNotEmpty;
+ int hj_OuterTupleMostCommonValuePartition;
} HashJoinState;
Index: src/include/optimizer/cost.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/optimizer/cost.h,v
retrieving revision 1.93
diff -c -r1.93 cost.h
*** src/include/optimizer/cost.h 4 Oct 2008 21:56:55 -0000 1.93
--- src/include/optimizer/cost.h 7 Oct 2008 18:31:42 -0000
***************
*** 52,57 ****
--- 52,58 ----
extern bool enable_nestloop;
extern bool enable_mergejoin;
extern bool enable_hashjoin;
+ extern bool enable_hashjoin_usestatmcvs;
extern bool constraint_exclusion;
extern double clamp_row_est(double nrows);
On Mon, Oct 20, 2008 at 4:42 PM, Lawrence, Ramon <ramon.lawrence@ubc.ca> wrote:
We propose a patch that improves hybrid hash join's performance for large
multi-batch joins where the probe relation has skew.Project name: Histojoin
Patch file: histojoin_v1.patchThis patch implements the Histojoin join algorithm as an optional feature
added to the standard Hybrid Hash Join (HHJ). A flag is used to enable or
disable the Histojoin features. When Histojoin is disabled, HHJ acts as
normal. The Histojoin features allow HHJ to use PostgreSQL's statistics to
do skew aware partitioning. The basic idea is to keep build relation tuples
in a small in-memory hash table that have join values that are frequently
occurring in the probe relation. This improves performance of HHJ when
multiple batches are used by 10% to 50% for skewed data sets. The
performance improvements of this patch can be seen in the paper (pages
25-30) at:http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf
All generators and materials needed to verify these results can be provided.
This is a patch against the HEAD of the repository.
This patch does not contain platform specific code. It compiles and has
been tested on our machines in both Windows (MSVC++) and Linux (GCC).Currently the Histojoin feature is enabled by default and is used whenever
HHJ is used and there are Most Common Value (MCV) statistics available on
the probe side base relation of the join. To disable this feature simply
set the enable_hashjoin_usestatmcvs flag to off in the database
configuration file or at run time with the 'set' command.One potential improvement not included in the patch is that Most Common
Value (MCV) statistics are only determined when the probe relation is
produced by a scan operator. There is a benefit to using MCVs even when the
probe relation is not a base scan, but we were unable to determine how to
find statistics from a base relation after other operators are performed.This patch was created by Bryce Cutt as part of his work on his M.Sc.
thesis.--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of British
Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca
I'm interested in trying to review this patch. Having not done patch
review before, I can't exactly promise grand results, but if you could
provide me with the data to check your results? In the meantime I'll
go read the paper.
- Josh / eggyknap
Joshua,
Thank you for offering to review the patch.
The easiest way to test would be to generate your own TPC-H data and
load it into a database for testing. I have posted the TPC-H generator
at:
http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip
The generator can produce skewed data sets. It was produced by
Microsoft Research.
After unzipping, on a Windows machine, you can just run the command:
dbgen -s 1 -z 1
This will produce a TPC-H database of scale 1 GB with a Zipfian skew of
z=1. More information on the generator is in the document README-S.DOC.
Source is provided for the generator, so you should be able to run it on
other operating systems as well.
The schema DDL is at:
http://people.ok.ubc.ca/rlawrenc/tpch_pg_ddl.txt
Note that the load time for 1G data is 1-2 hours and for 10G data is
about 24 hours. I recommend you do not add the foreign keys until after
the data is loaded.
The other alternative is to do a pgdump on our data sets. However, the
download size would be quite large, and it will take a couple of days
for us to get you the data in that form.
--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca
-----Original Message-----
From: Joshua Tolley [mailto:eggyknap@gmail.com]
Sent: November 1, 2008 3:42 PM
To: Lawrence, Ramon
Cc: pgsql-hackers@postgresql.org; Bryce Cutt
Subject: Re: [HACKERS] Proposed Patch to Improve Performance of Multi-
Batch Hash Join for Skewed Data SetsOn Mon, Oct 20, 2008 at 4:42 PM, Lawrence, Ramon
<ramon.lawrence@ubc.ca>
wrote:
We propose a patch that improves hybrid hash join's performance for
large
multi-batch joins where the probe relation has skew.
Project name: Histojoin
Patch file: histojoin_v1.patchThis patch implements the Histojoin join algorithm as an optional
feature
added to the standard Hybrid Hash Join (HHJ). A flag is used to
enable
or
disable the Histojoin features. When Histojoin is disabled, HHJ
acts as
normal. The Histojoin features allow HHJ to use PostgreSQL's
statistics
to
do skew aware partitioning. The basic idea is to keep build
relation
tuples
in a small in-memory hash table that have join values that are
frequently
occurring in the probe relation. This improves performance of HHJ
when
multiple batches are used by 10% to 50% for skewed data sets. The
performance improvements of this patch can be seen in the paper
(pages
25-30) at:
http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf
All generators and materials needed to verify these results can be
provided.
This is a patch against the HEAD of the repository.
This patch does not contain platform specific code. It compiles and
has
been tested on our machines in both Windows (MSVC++) and Linux
(GCC).
Currently the Histojoin feature is enabled by default and is used
whenever
HHJ is used and there are Most Common Value (MCV) statistics
available
on
the probe side base relation of the join. To disable this feature
simply
set the enable_hashjoin_usestatmcvs flag to off in the database
configuration file or at run time with the 'set' command.One potential improvement not included in the patch is that Most
Common
Value (MCV) statistics are only determined when the probe relation
is
produced by a scan operator. There is a benefit to using MCVs even
when
the
probe relation is not a base scan, but we were unable to determine
how
to
find statistics from a base relation after other operators are
performed.
This patch was created by Bryce Cutt as part of his work on his
M.Sc.
Show quoted text
thesis.
--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University ofBritish
Columbia Okanagan
E-mail: ramon.lawrence@ubc.caI'm interested in trying to review this patch. Having not done patch
review before, I can't exactly promise grand results, but if you could
provide me with the data to check your results? In the meantime I'll
go read the paper.- Josh / eggyknap
On Sun, Nov 2, 2008 at 4:48 PM, Lawrence, Ramon <ramon.lawrence@ubc.ca> wrote:
Joshua,
Thank you for offering to review the patch.
The easiest way to test would be to generate your own TPC-H data and
load it into a database for testing. I have posted the TPC-H generator
at:http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip
The generator can produce skewed data sets. It was produced by
Microsoft Research.After unzipping, on a Windows machine, you can just run the command:
dbgen -s 1 -z 1
This will produce a TPC-H database of scale 1 GB with a Zipfian skew of
z=1. More information on the generator is in the document README-S.DOC.
Source is provided for the generator, so you should be able to run it on
other operating systems as well.The schema DDL is at:
http://people.ok.ubc.ca/rlawrenc/tpch_pg_ddl.txt
Note that the load time for 1G data is 1-2 hours and for 10G data is
about 24 hours. I recommend you do not add the foreign keys until after
the data is loaded.The other alternative is to do a pgdump on our data sets. However, the
download size would be quite large, and it will take a couple of days
for us to get you the data in that form.--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca
I'll try out the TPC-H generator first :) Thanks.
- Josh
"Lawrence, Ramon" <ramon.lawrence@ubc.ca> writes:
The easiest way to test would be to generate your own TPC-H data and
load it into a database for testing. I have posted the TPC-H generator
at:
http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip
The generator can produce skewed data sets. It was produced by
Microsoft Research.
What alternatives are there for people who do not run Windows?
regards, tom lane
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
What alternatives are there for people who do not run Windows?regards, tom lane
The TPC-H generator is a standard code base provided at
http://www.tpc.org/tpch/. We have been able to compile this code on
Linux.
However, we were unable to get the Microsoft modifications to this code
to compile on Linux (although they are supposed to be portable). So, we
just used the Windows version with wine on our test Debian machine.
I have also posted the text files for the TPC-H 1G 1Z data set at:
http://people.ok.ubc.ca/rlawrenc/tpch1g1z.zip
Note that you need to trim the extra characters at the end of the lines
for PostgreSQL to read them properly.
Since the data takes a while to generate and load, we can also provide a
compressed version of the PostgreSQL data directory of the databases
with the data already loaded.
--
Ramon Lawrence
On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote:
We propose a patch that improves hybrid hash join's performance for large
multi-batch joins where the probe relation has skew.
I'm running into problems with this patch. It applies cleanly, and the
technique you provided for generating sample data works just fine
(though I admit I haven't verified that the expected skew exists in the
data). But the server crashes when I try to load the data. The backtrace
is below, labeled "Backtrace 1"; since it happens in
ExecScanHashMostCommonTuples, I figure it's because of the patch and not
something else odd (unless perhaps my hardware is flakey -- I'll try it
on other hardware as soon as I can, to verify). Note that I'm running
this on Ubuntu 8.10, 32-bit x86, running a kernel Ubuntu labels as
"2.6.27-7-generic #1 SMP". The statement in execution at the time was
"ALTER TABLE SUPPLIER ADD CONSTRAINT SUPPLIER_FK1 FOREIGN KEY
(S_NATIONKEY) references NATION (N_NATIONKEY);"
Further, when I go back into the database in psql, simply issuing a "\d"
command crashes the backend with a similar backtrace, labeled Backtrace
2, below. The query underlying \d and its EXPLAIN output are also
included, just for kicks.
- Josh
*****************************************
BACKTRACE 1
****************************************
Core was generated by `postgres: jtolley jtolley [local] ALTE'.
Program terminated with signal 6, Aborted.
[New process 20407]
#0 0xb80b0430 in __kernel_vsyscall ()
(gdb) bt
#0 0xb80b0430 in __kernel_vsyscall ()
#1 0xb7f22880 in raise () from /lib/tls/i686/cmov/libc.so.6
#2 0xb7f24248 in abort () from /lib/tls/i686/cmov/libc.so.6
#3 0x0831540e in ExceptionalCondition (
conditionName=0x8433274
"!(hjstate->hj_OuterTupleMostCommonValuePartition <
hashtable->nMostCommonTuplePartitions)",
errorType=0x834b66d "FailedAssertion", fileName=0x84331d9
"nodeHash.c", lineNumber=880) at assert.c:57
#4 0x081b457b in ExecScanHashMostCommonTuples (hjstate=0x8720a6c,
econtext=0x8720af8) at nodeHash.c:880
#5 0x081b60de in ExecHashJoin (node=0x8720a6c) at nodeHashjoin.c:357
#6 0x081a4748 in ExecProcNode (node=0x8720a6c) at execProcnode.c:406
#7 0x081a242b in standard_ExecutorRun (queryDesc=0x870957c,
direction=ForwardScanDirection, count=1) at execMain.c:1343
#8 0x081c2036 in _SPI_execute_plan (plan=0x87181bc, paramLI=0x0,
snapshot=0x8485300, crosscheck_snapshot=0x0, read_only=1 '\001',
fire_triggers=0 '\0', tcount=1) at spi.c:1976
#9 0x081c2350 in SPI_execute_snapshot (plan=0x87181bc, Values=0x0,
Nulls=0x0, snapshot=0x8485300, crosscheck_snapshot=0x0,
read_only=<value optimized out>, fire_triggers=<value optimized
out>, tcount=1) at spi.c:408
#10 0x082e1921 in RI_Initial_Check (trigger=0xbfeb0afc,
fk_rel=0xb5a21938, pk_rel=0xb5a20754) at ri_triggers.c:2763
#11 0x08178613 in ATRewriteTables (wqueue=0xbfeb0d88) at
tablecmds.c:5026
#12 0x0817ef36 in ATController (rel=0xb5a21938, cmds=<value optimized
out>, recurse=<value optimized out>) at tablecmds.c:2294
#13 0x08261dd5 in ProcessUtility (parsetree=0x86ca17c,
queryString=0x86c96ec "ALTER TABLE SUPPLIER\nADD CONSTRAINT
SUPPLIER_FK1 FOREIGN KEY (S_NATIONKEY) references NATION
(N_NATIONKEY);",
params=0x0, isTopLevel=1 '\001', dest=0x86ca2b4,
completionTag=0xbfeb0fc8 "") at utility.c:569
#14 0x0825e2ae in PortalRunUtility (portal=0x86fadfc,
utilityStmt=0x86ca17c, isTopLevel=<value optimized out>, dest=0x86ca2b4,
completionTag=0xbfeb0fc8 "") at pquery.c:1176
#15 0x0825f2c0 in PortalRunMulti (portal=0x86fadfc, isTopLevel=<value
optimized out>, dest=0x86ca2b4, altdest=0x86ca2b4,
completionTag=0xbfeb0fc8 "") at pquery.c:1281
#16 0x0825fb54 in PortalRun (portal=0x86fadfc, count=2147483647,
isTopLevel=6 '\006', dest=0x86ca2b4, altdest=0x86ca2b4,
completionTag=0xbfeb0fc8 "") at pquery.c:812
#17 0x0825a757 in exec_simple_query (
query_string=0x86c96ec "ALTER TABLE SUPPLIER\nADD CONSTRAINT
SUPPLIER_FK1 FOREIGN KEY (S_NATIONKEY) references NATION
(N_NATIONKEY);")
at postgres.c:992
#18 0x0825bfff in PostgresMain (argc=4, argv=0x8667b08,
username=0x8667ae0 "jtolley") at postgres.c:3569
#19 0x082261cf in ServerLoop () at postmaster.c:3258
#20 0x08227190 in PostmasterMain (argc=1, argv=0x8664250) at
postmaster.c:1031
#21 0x081cc126 in main (argc=1, argv=0x8664250) at main.c:188
(gdb)
*****************************************
BACKTRACE 2
****************************************
Core was generated by `postgres: jtolley jtolley [local] SELE'.
Program terminated with signal 6, Aborted.
[New process 20967]
#0 0xb80b0430 in __kernel_vsyscall ()
(gdb) bt
#0 0xb80b0430 in __kernel_vsyscall ()
#1 0xb7f22880 in raise () from /lib/tls/i686/cmov/libc.so.6
#2 0xb7f24248 in abort () from /lib/tls/i686/cmov/libc.so.6
#3 0x0831540e in ExceptionalCondition (
conditionName=0x8433274
"!(hjstate->hj_OuterTupleMostCommonValuePartition <
hashtable->nMostCommonTuplePartitions)",
errorType=0x834b66d "FailedAssertion", fileName=0x84331d9
"nodeHash.c", lineNumber=880) at assert.c:57
#4 0x081b457b in ExecScanHashMostCommonTuples (hjstate=0x86fb320,
econtext=0x86fb3ac) at nodeHash.c:880
#5 0x081b60de in ExecHashJoin (node=0x86fb320) at nodeHashjoin.c:357
#6 0x081a4748 in ExecProcNode (node=0x86fb320) at execProcnode.c:406
#7 0x081bb2a1 in ExecSort (node=0x86fb294) at nodeSort.c:102
#8 0x081a4718 in ExecProcNode (node=0x86fb294) at execProcnode.c:417
#9 0x081a242b in standard_ExecutorRun (queryDesc=0x8706e1c,
direction=ForwardScanDirection, count=0) at execMain.c:1343
#10 0x0825e64c in PortalRunSelect (portal=0x8700e0c, forward=1 '\001',
count=0, dest=0x871db14) at pquery.c:942
#11 0x0825f9ae in PortalRun (portal=0x8700e0c, count=2147483647,
isTopLevel=1 '\001', dest=0x871db14, altdest=0x871db14,
completionTag=0xbfeb0fc8 "") at pquery.c:796
#12 0x0825a757 in exec_simple_query (
query_string=0x86cb6f4 "SELECT n.nspname as \"Schema\",\n c.relname
as \"Name\",\n CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN
'view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN
'special' END as \"Type\",\n "...) at postgres.c:992
#13 0x0825bfff in PostgresMain (argc=4, argv=0x8667f58,
username=0x8667f30 "jtolley") at postgres.c:3569
#14 0x082261cf in ServerLoop () at postmaster.c:3258
#15 0x08227190 in PostmasterMain (argc=1, argv=0x8664250) at
postmaster.c:1031
#16 0x081cc126 in main (argc=1, argv=0x8664250) at main.c:188
*****************************************
\d EXPLAIN output
****************************************
jtolley=# explain SELECT n.nspname as "Schema",
jtolley-# c.relname as "Name",
jtolley-# CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view'
WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN 'special'
END as "Type",
jtolley-# pg_catalog.pg_get_userbyid(c.relowner) as "Owner"
jtolley-# FROM pg_catalog.pg_class c
jtolley-# LEFT JOIN pg_catalog.pg_namespace n ON n.oid =
c.relnamespace
jtolley-# WHERE c.relkind IN ('r','v','S','')
jtolley-# AND n.nspname <> 'pg_catalog'
jtolley-# AND n.nspname !~ '^pg_toast'
jtolley-# AND pg_catalog.pg_table_is_visible(c.oid)
jtolley-# ORDER BY 1,2;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Sort (cost=13.02..13.10 rows=35 width=133)
Sort Key: n.nspname, c.relname
-> Hash Join (cost=1.14..12.12 rows=35 width=133)
Hash Cond: (c.relnamespace = n.oid)
-> Seq Scan on pg_class c (cost=0.00..9.97 rows=35 width=73)
Filter: (pg_table_is_visible(oid) AND (relkind = ANY
('{r,v,S,""}'::"char"[])))
-> Hash (cost=1.09..1.09 rows=4 width=68)
-> Seq Scan on pg_namespace n (cost=0.00..1.09 rows=4
width=68)
Filter: ((nspname <> 'pg_catalog'::name) AND
(nspname !~ '^pg_toast'::text))
(9 rows)
On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote:
We propose a patch that improves hybrid hash join's performance for large
multi-batch joins where the probe relation has skew.
I also recommend modifying docs/src/sgml/config.sgml to include the
enable_hashjoin_usestatmcvs option.
- Josh / eggyknap
Joshua Tolley <eggyknap@gmail.com> writes:
On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote:
We propose a patch that improves hybrid hash join's performance for large
multi-batch joins where the probe relation has skew.
I also recommend modifying docs/src/sgml/config.sgml to include the
enable_hashjoin_usestatmcvs option.
If the patch is actually a win, why would we bother with such a GUC
at all?
regards, tom lane
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Wed, Nov 5, 2008 at 8:20 AM, Tom Lane wrote:
Joshua Tolley writes:
On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote:
We propose a patch that improves hybrid hash join's performance for large
multi-batch joins where the probe relation has skew.I also recommend modifying docs/src/sgml/config.sgml to include the
enable_hashjoin_usestatmcvs option.If the patch is actually a win, why would we bother with such a GUC
at all?regards, tom lane
Good point. Leaving it in place for patch review purposes is useful,
but we can probably lose it in the end.
- - Josh / eggyknap
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: http://getfiregpg.org
iEYEARECAAYFAkkRujsACgkQRiRfCGf1UMNSTACfbpDSQn0HGSVr3jI30GJApcRD
YbQAn2VZdI/aIalGBrbn1hlRWPEvbgV5
=LKZ3
-----END PGP SIGNATURE-----
The error is causes by me Asserting against the wrong variable. I
never noticed this as I apparently did not have assertions turned on
on my development machine. That is fixed now and with the new patch
version I have attached all assertions are passing with your query and
my test queries. I added another assertion to that section of the
code so that it is a bit more vigorous in confirming the hash table
partition is correct. It does not change the operation of the code.
There are two partition counts. One holds the maximum number of
buckets in the hash table and the other counts the number of actual
buckets created for hash values. I was incorrectly testing against
the second one because that was valid before I started using a hash
table to store the buckets.
The enable_hashjoin_usestatmcvs flag was valuable for my own research
and tests and likely useful for your review but Tom is correct that it
can be removed in the final version.
- Bryce Cutt
Show quoted text
On Wed, Nov 5, 2008 at 7:22 AM, Joshua Tolley <eggyknap@gmail.com> wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1On Wed, Nov 5, 2008 at 8:20 AM, Tom Lane wrote:
Joshua Tolley writes:
On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote:
We propose a patch that improves hybrid hash join's performance for large
multi-batch joins where the probe relation has skew.I also recommend modifying docs/src/sgml/config.sgml to include the
enable_hashjoin_usestatmcvs option.If the patch is actually a win, why would we bother with such a GUC
at all?regards, tom lane
Good point. Leaving it in place for patch review purposes is useful,
but we can probably lose it in the end.- - Josh / eggyknap
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: http://getfiregpg.orgiEYEARECAAYFAkkRujsACgkQRiRfCGf1UMNSTACfbpDSQn0HGSVr3jI30GJApcRD
YbQAn2VZdI/aIalGBrbn1hlRWPEvbgV5
=LKZ3
-----END PGP SIGNATURE-----
Attachments:
histojoin_v2.patchapplication/octet-stream; name=histojoin_v2.patchDownload
Index: src/backend/executor/nodeHash.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHash.c,v
retrieving revision 1.116
diff -c -r1.116 nodeHash.c
*** src/backend/executor/nodeHash.c 1 Jan 2008 19:45:49 -0000 1.116
--- src/backend/executor/nodeHash.c 5 Nov 2008 22:26:53 -0000
***************
*** 54,59 ****
--- 54,86 ----
}
/* ----------------------------------------------------------------
+ * isAMostCommonValue
+ *
+ * is the value one of the most common key values?
+ * ----------------------------------------------------------------
+ */
+ bool isAMostCommonValue(HashJoinTable hashtable, uint32 hashvalue, int *partitionNumber)
+ {
+ int bucket = hashvalue & (hashtable->nMostCommonTuplePartitionHashBuckets - 1);
+
+ while (hashtable->mostCommonTuplePartition[bucket].hashvalue != 0
+ && hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ {
+ bucket = (bucket + 1) & (hashtable->nMostCommonTuplePartitionHashBuckets - 1);
+ }
+
+ if (hashtable->mostCommonTuplePartition[bucket].hashvalue == hashvalue)
+ {
+ *partitionNumber = bucket;
+ return true;
+ }
+
+ //must have run into an empty slot which means this is not an MCV
+ *partitionNumber = MCV_INVALID_PARTITION;
+ return false;
+ }
+
+ /* ----------------------------------------------------------------
* MultiExecHash
*
* build hash table for hashjoin, doing partitioning if more
***************
*** 69,74 ****
--- 96,103 ----
TupleTableSlot *slot;
ExprContext *econtext;
uint32 hashvalue;
+ MinimalTuple mintuple;
+ int partitionNumber;
/* must provide our own instrumentation support */
if (node->ps.instrument)
***************
*** 99,106 ****
if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
&hashvalue))
{
! ExecHashTableInsert(hashtable, slot, hashvalue);
! hashtable->totalTuples += 1;
}
}
--- 128,163 ----
if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
&hashvalue))
{
! partitionNumber = MCV_INVALID_PARTITION;
!
! if (hashtable->usingMostCommonValues && isAMostCommonValue(hashtable, hashvalue, &partitionNumber))
! {
! HashJoinTuple hashTuple;
! int hashTupleSize;
!
! mintuple = ExecFetchSlotMinimalTuple(slot);
! hashTupleSize = HJTUPLE_OVERHEAD + mintuple->t_len;
! hashTuple = (HashJoinTuple) palloc(hashTupleSize);
! hashTuple->hashvalue = hashvalue;
! memcpy(HJTUPLE_MINTUPLE(hashTuple), mintuple, mintuple->t_len);
!
! hashTuple->next = hashtable->mostCommonTuplePartition[partitionNumber].tuples;
! hashtable->mostCommonTuplePartition[partitionNumber].tuples = hashTuple;
!
! hashtable->spaceUsed += hashTupleSize;
!
! if (hashtable->spaceUsed > hashtable->spaceAllowed) {
! ExecHashIncreaseNumBatches(hashtable);
! }
!
! hashtable->mostCommonTuplesStored++;
! }
!
! if (partitionNumber == MCV_INVALID_PARTITION)
! {
! ExecHashTableInsert(hashtable, slot, hashvalue);
! hashtable->totalTuples += 1;
! }
}
}
***************
*** 798,803 ****
--- 855,922 ----
}
/*
+ * ExecScanHashMostCommonTuples
+ * scan a hash bucket for matches to the current outer tuple
+ *
+ * The current outer tuple must be stored in econtext->ecxt_outertuple.
+ */
+ HashJoinTuple
+ ExecScanHashMostCommonTuples(HashJoinState *hjstate,
+ ExprContext *econtext)
+ {
+ List *hjclauses = hjstate->hashclauses;
+ HashJoinTable hashtable = hjstate->hj_HashTable;
+ HashJoinTuple hashTuple = hjstate->hj_CurTuple;
+ uint32 hashvalue = hjstate->hj_CurHashValue;
+
+ /*
+ * hj_CurTuple is NULL to start scanning a new bucket, or the address of
+ * the last tuple returned from the current bucket.
+ */
+ if (hashTuple == NULL)
+ {
+ //painstakingly make sure this is a valid partition index
+ Assert(hjstate->hj_OuterTupleMostCommonValuePartition > MCV_INVALID_PARTITION);
+ Assert(hjstate->hj_OuterTupleMostCommonValuePartition < hashtable->nMostCommonTuplePartitionHashBuckets);
+ Assert(hashtable->mostCommonTuplePartition[hjstate->hj_OuterTupleMostCommonValuePartition].hashvalue != 0);
+
+ hashTuple = hashtable->mostCommonTuplePartition[hjstate->hj_OuterTupleMostCommonValuePartition].tuples;
+ }
+ else
+ hashTuple = hashTuple->next;
+
+ while (hashTuple != NULL)
+ {
+ if (hashTuple->hashvalue == hashvalue)
+ {
+ TupleTableSlot *inntuple;
+
+ /* insert hashtable's tuple into exec slot so ExecQual sees it */
+ inntuple = ExecStoreMinimalTuple(HJTUPLE_MINTUPLE(hashTuple),
+ hjstate->hj_HashTupleSlot,
+ false); /* do not pfree */
+ econtext->ecxt_innertuple = inntuple;
+
+ /* reset temp memory each time to avoid leaks from qual expr */
+ ResetExprContext(econtext);
+
+ if (ExecQual(hjclauses, econtext, false))
+ {
+ hjstate->hj_CurTuple = hashTuple;
+ return hashTuple;
+ }
+ }
+
+ hashTuple = hashTuple->next;
+ }
+
+ /*
+ * no match
+ */
+ return NULL;
+ }
+
+ /*
* ExecScanHashBucket
* scan a hash bucket for matches to the current outer tuple
*
Index: src/backend/executor/nodeHashjoin.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v
retrieving revision 1.96
diff -c -r1.96 nodeHashjoin.c
*** src/backend/executor/nodeHashjoin.c 23 Oct 2008 14:34:34 -0000 1.96
--- src/backend/executor/nodeHashjoin.c 5 Nov 2008 22:56:59 -0000
***************
*** 20,25 ****
--- 20,30 ----
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
#include "utils/memutils.h"
+ #include "optimizer/cost.h"
+ #include "utils/syscache.h"
+ #include "utils/lsyscache.h"
+ #include "parser/parsetree.h"
+ #include "catalog/pg_statistic.h"
/* Returns true for JOIN_LEFT and JOIN_ANTI jointypes */
***************
*** 34,39 ****
--- 39,146 ----
TupleTableSlot *tupleSlot);
static int ExecHashJoinNewBatch(HashJoinState *hjstate);
+ /*
+ * getMostCommonValues
+ *
+ *
+ */
+ void getMostCommonValues(EState *estate, HashJoinState *hjstate)
+ {
+ HeapTupleData *statsTuple;
+ FuncExprState *clause;
+ ExprState *argstate;
+ Var *variable;
+
+ Datum *values;
+ int nvalues;
+ float4 *numbers;
+ int nnumbers;
+
+ Oid relid;
+ AttrNumber relattnum;
+ Oid atttype;
+ int32 atttypmod;
+
+ int i;
+
+ //is it a join on more than one key?
+ if (hjstate->hashclauses->length != 1)
+ return; //histojoin is not defined for more than one join key so run away
+
+ //make sure the outer node is a seq scan on a base relation otherwise we cant get MCVs at the moment and should not bother trying
+ if (outerPlanState(hjstate)->type != T_SeqScanState)
+ return;
+
+ //grab the relation object id of the outer relation
+ relid = getrelid(((SeqScan *) ((SeqScanState *) outerPlanState(hjstate))->ps.plan)->scanrelid, estate->es_range_table);
+ clause = (FuncExprState *) lfirst(list_head(hjstate->hashclauses));
+ argstate = (ExprState *) lfirst(list_head(clause->args));
+ variable = (Var *) argstate->expr;
+
+ //grab the necessary properties of the join variable
+ relattnum = variable->varattno;
+ atttype = variable->vartype;
+ atttypmod = variable->vartypmod;
+
+ statsTuple = SearchSysCache(STATRELATT,
+ ObjectIdGetDatum(relid),
+ Int16GetDatum(relattnum),
+ 0, 0);
+
+ if (HeapTupleIsValid(statsTuple))
+ {
+ if (get_attstatsslot(statsTuple,
+ atttype, atttypmod,
+ STATISTIC_KIND_MCV, InvalidOid,
+ &values, &nvalues,
+ &numbers, &nnumbers))
+ {
+ HashJoinTable hashtable;
+ FmgrInfo *hashfunctions;
+ //MCV Partitions is an open addressing hashtable with a power of 2 size greater than the number of MCV values
+ int nbuckets = 2;
+ uint32 collisionsWhileHashing = 0;
+ while (nbuckets <= nvalues)
+ {
+ nbuckets <<= 1;
+ }
+ //use two more bit just to help avoid collisions
+ nbuckets <<= 2;
+
+ hashtable = hjstate->hj_HashTable;
+ hashtable->usingMostCommonValues = true;
+ hashtable->nMostCommonTuplePartitionHashBuckets = nbuckets;
+ hashtable->mostCommonTuplePartition = palloc0(nbuckets * sizeof(HashJoinMostCommonValueTuplePartition));
+ hashfunctions = hashtable->outer_hashfunctions;
+
+ //create the partitions
+ for (i = 0; i < nvalues; i++)
+ {
+ uint32 hashvalue = DatumGetUInt32(FunctionCall1(&hashfunctions[0], values[i]));
+ int bucket = hashvalue & (nbuckets - 1);
+
+ while (hashtable->mostCommonTuplePartition[bucket].hashvalue != 0
+ && hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ {
+ bucket = (bucket + 1) & (nbuckets - 1);
+ collisionsWhileHashing++;
+ }
+
+ //leave partition alone if it has the same hashvalue as current MCV. we only want one partition per hashvalue
+ if (hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ {
+ hashtable->mostCommonTuplePartition[bucket].tuples = NULL;
+ hashtable->mostCommonTuplePartition[bucket].hashvalue = hashvalue;
+ hashtable->nMostCommonTuplePartitions++;
+ }
+ }
+
+ free_attstatsslot(atttype, values, nvalues, numbers, nnumbers);
+ }
+
+ ReleaseSysCache(statsTuple);
+ }
+ }
/* ----------------------------------------------------------------
* ExecHashJoin
***************
*** 146,151 ****
--- 253,267 ----
hashtable = ExecHashTableCreate((Hash *) hashNode->ps.plan,
node->hj_HashOperators);
node->hj_HashTable = hashtable;
+
+ hashtable->usingMostCommonValues = false;
+ hashtable->nMostCommonTuplePartitions = 0;
+ hashtable->nMostCommonTuplePartitionHashBuckets = 0;
+ hashtable->mostCommonTuplesStored = 0;
+ hashtable->mostCommonTuplePartition = NULL;
+
+ if (enable_hashjoin_usestatmcvs)
+ getMostCommonValues(estate, node);
/*
* execute the Hash node, to build the hash table
***************
*** 157,163 ****
* If the inner relation is completely empty, and we're not doing an
* outer join, we can quit without scanning the outer relation.
*/
! if (hashtable->totalTuples == 0 && !HASHJOIN_IS_OUTER(node))
return NULL;
/*
--- 273,279 ----
* If the inner relation is completely empty, and we're not doing an
* outer join, we can quit without scanning the outer relation.
*/
! if (hashtable->totalTuples == 0 && hashtable->mostCommonTuplesStored == 0 && !HASHJOIN_IS_OUTER(node))
return NULL;
/*
***************
*** 205,227 ****
ExecHashGetBucketAndBatch(hashtable, hashvalue,
&node->hj_CurBucketNo, &batchno);
node->hj_CurTuple = NULL;
!
! /*
! * Now we've got an outer tuple and the corresponding hash bucket,
! * but this tuple may not belong to the current batch.
! */
! if (batchno != hashtable->curbatch)
{
/*
! * Need to postpone this outer tuple to a later batch. Save it
! * in the corresponding outer-batch file.
*/
! Assert(batchno > hashtable->curbatch);
! ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(outerTupleSlot),
! hashvalue,
! &hashtable->outerBatchFile[batchno]);
! node->hj_NeedNewOuter = true;
! continue; /* loop around for a new outer tuple */
}
}
--- 321,349 ----
ExecHashGetBucketAndBatch(hashtable, hashvalue,
&node->hj_CurBucketNo, &batchno);
node->hj_CurTuple = NULL;
!
! node->hj_OuterTupleMostCommonValuePartition = MCV_INVALID_PARTITION;
!
!
! if (!(hashtable->usingMostCommonValues && isAMostCommonValue(hashtable, hashvalue, &node->hj_OuterTupleMostCommonValuePartition)))
{
/*
! * Now we've got an outer tuple and the corresponding hash bucket,
! * but this tuple may not belong to the current batch.
*/
! if (batchno != hashtable->curbatch)
! {
! /*
! * Need to postpone this outer tuple to a later batch. Save it
! * in the corresponding outer-batch file.
! */
! Assert(batchno > hashtable->curbatch);
! ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(outerTupleSlot),
! hashvalue,
! &hashtable->outerBatchFile[batchno]);
! node->hj_NeedNewOuter = true;
! continue; /* loop around for a new outer tuple */
! }
}
}
***************
*** 230,236 ****
*/
for (;;)
{
! curtuple = ExecScanHashBucket(node, econtext);
if (curtuple == NULL)
break; /* out of matches */
--- 352,365 ----
*/
for (;;)
{
! if (node->hj_OuterTupleMostCommonValuePartition != MCV_INVALID_PARTITION)
! {
! curtuple = ExecScanHashMostCommonTuples(node, econtext);
! }
! else
! {
! curtuple = ExecScanHashBucket(node, econtext);
! }
if (curtuple == NULL)
break; /* out of matches */
Index: src/backend/optimizer/path/costsize.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.200
diff -c -r1.200 costsize.c
*** src/backend/optimizer/path/costsize.c 21 Oct 2008 20:42:52 -0000 1.200
--- src/backend/optimizer/path/costsize.c 5 Nov 2008 22:57:01 -0000
***************
*** 109,114 ****
--- 109,116 ----
bool enable_mergejoin = true;
bool enable_hashjoin = true;
+ bool enable_hashjoin_usestatmcvs = true;
+
typedef struct
{
PlannerInfo *root;
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.475
diff -c -r1.475 guc.c
*** src/backend/utils/misc/guc.c 6 Oct 2008 13:05:36 -0000 1.475
--- src/backend/utils/misc/guc.c 9 Oct 2008 19:56:17 -0000
***************
*** 625,630 ****
--- 625,638 ----
true, NULL, NULL
},
{
+ {"enable_hashjoin_usestatmcvs", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the hash join's use of the MCVs stored in pg_statistic."),
+ NULL
+ },
+ &enable_hashjoin_usestatmcvs,
+ true, NULL, NULL
+ },
+ {
{"constraint_exclusion", PGC_USERSET, QUERY_TUNING_OTHER,
gettext_noop("Enables the planner to use constraints to optimize queries."),
gettext_noop("Child table scans will be skipped if their "
Index: src/include/executor/hashjoin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/hashjoin.h,v
retrieving revision 1.48
diff -c -r1.48 hashjoin.h
*** src/include/executor/hashjoin.h 1 Jan 2008 19:45:57 -0000 1.48
--- src/include/executor/hashjoin.h 17 Oct 2008 23:48:46 -0000
***************
*** 72,77 ****
--- 72,84 ----
#define HJTUPLE_MINTUPLE(hjtup) \
((MinimalTuple) ((char *) (hjtup) + HJTUPLE_OVERHEAD))
+ typedef struct HashJoinMostCommonValueTuplePartition
+ {
+ uint32 hashvalue;
+ HashJoinTuple tuples;
+ } HashJoinMostCommonValueTuplePartition;
+
+ #define MCV_INVALID_PARTITION -1
typedef struct HashJoinTableData
{
***************
*** 116,121 ****
--- 123,134 ----
MemoryContext hashCxt; /* context for whole-hash-join storage */
MemoryContext batchCxt; /* context for this-batch-only storage */
+
+ bool usingMostCommonValues;
+ HashJoinMostCommonValueTuplePartition *mostCommonTuplePartition;
+ int nMostCommonTuplePartitionHashBuckets;
+ int nMostCommonTuplePartitions;
+ uint32 mostCommonTuplesStored;
} HashJoinTableData;
#endif /* HASHJOIN_H */
Index: src/include/executor/nodeHash.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/nodeHash.h,v
retrieving revision 1.45
diff -c -r1.45 nodeHash.h
*** src/include/executor/nodeHash.h 1 Jan 2008 19:45:57 -0000 1.45
--- src/include/executor/nodeHash.h 30 Sep 2008 20:31:35 -0000
***************
*** 45,48 ****
--- 45,51 ----
int *numbuckets,
int *numbatches);
+ extern HashJoinTuple ExecScanHashMostCommonTuples(HashJoinState *hjstate, ExprContext *econtext);
+ extern bool isAMostCommonValue(HashJoinTable hashtable, uint32 hashvalue, int *partitionNumber);
+
#endif /* NODEHASH_H */
Index: src/include/executor/nodeHashjoin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/nodeHashjoin.h,v
retrieving revision 1.37
diff -c -r1.37 nodeHashjoin.h
*** src/include/executor/nodeHashjoin.h 1 Jan 2008 19:45:57 -0000 1.37
--- src/include/executor/nodeHashjoin.h 30 Sep 2008 20:32:05 -0000
***************
*** 26,29 ****
--- 26,31 ----
extern void ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
BufFile **fileptr);
+ extern void getMostCommonValues(EState *estate, HashJoinState *hjstate);
+
#endif /* NODEHASHJOIN_H */
Index: src/include/nodes/execnodes.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/nodes/execnodes.h,v
retrieving revision 1.194
diff -c -r1.194 execnodes.h
*** src/include/nodes/execnodes.h 31 Oct 2008 19:37:56 -0000 1.194
--- src/include/nodes/execnodes.h 5 Nov 2008 22:57:08 -0000
***************
*** 1386,1391 ****
--- 1386,1392 ----
bool hj_NeedNewOuter;
bool hj_MatchedOuter;
bool hj_OuterNotEmpty;
+ int hj_OuterTupleMostCommonValuePartition;
} HashJoinState;
Index: src/include/optimizer/cost.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/optimizer/cost.h,v
retrieving revision 1.93
diff -c -r1.93 cost.h
*** src/include/optimizer/cost.h 4 Oct 2008 21:56:55 -0000 1.93
--- src/include/optimizer/cost.h 7 Oct 2008 18:31:42 -0000
***************
*** 52,57 ****
--- 52,58 ----
extern bool enable_nestloop;
extern bool enable_mergejoin;
extern bool enable_hashjoin;
+ extern bool enable_hashjoin_usestatmcvs;
extern bool constraint_exclusion;
extern double clamp_row_est(double nrows);
On Wed, Nov 05, 2008 at 04:06:11PM -0800, Bryce Cutt wrote:
The error is causes by me Asserting against the wrong variable. I
never noticed this as I apparently did not have assertions turned on
on my development machine. That is fixed now and with the new patch
version I have attached all assertions are passing with your query and
my test queries. I added another assertion to that section of the
code so that it is a bit more vigorous in confirming the hash table
partition is correct. It does not change the operation of the code.There are two partition counts. One holds the maximum number of
buckets in the hash table and the other counts the number of actual
buckets created for hash values. I was incorrectly testing against
the second one because that was valid before I started using a hash
table to store the buckets.The enable_hashjoin_usestatmcvs flag was valuable for my own research
and tests and likely useful for your review but Tom is correct that it
can be removed in the final version.- Bryce Cutt
Thanks for the new patch; I'll take a look as soon as I can (prolly
tomorrow).
- Josh
On Wed, Nov 5, 2008 at 5:06 PM, Bryce Cutt <pandasuit@gmail.com> wrote:
The error is causes by me Asserting against the wrong variable. I
never noticed this as I apparently did not have assertions turned on
on my development machine. That is fixed now and with the new patch
version I have attached all assertions are passing with your query and
my test queries. I added another assertion to that section of the
code so that it is a bit more vigorous in confirming the hash table
partition is correct. It does not change the operation of the code.There are two partition counts. One holds the maximum number of
buckets in the hash table and the other counts the number of actual
buckets created for hash values. I was incorrectly testing against
the second one because that was valid before I started using a hash
table to store the buckets.The enable_hashjoin_usestatmcvs flag was valuable for my own research
and tests and likely useful for your review but Tom is correct that it
can be removed in the final version.- Bryce Cutt
Well, that builds nicely, lets me import the data, and I've seen a
performance improvement with enable_hashjoin_usestatmcvs on vs. off. I
plan to test that more formally (though probably not fully to the
extent you did in your paper; just enough to feel comfortable that I'm
getting similar results). Then I'll spend some time poking in the
code, for the relatively little good I feel I can do in that capacity,
and I'll also investigate scenarios with particularly inaccurate
statistics. Stay tuned.
- Josh
On Thu, 2008-11-06 at 15:33 -0700, Joshua Tolley wrote:
Stay tuned.
Minor question on this patch. AFAICS there is another patch that seems
to be aiming at exactly the same use case. Jonah's Bloom filter patch.
Shouldn't we have a dust off to see which one is best? Or at least a
discussion to test whether they overlap? Perhaps you already did that
and I missed it because I'm not very tuned in on this thread.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
On Thu, Nov 6, 2008 at 3:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Thu, 2008-11-06 at 15:33 -0700, Joshua Tolley wrote:
Stay tuned.
Minor question on this patch. AFAICS there is another patch that seems
to be aiming at exactly the same use case. Jonah's Bloom filter patch.Shouldn't we have a dust off to see which one is best? Or at least a
discussion to test whether they overlap? Perhaps you already did that
and I missed it because I'm not very tuned in on this thread.--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
We haven't had that discussion AFAIK, and definitely should. First
glance suggests they could coexist peacefully, with proper coaxing. If
I understand things properly, Jonah's patch filters tuples early in
the join process, and this patch tries to ensure that hash join
batches are kept in RAM when they're most likely to be used. So
they're orthogonal in purpose, and the patches actually apply *almost*
cleanly together. Jonah, any comments? If I continue to have some time
to devote, and get through all I think I can do to review this patch,
I'll gladly look at Jonah's too, FWIW.
- Josh
-----Original Message-----
Minor question on this patch. AFAICS there is another patch that
seems
to be aiming at exactly the same use case. Jonah's Bloom filter
patch.
Shouldn't we have a dust off to see which one is best? Or at least a
discussion to test whether they overlap? Perhaps you already did
that
and I missed it because I'm not very tuned in on this thread.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and SupportWe haven't had that discussion AFAIK, and definitely should. First
glance suggests they could coexist peacefully, with proper coaxing. If
I understand things properly, Jonah's patch filters tuples early in
the join process, and this patch tries to ensure that hash join
batches are kept in RAM when they're most likely to be used. So
they're orthogonal in purpose, and the patches actually apply *almost*
cleanly together. Jonah, any comments? If I continue to have some time
to devote, and get through all I think I can do to review this patch,
I'll gladly look at Jonah's too, FWIW.- Josh
The skew patch and bloom filter patch are orthogonal and can both be
applied. The bloom filter patch is a great idea, and it is used in many
other database systems. You can use the TPC-H data set to demonstrate
that the bloom filter patch will significantly improve performance of
multi-batch joins (with or without data skew).
Any query that filters a build table before joining on the probe table
will show improvements with a bloom filter. For example,
select * from customer, orders where customer.c_nationkey = 10 and
customer.c_custkey = orders.o_custkey
The bloom filter on customer would allow us to avoid probing with orders
tuples that cannot possibly find a match due to the selection criteria.
This is especially beneficial for multi-batch joins where an orders
tuple must be written to disk if its corresponding customer batch is not
the in-memory batch.
I have no experience reviewing patches, but I would be happy to help
contribute/review the bloom filter patch as best I can.
--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca
On Thu, Nov 6, 2008 at 5:31 PM, Lawrence, Ramon <ramon.lawrence@ubc.ca> wrote:
-----Original Message-----
Minor question on this patch. AFAICS there is another patch that
seems
to be aiming at exactly the same use case. Jonah's Bloom filter
patch.
Shouldn't we have a dust off to see which one is best? Or at least a
discussion to test whether they overlap? Perhaps you already didthat
and I missed it because I'm not very tuned in on this thread.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and SupportWe haven't had that discussion AFAIK, and definitely should. First
glance suggests they could coexist peacefully, with proper coaxing. If
I understand things properly, Jonah's patch filters tuples early in
the join process, and this patch tries to ensure that hash join
batches are kept in RAM when they're most likely to be used. So
they're orthogonal in purpose, and the patches actually apply *almost*
cleanly together. Jonah, any comments? If I continue to have some time
to devote, and get through all I think I can do to review this patch,
I'll gladly look at Jonah's too, FWIW.- Josh
The skew patch and bloom filter patch are orthogonal and can both be
applied. The bloom filter patch is a great idea, and it is used in many
other database systems. You can use the TPC-H data set to demonstrate
that the bloom filter patch will significantly improve performance of
multi-batch joins (with or without data skew).Any query that filters a build table before joining on the probe table
will show improvements with a bloom filter. For example,select * from customer, orders where customer.c_nationkey = 10 and
customer.c_custkey = orders.o_custkeyThe bloom filter on customer would allow us to avoid probing with orders
tuples that cannot possibly find a match due to the selection criteria.
This is especially beneficial for multi-batch joins where an orders
tuple must be written to disk if its corresponding customer batch is not
the in-memory batch.I have no experience reviewing patches, but I would be happy to help
contribute/review the bloom filter patch as best I can.--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca
I've no patch review experience, either -- this is my first one. See
http://wiki.postgresql.org/wiki/Reviewing_a_Patch for details on what
a reviewer ought to do in general; various patch review discussions on
the -hackers list have also proven helpful. As regards this patch
specifically, it seems we could merge the two patches into one and
consider them together. However, the bloom filter patch is listed as a
"Work in Progress" on
http://wiki.postgresql.org/wiki/CommitFest_2008-11. Perhaps it needs
more work before being considered seriously? Jonah, what do you think
would be most helpful?
- Josh / eggyknap
On Wed, Nov 05, 2008 at 04:06:11PM -0800, Bryce Cutt wrote:
The error is causes by me Asserting against the wrong variable. I
never noticed this as I apparently did not have assertions turned on
on my development machine. That is fixed now and with the new patch
version I have attached all assertions are passing with your query and
my test queries. I added another assertion to that section of the
code so that it is a bit more vigorous in confirming the hash table
partition is correct. It does not change the operation of the code.There are two partition counts. One holds the maximum number of
buckets in the hash table and the other counts the number of actual
buckets created for hash values. I was incorrectly testing against
the second one because that was valid before I started using a hash
table to store the buckets.The enable_hashjoin_usestatmcvs flag was valuable for my own research
and tests and likely useful for your review but Tom is correct that it
can be removed in the final version.- Bryce Cutt
Well, this version seems to work as advertised. Skewed data sets tend to
hash join more quickly with this turned on, and data sets with
deliberately bad statistics don't perform much differently than with the
feature turned off. The patch applies cleanly to CVS HEAD.
I don't consider myself qualified to do a decent code review. However I
noticed that the comments are all done with // instead of /* ... */.
That should probably be changed.
To those familiar with code review: is there more I should do to review
this?
- Josh / eggyknap
"Lawrence, Ramon" <ramon.lawrence@ubc.ca> writes:
We propose a patch that improves hybrid hash join's performance for
large multi-batch joins where the probe relation has skew.
...
The basic idea
is to keep build relation tuples in a small in-memory hash table that
have join values that are frequently occurring in the probe relation.
I looked at this patch a little.
I'm a tad worried about what happens when the values that are frequently
occurring in the outer relation are also frequently occurring in the
inner (which hardly seems an improbable case). Don't you stand a severe
risk of blowing out the in-memory hash table? It doesn't appear to me
that the code has any way to back off once it's decided that a certain
set of join key values are to be treated in-memory. Splitting the main
join into more batches certainly doesn't help with that.
Also, AFAICS the benefit of this patch comes entirely from avoiding dump
and reload of tuples bearing the most common values, which means it's a
significant waste of cycles when there's only one batch. It'd be better
to avoid doing any of the extra work in the single-batch case.
One thought that might address that point as well as the difficulty of
getting stats in nontrivial cases is to wait until we've overrun memory
and are forced to start batching, and at that point determine on-the-fly
which are the most common hash values from inspection of the hash table
as we dump it out. This would amount to optimizing on the basis of
frequency in the *inner* relation not the outer, but offhand I don't see
any strong theoretical basis why that wouldn't be just as good. It
could lose if the first work_mem worth of inner tuples isn't
representative of what follows; but this hardly seems more dangerous
than depending on MCV stats that are for the whole outer relation rather
than the portion of it being selected.
regards, tom lane
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
I'm a tad worried about what happens when the values that are
frequently
occurring in the outer relation are also frequently occurring in the
inner (which hardly seems an improbable case). Don't you stand a
severe
risk of blowing out the in-memory hash table? It doesn't appear to me
that the code has any way to back off once it's decided that a certain
set of join key values are to be treated in-memory. Splitting the
main
join into more batches certainly doesn't help with that.
Also, AFAICS the benefit of this patch comes entirely from avoiding
dump
and reload of tuples bearing the most common values, which means it's
a
significant waste of cycles when there's only one batch. It'd be
better
to avoid doing any of the extra work in the single-batch case.
One thought that might address that point as well as the difficulty of
getting stats in nontrivial cases is to wait until we've overrun
memory
and are forced to start batching, and at that point determine
on-the-fly
which are the most common hash values from inspection of the hash
table
as we dump it out. This would amount to optimizing on the basis of
frequency in the *inner* relation not the outer, but offhand I don't
see
any strong theoretical basis why that wouldn't be just as good. It
could lose if the first work_mem worth of inner tuples isn't
representative of what follows; but this hardly seems more dangerous
than depending on MCV stats that are for the whole outer relation
rather
than the portion of it being selected.
regards, tom lane
You are correct with both observations. The patch only has a benefit
when there is more than one batch. Also, there is a potential issue
with MCV hash table overflows if the number of tuples that match the
MCVs in the build relation is very large.
Bryce has created a patch (attached) that disables the code for one
batch joins. This patch also checks for MCV hash table overflows and
handles them by "flushing" from the MCV hash table back to the main hash
table. The main hash table will then resolve overflows as usual. Note
that this will cause the worse case of a build table with all the same
values to be handled the same as the current hash code, i.e., it will
attempt to re-partition until it eventually gives up and then allocates
the entire partition in memory. There may be a better way to handle
this case, but the new patch will remain consistent with the current
hash join implementation.
The issue with determining and using the MCV stats is more challenging
than it appears. First, knowing the MCVs of the build table will not
help us. What we need are the MCVs of the probe table because by
knowing those values we will keep the tuples with those values in the
build relation in memory. For example, consider a join between tables
Part and LineItem. Assume 1 popular part accounts for 10% of all
LineItems. If Part is the build relation and LineItem is the probe
relation, then by keeping that 1 part record in memory, we will
guarantee that we do not need to write out 10% of LineItem. If a
selection occurs on LineItem before the join, it may change the
distribution of LineItem (the MCVs) but it is probable that they are
still a good estimate of the MCVs in the derived LineItem relation. (We
did experiments on trying to sample the first few thousand tuples of the
probe relation to dynamically determine the MCVs but generally found
this was inaccurate due to non-random samples.) In essence, the goal is
to smartly pick the tuples that remain in the in-memory batch before
probing begins. Since the number of MCVs is small, incorrectly
selecting build tuples to remain in memory has negligible cost.
If we assume that LineItem has been filtered so much that it is now
smaller than Part and is the build relation then the MCV approach does
not apply. There is no skew in Part on partkey (since it is the PK) and
knowing the MCV partkeys in LineItem does not help us because they each
only join with a single tuple in Part. In this case, the MCV approach
should not be used because no benefit is possible, and it will not be
used because there will be no MCVs for Part.partkey.
The bad case with MCV hash table overflow requires a many-to-many join
between the two relations which would not occur on the more typical
PK-FK joins.
--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca
Attachments:
histojoin_v3.patchapplication/octet-stream; name=histojoin_v3.patchDownload
Index: src/backend/executor/nodeHash.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHash.c,v
retrieving revision 1.116
diff -c -r1.116 nodeHash.c
*** src/backend/executor/nodeHash.c 1 Jan 2008 19:45:49 -0000 1.116
--- src/backend/executor/nodeHash.c 24 Nov 2008 12:32:13 -0000
***************
*** 54,59 ****
--- 54,165 ----
}
/* ----------------------------------------------------------------
+ * isAMostCommonValue
+ *
+ * is the value one of the most common key values?
+ * ----------------------------------------------------------------
+ */
+ bool isAMostCommonValue(HashJoinTable hashtable, uint32 hashvalue, int *partitionNumber)
+ {
+ int bucket = hashvalue & (hashtable->nMostCommonTuplePartitionHashBuckets - 1);
+
+ while (hashtable->mostCommonTuplePartition[bucket].hashvalue != 0
+ && hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ {
+ bucket = (bucket + 1) & (hashtable->nMostCommonTuplePartitionHashBuckets - 1);
+ }
+
+ if (!hashtable->mostCommonTuplePartition[bucket].frozen && hashtable->mostCommonTuplePartition[bucket].hashvalue == hashvalue)
+ {
+ *partitionNumber = bucket;
+ return true;
+ }
+
+ /* must have run into an empty slot which means this is not an MCV*/
+ *partitionNumber = MCV_INVALID_PARTITION;
+ return false;
+ }
+
+ /*
+ * freezeNextMCVPartiton
+ *
+ * flush the tuples of the next MCV partition by pushing them into the main hashtable
+ */
+ bool freezeNextMCVPartiton(HashJoinTable hashtable) {
+ int partitionToFlush = hashtable->nMostCommonTuplePartitions - 1 - hashtable->nMostCommonTuplePartitionsFlushed;
+ if (partitionToFlush < 0)
+ return false;
+ else
+ {
+
+ int bucketno;
+ int batchno;
+ uint32 hashvalue;
+ HashJoinTuple hashTuple;
+ HashJoinTuple nextHashTuple;
+ HashJoinMostCommonValueTuplePartition *partition;
+ MinimalTuple mintuple;
+
+ partition = hashtable->flushOrderedMostCommonTuplePartition[partitionToFlush];
+ hashvalue = partition->hashvalue;
+
+ Assert(hashvalue != 0);
+
+ hashTuple = partition->tuples;
+
+ ExecHashGetBucketAndBatch(hashtable, hashvalue,
+ &bucketno, &batchno);
+
+ while (hashTuple != NULL)
+ {
+ /* decide whether to put the tuples in the hash table or a temp file */
+ if (batchno == hashtable->curbatch)
+ {
+ /* put the tuples in hash table */
+ nextHashTuple = hashTuple->next;
+
+ hashTuple->next = hashtable->buckets[bucketno];
+ hashtable->buckets[bucketno] = hashTuple;
+
+ hashTuple = nextHashTuple;
+ hashtable->totalTuples++;
+ hashtable->mostCommonTuplesStored--;
+
+ if (hashtable->spaceUsed > hashtable->spaceAllowed)
+ {
+ ExecHashIncreaseNumBatches(hashtable);
+ /* likely changed due to increase in batches */
+ ExecHashGetBucketAndBatch(hashtable, hashvalue,
+ &bucketno, &batchno);
+ }
+ }
+ else
+ {
+ /* put the tuples into a temp file for later batches */
+ Assert(batchno > hashtable->curbatch);
+ mintuple = HJTUPLE_MINTUPLE(hashTuple);
+ ExecHashJoinSaveTuple(mintuple,
+ hashvalue,
+ &hashtable->innerBatchFile[batchno]);
+ hashtable->spaceUsed -= HJTUPLE_OVERHEAD + mintuple->t_len;
+ nextHashTuple = hashTuple->next;
+ pfree(hashTuple);
+ hashTuple = nextHashTuple;
+ hashtable->inTupIOs++;
+ hashtable->totalTuples++;
+ hashtable->mostCommonTuplesStored--;
+ }
+ }
+
+ partition->frozen = true;
+ partition->tuples = NULL;
+ hashtable->nMostCommonTuplePartitionsFlushed++;
+
+ return true;
+ }
+ }
+
+ /* ----------------------------------------------------------------
* MultiExecHash
*
* build hash table for hashjoin, doing partitioning if more
***************
*** 69,74 ****
--- 175,182 ----
TupleTableSlot *slot;
ExprContext *econtext;
uint32 hashvalue;
+ MinimalTuple mintuple;
+ int partitionNumber;
/* must provide our own instrumentation support */
if (node->ps.instrument)
***************
*** 99,106 ****
if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
&hashvalue))
{
! ExecHashTableInsert(hashtable, slot, hashvalue);
! hashtable->totalTuples += 1;
}
}
--- 207,240 ----
if (ExecHashGetHashValue(hashtable, econtext, hashkeys, false, false,
&hashvalue))
{
! partitionNumber = MCV_INVALID_PARTITION;
!
! if (hashtable->usingMostCommonValues && isAMostCommonValue(hashtable, hashvalue, &partitionNumber))
! {
! HashJoinTuple hashTuple;
! int hashTupleSize;
!
! mintuple = ExecFetchSlotMinimalTuple(slot);
! hashTupleSize = HJTUPLE_OVERHEAD + mintuple->t_len;
! hashTuple = (HashJoinTuple) palloc(hashTupleSize);
! hashTuple->hashvalue = hashvalue;
! memcpy(HJTUPLE_MINTUPLE(hashTuple), mintuple, mintuple->t_len);
!
! hashTuple->next = hashtable->mostCommonTuplePartition[partitionNumber].tuples;
! hashtable->mostCommonTuplePartition[partitionNumber].tuples = hashTuple;
!
! hashtable->spaceUsed += hashTupleSize;
!
! hashtable->mostCommonTuplesStored++;
!
! while (hashtable->spaceUsed > hashtable->spaceAllowed && freezeNextMCVPartiton(hashtable)) {}
! }
!
! if (partitionNumber == MCV_INVALID_PARTITION)
! {
! ExecHashTableInsert(hashtable, slot, hashvalue);
! hashtable->totalTuples += 1;
! }
}
}
***************
*** 461,466 ****
--- 595,606 ----
BufFileClose(hashtable->outerBatchFile[i]);
}
+ if (hashtable->usingMostCommonValues)
+ {
+ pfree(hashtable->mostCommonTuplePartition);
+ pfree(hashtable->flushOrderedMostCommonTuplePartition);
+ }
+
/* Release working memory (batchCxt is a child, so it goes away too) */
MemoryContextDelete(hashtable->hashCxt);
***************
*** 798,803 ****
--- 938,1005 ----
}
/*
+ * ExecScanHashMostCommonTuples
+ * scan a hash bucket for matches to the current outer tuple
+ *
+ * The current outer tuple must be stored in econtext->ecxt_outertuple.
+ */
+ HashJoinTuple
+ ExecScanHashMostCommonTuples(HashJoinState *hjstate,
+ ExprContext *econtext)
+ {
+ List *hjclauses = hjstate->hashclauses;
+ HashJoinTable hashtable = hjstate->hj_HashTable;
+ HashJoinTuple hashTuple = hjstate->hj_CurTuple;
+ uint32 hashvalue = hjstate->hj_CurHashValue;
+
+ /*
+ * hj_CurTuple is NULL to start scanning a new bucket, or the address of
+ * the last tuple returned from the current bucket.
+ */
+ if (hashTuple == NULL)
+ {
+ //painstakingly make sure this is a valid partition index
+ Assert(hjstate->hj_OuterTupleMostCommonValuePartition > MCV_INVALID_PARTITION);
+ Assert(hjstate->hj_OuterTupleMostCommonValuePartition < hashtable->nMostCommonTuplePartitionHashBuckets);
+ Assert(hashtable->mostCommonTuplePartition[hjstate->hj_OuterTupleMostCommonValuePartition].hashvalue != 0);
+
+ hashTuple = hashtable->mostCommonTuplePartition[hjstate->hj_OuterTupleMostCommonValuePartition].tuples;
+ }
+ else
+ hashTuple = hashTuple->next;
+
+ while (hashTuple != NULL)
+ {
+ if (hashTuple->hashvalue == hashvalue)
+ {
+ TupleTableSlot *inntuple;
+
+ /* insert hashtable's tuple into exec slot so ExecQual sees it */
+ inntuple = ExecStoreMinimalTuple(HJTUPLE_MINTUPLE(hashTuple),
+ hjstate->hj_HashTupleSlot,
+ false); /* do not pfree */
+ econtext->ecxt_innertuple = inntuple;
+
+ /* reset temp memory each time to avoid leaks from qual expr */
+ ResetExprContext(econtext);
+
+ if (ExecQual(hjclauses, econtext, false))
+ {
+ hjstate->hj_CurTuple = hashTuple;
+ return hashTuple;
+ }
+ }
+
+ hashTuple = hashTuple->next;
+ }
+
+ /*
+ * no match
+ */
+ return NULL;
+ }
+
+ /*
* ExecScanHashBucket
* scan a hash bucket for matches to the current outer tuple
*
Index: src/backend/executor/nodeHashjoin.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v
retrieving revision 1.96
diff -c -r1.96 nodeHashjoin.c
*** src/backend/executor/nodeHashjoin.c 23 Oct 2008 14:34:34 -0000 1.96
--- src/backend/executor/nodeHashjoin.c 24 Nov 2008 12:35:29 -0000
***************
*** 20,25 ****
--- 20,30 ----
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
#include "utils/memutils.h"
+ #include "optimizer/cost.h"
+ #include "utils/syscache.h"
+ #include "utils/lsyscache.h"
+ #include "parser/parsetree.h"
+ #include "catalog/pg_statistic.h"
/* Returns true for JOIN_LEFT and JOIN_ANTI jointypes */
***************
*** 34,39 ****
--- 39,149 ----
TupleTableSlot *tupleSlot);
static int ExecHashJoinNewBatch(HashJoinState *hjstate);
+ /*
+ * getMostCommonValues
+ *
+ *
+ */
+ void getMostCommonValues(EState *estate, HashJoinState *hjstate)
+ {
+ HeapTupleData *statsTuple;
+ FuncExprState *clause;
+ ExprState *argstate;
+ Var *variable;
+
+ Datum *values;
+ int nvalues;
+ float4 *numbers;
+ int nnumbers;
+
+ Oid relid;
+ AttrNumber relattnum;
+ Oid atttype;
+ int32 atttypmod;
+
+ int i;
+
+ //is it a join on more than one key?
+ if (hjstate->hashclauses->length != 1)
+ return; //histojoin is not defined for more than one join key so run away
+
+ //make sure the outer node is a seq scan on a base relation otherwise we cant get MCVs at the moment and should not bother trying
+ if (outerPlanState(hjstate)->type != T_SeqScanState)
+ return;
+
+ //grab the relation object id of the outer relation
+ relid = getrelid(((SeqScan *) ((SeqScanState *) outerPlanState(hjstate))->ps.plan)->scanrelid, estate->es_range_table);
+ clause = (FuncExprState *) lfirst(list_head(hjstate->hashclauses));
+ argstate = (ExprState *) lfirst(list_head(clause->args));
+ variable = (Var *) argstate->expr;
+
+ //grab the necessary properties of the join variable
+ relattnum = variable->varattno;
+ atttype = variable->vartype;
+ atttypmod = variable->vartypmod;
+
+ statsTuple = SearchSysCache(STATRELATT,
+ ObjectIdGetDatum(relid),
+ Int16GetDatum(relattnum),
+ 0, 0);
+
+ if (HeapTupleIsValid(statsTuple))
+ {
+ if (get_attstatsslot(statsTuple,
+ atttype, atttypmod,
+ STATISTIC_KIND_MCV, InvalidOid,
+ &values, &nvalues,
+ &numbers, &nnumbers))
+ {
+ HashJoinTable hashtable;
+ FmgrInfo *hashfunctions;
+ //MCV Partitions is an open addressing hashtable with a power of 2 size greater than the number of MCV values
+ int nbuckets = 2;
+ uint32 collisionsWhileHashing = 0;
+ while (nbuckets <= nvalues)
+ {
+ nbuckets <<= 1;
+ }
+ //use two more bit just to help avoid collisions
+ nbuckets <<= 2;
+
+ hashtable = hjstate->hj_HashTable;
+ hashtable->usingMostCommonValues = true;
+ hashtable->nMostCommonTuplePartitionHashBuckets = nbuckets;
+ hashtable->mostCommonTuplePartition = palloc0(nbuckets * sizeof(HashJoinMostCommonValueTuplePartition));
+ hashtable->flushOrderedMostCommonTuplePartition = palloc0(nvalues * sizeof(HashJoinMostCommonValueTuplePartition*));
+ hashfunctions = hashtable->outer_hashfunctions;
+
+ //create the partitions
+ for (i = 0; i < nvalues; i++)
+ {
+ uint32 hashvalue = DatumGetUInt32(FunctionCall1(&hashfunctions[0], values[i]));
+ int bucket = hashvalue & (nbuckets - 1);
+
+ while (hashtable->mostCommonTuplePartition[bucket].hashvalue != 0
+ && hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ {
+ bucket = (bucket + 1) & (nbuckets - 1);
+ collisionsWhileHashing++;
+ }
+
+ //leave partition alone if it has the same hashvalue as current MCV. we only want one partition per hashvalue
+ if (hashtable->mostCommonTuplePartition[bucket].hashvalue != hashvalue)
+ {
+ hashtable->mostCommonTuplePartition[bucket].tuples = NULL;
+ hashtable->mostCommonTuplePartition[bucket].hashvalue = hashvalue;
+ hashtable->mostCommonTuplePartition[bucket].frozen = false;
+ + hashtable->flushOrderedMostCommonTuplePartition[hashtable->nMostCommonTuplePartitions] = &hashtable->mostCommonTuplePartition[bucket];
+ hashtable->nMostCommonTuplePartitions++;
+ }
+ }
+
+ free_attstatsslot(atttype, values, nvalues, numbers, nnumbers);
+ }
+
+ ReleaseSysCache(statsTuple);
+ }
+ }
/* ----------------------------------------------------------------
* ExecHashJoin
***************
*** 146,151 ****
--- 256,271 ----
hashtable = ExecHashTableCreate((Hash *) hashNode->ps.plan,
node->hj_HashOperators);
node->hj_HashTable = hashtable;
+
+ hashtable->usingMostCommonValues = false;
+ hashtable->nMostCommonTuplePartitions = 0;
+ hashtable->nMostCommonTuplePartitionHashBuckets = 0;
+ hashtable->mostCommonTuplesStored = 0;
+ hashtable->mostCommonTuplePartition = NULL;
+ hashtable->nMostCommonTuplePartitionsFlushed = 0;
+
+ if (hashtable->nbatch > 1 && enable_hashjoin_usestatmcvs)
+ getMostCommonValues(estate, node);
/*
* execute the Hash node, to build the hash table
***************
*** 157,163 ****
* If the inner relation is completely empty, and we're not doing an
* outer join, we can quit without scanning the outer relation.
*/
! if (hashtable->totalTuples == 0 && !HASHJOIN_IS_OUTER(node))
return NULL;
/*
--- 277,283 ----
* If the inner relation is completely empty, and we're not doing an
* outer join, we can quit without scanning the outer relation.
*/
! if (hashtable->totalTuples == 0 && hashtable->mostCommonTuplesStored == 0 && !HASHJOIN_IS_OUTER(node))
return NULL;
/*
***************
*** 205,227 ****
ExecHashGetBucketAndBatch(hashtable, hashvalue,
&node->hj_CurBucketNo, &batchno);
node->hj_CurTuple = NULL;
!
! /*
! * Now we've got an outer tuple and the corresponding hash bucket,
! * but this tuple may not belong to the current batch.
! */
! if (batchno != hashtable->curbatch)
{
/*
! * Need to postpone this outer tuple to a later batch. Save it
! * in the corresponding outer-batch file.
*/
! Assert(batchno > hashtable->curbatch);
! ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(outerTupleSlot),
! hashvalue,
! &hashtable->outerBatchFile[batchno]);
! node->hj_NeedNewOuter = true;
! continue; /* loop around for a new outer tuple */
}
}
--- 325,353 ----
ExecHashGetBucketAndBatch(hashtable, hashvalue,
&node->hj_CurBucketNo, &batchno);
node->hj_CurTuple = NULL;
!
! node->hj_OuterTupleMostCommonValuePartition = MCV_INVALID_PARTITION;
!
!
! if (!(hashtable->usingMostCommonValues && isAMostCommonValue(hashtable, hashvalue, &node->hj_OuterTupleMostCommonValuePartition)))
{
/*
! * Now we've got an outer tuple and the corresponding hash bucket,
! * but this tuple may not belong to the current batch.
*/
! if (batchno != hashtable->curbatch)
! {
! /*
! * Need to postpone this outer tuple to a later batch. Save it
! * in the corresponding outer-batch file.
! */
! Assert(batchno > hashtable->curbatch);
! ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(outerTupleSlot),
! hashvalue,
! &hashtable->outerBatchFile[batchno]);
! node->hj_NeedNewOuter = true;
! continue; /* loop around for a new outer tuple */
! }
}
}
***************
*** 230,236 ****
*/
for (;;)
{
! curtuple = ExecScanHashBucket(node, econtext);
if (curtuple == NULL)
break; /* out of matches */
--- 356,369 ----
*/
for (;;)
{
! if (node->hj_OuterTupleMostCommonValuePartition != MCV_INVALID_PARTITION)
! {
! curtuple = ExecScanHashMostCommonTuples(node, econtext);
! }
! else
! {
! curtuple = ExecScanHashBucket(node, econtext);
! }
if (curtuple == NULL)
break; /* out of matches */
Index: src/backend/optimizer/path/costsize.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.201
diff -c -r1.201 costsize.c
*** src/backend/optimizer/path/costsize.c 22 Nov 2008 22:47:05 -0000 1.201
--- src/backend/optimizer/path/costsize.c 24 Nov 2008 12:15:00 -0000
***************
*** 109,114 ****
--- 109,116 ----
bool enable_mergejoin = true;
bool enable_hashjoin = true;
+ bool enable_hashjoin_usestatmcvs = true;
+
typedef struct
{
PlannerInfo *root;
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.481
diff -c -r1.481 guc.c
*** src/backend/utils/misc/guc.c 21 Nov 2008 20:14:27 -0000 1.481
--- src/backend/utils/misc/guc.c 24 Nov 2008 12:15:05 -0000
***************
*** 636,641 ****
--- 636,649 ----
true, NULL, NULL
},
{
+ {"enable_hashjoin_usestatmcvs", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the hash join's use of the MCVs stored in pg_statistic."),
+ NULL
+ },
+ &enable_hashjoin_usestatmcvs,
+ true, NULL, NULL
+ },
+ {
{"constraint_exclusion", PGC_USERSET, QUERY_TUNING_OTHER,
gettext_noop("Enables the planner to use constraints to optimize queries."),
gettext_noop("Child table scans will be skipped if their "
Index: src/include/executor/hashjoin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/hashjoin.h,v
retrieving revision 1.48
diff -c -r1.48 hashjoin.h
*** src/include/executor/hashjoin.h 1 Jan 2008 19:45:57 -0000 1.48
--- src/include/executor/hashjoin.h 24 Nov 2008 12:40:18 -0000
***************
*** 72,77 ****
--- 72,85 ----
#define HJTUPLE_MINTUPLE(hjtup) \
((MinimalTuple) ((char *) (hjtup) + HJTUPLE_OVERHEAD))
+ typedef struct HashJoinMostCommonValueTuplePartition
+ {
+ uint32 hashvalue;
+ bool frozen;
+ HashJoinTuple tuples;
+ } HashJoinMostCommonValueTuplePartition;
+
+ #define MCV_INVALID_PARTITION -1
typedef struct HashJoinTableData
{
***************
*** 116,121 ****
--- 124,137 ----
MemoryContext hashCxt; /* context for whole-hash-join storage */
MemoryContext batchCxt; /* context for this-batch-only storage */
+
+ bool usingMostCommonValues;
+ HashJoinMostCommonValueTuplePartition *mostCommonTuplePartition;
+ HashJoinMostCommonValueTuplePartition **flushOrderedMostCommonTuplePartition;
+ int nMostCommonTuplePartitionHashBuckets;
+ int nMostCommonTuplePartitions;
+ int nMostCommonTuplePartitionsFlushed;
+ uint32 mostCommonTuplesStored;
} HashJoinTableData;
#endif /* HASHJOIN_H */
Index: src/include/executor/nodeHash.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/nodeHash.h,v
retrieving revision 1.45
diff -c -r1.45 nodeHash.h
*** src/include/executor/nodeHash.h 1 Jan 2008 19:45:57 -0000 1.45
--- src/include/executor/nodeHash.h 30 Sep 2008 20:31:35 -0000
***************
*** 45,48 ****
--- 45,51 ----
int *numbuckets,
int *numbatches);
+ extern HashJoinTuple ExecScanHashMostCommonTuples(HashJoinState *hjstate, ExprContext *econtext);
+ extern bool isAMostCommonValue(HashJoinTable hashtable, uint32 hashvalue, int *partitionNumber);
+
#endif /* NODEHASH_H */
Index: src/include/executor/nodeHashjoin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/nodeHashjoin.h,v
retrieving revision 1.37
diff -c -r1.37 nodeHashjoin.h
*** src/include/executor/nodeHashjoin.h 1 Jan 2008 19:45:57 -0000 1.37
--- src/include/executor/nodeHashjoin.h 30 Sep 2008 20:32:05 -0000
***************
*** 26,29 ****
--- 26,31 ----
extern void ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
BufFile **fileptr);
+ extern void getMostCommonValues(EState *estate, HashJoinState *hjstate);
+
#endif /* NODEHASHJOIN_H */
Index: src/include/nodes/execnodes.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/nodes/execnodes.h,v
retrieving revision 1.196
diff -c -r1.196 execnodes.h
*** src/include/nodes/execnodes.h 16 Nov 2008 17:34:28 -0000 1.196
--- src/include/nodes/execnodes.h 17 Nov 2008 20:05:27 -0000
***************
*** 1392,1397 ****
--- 1392,1398 ----
bool hj_NeedNewOuter;
bool hj_MatchedOuter;
bool hj_OuterNotEmpty;
+ int hj_OuterTupleMostCommonValuePartition;
} HashJoinState;
Index: src/include/optimizer/cost.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/optimizer/cost.h,v
retrieving revision 1.93
diff -c -r1.93 cost.h
*** src/include/optimizer/cost.h 4 Oct 2008 21:56:55 -0000 1.93
--- src/include/optimizer/cost.h 7 Oct 2008 18:31:42 -0000
***************
*** 52,57 ****
--- 52,58 ----
extern bool enable_nestloop;
extern bool enable_mergejoin;
extern bool enable_hashjoin;
+ extern bool enable_hashjoin_usestatmcvs;
extern bool constraint_exclusion;
extern double clamp_row_est(double nrows);
I have to admit that I haven't fully grokked what this patch is about
just yet, so what follows is mostly a coding style review at this
point. It would help a lot if you could add some comments to the new
functions that are being added to explain the purpose of each at a
very high level. There's clearly been a lot of thought put into some
parts of this logic, so it would be worth explaining the reasoning
behind that logic.
This patch applies clearly against CVS HEAD, but does not compile
(please fix the warning, too).
nodeHash.c:88: warning: no previous prototype for 'freezeNextMCVPartiton'
nodeHash.c: In function 'freezeNextMCVPartiton':
nodeHash.c:148: error: 'struct HashJoinTableData' has no member named 'inTupIOs'
I commented out the offending line. It errored out again here:
nodeHashjoin.c: In function 'getMostCommonValues':
nodeHashjoin.c:136: error: wrong type argument to unary plus
After removing the stray + sign, it compiled, but failed the
"rangefuncs" regression test. If you're going to keep the
enable_hashjoin_usestatmvcs() GUC around, you need to patch
rangefuncs.out so that the regression tests pass. I think, however,
that there was some discussion of removing that before the patch is
committed; if so, please do that instead. Keeping the GUC would also
require patching the documentation, which the current patch does not
do.
getMostCommonValues() isn't a good name for a non-static function
because there's nothing to tip the reader off to the fact that it has
something to do with hash joins; compare with the other function names
defined in the same header file. On the flip side, that function has
only one call site, so it should probably be made static and not
declared in the header file at all. Some of the other new functions
need similar treatment. I am also a little suspicious of this bit of
code:
relid = getrelid(((SeqScan *) ((SeqScanState *)
outerPlanState(hjstate))->ps.plan)->scanrelid,
estate->es_range_table);
clause = (FuncExprState *) lfirst(list_head(hjstate->hashclauses));
argstate = (ExprState *) lfirst(list_head(clause->args));
variable = (Var *) argstate->expr;
I'm not very familiar with the hash join code, but it seems like there
are a lot of assumptions being made there about what things are
pointing to what other things. Is this this actually safe? And if it
is, perhaps a comment explaining why?
getMostCommonValues() also appears to be creating and maintaining a
counter called collisionsWhileHashing, but nothing is ever done with
the counter. On a similar note, the variables relattnum, atttype, and
atttypmod don't appear to be necessary; 2 out of 3 of them are only
used once, so maybe inlining the reference and dropping the variable
would make more sense. Also, the if (HeapTupleIsValid(statsTuple))
block encompasses the whole rest of the function, maybe if
(!HeapTupleIsValid(statsTuple)) return?
I don't understand why
hashtable->mostCommonTuplePartition[bucket].tuples and .frozen need to
be initialized to 0. It looks to me like those are in a zero-filled
array that was just allocated, so it shouldn't be necessary to re-zero
them, unless I'm missing something.
freezeNextMCVPartiton is mis-spelled consistently throughout (the last
three letters should be "ion"). I also don't think it makes sense to
enclose everything but the first two lines of that function in an
else-block.
There is some initialization code in ExecHashJoin() that looks like it
belongs in ExecHashTableCreate.
It appears to me that the interface to isAMostCommonValue() could be
simplified by just making it return the partition number. It could
perhaps be renamed something like ExecHashGetMCVPartition().
Does ExecHashTableDestroy() need to explicitly pfree
hashtable->mostCommonTuplePartition and
hashtable->flushOrderedMostCommonTuplePartition? I would think those
would be allocated out of hashCxt - if they aren't, they probably
should be.
Department of minor nitpicks: (1) C++-style comments are not
permitted, (2) function names need to be capitalized like_this() or
LikeThis() but not likeThis(), (3) when defining a function, the
return type should be placed on the line preceding the actual function
name, so that the function name is at the beginning of the line, (4)
curly braces should be avoided around a block containing only one
statement, (5) excessive blank lines should be avoided (for example,
the one in costsize.c is clearly unnecessary, and there's at least one
place where you add two consecutive blank lines), and (6) I believe
the accepted way to write an empty loop is an indented semi-colon on
the next line, rather than {} on the same line as the while.
I will try to do some more substantive testing of this as well.
...Robert
Dr. Lawrence:
I'm still working on reviewing this patch. I've managed to load the
sample TPCH data from tpch1g1z.zip after changing the line endings to
UNIX-style and chopping off the trailing vertical bars. (If anyone is
interested, I have the results of pg_dump | bzip2 -9 on the resulting
database, which I would be happy to upload if someone has server
space. It is about 250MB.)
But, I'm not sure quite what to do in terms of generating queries.
TPCHSkew contains QGEN.EXE, but that seems to require that you provide
template queries as input, and I'm not sure where to get the
templates.
Any suggestions?
Thanks,
...Robert
Robert,
You do not need to use qgen.exe to generate queries as you are not
running the TPC-H benchmark test. Attached is an example of the 22
sample TPC-H queries according to the benchmark.
We have not tested using the TPC-H queries for this particular patch and
only use the TPC-H database as a large, skewed data set. The simpler
queries we test involve joins of Part-Lineitem or Supplier-Lineitem such
as:
Select * from part, lineitem where p_partkey = l_partkey
OR
Select count(*) from part, lineitem where p_partkey = l_partkey
The count(*) version is usually more useful for comparisons as the
generation of output tuples on the client side (say with pgadmin)
dominates the actual time to complete the query.
To isolate query costs, we also test using a simple server-side
function. The setup description I have also attached.
I would be happy to help in any way I can.
Bryce is currently working on an updated patch according to your
suggestions.
--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca
Show quoted text
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Robert Haas
Sent: December 17, 2008 7:54 PM
To: Lawrence, Ramon
Cc: Tom Lane; pgsql-hackers@postgresql.org; Bryce Cutt
Subject: Re: [HACKERS] Proposed Patch to Improve Performance of Multi-
Batch Hash Join for Skewed Data SetsDr. Lawrence:
I'm still working on reviewing this patch. I've managed to load the
sample TPCH data from tpch1g1z.zip after changing the line endings to
UNIX-style and chopping off the trailing vertical bars. (If anyone is
interested, I have the results of pg_dump | bzip2 -9 on the resulting
database, which I would be happy to upload if someone has server
space. It is about 250MB.)But, I'm not sure quite what to do in terms of generating queries.
TPCHSkew contains QGEN.EXE, but that seems to require that you provide
template queries as input, and I'm not sure where to get the
templates.Any suggestions?
Thanks,
...Robert
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: 17967_1229572440_1229572440_603c8f070812171953q3f220ed2keca9e6a62694eb62@mail.gmail.com
Robert,
I thoroughly appreciate the constructive criticism.
The compile errors are due to my development process being convoluted.
I will endeavor to not waste your time in the future with errors
caused by my development process.
I have updated the code to follow the conventions and suggestions
given. I am now working on adding the requested documentation. I
will not submit the next patch until that is done. The functionality
has not changed so you can performance test with the patch you have.
As for that particularly ugly piece of code. I figured that out while
digging through the selfuncs code. Basically I needed a way to get
the stats tuple for the outer relation join column of the join but to
do that I needed to figure out how to get the actual relation id and
attribute number that was being joined.
I have not yet figured out a better way to do this but I am sure there
is someone on the mailing list with far more knowledge of this than I
have.
I could possibly be more vigorous in testing to make sure the things I
am casting are exactly what I expect. My tests have always been
consistent so far.
I am essentially doing what is done in selfuncs. I believe I could
use the examine_variable() function in selfuncs.c except I would first
need a PlannerInfo and I don't think I can get that from inside the
join initialization code.
- Bryce Cutt
Show quoted text
On Mon, Dec 15, 2008 at 8:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I have to admit that I haven't fully grokked what this patch is about
just yet, so what follows is mostly a coding style review at this
point. It would help a lot if you could add some comments to the new
functions that are being added to explain the purpose of each at a
very high level. There's clearly been a lot of thought put into some
parts of this logic, so it would be worth explaining the reasoning
behind that logic.This patch applies clearly against CVS HEAD, but does not compile
(please fix the warning, too).nodeHash.c:88: warning: no previous prototype for 'freezeNextMCVPartiton'
nodeHash.c: In function 'freezeNextMCVPartiton':
nodeHash.c:148: error: 'struct HashJoinTableData' has no member named 'inTupIOs'I commented out the offending line. It errored out again here:
nodeHashjoin.c: In function 'getMostCommonValues':
nodeHashjoin.c:136: error: wrong type argument to unary plusAfter removing the stray + sign, it compiled, but failed the
"rangefuncs" regression test. If you're going to keep the
enable_hashjoin_usestatmvcs() GUC around, you need to patch
rangefuncs.out so that the regression tests pass. I think, however,
that there was some discussion of removing that before the patch is
committed; if so, please do that instead. Keeping the GUC would also
require patching the documentation, which the current patch does not
do.getMostCommonValues() isn't a good name for a non-static function
because there's nothing to tip the reader off to the fact that it has
something to do with hash joins; compare with the other function names
defined in the same header file. On the flip side, that function has
only one call site, so it should probably be made static and not
declared in the header file at all. Some of the other new functions
need similar treatment. I am also a little suspicious of this bit of
code:relid = getrelid(((SeqScan *) ((SeqScanState *)
outerPlanState(hjstate))->ps.plan)->scanrelid,
estate->es_range_table);
clause = (FuncExprState *) lfirst(list_head(hjstate->hashclauses));
argstate = (ExprState *) lfirst(list_head(clause->args));
variable = (Var *) argstate->expr;I'm not very familiar with the hash join code, but it seems like there
are a lot of assumptions being made there about what things are
pointing to what other things. Is this this actually safe? And if it
is, perhaps a comment explaining why?getMostCommonValues() also appears to be creating and maintaining a
counter called collisionsWhileHashing, but nothing is ever done with
the counter. On a similar note, the variables relattnum, atttype, and
atttypmod don't appear to be necessary; 2 out of 3 of them are only
used once, so maybe inlining the reference and dropping the variable
would make more sense. Also, the if (HeapTupleIsValid(statsTuple))
block encompasses the whole rest of the function, maybe if
(!HeapTupleIsValid(statsTuple)) return?I don't understand why
hashtable->mostCommonTuplePartition[bucket].tuples and .frozen need to
be initialized to 0. It looks to me like those are in a zero-filled
array that was just allocated, so it shouldn't be necessary to re-zero
them, unless I'm missing something.freezeNextMCVPartiton is mis-spelled consistently throughout (the last
three letters should be "ion"). I also don't think it makes sense to
enclose everything but the first two lines of that function in an
else-block.There is some initialization code in ExecHashJoin() that looks like it
belongs in ExecHashTableCreate.It appears to me that the interface to isAMostCommonValue() could be
simplified by just making it return the partition number. It could
perhaps be renamed something like ExecHashGetMCVPartition().Does ExecHashTableDestroy() need to explicitly pfree
hashtable->mostCommonTuplePartition and
hashtable->flushOrderedMostCommonTuplePartition? I would think those
would be allocated out of hashCxt - if they aren't, they probably
should be.Department of minor nitpicks: (1) C++-style comments are not
permitted, (2) function names need to be capitalized like_this() or
LikeThis() but not likeThis(), (3) when defining a function, the
return type should be placed on the line preceding the actual function
name, so that the function name is at the beginning of the line, (4)
curly braces should be avoided around a block containing only one
statement, (5) excessive blank lines should be avoided (for example,
the one in costsize.c is clearly unnecessary, and there's at least one
place where you add two consecutive blank lines), and (6) I believe
the accepted way to write an empty loop is an indented semi-colon on
the next line, rather than {} on the same line as the while.I will try to do some more substantive testing of this as well.
...Robert
[Some performance testing.]
I ran this query 10x with this patch applied, and then 10x again with
enable_hashjoin_usestatmvcs set to false to disable the optimization:
select sum(1) from (select * from part, lineitem where p_partkey = l_partkey) x;
With the optimization enabled, the query took between 26.6 and 38.3
seconds with an average of 31.6. With the optimization disabled, the
query took between 48.3 and 69.0 seconds with an average of 60.0
seconds.
It appears that the 100 entries in pg_statistic cover about 32% of l_partkey:
tpch=# WITH x AS (
SELECT stanumbers1, array_length(stanumbers1, 1) AS len
FROM pg_statistic WHERE starelid='lineitem'::regclass
AND staattnum = (SELECT attnum FROM pg_attribute
WHERE attrelid='lineitem'::regclass AND
attname='l_partkey')
)
SELECT sum(x.stanumbers1[y.g]) FROM x,
(select generate_series(1, x.len) g from x) y;
sum
--------
0.3276
(1 row)
(there's probably a better way to write that query...)
stadistinct for l_partkey is 23,050; the actual number of distinct
values is 199,919. IOW, 0.0005% of the distinct values account for
32.76% of the table. That's a lot of skew, but not unrealistic - I've
seen tables where more than half of the rows were covered by a single
value.
...Robert
On Sun, Dec 21, 2008 at 10:25:59PM -0500, Robert Haas wrote:
[Some performance testing.]
I (finally!) have a chance to post my performance testing results... my
apologies for the really long delay. <Excuses omitted>
Unfortunately I'm not seeing wonderful speedups with the particular
queries I did in this case. I generated three 1GB datasets, with skews
set at 1, 2, and 3. The test script I wrote turns on enable_usestatmcvs
and runs EXPLAIN ANALYZE on the same query five times. Then it turns
enable_usestatmcvs off, and runs the same query five more times. It does
this with each of the three datasets in turn, and then starts over at
the beginning until I tell it to quit. My results showed a statistically
significant improvement in speed only on the skew == 3 dataset.
I did the same tests twice, once with default_statistics_target set to
10, and once with it set to 100. I've attached boxplots of the total
query times as reported by EXPLAIN ANALYZE ("dst10" in the filename
indicates default_statistics_target was 10, and so on), my results
parsed out of the EXPLAIN ANALYZE output (test.filtered.10 and
test.filtered.100), the results of one-tailed Student's T tests of the
result set (ttests), and the R code to run the tests if anyone's really
interested (t.test.R).
The results data includes six columns: the skew value, whether
enable_usestatmcvs was on or not (represented by a 1 or 0), total times
for each of the three joins that made up the query, and total time for
the query itself. The results above pay attention only to the total
query time.
Finally, the query involved:
SELECT * FROM lineitem l LEFT JOIN part p ON (p.p_partkey = l.l_partkey)
LEFT JOIN orders o ON (o.o_orderkey = l.l_orderky) LEFT JOIN customer c
ON (c.c_custkey = o.o_custkey);
- Josh / eggyknap
Attachments:
boxplot-dst10.pngimage/pngDownload
�PNG
IHDR � � J
N� �PLTE
!!!"""###$$$%%%&&&'''((()))***+++,,,///000111222333444555666777888999:::;;;<<<>>>???@@@AAABBBCCCDDDEEEFFFGGGHHHJJJKKKLLLMMMNNNOOOPPPQQQRRRSSSTTTUUUVVVWWWXXXZZZ[[[\\\]]]^^^___```bbbdddeeefffggghhhiiikkklllmmmnnnooopppqqqrrrsssuuuvvvwwwxxxyyyzzz{{{|||}}}~~~������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������"