bad estimation together with large work_mem generates terrible slow hash joins

tv@fuzzy.cz

over 11 years ago

In reply to: Pavel Stehule (#1)

1 attachment(s)

Re: bad estimation together with large work_mem generates terrible slow hash joins

Hi,

Dne 2014-06-26 14:10, Pavel Stehule napsal:

Hello all,

today I had to work with one slow query - when I checked different
scenarios I found a dependency on work_mem size - but it is atypical -
bigger work_mem increased query execution 31 minutes (600MB work mem)
and 1 minute (1MB).

The problem described in Pavel's emails (illustrated by the awful
explain plans) is in this part:

-> Hash (cost=881801.58..881801.58 rows=61735 width=8) (actual
time=9076.153..9076.153 rows=3310768 loops=1)

That is, the estimated number of rows is ~60k, but in reality it's
~3.3M.
This then leads to a hash table with small number of buckets (8192)
containing
large number of tuples (~400 in this case) in a linked list. Which
significantly
slows down the lookups during the hash join.

This issue is actually quite common - all it takes is a significant
underestimation of the hashed relation, either because it's a complex
join
(thus inherently difficult to estimate) or because the WHERE conditions
are
not independent (see the example below).

The attached patch (early WIP, after ~30 minutes of hacking) addresses
this by
increasing the number of bucket whenever the average number of tuples
per item
exceeds 1.5x NTUP_PER_BUCKET.

======= Example ========

create table table_a (id int, fk_id int);
create table table_b (fk_id int, val_a int, val_b int);

insert into table_a select i, i from generate_series(1,10000000) s(i);
insert into table_b select i, mod(i,1000), mod(i,1000) from
generate_series(1,10000000) s(i);

analyze table_a;
analyze table_b;

explain analyze select count(*) from table_a join table_b on (table_a.id
= table_b.fk_id) where val_a < 10 and val_b < 10;

without the patch:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=385834.56..385834.57 rows=1 width=0) (actual
time=49543.263..49543.264 rows=1 loops=1)
-> Hash Join (cost=204069.89..385831.16 rows=1359 width=0) (actual
time=923.751..49531.554 rows=100000 loops=1)
Hash Cond: (table_a.id = table_b.fk_id)
-> Seq Scan on table_a (cost=0.00..144247.77 rows=9999977
width=4) (actual time=0.104..967.090 rows=10000000 loops=1)
-> Hash (cost=204052.90..204052.90 rows=1359 width=4) (actual
time=923.615..923.615 rows=100000 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 3516kB
-> Seq Scan on table_b (cost=0.00..204052.90 rows=1359
width=4) (actual time=0.086..910.656 rows=100000 loops=1)
Filter: ((val_a < 10) AND (val_b < 10))
Rows Removed by Filter: 9900000
Planning time: 0.271 ms
Execution time: 49545.053 ms
(11 rows)

with the patch:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=385834.56..385834.57 rows=1 width=0) (actual
time=9780.346..9780.347 rows=1 loops=1)
-> Hash Join (cost=204069.89..385831.16 rows=1359 width=0) (actual
time=939.297..9772.256 rows=100000 loops=1)
Hash Cond: (table_a.id = table_b.fk_id)
-> Seq Scan on table_a (cost=0.00..144247.77 rows=9999977
width=4) (actual time=0.103..962.446 rows=10000000 loops=1)
-> Hash (cost=204052.90..204052.90 rows=1359 width=4) (actual
time=939.172..939.172 rows=100000 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 3516kB
-> Seq Scan on table_b (cost=0.00..204052.90 rows=1359
width=4) (actual time=0.064..909.191 rows=100000 loops=1)
Filter: ((val_a < 10) AND (val_b < 10))
Rows Removed by Filter: 9900000
Planning time: 0.276 ms
Execution time: 9782.295 ms
(11 rows)

Time: 9784.392 ms

So the duration improved significantly - from ~52 seconds to ~10
seconds.

The patch is not perfect, it needs a bit more love, as illustrated by
the FIXME/TODO items. Feel free to comment.

regards
Tomas

Attachments:

hashjoin-nbuckets-growth-v1.patchtext/x-diff; name=hashjoin-nbuckets-growth-v1.patchDownload

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..d71b8b9 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -682,6 +683,77 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	int			oldnbuckets = hashtable->nbuckets;
+	int			i;
+    HashJoinTuple * oldbuckets = hashtable->buckets;
+
+    /* TODO Maybe it'd be better to resize the buckets in place (should be possible,
+     * but when I tried it I always ended up with a strange infinite loop). */
+
+	MemoryContext oldcxt;
+
+	/* FIXME do nothing if we've decided to shut off bucket count growth */
+
+    /* FIXME safety checks here */
+
+    /* update the hashtable info, so that we can compute buckets etc. */
+    hashtable->log2_nbuckets += 1; /* we've multiplied by 2 */
+	hashtable->nbuckets *= 2;
+
+	Assert(hashtable->nbuckets > 1);
+    Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+    /* grow the bucket list */
+    oldcxt = MemoryContextSwitchTo(hashtable->batchCxt);
+    
+	hashtable->buckets = (HashJoinTuple *) palloc0(hashtable->nbuckets * sizeof(HashJoinTuple));
+    
+    MemoryContextSwitchTo(oldcxt);
+
+    /* walk through the old buckets, check if tuples are in the right bucket */
+	for (i = 0; i < oldnbuckets; i++)
+	{
+		HashJoinTuple tuple;
+
+		tuple = oldbuckets[i];
+
+		while (tuple != NULL)
+		{
+			/* save link in case we delete */
+			HashJoinTuple nexttuple = tuple->next;
+			int			bucketno;
+			int			batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+									  &bucketno, &batchno);
+
+            /* move it to the correct bucket */
+            tuple->next = hashtable->buckets[bucketno];
+            hashtable->buckets[bucketno] = tuple;
+
+            /* process the next tuple */
+			tuple = nexttuple;
+
+		}
+
+	}
+    
+    pfree(oldbuckets);
+
+    /* TODO disable growth if nbuckets * NTUP_PER_BUCKET reaches work_mem (or something
+     * like that) */
+
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -740,6 +812,17 @@ ExecHashTableInsert(HashJoinTable hashtable,
 			hashtable->spacePeak = hashtable->spaceUsed;
 		if (hashtable->spaceUsed > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+            
+        /* check average number of tuples per bucket, increase (2*nbuckets) if needed */
+        if (hashtable->spaceUsed / hashTupleSize / hashtable->nbuckets > 1.5 * NTUP_PER_BUCKET) {
+
+#ifdef HJDEBUG
+	printf("Increasing nbucket to %d because average per bucket = %d\n",
+		   nbuckets,  hashtable->spaceUsed / hashTupleSize / hashtable->nbuckets);
+#endif
+            ExecHashIncreaseNumBuckets(hashtable);
+        }
+
 	}
 	else
 	{

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#2)

1 attachment(s)

Re: bad estimation together with large work_mem generates terrible slow hash joins

Attached is v2 of the patch, with some cleanups / minor improvements:

* improved comments, whitespace fixed / TODOs etc.

* tracking inital # of buckets (similar to initial # of batches)

* adding info about buckets to EXPLAIN ANALYZE, similar to batches - I
didn't want to make it overly complex, so the info about initial
bucket/batch count is added if at least one them is modified

* modified threshold triggering the growth, to get NTUP_PER_BUCKET on
average (see the NTUP_GROW_THRESHOLD comment nodeHash.c)

* there's a single FIXME, related to counting tuples in the

One thing that's important to note is the difference between # of
batches and # of buckets. While one # of batches is "global" the # of
buckets is 'within a batch'. So theoretically each batch can use
different number of buckets.

However the value is reused between batches, so it only grows. That
means this is possible:

initial: 1024 buckets (before 1st batch)
batch 1: 1024 buckets
batch 2: 1024 buckets
batch 3: 4096 buckets
batch 4: 8192 buckets

while this is not:

initial: 1024 buckets (before 1st batch)
batch 1: 1024 buckets
batch 2: 4096 buckets
batch 3: 1024 buckets
batch 4: 8192 buckets

However in practice I expect the first batch will to do all the work,
and the following batches will just reuse the same number of buckets.
This of course assumes the batches have similar tuple sizes etc.

So the first batch will do all the reshuffling the tables, and the
following batches will reuse the 'right' number of buckets from the start.

regards
Tomas

Attachments:

hashjoin-nbuckets-growth-v2.patchtext/x-diff; name=hashjoin-nbuckets-growth-v2.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..879b336 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -271,6 +272,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
@@ -386,6 +388,22 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 /* Target bucket loading (tuples per bucket) */
 #define NTUP_PER_BUCKET			10
 
+/* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
+ * 
+ * Once we reach the threshold we double the number of buckets, and we
+ * want to get 1.0 on average (to get NTUP_PER_BUCKET on average). That
+ * means these two equations should hold
+ * 
+ *   b = 2a         (growth)
+ *   (a + b)/2 = 1  (oscillate around NTUP_PER_BUCKET)
+ * 
+ * which means b=1.3333 (a = b/2). If we wanted higher threshold, we
+ * could grow the nbuckets to (4*nbuckets), thus using (b=4a) for
+ * growth, leading to (b=1.6). Or (b=8a) giving 1.7777 etc.
+ * 
+ * Let's start with doubling the bucket count, i.e. 1.333. */
+#define NTUP_GROW_THRESHOLD     1.333
+
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *numbuckets,
@@ -682,6 +700,92 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	int			i;
+	int         ntuples = 0;
+	int			oldnbuckets = hashtable->nbuckets;
+	HashJoinTuple  *oldbuckets = hashtable->buckets;
+	MemoryContext   oldcxt;
+
+	/* XXX Not sure if we should update the info about used space here.
+	 * The code seems to ignore the space used for 'buckets' and we're not
+	 * allocating more space for tuples (just shuffling them to the new
+	 * buckets). And the amount of memory used for buckets is quite small
+	 * (just an array of pointers, thus ~8kB per 1k buckets on 64-bit). */
+
+	/* XXX Should we disable growth if (nbuckets * NTUP_PER_BUCKET)
+	 * reaches work_mem (or something like that)? We shouldn't really
+	 * get into such position (should be handled by increasing the
+	 * number of batches, which is called right before this). */
+
+	/* XXX Maybe adding info into hashjoin explain output (e.g. initial
+	 * nbuckets, time spent growing the table) would be appropriate. */
+
+	/* update the hashtable info, so that we can compute buckets etc. */
+	hashtable->log2_nbuckets += 1;
+	hashtable->nbuckets *= 2;
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/* TODO Maybe it'd be better to resize the buckets in place (should be possible,
+	 * but when I tried it I always ended up with a strange infinite loop). */
+
+	/* allocate a new bucket list (use the batch context as before) */
+	oldcxt = MemoryContextSwitchTo(hashtable->batchCxt);
+
+	hashtable->buckets = (HashJoinTuple *) palloc0(hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/* walk through the old buckets, move the buckets into the new table */
+	for (i = 0; i < oldnbuckets; i++)
+	{
+
+		HashJoinTuple tuple = oldbuckets[i];
+
+		while (tuple != NULL)
+		{
+			/* save link in case we delete */
+			HashJoinTuple nexttuple = tuple->next;
+			int			bucketno;
+			int			batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* move it to the correct bucket */
+			tuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = tuple;
+
+			/* process the next tuple */
+			tuple = nexttuple;
+
+			ntuples++;
+		}
+	}
+
+	pfree(oldbuckets);
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, (float)ntuples / hashtable->nbuckets);
+#endif
+
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -740,6 +844,22 @@ ExecHashTableInsert(HashJoinTable hashtable,
 			hashtable->spacePeak = hashtable->spaceUsed;
 		if (hashtable->spaceUsed > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
+		/* check average number of tuples per bucket, increase (2*nbuckets) if needed */
+		/* FIXME This attempts to compute number of tuples in the hashtable using hashTupleSize,
+		 * 		 i.e. the size of the current tuple, which is not really accurate. Need to track
+		 * 		 the count properly.
+		 */
+		if (hashtable->spaceUsed / hashTupleSize / hashtable->nbuckets > NTUP_GROW_THRESHOLD * NTUP_PER_BUCKET) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbucket to %d because average per bucket = %d\n",
+				   nbuckets,  hashtable->spaceUsed / hashTupleSize / hashtable->nbuckets);
+#endif
+
+			ExecHashIncreaseNumBuckets(hashtable);
+		}
+
 	}
 	else
 	{
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3beae40..03e91a6 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -106,6 +106,7 @@ typedef struct HashSkewBucket
 typedef struct HashJoinTableData
 {
 	int			nbuckets;		/* # buckets in the in-memory hash table */
+	int			nbuckets_original;	/* # buckets when starting the first hash */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#3)

1 attachment(s)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 26.6.2014 20:43, Tomas Vondra wrote:

Attached is v2 of the patch, with some cleanups / minor improvements:

* there's a single FIXME, related to counting tuples in the

Meh, I couldn't resist resolving this FIXME, so attached is v3 of the
patch. This just adds a proper 'batch tuples' counter to the hash table.

All comments, measurements on different queries etc. welcome. We'll
certainly do a lot of testing, because this was a big issue for us.

regards
Tomas

Attachments:

hashjoin-nbuckets-growth-v3.patchtext/x-diff; name=hashjoin-nbuckets-growth-v3.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..b32fb6b 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -271,6 +272,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
@@ -285,6 +287,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->batchTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -386,6 +389,22 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 /* Target bucket loading (tuples per bucket) */
 #define NTUP_PER_BUCKET			10
 
+/* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
+ * 
+ * Once we reach the threshold we double the number of buckets, and we
+ * want to get 1.0 on average (to get NTUP_PER_BUCKET on average). That
+ * means these two equations should hold
+ * 
+ *   b = 2a         (growth)
+ *   (a + b)/2 = 1  (oscillate around NTUP_PER_BUCKET)
+ * 
+ * which means b=1.3333 (a = b/2). If we wanted higher threshold, we
+ * could grow the nbuckets to (4*nbuckets), thus using (b=4a) for
+ * growth, leading to (b=1.6). Or (b=8a) giving 1.7777 etc.
+ * 
+ * Let's start with doubling the bucket count, i.e. 1.333. */
+#define NTUP_GROW_THRESHOLD     1.333
+
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *numbuckets,
@@ -651,6 +670,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 				/* prevtuple doesn't change */
 				hashtable->spaceUsed -=
 					HJTUPLE_OVERHEAD + HJTUPLE_MINTUPLE(tuple)->t_len;
+				hashtable->batchTuples -= 1;
 				pfree(tuple);
 				nfreed++;
 			}
@@ -682,6 +702,92 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	int			i;
+	int         ntuples = 0;
+	int			oldnbuckets = hashtable->nbuckets;
+	HashJoinTuple  *oldbuckets = hashtable->buckets;
+	MemoryContext   oldcxt;
+
+	/* XXX Not sure if we should update the info about used space here.
+	 * The code seems to ignore the space used for 'buckets' and we're not
+	 * allocating more space for tuples (just shuffling them to the new
+	 * buckets). And the amount of memory used for buckets is quite small
+	 * (just an array of pointers, thus ~8kB per 1k buckets on 64-bit). */
+
+	/* XXX Should we disable growth if (nbuckets * NTUP_PER_BUCKET)
+	 * reaches work_mem (or something like that)? We shouldn't really
+	 * get into such position (should be handled by increasing the
+	 * number of batches, which is called right before this). */
+
+	/* XXX Maybe adding info into hashjoin explain output (e.g. initial
+	 * nbuckets, time spent growing the table) would be appropriate. */
+
+	/* update the hashtable info, so that we can compute buckets etc. */
+	hashtable->log2_nbuckets += 1;
+	hashtable->nbuckets *= 2;
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/* TODO Maybe it'd be better to resize the buckets in place (should be possible,
+	 * but when I tried it I always ended up with a strange infinite loop). */
+
+	/* allocate a new bucket list (use the batch context as before) */
+	oldcxt = MemoryContextSwitchTo(hashtable->batchCxt);
+
+	hashtable->buckets = (HashJoinTuple *) palloc0(hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/* walk through the old buckets, move the buckets into the new table */
+	for (i = 0; i < oldnbuckets; i++)
+	{
+
+		HashJoinTuple tuple = oldbuckets[i];
+
+		while (tuple != NULL)
+		{
+			/* save link in case we delete */
+			HashJoinTuple nexttuple = tuple->next;
+			int			bucketno;
+			int			batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* move it to the correct bucket */
+			tuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = tuple;
+
+			/* process the next tuple */
+			tuple = nexttuple;
+
+			ntuples++;
+		}
+	}
+
+	pfree(oldbuckets);
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, (float)ntuples / hashtable->nbuckets);
+#endif
+
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -734,12 +840,28 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		/* increase the number of tuples in the batch */
+		hashtable->batchTuples += 1;
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
 		if (hashtable->spaceUsed > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
+		/* If average number of tuples per bucket, double the number of buckets. */
+		if (((float)hashtable->batchTuples / hashtable->nbuckets) > NTUP_GROW_THRESHOLD * NTUP_PER_BUCKET) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbucket to %d because average per bucket = %.1f\n",
+				   nbuckets,  (float)hashtable->batchTuples / hashtable->nbuckets);
+#endif
+
+			ExecHashIncreaseNumBuckets(hashtable);
+
+		}
+
 	}
 	else
 	{
@@ -1066,6 +1188,7 @@ ExecHashTableReset(HashJoinTable hashtable)
 		palloc0(nbuckets * sizeof(HashJoinTuple));
 
 	hashtable->spaceUsed = 0;
+	hashtable->batchTuples = 0;
 
 	MemoryContextSwitchTo(oldcxt);
 }
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3beae40..2858f82 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -106,6 +106,7 @@ typedef struct HashSkewBucket
 typedef struct HashJoinTableData
 {
 	int			nbuckets;		/* # buckets in the in-memory hash table */
+	int			nbuckets_original;	/* # buckets when starting the first hash */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
@@ -129,6 +130,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	int64		batchTuples;	/* # tuples in the current batch */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#4)

3 attachment(s)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 26.6.2014 23:48, Tomas Vondra wrote:

On 26.6.2014 20:43, Tomas Vondra wrote:

Attached is v2 of the patch, with some cleanups / minor improvements:

* there's a single FIXME, related to counting tuples in the

Meh, I couldn't resist resolving this FIXME, so attached is v3 of the
patch. This just adds a proper 'batch tuples' counter to the hash table.

All comments, measurements on different queries etc. welcome. We'll
certainly do a lot of testing, because this was a big issue for us.

Attached is v4 of the patch, with a few minor improvements. The only
thing worth mentioning is overflow protection, similar to what's done in
the ExecChooseHashTableSize() function. Otherwise it's mostly about
improving comments.

Also attached is a v4 with GUC, making it easier to compare effect of
the patch, by simply setting "enable_hashjoin_bucket" to "off" (original
behaviour) or "on" (new behaviour).

And finally there's an SQL script demonstrating the effect of the patch
with various work_mem settings. For example what I see on my desktop is
this (averages from 3 runs):

===== SMALL WORK MEM (2MB) =====
no dynamic buckets dynamic buckets
query A 5945 ms 5969 ms
query B 6080 ms 5943 ms
query C 6531 ms 6822 ms
query D 6962 ms 6618 ms

===== MEDIUM WORK MEM (16MB) =====
no dynamic buckets dynamic buckets
query A 7955 ms 7944 ms
query B 9970 ms 7569 ms
query C 8643 ms 8560 ms
query D 33908 ms 7700 ms

===== LARGE WORK MEM (64MB) =====
no dynamic buckets dynamic buckets
query A 10235 ms 10233 ms
query B 32229 ms 9318 ms
query C 14500 ms 10554 ms
query D 213344 ms 9145 ms

Where "A" is "exactly estimated" and the other queries suffer by various
underestimates. My observations from this are:

(1) For small work_mem values it does not really matter, thanks to the
caching effects (the whole hash table fits into L2 CPU cache).

(2) For medium work_mem values (not really huge, but exceeding CPU
caches), the differences are negligible, except for the last query
with most severe underestimate. In that case the new behaviour is
much faster.

(3) For large work_mem values, the speedup is pretty obvious and
dependent on the underestimate.

The question is why to choose large work_mem values when the smaller
values actually perform better. Well, the example tables are not
perfectly representative. In case the outer table is much larger and
does not fit into RAM that easily (which is the case of large fact
tables or joins), the rescans (because of more batches) are more
expensive and outweight the caching benefits.

Also, the work_mem is shared with other nodes, e.g. aggregates, and
decreasing it because of hash joins would hurt them.

regards
Tomas

Attachments:

hashjoin-nbuckets-growth-v4.patchtext/x-diff; name=hashjoin-nbuckets-growth-v4.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..96fdd68 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -271,6 +272,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
@@ -285,6 +287,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->batchTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -386,6 +389,23 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 /* Target bucket loading (tuples per bucket) */
 #define NTUP_PER_BUCKET			10
 
+/* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
+ * 
+ * Once we reach the threshold we double the number of buckets, and we
+ * want to get 1.0 on average (to get NTUP_PER_BUCKET on average). That
+ * means these two equations should hold
+ * 
+ *   b = 2a         (growth)
+ *   (a + b)/2 = 1  (oscillate around NTUP_PER_BUCKET)
+ * 
+ * which means b=1.3333 (a = b/2). If we wanted higher threshold, we
+ * could grow the nbuckets to (4*nbuckets), thus using (b=4a) for
+ * growth, leading to (b=1.6). Or (b=8a) giving 1.7777 etc.
+ * 
+ * Let's start with doubling the bucket count, i.e. 1.333. */
+#define NTUP_GROW_COEFFICIENT	1.333
+#define NTUP_GROW_THRESHOLD		(NTUP_PER_BUCKET * NTUP_GROW_COEFFICIENT)
+
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *numbuckets,
@@ -651,6 +671,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 				/* prevtuple doesn't change */
 				hashtable->spaceUsed -=
 					HJTUPLE_OVERHEAD + HJTUPLE_MINTUPLE(tuple)->t_len;
+				hashtable->batchTuples -= 1;
 				pfree(tuple);
 				nfreed++;
 			}
@@ -682,6 +703,111 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	int			i;
+	int         ntuples = 0;
+	int			oldnbuckets = hashtable->nbuckets;
+	HashJoinTuple  *oldbuckets = hashtable->buckets;
+	MemoryContext   oldcxt;
+
+	/*
+	 * Safety check to avoid overflow. This should only happen with very large
+	 * work_mem values, because (INT_MAX/2) means ~8GB only for the buckets.
+	 * With tuples, the hash table would require tens of GBs of work_mem.
+	 * 
+	 * XXX Technically there's also a limit for buckets fitting into work_mem
+	 * (with NTUP_PER_BUCKET tuples), but this can't be really exceeded
+	 * because when filling work_mem, another hash will be added (thus the
+	 * number of tuples will drop and more buckets won't be needed anymore).
+	 * 
+	 * That is, something like this will be enforced implicitly:
+	 * 
+	 *    work_mem * 1024L >= (nbuckets * tupsize * NTUP_GROW_THRESHOLD)
+	 * 
+	 * So it's enough to check only the overflow here.
+	 */
+	if (oldnbuckets >= (INT_MAX/2))
+		return;
+
+	/* XXX Not sure if we should update the info about used space here.
+	 * The code seems to ignore the space used for 'buckets' and we're not
+	 * allocating more space for tuples (just shuffling them to the new
+	 * buckets). And the amount of memory used for buckets is quite small
+	 * (just an array of pointers, thus ~8kB per 1k buckets on 64-bit). */
+
+	/* XXX Should we disable growth if (nbuckets * NTUP_PER_BUCKET)
+	 * reaches work_mem (or something like that)? We shouldn't really
+	 * get into such position (should be handled by increasing the
+	 * number of batches, which is called right before this). */
+
+	/* XXX Maybe adding info into hashjoin explain output (e.g. initial
+	 * nbuckets, time spent growing the table) would be appropriate. */
+
+	/* update the hashtable info, so that we can compute buckets etc. */
+	hashtable->log2_nbuckets += 1;
+	hashtable->nbuckets *= 2;
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/* TODO Maybe it'd be better to resize the buckets in place (should be possible,
+	 * but when I tried it I always ended up with a strange infinite loop). */
+
+	/* allocate a new bucket list (use the batch context as before) */
+	oldcxt = MemoryContextSwitchTo(hashtable->batchCxt);
+
+	hashtable->buckets = (HashJoinTuple *) palloc0(hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/* walk through the old buckets, move the buckets into the new table */
+	for (i = 0; i < oldnbuckets; i++)
+	{
+
+		HashJoinTuple tuple = oldbuckets[i];
+
+		while (tuple != NULL)
+		{
+			/* save link in case we delete */
+			HashJoinTuple nexttuple = tuple->next;
+			int			bucketno;
+			int			batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* move it to the correct bucket */
+			tuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = tuple;
+
+			/* process the next tuple */
+			tuple = nexttuple;
+
+			ntuples++;
+		}
+	}
+
+	pfree(oldbuckets);
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, (float)ntuples / hashtable->nbuckets);
+#endif
+
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -734,12 +860,29 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		/* increase the number of tuples in the batch */
+		hashtable->batchTuples += 1;
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
 		if (hashtable->spaceUsed > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
+		/* If average number of tuples per bucket, double the number of buckets (unless
+		 * nbuckets growth already disabled). */
+		if (hashtable->batchTuples > (hashtable->nbuckets * NTUP_GROW_THRESHOLD)) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbucket to %d because average per bucket = %.1f\n",
+				   nbuckets,  (float)hashtable->batchTuples / hashtable->nbuckets);
+#endif
+
+			ExecHashIncreaseNumBuckets(hashtable);
+
+		}
+
 	}
 	else
 	{
@@ -1066,6 +1209,7 @@ ExecHashTableReset(HashJoinTable hashtable)
 		palloc0(nbuckets * sizeof(HashJoinTuple));
 
 	hashtable->spaceUsed = 0;
+	hashtable->batchTuples = 0;
 
 	MemoryContextSwitchTo(oldcxt);
 }
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3beae40..2858f82 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -106,6 +106,7 @@ typedef struct HashSkewBucket
 typedef struct HashJoinTableData
 {
 	int			nbuckets;		/* # buckets in the in-memory hash table */
+	int			nbuckets_original;	/* # buckets when starting the first hash */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
@@ -129,6 +130,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	int64		batchTuples;	/* # tuples in the current batch */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if

hashjoin-nbuckets-growth-v4-with-guc.patchtext/x-diff; name=hashjoin-nbuckets-growth-v4-with-guc.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..4cf6454 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,10 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
+bool enable_hashjoin_bucket = true;
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -271,6 +273,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
@@ -285,6 +288,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->batchTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -386,6 +390,23 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 /* Target bucket loading (tuples per bucket) */
 #define NTUP_PER_BUCKET			10
 
+/* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
+ * 
+ * Once we reach the threshold we double the number of buckets, and we
+ * want to get 1.0 on average (to get NTUP_PER_BUCKET on average). That
+ * means these two equations should hold
+ * 
+ *   b = 2a         (growth)
+ *   (a + b)/2 = 1  (oscillate around NTUP_PER_BUCKET)
+ * 
+ * which means b=1.3333 (a = b/2). If we wanted higher threshold, we
+ * could grow the nbuckets to (4*nbuckets), thus using (b=4a) for
+ * growth, leading to (b=1.6). Or (b=8a) giving 1.7777 etc.
+ * 
+ * Let's start with doubling the bucket count, i.e. 1.333. */
+#define NTUP_GROW_COEFFICIENT	1.333
+#define NTUP_GROW_THRESHOLD		(NTUP_PER_BUCKET * NTUP_GROW_COEFFICIENT)
+
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *numbuckets,
@@ -651,6 +672,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 				/* prevtuple doesn't change */
 				hashtable->spaceUsed -=
 					HJTUPLE_OVERHEAD + HJTUPLE_MINTUPLE(tuple)->t_len;
+				hashtable->batchTuples -= 1;
 				pfree(tuple);
 				nfreed++;
 			}
@@ -682,6 +704,111 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	int			i;
+	int         ntuples = 0;
+	int			oldnbuckets = hashtable->nbuckets;
+	HashJoinTuple  *oldbuckets = hashtable->buckets;
+	MemoryContext   oldcxt;
+
+	/*
+	 * Safety check to avoid overflow. This should only happen with very large
+	 * work_mem values, because (INT_MAX/2) means ~8GB only for the buckets.
+	 * With tuples, the hash table would require tens of GBs of work_mem.
+	 * 
+	 * XXX Technically there's also a limit for buckets fitting into work_mem
+	 * (with NTUP_PER_BUCKET tuples), but this can't be really exceeded
+	 * because when filling work_mem, another hash will be added (thus the
+	 * number of tuples will drop and more buckets won't be needed anymore).
+	 * 
+	 * That is, something like this will be enforced implicitly:
+	 * 
+	 *    work_mem * 1024L >= (nbuckets * tupsize * NTUP_GROW_THRESHOLD)
+	 * 
+	 * So it's enough to check only the overflow here.
+	 */
+	if (oldnbuckets >= (INT_MAX/2))
+		return;
+
+	/* XXX Not sure if we should update the info about used space here.
+	 * The code seems to ignore the space used for 'buckets' and we're not
+	 * allocating more space for tuples (just shuffling them to the new
+	 * buckets). And the amount of memory used for buckets is quite small
+	 * (just an array of pointers, thus ~8kB per 1k buckets on 64-bit). */
+
+	/* XXX Should we disable growth if (nbuckets * NTUP_PER_BUCKET)
+	 * reaches work_mem (or something like that)? We shouldn't really
+	 * get into such position (should be handled by increasing the
+	 * number of batches, which is called right before this). */
+
+	/* XXX Maybe adding info into hashjoin explain output (e.g. initial
+	 * nbuckets, time spent growing the table) would be appropriate. */
+
+	/* update the hashtable info, so that we can compute buckets etc. */
+	hashtable->log2_nbuckets += 1;
+	hashtable->nbuckets *= 2;
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/* TODO Maybe it'd be better to resize the buckets in place (should be possible,
+	 * but when I tried it I always ended up with a strange infinite loop). */
+
+	/* allocate a new bucket list (use the batch context as before) */
+	oldcxt = MemoryContextSwitchTo(hashtable->batchCxt);
+
+	hashtable->buckets = (HashJoinTuple *) palloc0(hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/* walk through the old buckets, move the buckets into the new table */
+	for (i = 0; i < oldnbuckets; i++)
+	{
+
+		HashJoinTuple tuple = oldbuckets[i];
+
+		while (tuple != NULL)
+		{
+			/* save link in case we delete */
+			HashJoinTuple nexttuple = tuple->next;
+			int			bucketno;
+			int			batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* move it to the correct bucket */
+			tuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = tuple;
+
+			/* process the next tuple */
+			tuple = nexttuple;
+
+			ntuples++;
+		}
+	}
+
+	pfree(oldbuckets);
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, (float)ntuples / hashtable->nbuckets);
+#endif
+
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -734,12 +861,30 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		/* increase the number of tuples in the batch */
+		hashtable->batchTuples += 1;
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
 		if (hashtable->spaceUsed > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
+		/* If average number of tuples per bucket, double the number of buckets (unless
+		 * nbuckets growth already disabled). */
+		if (enable_hashjoin_bucket)
+		if (hashtable->batchTuples > (hashtable->nbuckets * NTUP_GROW_THRESHOLD)) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbucket to %d because average per bucket = %.1f\n",
+				   nbuckets,  (float)hashtable->batchTuples / hashtable->nbuckets);
+#endif
+
+			ExecHashIncreaseNumBuckets(hashtable);
+
+		}
+
 	}
 	else
 	{
@@ -1066,6 +1211,7 @@ ExecHashTableReset(HashJoinTable hashtable)
 		palloc0(nbuckets * sizeof(HashJoinTuple));
 
 	hashtable->spaceUsed = 0;
+	hashtable->batchTuples = 0;
 
 	MemoryContextSwitchTo(oldcxt);
 }
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3a31a75..c92cc26 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -788,6 +788,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"enable_hashjoin_bucket", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables dynamic increase of hash buckets."),
+			NULL
+		},
+		&enable_hashjoin_bucket,
+		true,
+		NULL, NULL, NULL
+	},
+	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
 			gettext_noop("This algorithm attempts to do planning without "
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3beae40..2858f82 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -106,6 +106,7 @@ typedef struct HashSkewBucket
 typedef struct HashJoinTableData
 {
 	int			nbuckets;		/* # buckets in the in-memory hash table */
+	int			nbuckets_original;	/* # buckets when starting the first hash */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
@@ -129,6 +130,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	int64		batchTuples;	/* # tuples in the current batch */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 75be5bd..15604cb 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -50,4 +50,6 @@ extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *num_skew_mcvs);
 extern int	ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
 
+extern bool enable_hashjoin_bucket;
+
 #endif   /* NODEHASH_H */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 75e2afb..60b8da8 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -62,6 +62,7 @@ extern bool enable_material;
 extern bool enable_mergejoin;
 extern bool enable_hashjoin;
 extern int	constraint_exclusion;
+extern bool enable_hashjoin_bucket;
 
 extern double clamp_row_est(double nrows);
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,

hashjoin.sqlapplication/sql; name=hashjoin.sqlDownload

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#5)

1 attachment(s)

Re: bad estimation together with large work_mem generates terrible slow hash joins

Hi,

attached is v5 of the patch. The main change is that scaling the number
of buckets is done only once, after the initial hash table is build. The
main advantage of this is lower price. This also allowed some cleanup of
unecessary code.

However, this new patch causes warning like this:

WARNING: temporary file leak: File 231 still referenced

I've been eyeballing the code for a while now, but I still fail to see
where this comes from :-( Any ideas?

regards
Tomas

Attachments:

hashjoin-nbuckets-growth-v5-with-guc.patchtext/x-diff; name=hashjoin-nbuckets-growth-v5-with-guc.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..fbd721d 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,10 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
+bool enable_hashjoin_bucket = true;
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -48,6 +50,33 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 
+/*
+ * Compute appropriate size for hashtable given the estimated size of the
+ * relation to be hashed (number of rows and average row width).
+ *
+ * This is exported so that the planner's costsize.c can use it.
+ */
+
+/* Target bucket loading (tuples per bucket) */
+#define NTUP_PER_BUCKET			10
+
+/* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
+ * 
+ * Once we reach the threshold we double the number of buckets, and we
+ * want to get 1.0 on average (to get NTUP_PER_BUCKET on average). That
+ * means these two equations should hold
+ * 
+ *   b = 2a         (growth)
+ *   (a + b)/2 = 1  (oscillate around NTUP_PER_BUCKET)
+ * 
+ * which means b=1.3333 (a = b/2). If we wanted higher threshold, we
+ * could grow the nbuckets to (4*nbuckets), thus using (b=4a) for
+ * growth, leading to (b=1.6). Or (b=8a) giving 1.7777 etc.
+ * 
+ * Let's start with doubling the bucket count, i.e. 1.333. */
+#define NTUP_GROW_COEFFICIENT	1.333
+#define NTUP_GROW_THRESHOLD		(NTUP_PER_BUCKET * NTUP_GROW_COEFFICIENT)
+
 /* ----------------------------------------------------------------
  *		ExecHash
  *
@@ -126,6 +155,23 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* If average number of tuples per bucket is over the defined threshold,
+	 * increase the number of buckets to get below it. */
+	if (enable_hashjoin_bucket) {
+
+		double batchTuples = (hashtable->totalTuples / hashtable->nbatch);
+
+		if (batchTuples > (hashtable->nbuckets * NTUP_GROW_THRESHOLD)) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbucket to %d (average per bucket = %.1f)\n",
+				   nbuckets,  (batchTuples / hashtable->nbuckets));
+#endif
+			ExecHashIncreaseNumBuckets(hashtable);
+
+		}
+	}
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -271,6 +317,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
@@ -375,17 +422,6 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	return hashtable;
 }
 
-
-/*
- * Compute appropriate size for hashtable given the estimated size of the
- * relation to be hashed (number of rows and average row width).
- *
- * This is exported so that the planner's costsize.c can use it.
- */
-
-/* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
-
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *numbuckets,
@@ -682,6 +718,116 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	int			i;
+	int			oldnbuckets = hashtable->nbuckets;
+	HashJoinTuple  *oldbuckets = hashtable->buckets;
+	MemoryContext   oldcxt;
+	double		batchTuples = (hashtable->totalTuples / hashtable->nbatch);
+
+	/*
+	 * Determine the proper number of buckets, i.e. stop once the average
+	 * per bucket gets below the threshold (1.33 * NTUP_PER_BUCKET).
+	 * 
+	 * Also, check for overflow - this can only happen with extremely large
+	 * work_mem values, because (INT_MAX/2) means ~8GB only for the buckets.
+	 * With tuples, the hash table would require tens of GBs of work_mem.
+	 * 
+	 * XXX Technically there's also a limit for buckets fitting into work_mem
+	 * (with NTUP_PER_BUCKET tuples), but this can't be really exceeded
+	 * because when filling work_mem, another batch will be added (thus the
+	 * number of tuples will drop and more buckets won't be needed anymore).
+	 * 
+	 * That is, something like this will be enforced implicitly:
+	 * 
+	 *    work_mem * 1024L >= (nbuckets * tupsize * NTUP_GROW_THRESHOLD)
+	 * 
+	 * So it's enough to check only the overflow here.
+	 */
+
+	/* double the number of buckets until we get below the growth threshold, or
+	 * until we hit the overflow protection */
+	while ((batchTuples > (hashtable->nbuckets * NTUP_GROW_THRESHOLD))
+			&& (hashtable->nbuckets <= (INT_MAX/2))) {
+		hashtable->nbuckets *= 2;
+		hashtable->log2_nbuckets += 1;
+	}
+
+	/* XXX Not sure if we should update the info about used space here.
+	 * The code seems to ignore the space used for 'buckets' and we're not
+	 * allocating more space for tuples (just shuffling them to the new
+	 * buckets). And the amount of memory used for buckets is quite small
+	 * (just an array of pointers, thus ~8kB per 1k buckets on 64-bit). */
+
+	/* XXX Should we disable growth if (nbuckets * NTUP_PER_BUCKET)
+	 * reaches work_mem (or something like that)? We shouldn't really
+	 * get into such position (should be handled by increasing the
+	 * number of batches, which is called right before this). */
+
+	/* XXX Maybe adding info into hashjoin explain output (e.g. initial
+	 * nbuckets, time spent growing the table) would be appropriate. */
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets <= (INT_MAX/2));
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/* TODO Maybe it'd be better to resize the buckets in place (should be possible,
+	 * but when I tried it I always ended up with a strange infinite loop). */
+
+	/* allocate a new bucket list (use the batch context as before) */
+	oldcxt = MemoryContextSwitchTo(hashtable->batchCxt);
+
+	hashtable->buckets = (HashJoinTuple *)
+		palloc0(hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/* walk through the old buckets, move the buckets into the new table */
+	for (i = 0; i < oldnbuckets; i++)
+	{
+
+		HashJoinTuple tuple = oldbuckets[i];
+
+		while (tuple != NULL)
+		{
+			/* save link in case we delete */
+			HashJoinTuple nexttuple = tuple->next;
+			int			bucketno;
+			int			batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* move it to the correct bucket (in the new bucket array) */
+			tuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = tuple;
+
+			/* process the next tuple */
+			tuple = nexttuple;
+		}
+	}
+
+	pfree(oldbuckets);
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, batchTuples / hashtable->nbuckets);
+#endif
+	
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -740,6 +886,7 @@ ExecHashTableInsert(HashJoinTable hashtable,
 			hashtable->spacePeak = hashtable->spaceUsed;
 		if (hashtable->spaceUsed > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
 	}
 	else
 	{
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3a31a75..c92cc26 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -788,6 +788,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"enable_hashjoin_bucket", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables dynamic increase of hash buckets."),
+			NULL
+		},
+		&enable_hashjoin_bucket,
+		true,
+		NULL, NULL, NULL
+	},
+	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
 			gettext_noop("This algorithm attempts to do planning without "
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3beae40..03e91a6 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -106,6 +106,7 @@ typedef struct HashSkewBucket
 typedef struct HashJoinTableData
 {
 	int			nbuckets;		/* # buckets in the in-memory hash table */
+	int			nbuckets_original;	/* # buckets when starting the first hash */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 75be5bd..15604cb 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -50,4 +50,6 @@ extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *num_skew_mcvs);
 extern int	ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
 
+extern bool enable_hashjoin_bucket;
+
 #endif   /* NODEHASH_H */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 75e2afb..60b8da8 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -62,6 +62,7 @@ extern bool enable_material;
 extern bool enable_mergejoin;
 extern bool enable_hashjoin;
 extern int	constraint_exclusion;
+extern bool enable_hashjoin_bucket;
 
 extern double clamp_row_est(double nrows);
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#6)

1 attachment(s)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 30.6.2014 23:12, Tomas Vondra wrote:

Hi,

attached is v5 of the patch. The main change is that scaling the number
of buckets is done only once, after the initial hash table is build. The
main advantage of this is lower price. This also allowed some cleanup of
unecessary code.

However, this new patch causes warning like this:

WARNING: temporary file leak: File 231 still referenced

I've been eyeballing the code for a while now, but I still fail to see
where this comes from :-( Any ideas?

Meh, the patches are wrong - I haven't realized the tight coupling
between buckets/batches in ExecHashGetBucketAndBatch:

*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);

The previous patches worked mostly by pure luck (the nbuckets was set
only in the first batch), but once I moved the code at the end, it
started failing. And by "worked" I mean "didn't throw an error, but
probably returned bogus results with (nbatch>1)".

However, ISTM this nbuckets-nbatch coupling is not really necessary,
it's only constructed this way to assign independent batch/bucket. So I
went and changed the code like this:

*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> (32 - hashtable->log2_nbatch));

I.e. using the top bits for batchno, low bits for bucketno (as before).

Hopefully I got it right this time. At least it seems to be working for
cases that failed before (no file leaks, proper rowcounts so far).

regards
Tomas

Attachments:

hashjoin-nbuckets-growth-v6-with-guc.patchtext/x-diff; name=hashjoin-nbuckets-growth-v6-with-guc.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..48cca62 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,10 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
+bool enable_hashjoin_bucket = true;
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -48,6 +50,33 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 
+/*
+ * Compute appropriate size for hashtable given the estimated size of the
+ * relation to be hashed (number of rows and average row width).
+ *
+ * This is exported so that the planner's costsize.c can use it.
+ */
+
+/* Target bucket loading (tuples per bucket) */
+#define NTUP_PER_BUCKET			10
+
+/* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
+ * 
+ * Once we reach the threshold we double the number of buckets, and we
+ * want to get 1.0 on average (to get NTUP_PER_BUCKET on average). That
+ * means these two equations should hold
+ * 
+ *   b = 2a         (growth)
+ *   (a + b)/2 = 1  (oscillate around NTUP_PER_BUCKET)
+ * 
+ * which means b=1.3333 (a = b/2). If we wanted higher threshold, we
+ * could grow the nbuckets to (4*nbuckets), thus using (b=4a) for
+ * growth, leading to (b=1.6). Or (b=8a) giving 1.7777 etc.
+ * 
+ * Let's start with doubling the bucket count, i.e. 1.333. */
+#define NTUP_GROW_COEFFICIENT	1.333
+#define NTUP_GROW_THRESHOLD		(NTUP_PER_BUCKET * NTUP_GROW_COEFFICIENT)
+
 /* ----------------------------------------------------------------
  *		ExecHash
  *
@@ -116,6 +145,7 @@ MultiExecHash(HashState *node)
 				/* It's a skew tuple, so put it into that hash table */
 				ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 										bucketNumber);
+				hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -126,6 +156,23 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* If average number of tuples per bucket is over the defined threshold,
+	 * increase the number of buckets to get below it. */
+	if (enable_hashjoin_bucket) {
+
+		/* consider only tuples in the non-skew buckets */
+		double nonSkewTuples = (hashtable->totalTuples - hashtable->skewTuples);
+
+		if ((nonSkewTuples / hashtable->nbatch) > (hashtable->nbuckets * NTUP_GROW_THRESHOLD)) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbucket to %d (average per bucket = %.1f)\n",
+				   nbuckets,  (batchTuples / hashtable->nbuckets));
+#endif
+			ExecHashIncreaseNumBuckets(hashtable);
+		}
+	}
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -239,6 +286,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	int			nbatch;
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
+	int			log2_nbatch;
 	int			nkeys;
 	int			i;
 	ListCell   *ho;
@@ -263,6 +311,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	log2_nbuckets = my_log2(nbuckets);
 	Assert(nbuckets == (1 << log2_nbuckets));
 
+	/* nbatch must be a power of 2 */
+	log2_nbatch = my_log2(nbatch);
+	Assert(nbuckets == (1 << log2_nbatch));
+
 	/*
 	 * Initialize the hash table control block.
 	 *
@@ -271,6 +323,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
@@ -281,10 +334,12 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->skewBucketNums = NULL;
 	hashtable->nbatch = nbatch;
 	hashtable->curbatch = 0;
+	hashtable->log2_nbatch = log2_nbatch;
 	hashtable->nbatch_original = nbatch;
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -375,17 +430,6 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	return hashtable;
 }
 
-
-/*
- * Compute appropriate size for hashtable given the estimated size of the
- * relation to be hashed (number of rows and average row width).
- *
- * This is exported so that the planner's costsize.c can use it.
- */
-
-/* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
-
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *numbuckets,
@@ -605,6 +649,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	MemoryContextSwitchTo(oldcxt);
 
 	hashtable->nbatch = nbatch;
+	hashtable->log2_nbatch += 1;
 
 	/*
 	 * Scan through the existing hash table entries and dump out any that are
@@ -682,6 +727,116 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	int			i;
+	int			oldnbuckets = hashtable->nbuckets;
+	HashJoinTuple  *oldbuckets = hashtable->buckets;
+	MemoryContext   oldcxt;
+	double		batchTuples = (hashtable->totalTuples / hashtable->nbatch);
+
+	/*
+	 * Determine the proper number of buckets, i.e. stop once the average
+	 * per bucket gets below the threshold (1.33 * NTUP_PER_BUCKET).
+	 * 
+	 * Also, check for overflow - this can only happen with extremely large
+	 * work_mem values, because (INT_MAX/2) means ~8GB only for the buckets.
+	 * With tuples, the hash table would require tens of GBs of work_mem.
+	 * 
+	 * XXX Technically there's also a limit for buckets fitting into work_mem
+	 * (with NTUP_PER_BUCKET tuples), but this can't be really exceeded
+	 * because when filling work_mem, another batch will be added (thus the
+	 * number of tuples will drop and more buckets won't be needed anymore).
+	 * 
+	 * That is, something like this will be enforced implicitly:
+	 * 
+	 *    work_mem * 1024L >= (nbuckets * tupsize * NTUP_GROW_THRESHOLD)
+	 * 
+	 * So it's enough to check only the overflow here.
+	 */
+
+	/* double the number of buckets until we get below the growth threshold, or
+	 * until we hit the overflow protection */
+	while ((batchTuples > (hashtable->nbuckets * NTUP_GROW_THRESHOLD))
+			&& (hashtable->nbuckets <= (INT_MAX/2))) {
+		hashtable->nbuckets *= 2;
+		hashtable->log2_nbuckets += 1;
+	}
+
+	/* XXX Not sure if we should update the info about used space here.
+	 * The code seems to ignore the space used for 'buckets' and we're not
+	 * allocating more space for tuples (just shuffling them to the new
+	 * buckets). And the amount of memory used for buckets is quite small
+	 * (just an array of pointers, thus ~8kB per 1k buckets on 64-bit). */
+
+	/* XXX Should we disable growth if (nbuckets * NTUP_PER_BUCKET)
+	 * reaches work_mem (or something like that)? We shouldn't really
+	 * get into such position (should be handled by increasing the
+	 * number of batches, which is called right before this). */
+
+	/* XXX Maybe adding info into hashjoin explain output (e.g. initial
+	 * nbuckets, time spent growing the table) would be appropriate. */
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets <= (INT_MAX/2));
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/* TODO Maybe it'd be better to resize the buckets in place (should be possible,
+	 * but when I tried it I always ended up with a strange infinite loop). */
+
+	/* allocate a new bucket list (use the batch context as before) */
+	oldcxt = MemoryContextSwitchTo(hashtable->batchCxt);
+
+	hashtable->buckets = (HashJoinTuple *)
+		palloc0(hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/* walk through the old buckets, move the buckets into the new table */
+	for (i = 0; i < oldnbuckets; i++)
+	{
+
+		HashJoinTuple tuple = oldbuckets[i];
+
+		while (tuple != NULL)
+		{
+			/* save link in case we delete */
+			HashJoinTuple nexttuple = tuple->next;
+			int			bucketno;
+			int			batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* move it to the correct bucket (in the new bucket array) */
+			tuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = tuple;
+
+			/* process the next tuple */
+			tuple = nexttuple;
+		}
+	}
+
+	pfree(oldbuckets);
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, batchTuples / hashtable->nbuckets);
+#endif
+	
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -740,6 +895,7 @@ ExecHashTableInsert(HashJoinTable hashtable,
 			hashtable->spacePeak = hashtable->spaceUsed;
 		if (hashtable->spaceUsed > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
 	}
 	else
 	{
@@ -856,14 +1012,12 @@ ExecHashGetHashValue(HashJoinTable hashtable,
  * chains), and must only cause the batch number to remain the same or
  * increase.  Our algorithm is
  *		bucketno = hashvalue MOD nbuckets
- *		batchno = (hashvalue DIV nbuckets) MOD nbatch
+ *		batchno =  highers log2(nbatch) bits
  * where nbuckets and nbatch are both expected to be powers of 2, so we can
  * do the computations by shifting and masking.  (This assumes that all hash
  * functions are good about randomizing all their output bits, else we are
  * likely to have very skewed bucket or batch occupancy.)
  *
- * nbuckets doesn't change over the course of the join.
- *
  * nbatch is always a power of 2; we increase it only by doubling it.  This
  * effectively adds one more bit to the top of the batchno.
  */
@@ -880,7 +1034,7 @@ ExecHashGetBucketAndBatch(HashJoinTable hashtable,
 	{
 		/* we can do MOD by masking, DIV by shifting */
 		*bucketno = hashvalue & (nbuckets - 1);
-		*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
+		*batchno = (hashvalue >> (32 - hashtable->log2_nbatch));
 	}
 	else
 	{
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3a31a75..c92cc26 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -788,6 +788,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"enable_hashjoin_bucket", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables dynamic increase of hash buckets."),
+			NULL
+		},
+		&enable_hashjoin_bucket,
+		true,
+		NULL, NULL, NULL
+	},
+	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
 			gettext_noop("This algorithm attempts to do planning without "
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3beae40..82648ab 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -106,6 +106,7 @@ typedef struct HashSkewBucket
 typedef struct HashJoinTableData
 {
 	int			nbuckets;		/* # buckets in the in-memory hash table */
+	int			nbuckets_original;	/* # buckets when starting the first hash */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
@@ -122,6 +123,7 @@ typedef struct HashJoinTableData
 
 	int			nbatch;			/* number of batches */
 	int			curbatch;		/* current batch #; 0 during 1st pass */
+	int			log2_nbatch;	/* its log2 (nbuckets must be a power of 2) */
 
 	int			nbatch_original;	/* nbatch when we started inner scan */
 	int			nbatch_outstart;	/* nbatch when we started outer scan */
@@ -129,6 +131,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	double		skewTuples;		/* # tuples inserted into skew tuples */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 75be5bd..15604cb 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -50,4 +50,6 @@ extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *num_skew_mcvs);
 extern int	ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
 
+extern bool enable_hashjoin_bucket;
+
 #endif   /* NODEHASH_H */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 75e2afb..60b8da8 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -62,6 +62,7 @@ extern bool enable_material;
 extern bool enable_mergejoin;
 extern bool enable_hashjoin;
 extern int	constraint_exclusion;
+extern bool enable_hashjoin_bucket;
 
 extern double clamp_row_est(double nrows);
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,

Atri Sharma

atri.jiit@gmail.com

over 11 years ago

In reply to: Tomas Vondra (#7)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On Tue, Jul 1, 2014 at 4:54 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 30.6.2014 23:12, Tomas Vondra wrote:

Hi,

attached is v5 of the patch. The main change is that scaling the number
of buckets is done only once, after the initial hash table is build. The
main advantage of this is lower price. This also allowed some cleanup of
unecessary code.

However, this new patch causes warning like this:

WARNING: temporary file leak: File 231 still referenced

I've been eyeballing the code for a while now, but I still fail to see
where this comes from :-( Any ideas?

Meh, the patches are wrong - I haven't realized the tight coupling
between buckets/batches in ExecHashGetBucketAndBatch:

*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);

The previous patches worked mostly by pure luck (the nbuckets was set
only in the first batch), but once I moved the code at the end, it
started failing. And by "worked" I mean "didn't throw an error, but
probably returned bogus results with (nbatch>1)".

However, ISTM this nbuckets-nbatch coupling is not really necessary,
it's only constructed this way to assign independent batch/bucket. So I
went and changed the code like this:

*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> (32 - hashtable->log2_nbatch));

I.e. using the top bits for batchno, low bits for bucketno (as before).

Hopefully I got it right this time. At least it seems to be working for
cases that failed before (no file leaks, proper rowcounts so far).

Hi,

I had a look at the patch and I was wondering if there is a way we can set
a cap on the resized size, and instead increase the number of batches
instead? Since this patch evaluates the growth of tuples vs increase of
space, we could look at increasing the number of batches itself instead of
number of buckets, if the resized number of buckets exceeds a threshold. It
shouldnt be too hard, AIUI it would involve a call in MultiExecHash right
after the new cost evaluation the patch adds.

It isnt a target of the current patch, but I think that it is a logical
extension to the same.

Thoughts/Comments?

Regards,

Atri
--
Regards,

Atri
*l'apprenant*

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#7)

1 attachment(s)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 1.7.2014 01:24, Tomas Vondra wrote:

On 30.6.2014 23:12, Tomas Vondra wrote:

Hi,

Hopefully I got it right this time. At least it seems to be working for
cases that failed before (no file leaks, proper rowcounts so far).

Attached v7, fixing nbatch/ntuples in an assert.

regards
Tomas

Attachments:

hashjoin-nbuckets-growth-v7.patchtext/x-diff; name=hashjoin-nbuckets-growth-v7.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..53642d1 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,10 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
+bool enable_hashjoin_bucket = true;
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -48,6 +50,33 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 
+/*
+ * Compute appropriate size for hashtable given the estimated size of the
+ * relation to be hashed (number of rows and average row width).
+ *
+ * This is exported so that the planner's costsize.c can use it.
+ */
+
+/* Target bucket loading (tuples per bucket) */
+#define NTUP_PER_BUCKET			10
+
+/* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
+ * 
+ * Once we reach the threshold we double the number of buckets, and we
+ * want to get 1.0 on average (to get NTUP_PER_BUCKET on average). That
+ * means these two equations should hold
+ * 
+ *   b = 2a         (growth)
+ *   (a + b)/2 = 1  (oscillate around NTUP_PER_BUCKET)
+ * 
+ * which means b=1.3333 (a = b/2). If we wanted higher threshold, we
+ * could grow the nbuckets to (4*nbuckets), thus using (b=4a) for
+ * growth, leading to (b=1.6). Or (b=8a) giving 1.7777 etc.
+ * 
+ * Let's start with doubling the bucket count, i.e. 1.333. */
+#define NTUP_GROW_COEFFICIENT	1.333
+#define NTUP_GROW_THRESHOLD		(NTUP_PER_BUCKET * NTUP_GROW_COEFFICIENT)
+
 /* ----------------------------------------------------------------
  *		ExecHash
  *
@@ -116,6 +145,7 @@ MultiExecHash(HashState *node)
 				/* It's a skew tuple, so put it into that hash table */
 				ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 										bucketNumber);
+				hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -126,6 +156,23 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* If average number of tuples per bucket is over the defined threshold,
+	 * increase the number of buckets to get below it. */
+	if (enable_hashjoin_bucket) {
+
+		/* consider only tuples in the non-skew buckets */
+		double nonSkewTuples = (hashtable->totalTuples - hashtable->skewTuples);
+
+		if ((nonSkewTuples / hashtable->nbatch) > (hashtable->nbuckets * NTUP_GROW_THRESHOLD)) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbucket to %d (average per bucket = %.1f)\n",
+				   nbuckets,  (batchTuples / hashtable->nbuckets));
+#endif
+			ExecHashIncreaseNumBuckets(hashtable);
+		}
+	}
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -239,6 +286,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	int			nbatch;
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
+	int			log2_nbatch;
 	int			nkeys;
 	int			i;
 	ListCell   *ho;
@@ -263,6 +311,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	log2_nbuckets = my_log2(nbuckets);
 	Assert(nbuckets == (1 << log2_nbuckets));
 
+	/* nbatch must be a power of 2 */
+	log2_nbatch = my_log2(nbatch);
+	Assert(nbatch == (1 << log2_nbatch));
+
 	/*
 	 * Initialize the hash table control block.
 	 *
@@ -271,6 +323,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
@@ -281,10 +334,12 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->skewBucketNums = NULL;
 	hashtable->nbatch = nbatch;
 	hashtable->curbatch = 0;
+	hashtable->log2_nbatch = log2_nbatch;
 	hashtable->nbatch_original = nbatch;
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -375,17 +430,6 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	return hashtable;
 }
 
-
-/*
- * Compute appropriate size for hashtable given the estimated size of the
- * relation to be hashed (number of rows and average row width).
- *
- * This is exported so that the planner's costsize.c can use it.
- */
-
-/* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
-
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *numbuckets,
@@ -605,6 +649,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	MemoryContextSwitchTo(oldcxt);
 
 	hashtable->nbatch = nbatch;
+	hashtable->log2_nbatch += 1;
 
 	/*
 	 * Scan through the existing hash table entries and dump out any that are
@@ -682,6 +727,116 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	int			i;
+	int			oldnbuckets = hashtable->nbuckets;
+	HashJoinTuple  *oldbuckets = hashtable->buckets;
+	MemoryContext   oldcxt;
+	double		batchTuples = (hashtable->totalTuples / hashtable->nbatch);
+
+	/*
+	 * Determine the proper number of buckets, i.e. stop once the average
+	 * per bucket gets below the threshold (1.33 * NTUP_PER_BUCKET).
+	 * 
+	 * Also, check for overflow - this can only happen with extremely large
+	 * work_mem values, because (INT_MAX/2) means ~8GB only for the buckets.
+	 * With tuples, the hash table would require tens of GBs of work_mem.
+	 * 
+	 * XXX Technically there's also a limit for buckets fitting into work_mem
+	 * (with NTUP_PER_BUCKET tuples), but this can't be really exceeded
+	 * because when filling work_mem, another batch will be added (thus the
+	 * number of tuples will drop and more buckets won't be needed anymore).
+	 * 
+	 * That is, something like this will be enforced implicitly:
+	 * 
+	 *    work_mem * 1024L >= (nbuckets * tupsize * NTUP_GROW_THRESHOLD)
+	 * 
+	 * So it's enough to check only the overflow here.
+	 */
+
+	/* double the number of buckets until we get below the growth threshold, or
+	 * until we hit the overflow protection */
+	while ((batchTuples > (hashtable->nbuckets * NTUP_GROW_THRESHOLD))
+			&& (hashtable->nbuckets <= (INT_MAX/2))) {
+		hashtable->nbuckets *= 2;
+		hashtable->log2_nbuckets += 1;
+	}
+
+	/* XXX Not sure if we should update the info about used space here.
+	 * The code seems to ignore the space used for 'buckets' and we're not
+	 * allocating more space for tuples (just shuffling them to the new
+	 * buckets). And the amount of memory used for buckets is quite small
+	 * (just an array of pointers, thus ~8kB per 1k buckets on 64-bit). */
+
+	/* XXX Should we disable growth if (nbuckets * NTUP_PER_BUCKET)
+	 * reaches work_mem (or something like that)? We shouldn't really
+	 * get into such position (should be handled by increasing the
+	 * number of batches, which is called right before this). */
+
+	/* XXX Maybe adding info into hashjoin explain output (e.g. initial
+	 * nbuckets, time spent growing the table) would be appropriate. */
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets <= (INT_MAX/2));
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/* TODO Maybe it'd be better to resize the buckets in place (should be possible,
+	 * but when I tried it I always ended up with a strange infinite loop). */
+
+	/* allocate a new bucket list (use the batch context as before) */
+	oldcxt = MemoryContextSwitchTo(hashtable->batchCxt);
+
+	hashtable->buckets = (HashJoinTuple *)
+		palloc0(hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/* walk through the old buckets, move the buckets into the new table */
+	for (i = 0; i < oldnbuckets; i++)
+	{
+
+		HashJoinTuple tuple = oldbuckets[i];
+
+		while (tuple != NULL)
+		{
+			/* save link in case we delete */
+			HashJoinTuple nexttuple = tuple->next;
+			int			bucketno;
+			int			batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* move it to the correct bucket (in the new bucket array) */
+			tuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = tuple;
+
+			/* process the next tuple */
+			tuple = nexttuple;
+		}
+	}
+
+	pfree(oldbuckets);
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, batchTuples / hashtable->nbuckets);
+#endif
+	
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -740,6 +895,7 @@ ExecHashTableInsert(HashJoinTable hashtable,
 			hashtable->spacePeak = hashtable->spaceUsed;
 		if (hashtable->spaceUsed > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
 	}
 	else
 	{
@@ -856,14 +1012,12 @@ ExecHashGetHashValue(HashJoinTable hashtable,
  * chains), and must only cause the batch number to remain the same or
  * increase.  Our algorithm is
  *		bucketno = hashvalue MOD nbuckets
- *		batchno = (hashvalue DIV nbuckets) MOD nbatch
+ *		batchno =  highers log2(nbatch) bits
  * where nbuckets and nbatch are both expected to be powers of 2, so we can
  * do the computations by shifting and masking.  (This assumes that all hash
  * functions are good about randomizing all their output bits, else we are
  * likely to have very skewed bucket or batch occupancy.)
  *
- * nbuckets doesn't change over the course of the join.
- *
  * nbatch is always a power of 2; we increase it only by doubling it.  This
  * effectively adds one more bit to the top of the batchno.
  */
@@ -880,7 +1034,7 @@ ExecHashGetBucketAndBatch(HashJoinTable hashtable,
 	{
 		/* we can do MOD by masking, DIV by shifting */
 		*bucketno = hashvalue & (nbuckets - 1);
-		*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
+		*batchno = (hashvalue >> (32 - hashtable->log2_nbatch));
 	}
 	else
 	{
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3a31a75..c92cc26 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -788,6 +788,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"enable_hashjoin_bucket", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables dynamic increase of hash buckets."),
+			NULL
+		},
+		&enable_hashjoin_bucket,
+		true,
+		NULL, NULL, NULL
+	},
+	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
 			gettext_noop("This algorithm attempts to do planning without "
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3beae40..82648ab 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -106,6 +106,7 @@ typedef struct HashSkewBucket
 typedef struct HashJoinTableData
 {
 	int			nbuckets;		/* # buckets in the in-memory hash table */
+	int			nbuckets_original;	/* # buckets when starting the first hash */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
@@ -122,6 +123,7 @@ typedef struct HashJoinTableData
 
 	int			nbatch;			/* number of batches */
 	int			curbatch;		/* current batch #; 0 during 1st pass */
+	int			log2_nbatch;	/* its log2 (nbuckets must be a power of 2) */
 
 	int			nbatch_original;	/* nbatch when we started inner scan */
 	int			nbatch_outstart;	/* nbatch when we started outer scan */
@@ -129,6 +131,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	double		skewTuples;		/* # tuples inserted into skew tuples */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 75be5bd..15604cb 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -50,4 +50,6 @@ extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *num_skew_mcvs);
 extern int	ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
 
+extern bool enable_hashjoin_bucket;
+
 #endif   /* NODEHASH_H */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 75e2afb..60b8da8 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -62,6 +62,7 @@ extern bool enable_material;
 extern bool enable_mergejoin;
 extern bool enable_hashjoin;
 extern int	constraint_exclusion;
+extern bool enable_hashjoin_bucket;
 
 extern double clamp_row_est(double nrows);
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,

#10

tv@fuzzy.cz

over 11 years ago

In reply to: Atri Sharma (#8)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 3.7.2014 15:42, Atri Sharma wrote:

On Tue, Jul 1, 2014 at 4:54 AM, Tomas Vondra <tv@fuzzy.cz
<mailto:tv@fuzzy.cz>> wrote:

On 30.6.2014 23:12, Tomas Vondra wrote:

Hi,

attached is v5 of the patch. The main change is that scaling the

number

of buckets is done only once, after the initial hash table is

build. The

main advantage of this is lower price. This also allowed some

cleanup of

unecessary code.

However, this new patch causes warning like this:

WARNING: temporary file leak: File 231 still referenced

I've been eyeballing the code for a while now, but I still fail to see
where this comes from :-( Any ideas?

Meh, the patches are wrong - I haven't realized the tight coupling
between buckets/batches in ExecHashGetBucketAndBatch:

*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);

The previous patches worked mostly by pure luck (the nbuckets was set
only in the first batch), but once I moved the code at the end, it
started failing. And by "worked" I mean "didn't throw an error, but
probably returned bogus results with (nbatch>1)".

However, ISTM this nbuckets-nbatch coupling is not really necessary,
it's only constructed this way to assign independent batch/bucket. So I
went and changed the code like this:

*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> (32 - hashtable->log2_nbatch));

I.e. using the top bits for batchno, low bits for bucketno (as before).

Hopefully I got it right this time. At least it seems to be working for
cases that failed before (no file leaks, proper rowcounts so far).

Hi,

I had a look at the patch and I was wondering if there is a way we can
set a cap on the resized size, and instead increase the number of
batches instead? Since this patch evaluates the growth of tuples vs
increase of space, we could look at increasing the number of batches
itself instead of number of buckets, if the resized number of buckets
exceeds a threshold. It shouldnt be too hard, AIUI it would involve a
call in MultiExecHash right after the new cost evaluation the patch adds.

So you propose to fill the hash table, and when hitting NTUP_PER_BUCKET
(i.e. after adding 'nbuckets * NTUP_PER_BUCKET' tuples) increase the
number of batches?

Hmmm, that'd be easy to implement. I don't think it's possible to do
that in MultiExecHash (that's too late). But it's a matter of changing
this condition in ExecHashTableInsert:

if (hashtable->spaceUsed > hashtable->spaceAllowed)
ExecHashIncreaseNumBatches(hashtable);

to something like this

if ((hashtable->spaceUsed > hashtable->spaceAllowed) ||
(hashtable->totalTuples / hashtable->nbatch
== NTUP_PER_BUCKET * hashtable->nbuckets))
ExecHashIncreaseNumBatches(hashtable);

But I don't think increasing number of batches is a good approach, as it
means more rescans. I think using memory is more efficient (after all,
that's what work_mem is for, right?).

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#9)

1 attachment(s)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 3.7.2014 19:36, Tomas Vondra wrote:

On 1.7.2014 01:24, Tomas Vondra wrote:

On 30.6.2014 23:12, Tomas Vondra wrote:

Hi,

Hopefully I got it right this time. At least it seems to be working
for cases that failed before (no file leaks, proper rowcounts so
far).

Attached v7, fixing nbatch/ntuples in an assert.

regards
Tomas

Attached is v8 of this patch, significantly reworked based on the
related discussion (mostly in 'tweaking NTUP_PER_BUCKET').

With the patch applied, the hash table works like this:

initial sizing (ExecChooseHashTableSize)
----------------------------------------
- size nbuckets for NTUP_PER_BUCKET=1
- account for memory allocated for buckets

building the hash table
-----------------------
- while adding tuples, keep track of optimal number of buckets (for the
current number of tuples per batch)
- don't resize until all the tuples are added (by increasing nbatch the
number of buckets may decrease)
- after adding all tuples, consider resizing (optimal nbuckets !=
initial nbuckets)
- do the resize

This means that for well estimated plans (reasonably exact tuple count
for the inner relation), the hash table is properly sized and no resize
is needed.

For badly estimated plans, this works about the same as the previous
patches (we're accounting for memory needed for buckets etc.).

More batches or less buckets?
-----------------------------
In the related threads, I repeatedly mentioned that I don't know a good
answer to "Should I add more batches or decrease the number of buckets?"
while sizing the hash table. The current code (as in HEAD) does not face
this dillema because (a) it does not count buckets into work_mem and (b)
does not allow changing nbuckets.

I still don't have a good answer to this question, but I think the patch
can stand even without it. If you want more/less batches, use
smaller/larger work_mem - same as with the current code. Also, with the
'dense allocation' patch [1]/messages/by-id/53C2DECC.2010408@fuzzy.cz, it's possible to use larger work_mem
values (and thus compensate for counting buckets into work_mem).

Changes since v7
----------------

(a) remove NTUP_GROW_THRESHOLD (just use NTUP_PER_BUCKET directly)
(b) set NTUP_PER_BUCKET=1
(c) include buckets (optimal) when considering nbatch increase
(d) track optimal number of buckets (for NTUP_PER_BUCKET)
(e) after loading all the tuples, resize the hash table to get
nbuckets_optimal (if needed)
(f) renamed GUC to enable_hash_resize (makes a bit more sense)

Comments
--------

(a) NTUP_GROW_THRESHOLD was overcomplicated and mostly a result of
misunderstanding how NTUP_PER_BUCKET works (it's upper threshold,
not average) and is not really needed.

(b) The value "1" gives the best performance, but also requires more
memory for the buckets. This should be more than compensated by
densely allocating the tuples (see hashjoin-alloc-v3.patch
in the "tweaking NTUP_PER_BUCKET" thread [1]/messages/by-id/53C2DECC.2010408@fuzzy.cz).

(c,d) Originally, the memory needed for buckets was not included in
'spaceUsed', but because the NTUP_PER_BUCKET=1 change this is
not really acceptable (needs much more memory than before).
So now it's included both in the initial sizing of the hash
table, and when adding more tuples (nbuckets_optimal).

Still, spaceUsed does not include palloc overhead (which may be
a significant amount of memory). This is tackled by [1]/messages/by-id/53C2DECC.2010408@fuzzy.cz.

(e) No change here.

(f) A bit better GUC name. Anyway, this is for easier development
only, and won't be included in the final patch.

regards
Tomas

[1]: /messages/by-id/53C2DECC.2010408@fuzzy.cz

Attachments:

hashjoin-nbuckets-growth-v8.patchtext/x-diff; name=hashjoin-nbuckets-growth-v8.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..1befed5 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,10 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
+bool enable_hash_resize = true;
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -48,6 +50,9 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets_optimal * sizeof(void*))
+
 /* ----------------------------------------------------------------
  *		ExecHash
  *
@@ -116,6 +121,7 @@ MultiExecHash(HashState *node)
 				/* It's a skew tuple, so put it into that hash table */
 				ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 										bucketNumber);
+				hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -126,6 +132,31 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* If average number of tuples per bucket is over the defined threshold,
+	 * increase the number of buckets to get below it. */
+	if (enable_hash_resize) {
+
+		elog(WARNING, "nbuckets=%d optimal=%d tuples=%.0f", hashtable->nbuckets, hashtable->nbuckets_optimal,
+			hashtable->totalTuples - hashtable->skewTuples);
+
+		/* Do we actually need to resize? */
+		if (hashtable->nbuckets != hashtable->nbuckets_optimal) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbuckets %d => %d\n",
+				   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+			ExecHashIncreaseNumBuckets(hashtable);
+		}
+	}
+
+	elog(WARNING, "buckets space = %ld", BUCKETS_SPACE(hashtable));
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += BUCKETS_SPACE(hashtable);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -239,6 +270,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	int			nbatch;
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
+	int			log2_nbatch;
 	int			nkeys;
 	int			i;
 	ListCell   *ho;
@@ -263,6 +295,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	log2_nbuckets = my_log2(nbuckets);
 	Assert(nbuckets == (1 << log2_nbuckets));
 
+	/* nbatch must be a power of 2 */
+	log2_nbatch = my_log2(nbatch);
+	Assert(nbatch == (1 << log2_nbatch));
+
 	/*
 	 * Initialize the hash table control block.
 	 *
@@ -271,6 +307,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
+	hashtable->nbuckets_optimal = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
@@ -281,10 +319,12 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->skewBucketNums = NULL;
 	hashtable->nbatch = nbatch;
 	hashtable->curbatch = 0;
+	hashtable->log2_nbatch = log2_nbatch;
 	hashtable->nbatch_original = nbatch;
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -375,7 +415,6 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	return hashtable;
 }
 
-
 /*
  * Compute appropriate size for hashtable given the estimated size of the
  * relation to be hashed (number of rows and average row width).
@@ -384,7 +423,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
  */
 
 /* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
+#define NTUP_PER_BUCKET			1
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
@@ -394,6 +433,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 {
 	int			tupsize;
 	double		inner_rel_bytes;
+	double		buckets_bytes;
 	long		hash_table_bytes;
 	long		skew_table_bytes;
 	long		max_pointers;
@@ -415,6 +455,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		MAXALIGN(tupwidth);
 	inner_rel_bytes = ntuples * tupsize;
 
+	/* Estimate memory needed for buckets, with the target NTUP_PER_BUCKET */
+	buckets_bytes = sizeof(void*) * my_log2(ntuples / NTUP_PER_BUCKET);
+
 	/*
 	 * Target in-memory hashtable size is work_mem kilobytes.
 	 */
@@ -466,23 +509,48 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/* also ensure we avoid integer overflow in nbatch and nbuckets */
 	max_pointers = Min(max_pointers, INT_MAX / 2);
 
-	if (inner_rel_bytes > hash_table_bytes)
+	if ((inner_rel_bytes + buckets_bytes) > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
 		long		lbuckets;
 		double		dbatch;
 		int			minbatch;
-
-		lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
-		lbuckets = Min(lbuckets, max_pointers);
-		nbuckets = (int) lbuckets;
+		double		batch_bytes;
 
 		dbatch = ceil(inner_rel_bytes / hash_table_bytes);
 		dbatch = Min(dbatch, max_pointers);
 		minbatch = (int) dbatch;
 		nbatch = 2;
+
+		/* increase nbatch until the hashtable fits into work_mem */
 		while (nbatch < minbatch)
 			nbatch <<= 1;
+
+		/* now, we need to see if the buckets fit into work_mem too */
+		lbuckets = 1 << (my_log2(ntuples / nbatch / NTUP_PER_BUCKET));
+
+		/* assuming equally-sized batches, compute bytes of hash table and buckets */
+		batch_bytes = inner_rel_bytes / nbatch;
+		buckets_bytes = lbuckets * sizeof(void*);
+
+		/* we need to get both into work_mem */
+		while (batch_bytes + buckets_bytes > hash_table_bytes) {
+
+			/* increment number of batches, and re-evaluate the amount of memory */
+			nbatch <<= 1;
+
+			/* now, we need to see if the buckets fit into work_mem too */
+			lbuckets = 1 << (my_log2(ntuples / nbatch / NTUP_PER_BUCKET));
+
+			/* assuming equally-sized batches, compute bytes of hash table and buckets */
+			batch_bytes = inner_rel_bytes / nbatch;
+			buckets_bytes = lbuckets * sizeof(void*);
+
+		}
+
+		/* protect against nbucket overflow */
+		lbuckets = Min(lbuckets, max_pointers);
+		nbuckets = (int) lbuckets;
 	}
 	else
 	{
@@ -560,6 +628,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
+	double		batchTuples;
 
 	/* do nothing if we've decided to shut off growth */
 	if (!hashtable->growEnabled)
@@ -605,6 +674,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	MemoryContextSwitchTo(oldcxt);
 
 	hashtable->nbatch = nbatch;
+	hashtable->log2_nbatch += 1;
 
 	/*
 	 * Scan through the existing hash table entries and dump out any that are
@@ -679,8 +749,109 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 		printf("Disabling further increase of nbatch\n");
 #endif
 	}
+
+	/*
+	 * After increasing the number of batches, the number of tuples per batch
+	 * is lower (~1/2), so the optimal number of buckets might be lower too.
+	 * It's probably (nbuckets_optimal/2) but let's compute it anyway.
+	 */
+	hashtable->nbuckets_optimal = 10;
+	batchTuples = (hashtable->totalTuples - hashtable->skewTuples) / hashtable->nbatch;
+	while ((batchTuples / hashtable->nbuckets_optimal > NTUP_PER_BUCKET)
+		&& (hashtable->nbuckets_optimal <= (INT_MAX/2))) {
+		hashtable->nbuckets_optimal *= 2;
+	}
+}
+
+/*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	int			i;
+	int			oldnbuckets = hashtable->nbuckets;
+	HashJoinTuple  *oldbuckets = hashtable->buckets;
+	MemoryContext   oldcxt;
+	double		batchTuples = (hashtable->totalTuples - hashtable->skewTuples) / hashtable->nbatch;
+
+	/*
+	 * We already know the optimal number of buckets, so let's just
+	 * compute the log2_nbuckets for it.
+	 */
+	hashtable->nbuckets = hashtable->nbuckets_optimal;
+	hashtable->log2_nbuckets = my_log2(hashtable->nbuckets_optimal);
+
+	/* XXX Not sure if we should update the info about used space here.
+	 * The code seems to ignore the space used for 'buckets' and we're not
+	 * allocating more space for tuples (just shuffling them to the new
+	 * buckets). And the amount of memory used for buckets is quite small
+	 * (just an array of pointers, thus ~8kB per 1k buckets on 64-bit). */
+
+	/* XXX Should we disable growth if (nbuckets * NTUP_PER_BUCKET)
+	 * reaches work_mem (or something like that)? We shouldn't really
+	 * get into such position (should be handled by increasing the
+	 * number of batches, which is called right before this). */
+
+	/* XXX Maybe adding info into hashjoin explain output (e.g. initial
+	 * nbuckets, time spent growing the table) would be appropriate. */
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets <= (INT_MAX/2));
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/* TODO Maybe it'd be better to resize the buckets in place (should be possible,
+	 * but when I tried it I always ended up with a strange infinite loop). */
+
+	/* allocate a new bucket list (use the batch context as before) */
+	oldcxt = MemoryContextSwitchTo(hashtable->batchCxt);
+
+	hashtable->buckets = (HashJoinTuple *)
+		palloc0(hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/* walk through the old buckets, move the buckets into the new table */
+	for (i = 0; i < oldnbuckets; i++)
+	{
+
+		HashJoinTuple tuple = oldbuckets[i];
+
+		while (tuple != NULL)
+		{
+			/* save link in case we delete */
+			HashJoinTuple nexttuple = tuple->next;
+			int			bucketno;
+			int			batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* move it to the correct bucket (in the new bucket array) */
+			tuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = tuple;
+
+			/* process the next tuple */
+			tuple = nexttuple;
+		}
+	}
+
+	pfree(oldbuckets);
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, batchTuples / hashtable->nbuckets);
+#endif
+	
 }
 
+
 /*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
@@ -734,12 +905,26 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		/* Increase the number of buckets (optimal) if we exceeded NTUP_PER_BUCKET. */
+		if (((hashtable->totalTuples - hashtable->skewTuples) / hashtable->nbatch)
+			>= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET)) {
+
+			/* overflow protection */
+			if (hashtable->nbuckets_optimal <= (INT_MAX/2))
+				hashtable->nbuckets_optimal *= 2;
+
+		}
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
-		if (hashtable->spaceUsed > hashtable->spaceAllowed)
+
+		/* Do we need to increase number of batches? Account for space required
+		 * by buckets (optimal number). */
+		if (hashtable->spaceUsed + BUCKETS_SPACE(hashtable) > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
 	}
 	else
 	{
@@ -856,14 +1041,12 @@ ExecHashGetHashValue(HashJoinTable hashtable,
  * chains), and must only cause the batch number to remain the same or
  * increase.  Our algorithm is
  *		bucketno = hashvalue MOD nbuckets
- *		batchno = (hashvalue DIV nbuckets) MOD nbatch
+ *		batchno =  highers log2(nbatch) bits
  * where nbuckets and nbatch are both expected to be powers of 2, so we can
  * do the computations by shifting and masking.  (This assumes that all hash
  * functions are good about randomizing all their output bits, else we are
  * likely to have very skewed bucket or batch occupancy.)
  *
- * nbuckets doesn't change over the course of the join.
- *
  * nbatch is always a power of 2; we increase it only by doubling it.  This
  * effectively adds one more bit to the top of the batchno.
  */
@@ -880,7 +1063,7 @@ ExecHashGetBucketAndBatch(HashJoinTable hashtable,
 	{
 		/* we can do MOD by masking, DIV by shifting */
 		*bucketno = hashvalue & (nbuckets - 1);
-		*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
+		*batchno = (hashvalue >> (32 - hashtable->log2_nbatch));
 	}
 	else
 	{
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3a31a75..f1cce9f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -788,6 +788,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"enable_hash_resize", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables dynamic increase of hash buckets."),
+			NULL
+		},
+		&enable_hash_resize,
+		true,
+		NULL, NULL, NULL
+	},
+	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
 			gettext_noop("This algorithm attempts to do planning without "
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3beae40..acabb48 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -106,8 +106,11 @@ typedef struct HashSkewBucket
 typedef struct HashJoinTableData
 {
 	int			nbuckets;		/* # buckets in the in-memory hash table */
+	int			nbuckets_original;	/* # buckets when starting the first hash */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
+	int			nbuckets_optimal;	/* optimal number of buckets (for batch) */
+
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
 	struct HashJoinTupleData **buckets;
 	/* buckets array is per-batch storage, as are all the tuples */
@@ -122,6 +125,7 @@ typedef struct HashJoinTableData
 
 	int			nbatch;			/* number of batches */
 	int			curbatch;		/* current batch #; 0 during 1st pass */
+	int			log2_nbatch;	/* its log2 (nbuckets must be a power of 2) */
 
 	int			nbatch_original;	/* nbatch when we started inner scan */
 	int			nbatch_outstart;	/* nbatch when we started outer scan */
@@ -129,6 +133,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	double		skewTuples;		/* # tuples inserted into skew tuples */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 75e2afb..e7c5f50 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -62,6 +62,7 @@ extern bool enable_material;
 extern bool enable_mergejoin;
 extern bool enable_hashjoin;
 extern int	constraint_exclusion;
+extern bool enable_hash_resize;
 
 extern double clamp_row_est(double nrows);
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,

#12

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#11)

2 attachment(s)

Re: bad estimation together with large work_mem generates terrible slow hash joins

Attached v9 of the patch. Aside from a few minor fixes, the main change
is that this is assumed to be combined with the "dense allocation" patch.

It also rewrites the ExecHashIncreaseNumBuckets to follow the same
pattern as ExecHashIncreaseNumBatches (i.e. scanning chunks directly,
instead of buckets). It's cleaner and more consistent.

hashjoin-nbuckets-growth-v9.patch contains only this patch, so you need
to grab the hashjoin-alloc-v4.patch from a different thread and apply it
first)

hashjoin-nbuckets-growth-v9-combined.patch contains both patches combined

regards
Tomas Vondra

Attachments:

hashjoin-nbuckets-growth-v9-combined.patchtext/x-diff; name=hashjoin-nbuckets-growth-v9-combined.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..ee3fffb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..014b6f1 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,10 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
+bool enable_hash_resize = true;
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -47,6 +49,10 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 						int bucketNumber);
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
+static char * chunk_alloc(HashJoinTable hashtable, int tupleSize);
+
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets_optimal * sizeof(void*))
 
 /* ----------------------------------------------------------------
  *		ExecHash
@@ -116,6 +122,7 @@ MultiExecHash(HashState *node)
 				/* It's a skew tuple, so put it into that hash table */
 				ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 										bucketNumber);
+				hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -126,10 +133,34 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* If average number of tuples per bucket is over the defined threshold,
+	 * increase the number of buckets to get below it. */
+	if (enable_hash_resize) {
+
+		/* 
+		 * Do we actually need to resize? Only scale up, don't scale down (for example
+		 * after adding a batch, it's possible that nbuckets > nbuckets_optimal).
+		 */
+		if (hashtable->nbuckets < hashtable->nbuckets_optimal) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbuckets %d => %d\n",
+				   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+			ExecHashIncreaseNumBuckets(hashtable);
+		}
+
+	}
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += BUCKETS_SPACE(hashtable);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
-
+	
 	/*
 	 * We do not return the hash table directly because it's not a subtype of
 	 * Node, and so would violate the MultiExecProcNode API.  Instead, our
@@ -223,7 +254,6 @@ ExecEndHash(HashState *node)
 	ExecEndNode(outerPlan);
 }
 
-
 /* ----------------------------------------------------------------
  *		ExecHashTableCreate
  *
@@ -239,6 +269,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	int			nbatch;
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
+	int			log2_nbatch;
 	int			nkeys;
 	int			i;
 	ListCell   *ho;
@@ -263,6 +294,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	log2_nbuckets = my_log2(nbuckets);
 	Assert(nbuckets == (1 << log2_nbuckets));
 
+	/* nbatch must be a power of 2 */
+	log2_nbatch = my_log2(nbatch);
+	Assert(nbatch == (1 << log2_nbatch));
+
 	/*
 	 * Initialize the hash table control block.
 	 *
@@ -271,6 +306,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
+	hashtable->nbuckets_optimal = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
@@ -281,10 +318,12 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->skewBucketNums = NULL;
 	hashtable->nbatch = nbatch;
 	hashtable->curbatch = 0;
+	hashtable->log2_nbatch = log2_nbatch;
 	hashtable->nbatch_original = nbatch;
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -294,6 +333,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->spaceAllowedSkew =
 		hashtable->spaceAllowed * SKEW_WORK_MEM_PERCENT / 100;
 
+	hashtable->chunks_batch = NULL;
+
 	/*
 	 * Get info about the hash functions to be used for each hash key. Also
 	 * remember whether the join operators are strict.
@@ -375,7 +416,6 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	return hashtable;
 }
 
-
 /*
  * Compute appropriate size for hashtable given the estimated size of the
  * relation to be hashed (number of rows and average row width).
@@ -384,7 +424,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
  */
 
 /* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
+#define NTUP_PER_BUCKET			1
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
@@ -394,6 +434,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 {
 	int			tupsize;
 	double		inner_rel_bytes;
+	double		buckets_bytes;
 	long		hash_table_bytes;
 	long		skew_table_bytes;
 	long		max_pointers;
@@ -415,6 +456,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		MAXALIGN(tupwidth);
 	inner_rel_bytes = ntuples * tupsize;
 
+	/* Estimate memory needed for buckets, with the target NTUP_PER_BUCKET */
+	buckets_bytes = sizeof(void*) * my_log2(ntuples / NTUP_PER_BUCKET);
+
 	/*
 	 * Target in-memory hashtable size is work_mem kilobytes.
 	 */
@@ -466,23 +510,48 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/* also ensure we avoid integer overflow in nbatch and nbuckets */
 	max_pointers = Min(max_pointers, INT_MAX / 2);
 
-	if (inner_rel_bytes > hash_table_bytes)
+	if ((inner_rel_bytes + buckets_bytes) > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
 		long		lbuckets;
 		double		dbatch;
 		int			minbatch;
-
-		lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
-		lbuckets = Min(lbuckets, max_pointers);
-		nbuckets = (int) lbuckets;
+		double		batch_bytes;
 
 		dbatch = ceil(inner_rel_bytes / hash_table_bytes);
 		dbatch = Min(dbatch, max_pointers);
 		minbatch = (int) dbatch;
 		nbatch = 2;
+
+		/* increase nbatch until the hashtable fits into work_mem */
 		while (nbatch < minbatch)
 			nbatch <<= 1;
+
+		/* now, we need to see if the buckets fit into work_mem too */
+		lbuckets = 1 << (my_log2(ntuples / nbatch / NTUP_PER_BUCKET));
+
+		/* assuming equally-sized batches, compute bytes of hash table and buckets */
+		batch_bytes = inner_rel_bytes / nbatch;
+		buckets_bytes = lbuckets * sizeof(void*);
+
+		/* we need to get both into work_mem */
+		while (batch_bytes + buckets_bytes > hash_table_bytes) {
+
+			/* increment number of batches, and re-evaluate the amount of memory */
+			nbatch <<= 1;
+
+			/* now, we need to see if the buckets fit into work_mem too */
+			lbuckets = 1 << (my_log2(ntuples / nbatch / NTUP_PER_BUCKET));
+
+			/* assuming equally-sized batches, compute bytes of hash table and buckets */
+			batch_bytes = inner_rel_bytes / nbatch;
+			buckets_bytes = lbuckets * sizeof(void*);
+
+		}
+
+		/* protect against nbucket overflow */
+		lbuckets = Min(lbuckets, max_pointers);
+		nbuckets = (int) lbuckets;
 	}
 	else
 	{
@@ -556,10 +625,12 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	int			oldnbatch = hashtable->nbatch;
 	int			curbatch = hashtable->curbatch;
 	int			nbatch;
-	int			i;
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
+	double		batchTuples;
+
+	HashChunk	oldchunks = hashtable->chunks_batch;
 
 	/* do nothing if we've decided to shut off growth */
 	if (!hashtable->growEnabled)
@@ -605,6 +676,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	MemoryContextSwitchTo(oldcxt);
 
 	hashtable->nbatch = nbatch;
+	hashtable->log2_nbatch += 1;
 
 	/*
 	 * Scan through the existing hash table entries and dump out any that are
@@ -612,51 +684,70 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 */
 	ninmemory = nfreed = 0;
 
-	for (i = 0; i < hashtable->nbuckets; i++)
-	{
-		HashJoinTuple prevtuple;
-		HashJoinTuple tuple;
+	/* we're going to scan through the chunks directly, not buckets, so we can reset the
+	 * buckets and reinsert all the tuples there - just set them to NULL */
+	memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
 
-		prevtuple = NULL;
-		tuple = hashtable->buckets[i];
+	/* reset the chunks (we have a copy in oldchunks) - this way we'll allocate */
+	hashtable->chunks_batch = NULL;
 
-		while (tuple != NULL)
-		{
-			/* save link in case we delete */
-			HashJoinTuple nexttuple = tuple->next;
-			int			bucketno;
-			int			batchno;
+	/* so, let's scan through the old chunks, and all tuples in each chunk */
+	while (oldchunks != NULL) {
+
+		/* pointer to the next memory chunk (may be NULL for the last chunk) */
+		HashChunk nextchunk = oldchunks->next;
+
+		/* position within the buffer (up to oldchunks->used) */
+		size_t idx = 0;	
+
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx < oldchunks->used) {
+
+			/* get the hashjoin tuple and mintuple stored right after it */
+			HashJoinTuple hashTuple = (HashJoinTuple)(oldchunks->data + idx);
+			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+			int		hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+			int		bucketno;
+			int		batchno;
 
 			ninmemory++;
-			ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
 									  &bucketno, &batchno);
-			Assert(bucketno == i);
+
 			if (batchno == curbatch)
 			{
-				/* keep tuple */
-				prevtuple = tuple;
+				/* keep tuple - this means we need to copy it into the new chunks */
+				HashJoinTuple copyTuple = (HashJoinTuple) chunk_alloc(hashtable, hashTupleSize);
+				memcpy(copyTuple, hashTuple, hashTupleSize);
+
+				/* and of course add it to the new buckets - just push the copy it onto the front
+				 * of the bucket's list */
+				copyTuple->next = hashtable->buckets[bucketno];
+				hashtable->buckets[bucketno] = copyTuple;
 			}
 			else
 			{
 				/* dump it out */
 				Assert(batchno > curbatch);
-				ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(tuple),
-									  tuple->hashvalue,
+				ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(hashTuple),
+									  hashTuple->hashvalue,
 									  &hashtable->innerBatchFile[batchno]);
-				/* and remove from hash table */
-				if (prevtuple)
-					prevtuple->next = nexttuple;
-				else
-					hashtable->buckets[i] = nexttuple;
-				/* prevtuple doesn't change */
-				hashtable->spaceUsed -=
-					HJTUPLE_OVERHEAD + HJTUPLE_MINTUPLE(tuple)->t_len;
-				pfree(tuple);
+
+				hashtable->spaceUsed -= hashTupleSize;
 				nfreed++;
 			}
 
-			tuple = nexttuple;
+			/* bytes occupied in memory HJ tuple overhead + actual tuple length */
+			idx += hashTupleSize;
+
 		}
+
+		/* so, we're done with this chunk - free it and proceed to the next one */
+		pfree(oldchunks);
+		oldchunks = nextchunk;
+
 	}
 
 #ifdef HJDEBUG
@@ -679,9 +770,123 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 		printf("Disabling further increase of nbatch\n");
 #endif
 	}
+
+	/*
+	 * After increasing the number of batches, the number of tuples per batch
+	 * is lower (~1/2), so the optimal number of buckets might be lower too.
+	 * It's probably (nbuckets_optimal/2) but let's compute it anyway
+	 * (let's start with nbuckets_original - we already have that memory
+	 * allocated, so let's use it).
+	 */
+	hashtable->nbuckets_optimal = hashtable->nbuckets_optimal;
+	batchTuples = (hashtable->totalTuples - hashtable->skewTuples) / hashtable->nbatch;
+	while ((batchTuples / hashtable->nbuckets_optimal > NTUP_PER_BUCKET)
+		&& (hashtable->nbuckets_optimal <= (INT_MAX/2))) {
+		hashtable->nbuckets_optimal *= 2;
+	}
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	HashChunk	chunk;
+
+	/* do nothing if not an increase (it's called increase for a reason) */
+	if (hashtable->nbuckets >= hashtable->nbuckets_optimal)
+		return;
+
+	/*
+	 * We already know the optimal number of buckets, so let's just
+	 * compute the log2_nbuckets for it.
+	 */
+	hashtable->nbuckets = hashtable->nbuckets_optimal;
+	hashtable->log2_nbuckets = my_log2(hashtable->nbuckets_optimal);
+
+	/* XXX Not sure if we should update the info about used space here.
+	 * The code seems to ignore the space used for 'buckets' and we're not
+	 * allocating more space for tuples (just shuffling them to the new
+	 * buckets). And the amount of memory used for buckets is quite small
+	 * (just an array of pointers, thus ~8kB per 1k buckets on 64-bit). */
+
+	/* XXX Should we disable growth if (nbuckets * NTUP_PER_BUCKET)
+	 * reaches work_mem (or something like that)? We shouldn't really
+	 * get into such position (should be handled by increasing the
+	 * number of batches, which is called right before this). */
+
+	/* XXX Maybe adding info into hashjoin explain output (e.g. initial
+	 * nbuckets, time spent growing the table) would be appropriate. */
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets <= (INT_MAX/2));
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/* Just reallocate the proper number of buckets - we don't need to
+	 * walk through them - we can walk the dense-allocated chunks
+	 * (just like in ExecHashIncreaseNumBatches, but without all the
+	 * copying into new chunks)
+	 */
+
+	/* just resize the existing bucket list to the right size */
+	hashtable->buckets = (HashJoinTuple *) repalloc(hashtable->buckets,
+								hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	/* XXX Maybe not needed? */
+	memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
+
+	/* so, let's scan through the chunks, and all tuples in each chunk */
+	chunk = hashtable->chunks_batch;
+	while (chunk != NULL) {
+
+		/* position within the buffer (up to chunk->used) */
+		size_t idx = 0;	
+
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx < chunk->used) {
+
+			/* get the hashjoin tuple and mintuple stored right after it */
+			HashJoinTuple hashTuple = (HashJoinTuple)(chunk->data + idx);
+			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+			int		hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+			int		bucketno;
+			int		batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* just add the tuple to the proper bucket */
+			hashTuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = hashTuple;
+
+			/* bytes occupied in memory HJ tuple overhead + actual tuple length */
+			idx += hashTupleSize;
+
+		}
+
+		/* proceed to the next chunk */
+		chunk = chunk->next;
+
+	}
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, batchTuples / hashtable->nbuckets);
+#endif
+	
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -717,8 +922,8 @@ ExecHashTableInsert(HashJoinTable hashtable,
 
 		/* Create the HashJoinTuple */
 		hashTupleSize = HJTUPLE_OVERHEAD + tuple->t_len;
-		hashTuple = (HashJoinTuple) MemoryContextAlloc(hashtable->batchCxt,
-													   hashTupleSize);
+		hashTuple = (HashJoinTuple) chunk_alloc(hashtable, hashTupleSize);
+
 		hashTuple->hashvalue = hashvalue;
 		memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
 
@@ -734,12 +939,26 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		/* Increase the number of buckets (optimal) if we exceeded NTUP_PER_BUCKET. */
+		if (((hashtable->totalTuples - hashtable->skewTuples) / hashtable->nbatch)
+			>= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET)) {
+
+			/* overflow protection */
+			if (hashtable->nbuckets_optimal <= (INT_MAX/2))
+				hashtable->nbuckets_optimal *= 2;
+
+		}
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
-		if (hashtable->spaceUsed > hashtable->spaceAllowed)
+
+		/* Do we need to increase number of batches? Account for space required
+		 * by buckets (optimal number). */
+		if (hashtable->spaceUsed + BUCKETS_SPACE(hashtable) > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
 	}
 	else
 	{
@@ -856,14 +1075,12 @@ ExecHashGetHashValue(HashJoinTable hashtable,
  * chains), and must only cause the batch number to remain the same or
  * increase.  Our algorithm is
  *		bucketno = hashvalue MOD nbuckets
- *		batchno = (hashvalue DIV nbuckets) MOD nbatch
+ *		batchno =  highers log2(nbatch) bits
  * where nbuckets and nbatch are both expected to be powers of 2, so we can
  * do the computations by shifting and masking.  (This assumes that all hash
  * functions are good about randomizing all their output bits, else we are
  * likely to have very skewed bucket or batch occupancy.)
  *
- * nbuckets doesn't change over the course of the join.
- *
  * nbatch is always a power of 2; we increase it only by doubling it.  This
  * effectively adds one more bit to the top of the batchno.
  */
@@ -880,7 +1097,7 @@ ExecHashGetBucketAndBatch(HashJoinTable hashtable,
 	{
 		/* we can do MOD by masking, DIV by shifting */
 		*bucketno = hashvalue & (nbuckets - 1);
-		*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
+		*batchno = (hashvalue >> (32 - hashtable->log2_nbatch));
 	}
 	else
 	{
@@ -1068,6 +1285,11 @@ ExecHashTableReset(HashJoinTable hashtable)
 	hashtable->spaceUsed = 0;
 
 	MemoryContextSwitchTo(oldcxt);
+
+	/* reset the chunks too (the memory was allocated within batchCxt, so it's
+	 * already freed by resetting the batch context -just set it to NULL) */
+	hashtable->chunks_batch = NULL;
+
 }
 
 /*
@@ -1318,6 +1540,61 @@ ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue)
 	return INVALID_SKEW_BUCKET_NO;
 }
 
+static
+char * chunk_alloc(HashJoinTable hashtable, int size) {
+
+	/* maybe we should use MAXALIGN(size) here ... */
+
+	/* if tuple size is larger than of 1/8 of chunk size, allocate a separate chunk */
+	if (size > (HASH_CHUNK_SIZE/8)) {
+
+		/* allocate new chunk and put it at the beginning of the list */
+		HashChunk newChunk = (HashChunk)MemoryContextAlloc(hashtable->batchCxt, offsetof(HashChunkData, data) + size);
+
+		newChunk->maxlen = size;
+		newChunk->used = 0;
+		newChunk->ntuples = 0;
+
+		/* If there already is a chunk, add if after it so we can still use the space in it */
+		if (hashtable->chunks_batch != NULL) {
+			newChunk->next = hashtable->chunks_batch->next;
+			hashtable->chunks_batch->next = newChunk;
+		} else {
+			newChunk->next = hashtable->chunks_batch;
+			hashtable->chunks_batch = newChunk;
+		}
+
+		newChunk->used += size;
+		newChunk->ntuples += 1;
+
+		return newChunk->data;
+
+	}
+
+	/* ok, it's within chunk size, let's see if we have enough space for it in the current
+	 * chunk => if not, allocate a new chunk (chunks==NULL means this is the first chunk) */
+	if ((hashtable->chunks_batch == NULL) || (hashtable->chunks_batch->maxlen - hashtable->chunks_batch->used) < size) {
+
+		/* allocate new chunk and put it at the beginning of the list */
+		HashChunk newChunk = (HashChunk)MemoryContextAlloc(hashtable->batchCxt, offsetof(HashChunkData, data) + HASH_CHUNK_SIZE);
+
+		newChunk->maxlen = HASH_CHUNK_SIZE;
+		newChunk->used = 0;
+		newChunk->ntuples = 0;
+
+		newChunk->next = hashtable->chunks_batch;
+		hashtable->chunks_batch = newChunk;
+	}
+
+	/* OK, we have enough space in the chunk, let's add the tuple */
+	hashtable->chunks_batch->used += size;
+	hashtable->chunks_batch->ntuples += 1;
+
+	/* allocate pointer to the start of the tuple memory */
+	return hashtable->chunks_batch->data + (hashtable->chunks_batch->used - size);
+
+}
+
 /*
  * ExecHashSkewTableInsert
  *
@@ -1340,6 +1617,7 @@ ExecHashSkewTableInsert(HashJoinTable hashtable,
 	hashTupleSize = HJTUPLE_OVERHEAD + tuple->t_len;
 	hashTuple = (HashJoinTuple) MemoryContextAlloc(hashtable->batchCxt,
 												   hashTupleSize);
+
 	hashTuple->hashvalue = hashvalue;
 	memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
 	HeapTupleHeaderClearMatch(HJTUPLE_MINTUPLE(hashTuple));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6c52db8..9dae612 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -788,6 +788,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"enable_hash_resize", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables dynamic increase of hash buckets."),
+			NULL
+		},
+		&enable_hash_resize,
+		true,
+		NULL, NULL, NULL
+	},
+	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
 			gettext_noop("This algorithm attempts to do planning without "
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3beae40..9d03de2 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -102,12 +102,31 @@ typedef struct HashSkewBucket
 #define SKEW_WORK_MEM_PERCENT  2
 #define SKEW_MIN_OUTER_FRACTION  0.01
 
+#define HASH_CHUNK_SIZE			(16*1024L)	/* 16kB chunks by default */
+#define HASH_SKEW_CHUNK_SIZE	(4*1024L)	/* 4kB chunks for skew buckets */
+
+typedef struct HashChunkData
+{
+	int			ntuples;	/* number of tuples stored in this chunk */
+	size_t		maxlen;		/* length of the buffer */
+	size_t		used;		/* number of chunk bytes already used */
+
+	struct HashChunkData   *next;	/* pointer to the next chunk (linked list) */
+
+	char		data[1];	/* buffer allocated at the end */
+} HashChunkData;
+
+typedef struct HashChunkData* HashChunk;
+
 
 typedef struct HashJoinTableData
 {
 	int			nbuckets;		/* # buckets in the in-memory hash table */
+	int			nbuckets_original;	/* # buckets when starting the first hash */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
+	int			nbuckets_optimal;	/* optimal number of buckets (for batch) */
+
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
 	struct HashJoinTupleData **buckets;
 	/* buckets array is per-batch storage, as are all the tuples */
@@ -122,6 +141,7 @@ typedef struct HashJoinTableData
 
 	int			nbatch;			/* number of batches */
 	int			curbatch;		/* current batch #; 0 during 1st pass */
+	int			log2_nbatch;	/* its log2 (nbuckets must be a power of 2) */
 
 	int			nbatch_original;	/* nbatch when we started inner scan */
 	int			nbatch_outstart;	/* nbatch when we started outer scan */
@@ -129,6 +149,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	double		skewTuples;		/* # tuples inserted into skew tuples */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if
@@ -157,6 +178,10 @@ typedef struct HashJoinTableData
 
 	MemoryContext hashCxt;		/* context for whole-hash-join storage */
 	MemoryContext batchCxt;		/* context for this-batch-only storage */
+
+	/* used for dense allocation of tuples (into linked chunks) */
+	HashChunk	chunks_batch;	/*  one list for the whole batch */
+
 }	HashJoinTableData;
 
 #endif   /* HASHJOIN_H */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 75e2afb..e7c5f50 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -62,6 +62,7 @@ extern bool enable_material;
 extern bool enable_mergejoin;
 extern bool enable_hashjoin;
 extern int	constraint_exclusion;
+extern bool enable_hash_resize;
 
 extern double clamp_row_est(double nrows);
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,

hashjoin-nbuckets-growth-v9.patchtext/x-diff; name=hashjoin-nbuckets-growth-v9.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..ee3fffb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6a9d956..014b6f1 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,10 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
+bool enable_hash_resize = true;
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -49,6 +51,9 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 static char * chunk_alloc(HashJoinTable hashtable, int tupleSize);
 
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets_optimal * sizeof(void*))
+
 /* ----------------------------------------------------------------
  *		ExecHash
  *
@@ -117,6 +122,7 @@ MultiExecHash(HashState *node)
 				/* It's a skew tuple, so put it into that hash table */
 				ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 										bucketNumber);
+				hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -127,6 +133,30 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* If average number of tuples per bucket is over the defined threshold,
+	 * increase the number of buckets to get below it. */
+	if (enable_hash_resize) {
+
+		/* 
+		 * Do we actually need to resize? Only scale up, don't scale down (for example
+		 * after adding a batch, it's possible that nbuckets > nbuckets_optimal).
+		 */
+		if (hashtable->nbuckets < hashtable->nbuckets_optimal) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbuckets %d => %d\n",
+				   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+			ExecHashIncreaseNumBuckets(hashtable);
+		}
+
+	}
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += BUCKETS_SPACE(hashtable);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -239,6 +269,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	int			nbatch;
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
+	int			log2_nbatch;
 	int			nkeys;
 	int			i;
 	ListCell   *ho;
@@ -263,6 +294,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	log2_nbuckets = my_log2(nbuckets);
 	Assert(nbuckets == (1 << log2_nbuckets));
 
+	/* nbatch must be a power of 2 */
+	log2_nbatch = my_log2(nbatch);
+	Assert(nbatch == (1 << log2_nbatch));
+
 	/*
 	 * Initialize the hash table control block.
 	 *
@@ -271,6 +306,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
+	hashtable->nbuckets_optimal = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
@@ -281,10 +318,12 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->skewBucketNums = NULL;
 	hashtable->nbatch = nbatch;
 	hashtable->curbatch = 0;
+	hashtable->log2_nbatch = log2_nbatch;
 	hashtable->nbatch_original = nbatch;
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -377,7 +416,6 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	return hashtable;
 }
 
-
 /*
  * Compute appropriate size for hashtable given the estimated size of the
  * relation to be hashed (number of rows and average row width).
@@ -386,7 +424,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
  */
 
 /* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
+#define NTUP_PER_BUCKET			1
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
@@ -396,6 +434,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 {
 	int			tupsize;
 	double		inner_rel_bytes;
+	double		buckets_bytes;
 	long		hash_table_bytes;
 	long		skew_table_bytes;
 	long		max_pointers;
@@ -417,6 +456,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		MAXALIGN(tupwidth);
 	inner_rel_bytes = ntuples * tupsize;
 
+	/* Estimate memory needed for buckets, with the target NTUP_PER_BUCKET */
+	buckets_bytes = sizeof(void*) * my_log2(ntuples / NTUP_PER_BUCKET);
+
 	/*
 	 * Target in-memory hashtable size is work_mem kilobytes.
 	 */
@@ -468,23 +510,48 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/* also ensure we avoid integer overflow in nbatch and nbuckets */
 	max_pointers = Min(max_pointers, INT_MAX / 2);
 
-	if (inner_rel_bytes > hash_table_bytes)
+	if ((inner_rel_bytes + buckets_bytes) > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
 		long		lbuckets;
 		double		dbatch;
 		int			minbatch;
-
-		lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
-		lbuckets = Min(lbuckets, max_pointers);
-		nbuckets = (int) lbuckets;
+		double		batch_bytes;
 
 		dbatch = ceil(inner_rel_bytes / hash_table_bytes);
 		dbatch = Min(dbatch, max_pointers);
 		minbatch = (int) dbatch;
 		nbatch = 2;
+
+		/* increase nbatch until the hashtable fits into work_mem */
 		while (nbatch < minbatch)
 			nbatch <<= 1;
+
+		/* now, we need to see if the buckets fit into work_mem too */
+		lbuckets = 1 << (my_log2(ntuples / nbatch / NTUP_PER_BUCKET));
+
+		/* assuming equally-sized batches, compute bytes of hash table and buckets */
+		batch_bytes = inner_rel_bytes / nbatch;
+		buckets_bytes = lbuckets * sizeof(void*);
+
+		/* we need to get both into work_mem */
+		while (batch_bytes + buckets_bytes > hash_table_bytes) {
+
+			/* increment number of batches, and re-evaluate the amount of memory */
+			nbatch <<= 1;
+
+			/* now, we need to see if the buckets fit into work_mem too */
+			lbuckets = 1 << (my_log2(ntuples / nbatch / NTUP_PER_BUCKET));
+
+			/* assuming equally-sized batches, compute bytes of hash table and buckets */
+			batch_bytes = inner_rel_bytes / nbatch;
+			buckets_bytes = lbuckets * sizeof(void*);
+
+		}
+
+		/* protect against nbucket overflow */
+		lbuckets = Min(lbuckets, max_pointers);
+		nbuckets = (int) lbuckets;
 	}
 	else
 	{
@@ -561,6 +628,8 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
+	double		batchTuples;
+
 	HashChunk	oldchunks = hashtable->chunks_batch;
 
 	/* do nothing if we've decided to shut off growth */
@@ -607,6 +676,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	MemoryContextSwitchTo(oldcxt);
 
 	hashtable->nbatch = nbatch;
+	hashtable->log2_nbatch += 1;
 
 	/*
 	 * Scan through the existing hash table entries and dump out any that are
@@ -700,9 +770,123 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 		printf("Disabling further increase of nbatch\n");
 #endif
 	}
+
+	/*
+	 * After increasing the number of batches, the number of tuples per batch
+	 * is lower (~1/2), so the optimal number of buckets might be lower too.
+	 * It's probably (nbuckets_optimal/2) but let's compute it anyway
+	 * (let's start with nbuckets_original - we already have that memory
+	 * allocated, so let's use it).
+	 */
+	hashtable->nbuckets_optimal = hashtable->nbuckets_optimal;
+	batchTuples = (hashtable->totalTuples - hashtable->skewTuples) / hashtable->nbatch;
+	while ((batchTuples / hashtable->nbuckets_optimal > NTUP_PER_BUCKET)
+		&& (hashtable->nbuckets_optimal <= (INT_MAX/2))) {
+		hashtable->nbuckets_optimal *= 2;
+	}
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	HashChunk	chunk;
+
+	/* do nothing if not an increase (it's called increase for a reason) */
+	if (hashtable->nbuckets >= hashtable->nbuckets_optimal)
+		return;
+
+	/*
+	 * We already know the optimal number of buckets, so let's just
+	 * compute the log2_nbuckets for it.
+	 */
+	hashtable->nbuckets = hashtable->nbuckets_optimal;
+	hashtable->log2_nbuckets = my_log2(hashtable->nbuckets_optimal);
+
+	/* XXX Not sure if we should update the info about used space here.
+	 * The code seems to ignore the space used for 'buckets' and we're not
+	 * allocating more space for tuples (just shuffling them to the new
+	 * buckets). And the amount of memory used for buckets is quite small
+	 * (just an array of pointers, thus ~8kB per 1k buckets on 64-bit). */
+
+	/* XXX Should we disable growth if (nbuckets * NTUP_PER_BUCKET)
+	 * reaches work_mem (or something like that)? We shouldn't really
+	 * get into such position (should be handled by increasing the
+	 * number of batches, which is called right before this). */
+
+	/* XXX Maybe adding info into hashjoin explain output (e.g. initial
+	 * nbuckets, time spent growing the table) would be appropriate. */
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets <= (INT_MAX/2));
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/* Just reallocate the proper number of buckets - we don't need to
+	 * walk through them - we can walk the dense-allocated chunks
+	 * (just like in ExecHashIncreaseNumBatches, but without all the
+	 * copying into new chunks)
+	 */
+
+	/* just resize the existing bucket list to the right size */
+	hashtable->buckets = (HashJoinTuple *) repalloc(hashtable->buckets,
+								hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	/* XXX Maybe not needed? */
+	memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
+
+	/* so, let's scan through the chunks, and all tuples in each chunk */
+	chunk = hashtable->chunks_batch;
+	while (chunk != NULL) {
+
+		/* position within the buffer (up to chunk->used) */
+		size_t idx = 0;	
+
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx < chunk->used) {
+
+			/* get the hashjoin tuple and mintuple stored right after it */
+			HashJoinTuple hashTuple = (HashJoinTuple)(chunk->data + idx);
+			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+			int		hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+			int		bucketno;
+			int		batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* just add the tuple to the proper bucket */
+			hashTuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = hashTuple;
+
+			/* bytes occupied in memory HJ tuple overhead + actual tuple length */
+			idx += hashTupleSize;
+
+		}
+
+		/* proceed to the next chunk */
+		chunk = chunk->next;
+
+	}
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, batchTuples / hashtable->nbuckets);
+#endif
+	
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -755,12 +939,26 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		/* Increase the number of buckets (optimal) if we exceeded NTUP_PER_BUCKET. */
+		if (((hashtable->totalTuples - hashtable->skewTuples) / hashtable->nbatch)
+			>= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET)) {
+
+			/* overflow protection */
+			if (hashtable->nbuckets_optimal <= (INT_MAX/2))
+				hashtable->nbuckets_optimal *= 2;
+
+		}
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
-		if (hashtable->spaceUsed > hashtable->spaceAllowed)
+
+		/* Do we need to increase number of batches? Account for space required
+		 * by buckets (optimal number). */
+		if (hashtable->spaceUsed + BUCKETS_SPACE(hashtable) > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
 	}
 	else
 	{
@@ -877,14 +1075,12 @@ ExecHashGetHashValue(HashJoinTable hashtable,
  * chains), and must only cause the batch number to remain the same or
  * increase.  Our algorithm is
  *		bucketno = hashvalue MOD nbuckets
- *		batchno = (hashvalue DIV nbuckets) MOD nbatch
+ *		batchno =  highers log2(nbatch) bits
  * where nbuckets and nbatch are both expected to be powers of 2, so we can
  * do the computations by shifting and masking.  (This assumes that all hash
  * functions are good about randomizing all their output bits, else we are
  * likely to have very skewed bucket or batch occupancy.)
  *
- * nbuckets doesn't change over the course of the join.
- *
  * nbatch is always a power of 2; we increase it only by doubling it.  This
  * effectively adds one more bit to the top of the batchno.
  */
@@ -901,7 +1097,7 @@ ExecHashGetBucketAndBatch(HashJoinTable hashtable,
 	{
 		/* we can do MOD by masking, DIV by shifting */
 		*bucketno = hashvalue & (nbuckets - 1);
-		*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
+		*batchno = (hashvalue >> (32 - hashtable->log2_nbatch));
 	}
 	else
 	{
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6c52db8..9dae612 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -788,6 +788,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"enable_hash_resize", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables dynamic increase of hash buckets."),
+			NULL
+		},
+		&enable_hash_resize,
+		true,
+		NULL, NULL, NULL
+	},
+	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
 			gettext_noop("This algorithm attempts to do planning without "
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index c80a4d1..9d03de2 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -122,8 +122,11 @@ typedef struct HashChunkData* HashChunk;
 typedef struct HashJoinTableData
 {
 	int			nbuckets;		/* # buckets in the in-memory hash table */
+	int			nbuckets_original;	/* # buckets when starting the first hash */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
+	int			nbuckets_optimal;	/* optimal number of buckets (for batch) */
+
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
 	struct HashJoinTupleData **buckets;
 	/* buckets array is per-batch storage, as are all the tuples */
@@ -138,6 +141,7 @@ typedef struct HashJoinTableData
 
 	int			nbatch;			/* number of batches */
 	int			curbatch;		/* current batch #; 0 during 1st pass */
+	int			log2_nbatch;	/* its log2 (nbuckets must be a power of 2) */
 
 	int			nbatch_original;	/* nbatch when we started inner scan */
 	int			nbatch_outstart;	/* nbatch when we started outer scan */
@@ -145,6 +149,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	double		skewTuples;		/* # tuples inserted into skew tuples */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 75e2afb..e7c5f50 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -62,6 +62,7 @@ extern bool enable_material;
 extern bool enable_mergejoin;
 extern bool enable_hashjoin;
 extern int	constraint_exclusion;
+extern bool enable_hash_resize;
 
 extern double clamp_row_est(double nrows);
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,

#13

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#12)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 20.7.2014 18:29, Tomas Vondra wrote:

Attached v9 of the patch. Aside from a few minor fixes, the main change
is that this is assumed to be combined with the "dense allocation" patch.

It also rewrites the ExecHashIncreaseNumBuckets to follow the same
pattern as ExecHashIncreaseNumBatches (i.e. scanning chunks directly,
instead of buckets). It's cleaner and more consistent.

hashjoin-nbuckets-growth-v9.patch contains only this patch, so you need
to grab the hashjoin-alloc-v4.patch from a different thread and apply it
first)

hashjoin-nbuckets-growth-v9-combined.patch contains both patches combined

I just noticed that the changes made to ExecHashGetBucketAndBatch may
not be perfectly fine - it does not "break" the hash join (i.e. the
results are still correct), but I believe it may cause more batches. But
I'm not entirely sure about it, or how to fix that.

First, an intro into ExecHashGetBucketAndBatch internals, and how the
patch changes that. If not interested, skip to the "problem" section below.

The "old" ExecHashGetBucketAndBatch did this:

*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);

i.e. given the 32-bit hash value, the lowest log2_nbuckets bits are used
to determine bucket number. The rest of the hash (after removing the
bits used for buckets) is used for batch. I.e. something like this

31 23 15 7 0
|----------------|----------------|----------------|----------------|
| | <- batches | buckets |

This worked fine, because the number of bits for buckets was fixed, and
only the number of batches was growing. This was done by adding
most-significant bits (as illustrated by the tiny arrow. So when there
were 4 bits for batch number, after adding another bit (doubling
nbatches) batch '0000' would be split either into '00000' or '10000'.

With dynamic bucket count this does not work, because the batches and
buckets need to be independend (i.e. derived from non-overlapping parts
of the hash). The simplest approach of 'moving batches around' does not
work, because that would break this assert:

Assert(batchno > curbatch);

In other words "don't move tuples to already-processed batches". So the
batch number for a tuple needs to be sufficiently stable, and only ever
increase (never decrease).

So what I did is this:

31 23 15 7 0
|----------------|----------------|----------------|----------------|
| batches -> | | <- buckets |

and this is what happens in ExecHashGetBucketAndBatch:

*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> (32 - hashtable->log2_nbatch));

So the "bucketno" expression is exactly the same (but now the nbuckets
may change dynamically), but "batchno" is now computed from the other
end of the hash (highest bits), and grows by adding a least-significant bit.

Problem
-------

The original implementation had the nice feature, that each increase of
number of batches (nbatches *= 2) split each batch in half. Half the
tuples stayed in the batch, half the tuples moved to one of the newly
created batches, thanks to adding a most-significant bit. The tuples
getting 0 bit stayed in the same batch, tuples getting 1 moved to the
new one (and it was guaranteed to be one of the new ones).

It's slightly more complicated because of the lazy behavior when
repeatedly incrementing the number of batches, but this is the
principle. Keep 1/2 the tuples, move 1/2 to another bucket.

The new implementation changes this, because the batch number grows in
the opposite direction - adds a lest-significant bit, so for example
batch '1111' gets split into '11111' and '11110'.

It's rougly equal to

(batchno << 1) + K

where K is either 0 or 1, which is always >= than the old batchno. So it
does not violate the Assert (which is why I haven't noticed this earlier).

But it breaks the desirable behavior that 1/2 the tuples remain in the
current batch, and instead moves a lot of them to batches further down
the road, and needlessly increases the number of batches.

At least that's how I understand it ...

Possible solutions
------------------

Adding least-significant bit does not work, we need get back to adding
the most-significant one. Not sure what's the least complex way to do
that, though.

I'm thinking about computing the nbuckets limit (how many buckets may
fit into work_mem):

tupsize = HJTUPLE_OVERHEAD +
MAXALIGN(sizeof(MinimalTupleData)) +
MAXALIGN(tupwidth);

nbuckets_bits_max = my_log2(work_mem / tupsize)

and then start with nbatches from this bit, like this:

31 23 max 15 7 0
|----------------|----------------|----------------|----------------|
| | <- batches | free | <- buckets |

That might work, I guess. But maybe there's something better (using some
secret bitwise-operator-fu ...).

Further concerns
----------------

I'm wondering whether we should deploy some safe-guards against
exceeding the 32 bits in the hash. It probably was not a big issue
before, but with large work_mem values and NTUP_PER_BUCKET=1 this might
become a problem.

With work_mem=1GB, and 50B tuples (including overhead):

buckets = 21474836

which means ~25 bits reserved for nbucketno. That leaves only 7 bits for
batch number. Yeah, that's ~128 batches (so ~128GB hash table), but
maybe it's not that far.

Also, what if the tupwidth estimate is too low. In that case the
nbuckets_bits_max would be artificially high (and we'd have to do more
batches).

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

robertmhaas@gmail.com

over 11 years ago

In reply to: Tomas Vondra (#13)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On Sat, Aug 9, 2014 at 9:13 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

Adding least-significant bit does not work, we need get back to adding
the most-significant one. Not sure what's the least complex way to do
that, though.

I'm thinking about computing the nbuckets limit (how many buckets may
fit into work_mem):

tupsize = HJTUPLE_OVERHEAD +
MAXALIGN(sizeof(MinimalTupleData)) +
MAXALIGN(tupwidth);

nbuckets_bits_max = my_log2(work_mem / tupsize)

and then start with nbatches from this bit, like this:

31 23 max 15 7 0
|----------------|----------------|----------------|----------------|
| | <- batches | free | <- buckets |

That doesn't seem unreasonable provide there's enough bit space, but,
as you point out, the day might not be far off when there isn't.

It also strikes me that when there's only 1 batch, the set of bits
that map onto the batch number is zero-width, and one zero-width bit
range is as good as another. In other words, if you're only planning
to do one batch, you can easily grow the number of buckets on the fly.
Growing the number of buckets only becomes difficult once you have
more than one batch.

And I mention that because, in the scenario mentioned in your original
post ("a hash table with small number of buckets (8192) containing
large number of tuples [~3.3M]"), presumably what happens is you start
out thinking you are going to have one batch with 8K buckets. Then,
as more tuples continue to pour in, you see the load factor rising and
responding by bumping up the size of the hash table. Now eventually
you realize, gosh, this isn't even going to fit into work_mem, so you
need to start batching it. But by that point you've presumably done
as much as you're going to do in terms of increasing the number of
buckets; after that, you're just going to add more batches as
necessary. So not being able to further increase the number of
buckets once the number of batches is no longer 1 wouldn't be a
problem in that case.

Now if you start out planning for multiple batches, and then you
discover you'd like to keep the same number of batches but have more
buckets per batch, that's gonna be hard. It's not impossible, because
you could steal some of the unused high-order bits above the number of
batches, and then concatenate them with the low-order bits that you
already had, but that seems like kind of an ugly kludge, and AFAICS it
only helps if the new tuples are narrower by enough to justify the
cost of splitting all the buckets. I won't say that couldn't happen,
but I think it would be better to keep that complexity out of the
patch unless we're absolutely certain it's justified.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

tv@fuzzy.cz

over 11 years ago

In reply to: Robert Haas (#14)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 11.8.2014 20:25, Robert Haas wrote:

On Sat, Aug 9, 2014 at 9:13 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

Adding least-significant bit does not work, we need get back to adding
the most-significant one. Not sure what's the least complex way to do
that, though.

I'm thinking about computing the nbuckets limit (how many buckets may
fit into work_mem):

tupsize = HJTUPLE_OVERHEAD +
MAXALIGN(sizeof(MinimalTupleData)) +
MAXALIGN(tupwidth);

nbuckets_bits_max = my_log2(work_mem / tupsize)

and then start with nbatches from this bit, like this:

31 23 max 15 7 0
|----------------|----------------|----------------|----------------|
| | <- batches | free | <- buckets |

That doesn't seem unreasonable provide there's enough bit space, but,
as you point out, the day might not be far off when there isn't.

Thinking about this a bit more, I believe the danger is not that
imminent. 32 bits mean the Hash node should handle 2^32 tuples in total
(possibly split into multiple batches).

It used to be "2^32 buckets" thanks to NTUP_PER_BUCKET=10, but the patch
lowers this to 1 so it's "tuples" now.

But while we're we're a bit closer to exhausting the 32 bits, I think
it's not really that big issue. Not only hashing a table with ~4e9 rows
is going to be a hellish experience, but if we really want to support
it, we should probably switch to 64-bit hashes.

Because adding some checks is not going to help - it may detect an
issue, but it won't fix it. And actually, even if the two values get
overlap, it does not mean the hashjoin will stop working. There'll be
dependency between batches and buckets, causing non-uniform usage of the
buckets, but it won't blow up.

So I don't think we need to worry about exhausting the 32 bits for now.

It also strikes me that when there's only 1 batch, the set of bits
that map onto the batch number is zero-width, and one zero-width bit
range is as good as another. In other words, if you're only planning
to do one batch, you can easily grow the number of buckets on the fly.
Growing the number of buckets only becomes difficult once you have
more than one batch.

Yes, that's true. It's also roughly what the patch does IMHO.

If you know you'll need more batches, you can start with the maximum
number of buckets right away (because you expect there's more data than
work_mem, so shoot for

nbuckets = mylog2(work_mem/tuple_size)

or something like that. It's also true that if you're starting with a
single batch, you can do whatever you want with nbuckets until you need
to do (nbatches *= 2).

It's also true that once you're done with the first batch, all the other
batches will be just fine with the number of buckets, because the
batches be about equally sized.

But I think this is pretty much what the patch does, in the end.

And I mention that because, in the scenario mentioned in your original
post ("a hash table with small number of buckets (8192) containing
large number of tuples [~3.3M]"), presumably what happens is you start
out thinking you are going to have one batch with 8K buckets. Then,
as more tuples continue to pour in, you see the load factor rising and
responding by bumping up the size of the hash table. Now eventually
you realize, gosh, this isn't even going to fit into work_mem, so you
need to start batching it. But by that point you've presumably done
as much as you're going to do in terms of increasing the number of
buckets; after that, you're just going to add more batches as
necessary. So not being able to further increase the number of
buckets once the number of batches is no longer 1 wouldn't be a
problem in that case.

Now if you start out planning for multiple batches, and then you
discover you'd like to keep the same number of batches but have more
buckets per batch, that's gonna be hard. It's not impossible, because
you could steal some of the unused high-order bits above the number of
batches, and then concatenate them with the low-order bits that you
already had, but that seems like kind of an ugly kludge, and AFAICS it
only helps if the new tuples are narrower by enough to justify the
cost of splitting all the buckets. I won't say that couldn't happen,
but I think it would be better to keep that complexity out of the
patch unless we're absolutely certain it's justified.

I'm definitely in favor of keeping the patch as simple as possible.

I was considering using reversing the bits of the hash, because that's
pretty much the simplest solution. But I think you're right it might
actually work like this:

* Are more batches needed?

(yes) => just use nbuckets = my_log2(work_mem / tuple_size)

(no) => go ahead, until processing all tuples or hitting work_mem

(work_mem) => meh, use the same nbuckets above

(all tuples) => compute optimal nbuckets / resize

But I need to think about this a bit. So far it seems to me there's no
way additional batches might benefit from increasing nbuckets further.

Each batch should contain about the same number of tuples, or more
correctly, the same number of distinct hash values. But if there are
multiple tuples with the same hash value, increasing the number of
buckets is not going to help (it's still going to end up in the same
bucket).

And the number of tuples in a batch may only get lower, while adding
batches (e.g. if a batch somewhere in the middle gets a lot of tuples
with the same hash value, exceeding work_mem).

I need to think about this a bit more, but so far it seems like a good
clean solution to me.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#15)

2 attachment(s)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 12.8.2014 00:30, Tomas Vondra wrote:

On 11.8.2014 20:25, Robert Haas wrote:

It also strikes me that when there's only 1 batch, the set of bits
that map onto the batch number is zero-width, and one zero-width bit
range is as good as another. In other words, if you're only planning
to do one batch, you can easily grow the number of buckets on the fly.
Growing the number of buckets only becomes difficult once you have
more than one batch.

...

I was considering using reversing the bits of the hash, because that's
pretty much the simplest solution. But I think you're right it might
actually work like this:

* Are more batches needed?

(yes) => just use nbuckets = my_log2(work_mem / tuple_size)

(no) => go ahead, until processing all tuples or hitting work_mem

(work_mem) => meh, use the same nbuckets above

(all tuples) => compute optimal nbuckets / resize

But I need to think about this a bit. So far it seems to me there's no
way additional batches might benefit from increasing nbuckets further.

I think this is a simple and solid solution, solving the batchno
computation issues quite nicely. Attached is v10 patch (bare and
combined with the dense allocation), that does this:

1) when we know we'll need batching, buckets are sized for full work_mem
(using the estimated tuple width, etc.)

2) without the batching, we estimate the 'right number of buckets' for
the estimated number of tuples, and keep track of the optimal number
as tuples are added to the hash table

- if we discover we need to start batching, we keep the current
optimal value (which should be the same as the max number of
buckets) and don't mess with it anymore (making it possible to
compute batch IDs just like before)

- also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table
is resized as part of the rebatch

- if the hash build completes without batching, we do the resize

I believe the patch is pretty much perfect. I plan to do more thorough
testing on a wide range of queries in the next few days.

I also removed the 'enable_hash_resize' GUC, because it would be more
complex to implement this properly after doing the resize as part of
rebatch etc.. So either it would make the patch more complex, or it
wouldn't do what the name promises.

regards
Tomas

Attachments:

hashjoin-nbuckets-growth-v10-combined.patchtext/x-diff; name=hashjoin-nbuckets-growth-v10-combined.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..ee3fffb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..a5d4812 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,8 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
-
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -47,6 +47,10 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 						int bucketNumber);
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
+static char * chunk_alloc(HashJoinTable hashtable, int tupleSize);
+
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets_optimal * sizeof(HashJoinTuple))
 
 /* ----------------------------------------------------------------
  *		ExecHash
@@ -116,6 +120,7 @@ MultiExecHash(HashState *node)
 				/* It's a skew tuple, so put it into that hash table */
 				ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 										bucketNumber);
+				hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -126,10 +131,30 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
+	if (hashtable->nbuckets != hashtable->nbuckets_optimal) {
+
+		/* We never decrease the number of buckets. */
+		Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+#ifdef HJDEBUG
+		printf("Increasing nbuckets %d => %d\n",
+			   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+
+		ExecHashIncreaseNumBuckets(hashtable);
+
+	}
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += BUCKETS_SPACE(hashtable);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
-
+	
 	/*
 	 * We do not return the hash table directly because it's not a subtype of
 	 * Node, and so would violate the MultiExecProcNode API.  Instead, our
@@ -223,7 +248,6 @@ ExecEndHash(HashState *node)
 	ExecEndNode(outerPlan);
 }
 
-
 /* ----------------------------------------------------------------
  *		ExecHashTableCreate
  *
@@ -271,7 +295,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
+	hashtable->nbuckets_optimal = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
+	hashtable->log2_nbuckets_optimal = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
 	hashtable->skewEnabled = false;
@@ -285,6 +312,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -294,6 +322,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->spaceAllowedSkew =
 		hashtable->spaceAllowed * SKEW_WORK_MEM_PERCENT / 100;
 
+	hashtable->chunks_batch = NULL;
+
 	/*
 	 * Get info about the hash functions to be used for each hash key. Also
 	 * remember whether the join operators are strict.
@@ -375,7 +405,6 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	return hashtable;
 }
 
-
 /*
  * Compute appropriate size for hashtable given the estimated size of the
  * relation to be hashed (number of rows and average row width).
@@ -384,7 +413,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
  */
 
 /* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
+#define NTUP_PER_BUCKET			1
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
@@ -394,6 +423,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 {
 	int			tupsize;
 	double		inner_rel_bytes;
+	double		buckets_bytes;
 	long		hash_table_bytes;
 	long		skew_table_bytes;
 	long		max_pointers;
@@ -416,6 +446,13 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	inner_rel_bytes = ntuples * tupsize;
 
 	/*
+	 * Estimate memory needed for buckets, assuming all the tuples fit into
+	 * a single batch (consider NTUP_PER_BUCKET tuples per bucket) - buckets
+	 * are just usual 'HashJoinTuple' (pointers to HashJoinTupleData).
+	 */
+	buckets_bytes = sizeof(HashJoinTuple) * my_log2(ntuples / NTUP_PER_BUCKET);
+
+	/*
 	 * Target in-memory hashtable size is work_mem kilobytes.
 	 */
 	hash_table_bytes = work_mem * 1024L;
@@ -466,23 +503,56 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/* also ensure we avoid integer overflow in nbatch and nbuckets */
 	max_pointers = Min(max_pointers, INT_MAX / 2);
 
-	if (inner_rel_bytes > hash_table_bytes)
+	if ((inner_rel_bytes + buckets_bytes) > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
 		long		lbuckets;
 		double		dbatch;
 		int			minbatch;
-
-		lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
-		lbuckets = Min(lbuckets, max_pointers);
-		nbuckets = (int) lbuckets;
+		double		batch_bytes;
 
 		dbatch = ceil(inner_rel_bytes / hash_table_bytes);
 		dbatch = Min(dbatch, max_pointers);
 		minbatch = (int) dbatch;
 		nbatch = 2;
+
+		/* increase nbatch until the hashtable fits into work_mem */
 		while (nbatch < minbatch)
 			nbatch <<= 1;
+
+		/* assuming equally-sized batches, compute bytes of hash table */
+		batch_bytes = inner_rel_bytes / nbatch;
+
+		/* when batching, for the buckets, assume full work_mem */
+		lbuckets = 1 << my_log2(hash_table_bytes / (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple)));
+		buckets_bytes = lbuckets * sizeof(HashJoinTuple);
+
+		/*
+		 * XXX the while loop should not really repeat more than once, or maybe twice,
+		 * becase at this point (batch_bytes < work_mem) and it only may not fit
+		 * if (batch_bytes + buckets_bytes > work_mem). But the tuples are generally
+		 * much larger than buckets (which are just pointers).
+		 *
+		 * Adding another batch reduces the _expected_ number of tuples in the batch,
+		 * however the buckets are sized for full work_mem utilization (e.g. because
+		 * of underestimate, which is a common problem). So the while loop only tweak
+		 * batch_bytes, not the number of buckets (or buckets_bytes).
+		 */
+
+		/* we need to get both tuples and buckets into work_mem */
+		while (batch_bytes + buckets_bytes > hash_table_bytes) {
+
+			/* increment number of batches, and re-evaluate the amount of memory */
+			nbatch <<= 1;
+
+			/* assuming equally-sized batches, compute bytes of hash table and buckets */
+			batch_bytes = inner_rel_bytes / nbatch;
+
+		}
+
+		/* protect against nbucket overflow */
+		lbuckets = Min(lbuckets, max_pointers);
+		nbuckets = (int) lbuckets;
 	}
 	else
 	{
@@ -556,11 +626,12 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	int			oldnbatch = hashtable->nbatch;
 	int			curbatch = hashtable->curbatch;
 	int			nbatch;
-	int			i;
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
 
+	HashChunk	oldchunks = hashtable->chunks_batch;
+
 	/* do nothing if we've decided to shut off growth */
 	if (!hashtable->growEnabled)
 		return;
@@ -612,51 +683,81 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 */
 	ninmemory = nfreed = 0;
 
-	for (i = 0; i < hashtable->nbuckets; i++)
-	{
-		HashJoinTuple prevtuple;
-		HashJoinTuple tuple;
+	/* we're going to scan through the chunks directly, not buckets, so we can reset the
+	 * buckets and reinsert all the tuples there - just set them to NULL */
+	memset(hashtable->buckets, 0, sizeof(HashJoinTuple) * hashtable->nbuckets);
 
-		prevtuple = NULL;
-		tuple = hashtable->buckets[i];
+	/* if we need to do a resize of buckets, we can do it right during the rebatch */
+	if (hashtable->nbuckets_optimal != hashtable->nbuckets) {
 
-		while (tuple != NULL)
-		{
-			/* save link in case we delete */
-			HashJoinTuple nexttuple = tuple->next;
-			int			bucketno;
-			int			batchno;
+		hashtable->nbuckets = hashtable->nbuckets_optimal;
+		hashtable->log2_nbuckets = hashtable->log2_nbuckets_optimal;
+
+		hashtable->buckets = repalloc(hashtable->buckets,
+									  sizeof(HashJoinTuple) * hashtable->nbuckets);
+
+	}
+
+	/* reset the chunks (we have a copy in oldchunks) - this way we'll allocate */
+	hashtable->chunks_batch = NULL;
+
+	/* so, let's scan through the old chunks, and all tuples in each chunk */
+	while (oldchunks != NULL) {
+
+		/* pointer to the next memory chunk (may be NULL for the last chunk) */
+		HashChunk nextchunk = oldchunks->next;
+
+		/* position within the buffer (up to oldchunks->used) */
+		size_t idx = 0;
+
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx < oldchunks->used) {
+
+			/* get the hashjoin tuple and mintuple stored right after it */
+			HashJoinTuple hashTuple = (HashJoinTuple)(oldchunks->data + idx);
+			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+			int		hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+			int		bucketno;
+			int		batchno;
 
 			ninmemory++;
-			ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
 									  &bucketno, &batchno);
-			Assert(bucketno == i);
+
 			if (batchno == curbatch)
 			{
-				/* keep tuple */
-				prevtuple = tuple;
+				/* keep tuple - this means we need to copy it into the new chunks */
+				HashJoinTuple copyTuple = (HashJoinTuple) chunk_alloc(hashtable, hashTupleSize);
+				memcpy(copyTuple, hashTuple, hashTupleSize);
+
+				/* and of course add it to the new buckets - just push the copy it onto the front
+				 * of the bucket's list */
+				copyTuple->next = hashtable->buckets[bucketno];
+				hashtable->buckets[bucketno] = copyTuple;
 			}
 			else
 			{
 				/* dump it out */
 				Assert(batchno > curbatch);
-				ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(tuple),
-									  tuple->hashvalue,
+				ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(hashTuple),
+									  hashTuple->hashvalue,
 									  &hashtable->innerBatchFile[batchno]);
-				/* and remove from hash table */
-				if (prevtuple)
-					prevtuple->next = nexttuple;
-				else
-					hashtable->buckets[i] = nexttuple;
-				/* prevtuple doesn't change */
-				hashtable->spaceUsed -=
-					HJTUPLE_OVERHEAD + HJTUPLE_MINTUPLE(tuple)->t_len;
-				pfree(tuple);
+
+				hashtable->spaceUsed -= hashTupleSize;
 				nfreed++;
 			}
 
-			tuple = nexttuple;
+			/* bytes occupied in memory HJ tuple overhead + actual tuple length */
+			idx += hashTupleSize;
+
 		}
+
+		/* so, we're done with this chunk - free it and proceed to the next one */
+		pfree(oldchunks);
+		oldchunks = nextchunk;
+
 	}
 
 #ifdef HJDEBUG
@@ -682,6 +783,106 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	HashChunk	chunk;
+
+	/* do nothing if not an increase (it's called increase for a reason) */
+	if (hashtable->nbuckets >= hashtable->nbuckets_optimal)
+		return;
+
+	/*
+	 * We already know the optimal number of buckets, so let's just
+	 * compute the log2_nbuckets for it.
+	 */
+	hashtable->nbuckets = hashtable->nbuckets_optimal;
+	hashtable->log2_nbuckets = my_log2(hashtable->nbuckets_optimal);
+
+	/* XXX Not sure if we should update the info about used space here.
+	 * The code seems to ignore the space used for 'buckets' and we're not
+	 * allocating more space for tuples (just shuffling them to the new
+	 * buckets). And the amount of memory used for buckets is quite small
+	 * (just an array of pointers, thus ~8kB per 1k buckets on 64-bit). */
+
+	/* XXX Should we disable growth if (nbuckets * NTUP_PER_BUCKET)
+	 * reaches work_mem (or something like that)? We shouldn't really
+	 * get into such position (should be handled by increasing the
+	 * number of batches, which is called right before this). */
+
+	/* XXX Maybe adding info into hashjoin explain output (e.g. initial
+	 * nbuckets, time spent growing the table) would be appropriate. */
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets <= (INT_MAX/2));
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/* Just reallocate the proper number of buckets - we don't need to
+	 * walk through them - we can walk the dense-allocated chunks
+	 * (just like in ExecHashIncreaseNumBatches, but without all the
+	 * copying into new chunks)
+	 */
+
+	/* just resize the existing bucket list to the right size */
+	hashtable->buckets = (HashJoinTuple *) repalloc(hashtable->buckets,
+								hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	/* XXX Maybe not needed? */
+	memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
+
+	/* so, let's scan through the chunks, and all tuples in each chunk */
+	chunk = hashtable->chunks_batch;
+	while (chunk != NULL) {
+
+		/* position within the buffer (up to chunk->used) */
+		size_t idx = 0;
+
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx < chunk->used) {
+
+			/* get the hashjoin tuple and mintuple stored right after it */
+			HashJoinTuple hashTuple = (HashJoinTuple)(chunk->data + idx);
+			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+			int		hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+			int		bucketno;
+			int		batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* just add the tuple to the proper bucket */
+			hashTuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = hashTuple;
+
+			/* bytes occupied in memory HJ tuple overhead + actual tuple length */
+			idx += hashTupleSize;
+
+		}
+
+		/* proceed to the next chunk */
+		chunk = chunk->next;
+
+	}
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, batchTuples / hashtable->nbuckets);
+#endif
+
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -717,8 +918,8 @@ ExecHashTableInsert(HashJoinTable hashtable,
 
 		/* Create the HashJoinTuple */
 		hashTupleSize = HJTUPLE_OVERHEAD + tuple->t_len;
-		hashTuple = (HashJoinTuple) MemoryContextAlloc(hashtable->batchCxt,
-													   hashTupleSize);
+		hashTuple = (HashJoinTuple) chunk_alloc(hashtable, hashTupleSize);
+
 		hashTuple->hashvalue = hashvalue;
 		memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
 
@@ -734,12 +935,30 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		/* Increase the (optimal) number of buckets if we just exceeded NTUP_PER_BUCKET,
+		 * but only when there's still a single batch at this moment. */
+		if ((hashtable->nbatch == 1) &&
+			((hashtable->totalTuples - hashtable->skewTuples)
+				>= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET))) {
+
+			/* overflow protection */
+			if (hashtable->nbuckets_optimal <= (INT_MAX/2)) {
+				hashtable->nbuckets_optimal *= 2;
+				hashtable->log2_nbuckets_optimal += 1;
+			}
+		}
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
-		if (hashtable->spaceUsed > hashtable->spaceAllowed)
+
+		/* Do we need to increase number of batches? Account for space required
+		 * by buckets (optimal number). */
+		if (hashtable->spaceUsed + BUCKETS_SPACE(hashtable) > hashtable->spaceAllowed) {
 			ExecHashIncreaseNumBatches(hashtable);
+		}
+
 	}
 	else
 	{
@@ -862,7 +1081,10 @@ ExecHashGetHashValue(HashJoinTable hashtable,
  * functions are good about randomizing all their output bits, else we are
  * likely to have very skewed bucket or batch occupancy.)
  *
- * nbuckets doesn't change over the course of the join.
+ * The nbuckets/log2_nbuckets may change until (nbatch>1) because of dynamic
+ * buckets growth. Once we start batching, the value is fixed and does not
+ * change over the course of join (making it possible to compute batchno
+ * the way we do here).
  *
  * nbatch is always a power of 2; we increase it only by doubling it.  This
  * effectively adds one more bit to the top of the batchno.
@@ -1068,6 +1290,11 @@ ExecHashTableReset(HashJoinTable hashtable)
 	hashtable->spaceUsed = 0;
 
 	MemoryContextSwitchTo(oldcxt);
+
+	/* reset the chunks too (the memory was allocated within batchCxt, so it's
+	 * already freed by resetting the batch context -just set it to NULL) */
+	hashtable->chunks_batch = NULL;
+
 }
 
 /*
@@ -1318,6 +1545,61 @@ ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue)
 	return INVALID_SKEW_BUCKET_NO;
 }
 
+static
+char * chunk_alloc(HashJoinTable hashtable, int size) {
+
+	/* maybe we should use MAXALIGN(size) here ... */
+
+	/* if tuple size is larger than of 1/8 of chunk size, allocate a separate chunk */
+	if (size > (HASH_CHUNK_SIZE/8)) {
+
+		/* allocate new chunk and put it at the beginning of the list */
+		HashChunk newChunk = (HashChunk)MemoryContextAlloc(hashtable->batchCxt, offsetof(HashChunkData, data) + size);
+
+		newChunk->maxlen = size;
+		newChunk->used = 0;
+		newChunk->ntuples = 0;
+
+		/* If there already is a chunk, add if after it so we can still use the space in it */
+		if (hashtable->chunks_batch != NULL) {
+			newChunk->next = hashtable->chunks_batch->next;
+			hashtable->chunks_batch->next = newChunk;
+		} else {
+			newChunk->next = hashtable->chunks_batch;
+			hashtable->chunks_batch = newChunk;
+		}
+
+		newChunk->used += size;
+		newChunk->ntuples += 1;
+
+		return newChunk->data;
+
+	}
+
+	/* ok, it's within chunk size, let's see if we have enough space for it in the current
+	 * chunk => if not, allocate a new chunk (chunks==NULL means this is the first chunk) */
+	if ((hashtable->chunks_batch == NULL) || (hashtable->chunks_batch->maxlen - hashtable->chunks_batch->used) < size) {
+
+		/* allocate new chunk and put it at the beginning of the list */
+		HashChunk newChunk = (HashChunk)MemoryContextAlloc(hashtable->batchCxt, offsetof(HashChunkData, data) + HASH_CHUNK_SIZE);
+
+		newChunk->maxlen = HASH_CHUNK_SIZE;
+		newChunk->used = 0;
+		newChunk->ntuples = 0;
+
+		newChunk->next = hashtable->chunks_batch;
+		hashtable->chunks_batch = newChunk;
+	}
+
+	/* OK, we have enough space in the chunk, let's add the tuple */
+	hashtable->chunks_batch->used += size;
+	hashtable->chunks_batch->ntuples += 1;
+
+	/* allocate pointer to the start of the tuple memory */
+	return hashtable->chunks_batch->data + (hashtable->chunks_batch->used - size);
+
+}
+
 /*
  * ExecHashSkewTableInsert
  *
@@ -1340,6 +1622,7 @@ ExecHashSkewTableInsert(HashJoinTable hashtable,
 	hashTupleSize = HJTUPLE_OVERHEAD + tuple->t_len;
 	hashTuple = (HashJoinTuple) MemoryContextAlloc(hashtable->batchCxt,
 												   hashTupleSize);
+
 	hashTuple->hashvalue = hashvalue;
 	memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
 	HeapTupleHeaderClearMatch(HJTUPLE_MINTUPLE(hashTuple));
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3beae40..75f86c9 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -102,12 +102,32 @@ typedef struct HashSkewBucket
 #define SKEW_WORK_MEM_PERCENT  2
 #define SKEW_MIN_OUTER_FRACTION  0.01
 
+#define HASH_CHUNK_SIZE			(16*1024L)	/* 16kB chunks by default */
+#define HASH_SKEW_CHUNK_SIZE	(4*1024L)	/* 4kB chunks for skew buckets */
+
+typedef struct HashChunkData
+{
+	int			ntuples;	/* number of tuples stored in this chunk */
+	size_t		maxlen;		/* length of the buffer */
+	size_t		used;		/* number of chunk bytes already used */
+
+	struct HashChunkData   *next;	/* pointer to the next chunk (linked list) */
+
+	char		data[1];	/* buffer allocated at the end */
+} HashChunkData;
+
+typedef struct HashChunkData* HashChunk;
+
 
 typedef struct HashJoinTableData
 {
 	int			nbuckets;		/* # buckets in the in-memory hash table */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
+	int			nbuckets_original;	/* # buckets when starting the first hash */
+	int			nbuckets_optimal;	/* optimal # buckets (per batch) */
+	int			log2_nbuckets_optimal;	/* same as log2_nbuckets optimal */
+
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
 	struct HashJoinTupleData **buckets;
 	/* buckets array is per-batch storage, as are all the tuples */
@@ -129,6 +149,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	double		skewTuples;		/* # tuples inserted into skew tuples */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if
@@ -157,6 +178,10 @@ typedef struct HashJoinTableData
 
 	MemoryContext hashCxt;		/* context for whole-hash-join storage */
 	MemoryContext batchCxt;		/* context for this-batch-only storage */
+
+	/* used for dense allocation of tuples (into linked chunks) */
+	HashChunk	chunks_batch;	/*  one list for the whole batch */
+
 }	HashJoinTableData;
 
 #endif   /* HASHJOIN_H */

hashjoin-nbuckets-growth-v10.patchtext/x-diff; name=hashjoin-nbuckets-growth-v10.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..ee3fffb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6a9d956..a5d4812 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,8 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
-
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -49,6 +49,9 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 static char * chunk_alloc(HashJoinTable hashtable, int tupleSize);
 
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets_optimal * sizeof(HashJoinTuple))
+
 /* ----------------------------------------------------------------
  *		ExecHash
  *
@@ -117,6 +120,7 @@ MultiExecHash(HashState *node)
 				/* It's a skew tuple, so put it into that hash table */
 				ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 										bucketNumber);
+				hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -127,6 +131,26 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
+	if (hashtable->nbuckets != hashtable->nbuckets_optimal) {
+
+		/* We never decrease the number of buckets. */
+		Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+#ifdef HJDEBUG
+		printf("Increasing nbuckets %d => %d\n",
+			   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+
+		ExecHashIncreaseNumBuckets(hashtable);
+
+	}
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += BUCKETS_SPACE(hashtable);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -271,7 +295,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
+	hashtable->nbuckets_optimal = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
+	hashtable->log2_nbuckets_optimal = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
 	hashtable->skewEnabled = false;
@@ -285,6 +312,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -377,7 +405,6 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	return hashtable;
 }
 
-
 /*
  * Compute appropriate size for hashtable given the estimated size of the
  * relation to be hashed (number of rows and average row width).
@@ -386,7 +413,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
  */
 
 /* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
+#define NTUP_PER_BUCKET			1
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
@@ -396,6 +423,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 {
 	int			tupsize;
 	double		inner_rel_bytes;
+	double		buckets_bytes;
 	long		hash_table_bytes;
 	long		skew_table_bytes;
 	long		max_pointers;
@@ -418,6 +446,13 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	inner_rel_bytes = ntuples * tupsize;
 
 	/*
+	 * Estimate memory needed for buckets, assuming all the tuples fit into
+	 * a single batch (consider NTUP_PER_BUCKET tuples per bucket) - buckets
+	 * are just usual 'HashJoinTuple' (pointers to HashJoinTupleData).
+	 */
+	buckets_bytes = sizeof(HashJoinTuple) * my_log2(ntuples / NTUP_PER_BUCKET);
+
+	/*
 	 * Target in-memory hashtable size is work_mem kilobytes.
 	 */
 	hash_table_bytes = work_mem * 1024L;
@@ -468,23 +503,56 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/* also ensure we avoid integer overflow in nbatch and nbuckets */
 	max_pointers = Min(max_pointers, INT_MAX / 2);
 
-	if (inner_rel_bytes > hash_table_bytes)
+	if ((inner_rel_bytes + buckets_bytes) > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
 		long		lbuckets;
 		double		dbatch;
 		int			minbatch;
-
-		lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
-		lbuckets = Min(lbuckets, max_pointers);
-		nbuckets = (int) lbuckets;
+		double		batch_bytes;
 
 		dbatch = ceil(inner_rel_bytes / hash_table_bytes);
 		dbatch = Min(dbatch, max_pointers);
 		minbatch = (int) dbatch;
 		nbatch = 2;
+
+		/* increase nbatch until the hashtable fits into work_mem */
 		while (nbatch < minbatch)
 			nbatch <<= 1;
+
+		/* assuming equally-sized batches, compute bytes of hash table */
+		batch_bytes = inner_rel_bytes / nbatch;
+
+		/* when batching, for the buckets, assume full work_mem */
+		lbuckets = 1 << my_log2(hash_table_bytes / (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple)));
+		buckets_bytes = lbuckets * sizeof(HashJoinTuple);
+
+		/*
+		 * XXX the while loop should not really repeat more than once, or maybe twice,
+		 * becase at this point (batch_bytes < work_mem) and it only may not fit
+		 * if (batch_bytes + buckets_bytes > work_mem). But the tuples are generally
+		 * much larger than buckets (which are just pointers).
+		 *
+		 * Adding another batch reduces the _expected_ number of tuples in the batch,
+		 * however the buckets are sized for full work_mem utilization (e.g. because
+		 * of underestimate, which is a common problem). So the while loop only tweak
+		 * batch_bytes, not the number of buckets (or buckets_bytes).
+		 */
+
+		/* we need to get both tuples and buckets into work_mem */
+		while (batch_bytes + buckets_bytes > hash_table_bytes) {
+
+			/* increment number of batches, and re-evaluate the amount of memory */
+			nbatch <<= 1;
+
+			/* assuming equally-sized batches, compute bytes of hash table and buckets */
+			batch_bytes = inner_rel_bytes / nbatch;
+
+		}
+
+		/* protect against nbucket overflow */
+		lbuckets = Min(lbuckets, max_pointers);
+		nbuckets = (int) lbuckets;
 	}
 	else
 	{
@@ -561,6 +629,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
+
 	HashChunk	oldchunks = hashtable->chunks_batch;
 
 	/* do nothing if we've decided to shut off growth */
@@ -616,7 +685,18 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 
 	/* we're going to scan through the chunks directly, not buckets, so we can reset the
 	 * buckets and reinsert all the tuples there - just set them to NULL */
-	memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
+	memset(hashtable->buckets, 0, sizeof(HashJoinTuple) * hashtable->nbuckets);
+
+	/* if we need to do a resize of buckets, we can do it right during the rebatch */
+	if (hashtable->nbuckets_optimal != hashtable->nbuckets) {
+
+		hashtable->nbuckets = hashtable->nbuckets_optimal;
+		hashtable->log2_nbuckets = hashtable->log2_nbuckets_optimal;
+
+		hashtable->buckets = repalloc(hashtable->buckets,
+									  sizeof(HashJoinTuple) * hashtable->nbuckets);
+
+	}
 
 	/* reset the chunks (we have a copy in oldchunks) - this way we'll allocate */
 	hashtable->chunks_batch = NULL;
@@ -628,7 +708,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 		HashChunk nextchunk = oldchunks->next;
 
 		/* position within the buffer (up to oldchunks->used) */
-		size_t idx = 0;	
+		size_t idx = 0;
 
 		/* process all tuples stored in this chunk (and then free it) */
 		while (idx < oldchunks->used) {
@@ -703,6 +783,106 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	HashChunk	chunk;
+
+	/* do nothing if not an increase (it's called increase for a reason) */
+	if (hashtable->nbuckets >= hashtable->nbuckets_optimal)
+		return;
+
+	/*
+	 * We already know the optimal number of buckets, so let's just
+	 * compute the log2_nbuckets for it.
+	 */
+	hashtable->nbuckets = hashtable->nbuckets_optimal;
+	hashtable->log2_nbuckets = my_log2(hashtable->nbuckets_optimal);
+
+	/* XXX Not sure if we should update the info about used space here.
+	 * The code seems to ignore the space used for 'buckets' and we're not
+	 * allocating more space for tuples (just shuffling them to the new
+	 * buckets). And the amount of memory used for buckets is quite small
+	 * (just an array of pointers, thus ~8kB per 1k buckets on 64-bit). */
+
+	/* XXX Should we disable growth if (nbuckets * NTUP_PER_BUCKET)
+	 * reaches work_mem (or something like that)? We shouldn't really
+	 * get into such position (should be handled by increasing the
+	 * number of batches, which is called right before this). */
+
+	/* XXX Maybe adding info into hashjoin explain output (e.g. initial
+	 * nbuckets, time spent growing the table) would be appropriate. */
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets <= (INT_MAX/2));
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/* Just reallocate the proper number of buckets - we don't need to
+	 * walk through them - we can walk the dense-allocated chunks
+	 * (just like in ExecHashIncreaseNumBatches, but without all the
+	 * copying into new chunks)
+	 */
+
+	/* just resize the existing bucket list to the right size */
+	hashtable->buckets = (HashJoinTuple *) repalloc(hashtable->buckets,
+								hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	/* XXX Maybe not needed? */
+	memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
+
+	/* so, let's scan through the chunks, and all tuples in each chunk */
+	chunk = hashtable->chunks_batch;
+	while (chunk != NULL) {
+
+		/* position within the buffer (up to chunk->used) */
+		size_t idx = 0;
+
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx < chunk->used) {
+
+			/* get the hashjoin tuple and mintuple stored right after it */
+			HashJoinTuple hashTuple = (HashJoinTuple)(chunk->data + idx);
+			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+			int		hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+			int		bucketno;
+			int		batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* just add the tuple to the proper bucket */
+			hashTuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = hashTuple;
+
+			/* bytes occupied in memory HJ tuple overhead + actual tuple length */
+			idx += hashTupleSize;
+
+		}
+
+		/* proceed to the next chunk */
+		chunk = chunk->next;
+
+	}
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, batchTuples / hashtable->nbuckets);
+#endif
+
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -755,12 +935,30 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		/* Increase the (optimal) number of buckets if we just exceeded NTUP_PER_BUCKET,
+		 * but only when there's still a single batch at this moment. */
+		if ((hashtable->nbatch == 1) &&
+			((hashtable->totalTuples - hashtable->skewTuples)
+				>= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET))) {
+
+			/* overflow protection */
+			if (hashtable->nbuckets_optimal <= (INT_MAX/2)) {
+				hashtable->nbuckets_optimal *= 2;
+				hashtable->log2_nbuckets_optimal += 1;
+			}
+		}
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
-		if (hashtable->spaceUsed > hashtable->spaceAllowed)
+
+		/* Do we need to increase number of batches? Account for space required
+		 * by buckets (optimal number). */
+		if (hashtable->spaceUsed + BUCKETS_SPACE(hashtable) > hashtable->spaceAllowed) {
 			ExecHashIncreaseNumBatches(hashtable);
+		}
+
 	}
 	else
 	{
@@ -883,7 +1081,10 @@ ExecHashGetHashValue(HashJoinTable hashtable,
  * functions are good about randomizing all their output bits, else we are
  * likely to have very skewed bucket or batch occupancy.)
  *
- * nbuckets doesn't change over the course of the join.
+ * The nbuckets/log2_nbuckets may change until (nbatch>1) because of dynamic
+ * buckets growth. Once we start batching, the value is fixed and does not
+ * change over the course of join (making it possible to compute batchno
+ * the way we do here).
  *
  * nbatch is always a power of 2; we increase it only by doubling it.  This
  * effectively adds one more bit to the top of the batchno.
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index c80a4d1..75f86c9 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -124,6 +124,10 @@ typedef struct HashJoinTableData
 	int			nbuckets;		/* # buckets in the in-memory hash table */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
+	int			nbuckets_original;	/* # buckets when starting the first hash */
+	int			nbuckets_optimal;	/* optimal # buckets (per batch) */
+	int			log2_nbuckets_optimal;	/* same as log2_nbuckets optimal */
+
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
 	struct HashJoinTupleData **buckets;
 	/* buckets array is per-batch storage, as are all the tuples */
@@ -145,6 +149,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	double		skewTuples;		/* # tuples inserted into skew tuples */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if

#17

robertmhaas@gmail.com

over 11 years ago

In reply to: Tomas Vondra (#16)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On Sat, Aug 16, 2014 at 9:31 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 12.8.2014 00:30, Tomas Vondra wrote:

On 11.8.2014 20:25, Robert Haas wrote:

It also strikes me that when there's only 1 batch, the set of bits
that map onto the batch number is zero-width, and one zero-width bit
range is as good as another. In other words, if you're only planning
to do one batch, you can easily grow the number of buckets on the fly.
Growing the number of buckets only becomes difficult once you have
more than one batch.

...

I was considering using reversing the bits of the hash, because that's
pretty much the simplest solution. But I think you're right it might
actually work like this:

* Are more batches needed?

(yes) => just use nbuckets = my_log2(work_mem / tuple_size)

(no) => go ahead, until processing all tuples or hitting work_mem

(work_mem) => meh, use the same nbuckets above

(all tuples) => compute optimal nbuckets / resize

But I need to think about this a bit. So far it seems to me there's no
way additional batches might benefit from increasing nbuckets further.

I think this is a simple and solid solution, solving the batchno
computation issues quite nicely. Attached is v10 patch (bare and
combined with the dense allocation), that does this:

1) when we know we'll need batching, buckets are sized for full work_mem
(using the estimated tuple width, etc.)

2) without the batching, we estimate the 'right number of buckets' for
the estimated number of tuples, and keep track of the optimal number
as tuples are added to the hash table

- if we discover we need to start batching, we keep the current
optimal value (which should be the same as the max number of
buckets) and don't mess with it anymore (making it possible to
compute batch IDs just like before)

- also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table
is resized as part of the rebatch

- if the hash build completes without batching, we do the resize

I believe the patch is pretty much perfect. I plan to do more thorough
testing on a wide range of queries in the next few days.

I also removed the 'enable_hash_resize' GUC, because it would be more
complex to implement this properly after doing the resize as part of
rebatch etc.. So either it would make the patch more complex, or it
wouldn't do what the name promises.

A variety of trivial comments on this:

PostgreSQL style is un-cuddled curly braces. Also, multi-line
comments need to start with a line containing only /* and end with a
line containing only */. In one place you've added curly braces
around a single-line block that is otherwise unmodified; please don't
do that. In one place, you have "becase" instead of "because". In
another place, you write "add if after it" but it should say "add it
after it" or maybe better "add the new one after it". Avoid using
punctuation like "=>" in comments to illustrate the connection between
sentences; instead, use a connecting word like "then" or "therefore"
or whatever is appropriate; in this instance, a period followed by the
start of a new sentence seems sufficient. Revert the removal of a
single line of whitespace near the top of nodeHash.c.

There are too many things marked XXX in this patch. They should
either be fixed, if they are real problems, or they should be
commented in a way that doesn't give rise to the idea that they're
problems if they aren't.

OK, now on to some more substantive stuff:

1. It's not clear to me what the overall effect of this patch on
memory utilization is. Reducing NTUP_PER_BUCKET from 10 to 1 is going
to use, on the average, 10x as much bucket-header memory per tuple.
Specifically, I think it means we'll use about 8 bytes of
bucket-header memory per tuple instead of 0.8 bytes per tuple. If the
tuples are narrow, that could be significant; concerns have been
expressed about that here in the past. Increasing the number of
buckets could also increase memory usage. On the other hand, the
dense allocation stuff probably saves a ton of memory, so maybe we end
up overall, but I'm not sure. Your thoughts, and maybe some test
results with narrow and wide tuples, would be appreciated.

2. But, on the positive side, modulo the memory utilization questions
mentioned above, I would expect the impact on hash join performance to
be positive. Going from 10 tuples per bucket to just 1 should help,
and on cases where the actual load factor would have ended up much
higher because of poor estimation, increasing the number of buckets on
the fly should help even more. I haven't tested this, though.

I haven't had a chance to completely go through this yet, so these are
just some initial thoughts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

/messages/by-id/53BEEA9E.2080009@fuzzy.cz

tv@fuzzy.cz

over 11 years ago

In reply to: Robert Haas (#17)

2 attachment(s)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 19.8.2014 19:05, Robert Haas wrote:

On Sat, Aug 16, 2014 at 9:31 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 12.8.2014 00:30, Tomas Vondra wrote:

On 11.8.2014 20:25, Robert Haas wrote:

It also strikes me that when there's only 1 batch, the set of bits
that map onto the batch number is zero-width, and one zero-width bit
range is as good as another. In other words, if you're only planning
to do one batch, you can easily grow the number of buckets on the fly.
Growing the number of buckets only becomes difficult once you have
more than one batch.

...

I was considering using reversing the bits of the hash, because that's
pretty much the simplest solution. But I think you're right it might
actually work like this:

* Are more batches needed?

(yes) => just use nbuckets = my_log2(work_mem / tuple_size)

(no) => go ahead, until processing all tuples or hitting work_mem

(work_mem) => meh, use the same nbuckets above

(all tuples) => compute optimal nbuckets / resize

But I need to think about this a bit. So far it seems to me there's no
way additional batches might benefit from increasing nbuckets further.

I think this is a simple and solid solution, solving the batchno
computation issues quite nicely. Attached is v10 patch (bare and
combined with the dense allocation), that does this:

1) when we know we'll need batching, buckets are sized for full work_mem
(using the estimated tuple width, etc.)

2) without the batching, we estimate the 'right number of buckets' for
the estimated number of tuples, and keep track of the optimal number
as tuples are added to the hash table

- if we discover we need to start batching, we keep the current
optimal value (which should be the same as the max number of
buckets) and don't mess with it anymore (making it possible to
compute batch IDs just like before)

- also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table
is resized as part of the rebatch

- if the hash build completes without batching, we do the resize

I believe the patch is pretty much perfect. I plan to do more thorough
testing on a wide range of queries in the next few days.

I also removed the 'enable_hash_resize' GUC, because it would be more
complex to implement this properly after doing the resize as part of
rebatch etc.. So either it would make the patch more complex, or it
wouldn't do what the name promises.

A variety of trivial comments on this:

PostgreSQL style is un-cuddled curly braces. Also, multi-line
comments need to start with a line containing only /* and end with a
line containing only */. In one place you've added curly braces
around a single-line block that is otherwise unmodified; please don't
do that. In one place, you have "becase" instead of "because". In
another place, you write "add if after it" but it should say "add it
after it" or maybe better "add the new one after it". Avoid using
punctuation like "=>" in comments to illustrate the connection between
sentences; instead, use a connecting word like "then" or "therefore"
or whatever is appropriate; in this instance, a period followed by the
start of a new sentence seems sufficient. Revert the removal of a
single line of whitespace near the top of nodeHash.c.

There are too many things marked XXX in this patch. They should
either be fixed, if they are real problems, or they should be
commented in a way that doesn't give rise to the idea that they're
problems if they aren't.

OK, thanks for pointing this out. Attached is v11 of the patch (both
separate and combined with the dense allocation, as before).

I fixed as many of those issues as possible. All the XXX items were
obsolete, except for one in the chunk_alloc function.

I have also removed one constant

OK, now on to some more substantive stuff:

1. It's not clear to me what the overall effect of this patch on
memory utilization is. Reducing NTUP_PER_BUCKET from 10 to 1 is going
to use, on the average, 10x as much bucket-header memory per tuple.
Specifically, I think it means we'll use about 8 bytes of
bucket-header memory per tuple instead of 0.8 bytes per tuple. If the
tuples are narrow, that could be significant; concerns have been
expressed about that here in the past. Increasing the number of
buckets could also increase memory usage. On the other hand, the
dense allocation stuff probably saves a ton of memory, so maybe we end
up overall, but I'm not sure. Your thoughts, and maybe some test
results with narrow and wide tuples, would be appreciated.

The effect of the dense allocation was briefly discussed in this thread,
along with some quick measurements:

The dense allocation removes pretty much all the palloc overhead. For a
40B tuple, I did get this before the dense allocation

HashBatchContext: 1451221040 total in 182 blocks; 2826592 free (11
chunks); 1448394448 used

and this with the dense allocation patch applied

HashBatchContext: 729651520 total in 21980 blocks; 0 free (0 chunks);
729651520 used

So it's pretty much 50% reduction in memory consumption.

Now, the palloc header is >=16B, and removing this more than compensates
the increased number of buckets (in the worst case, there may be ~2x
buckets per tuple, which is 2x8B pointers).

I did a number of tests, and the results are quite consistent with this.

2. But, on the positive side, modulo the memory utilization questions
mentioned above, I would expect the impact on hash join performance to
be positive. Going from 10 tuples per bucket to just 1 should help,
and on cases where the actual load factor would have ended up much
higher because of poor estimation, increasing the number of buckets on
the fly should help even more. I haven't tested this, though.

Yes, that's the results I was getting.

I've done a number of tests, and not a single one showed a slowdown so
far. Most well-estimated queries were 2-3x faster, and the poorly
estimated ones were orders of magnitude faster.

We're working on getting this fixed on a range of workloads, I'll post
some results once I have that.

I haven't had a chance to completely go through this yet, so these are
just some initial thoughts.

Thanks!

Tomas

Attachments:

hashjoin-nbuckets-growth-v11.patchtext/x-diff; name=hashjoin-nbuckets-growth-v11.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..d128e08 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,21 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) ||
+				 (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6a9d956..688b6b4 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,8 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
-
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -49,6 +49,9 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 static char * chunk_alloc(HashJoinTable hashtable, int tupleSize);
 
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets_optimal * sizeof(HashJoinTuple))
+
 /* ----------------------------------------------------------------
  *		ExecHash
  *
@@ -117,6 +120,7 @@ MultiExecHash(HashState *node)
 				/* It's a skew tuple, so put it into that hash table */
 				ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 										bucketNumber);
+				hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -127,10 +131,31 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
+	if (hashtable->nbuckets != hashtable->nbuckets_optimal)
+	{
+
+		/* We never decrease the number of buckets. */
+		Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+#ifdef HJDEBUG
+		printf("Increasing nbuckets %d => %d\n",
+			   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+
+		ExecHashIncreaseNumBuckets(hashtable);
+
+	}
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += BUCKETS_SPACE(hashtable);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
-	
+
 	/*
 	 * We do not return the hash table directly because it's not a subtype of
 	 * Node, and so would violate the MultiExecProcNode API.  Instead, our
@@ -224,6 +249,7 @@ ExecEndHash(HashState *node)
 	ExecEndNode(outerPlan);
 }
 
+
 /* ----------------------------------------------------------------
  *		ExecHashTableCreate
  *
@@ -271,7 +297,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
+	hashtable->nbuckets_optimal = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
+	hashtable->log2_nbuckets_optimal = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
 	hashtable->skewEnabled = false;
@@ -285,6 +314,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -386,7 +416,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
  */
 
 /* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
+#define NTUP_PER_BUCKET			1
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
@@ -396,6 +426,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 {
 	int			tupsize;
 	double		inner_rel_bytes;
+	double		buckets_bytes;
 	long		hash_table_bytes;
 	long		skew_table_bytes;
 	long		max_pointers;
@@ -418,6 +449,13 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	inner_rel_bytes = ntuples * tupsize;
 
 	/*
+	 * Estimate memory needed for buckets, assuming all the tuples fit into
+	 * a single batch (consider NTUP_PER_BUCKET tuples per bucket) - buckets
+	 * are just usual 'HashJoinTuple' (pointers to HashJoinTupleData).
+	 */
+	buckets_bytes = sizeof(HashJoinTuple) * my_log2(ntuples / NTUP_PER_BUCKET);
+
+	/*
 	 * Target in-memory hashtable size is work_mem kilobytes.
 	 */
 	hash_table_bytes = work_mem * 1024L;
@@ -468,16 +506,13 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/* also ensure we avoid integer overflow in nbatch and nbuckets */
 	max_pointers = Min(max_pointers, INT_MAX / 2);
 
-	if (inner_rel_bytes > hash_table_bytes)
+	if ((inner_rel_bytes + buckets_bytes) > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
 		long		lbuckets;
 		double		dbatch;
 		int			minbatch;
-
-		lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
-		lbuckets = Min(lbuckets, max_pointers);
-		nbuckets = (int) lbuckets;
+		double		batch_bytes;
 
 		dbatch = ceil(inner_rel_bytes / hash_table_bytes);
 		dbatch = Min(dbatch, max_pointers);
@@ -485,6 +520,40 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch = 2;
 		while (nbatch < minbatch)
 			nbatch <<= 1;
+
+		/* assuming equally-sized batches, compute bytes of hash table */
+		batch_bytes = inner_rel_bytes / nbatch;
+
+		/* when batching, for the buckets, assume full work_mem */
+		lbuckets = 1 << my_log2(hash_table_bytes / (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple)));
+		buckets_bytes = lbuckets * sizeof(HashJoinTuple);
+
+		/*
+		 * Increase the nbatch until we get both tuples and buckets into work_mem.
+		 * 
+		 * The loop should not execute more than once in most cases, becase tuples are
+		 * usually much wider than buckets (usually 8B pointers), so by using only
+		 * (batch_bytes/2) should get us below work_mem.
+		 *
+		 * The worst case is that (nbuckets == 2*ntuples-1), giving us about twice the
+		 * number of buckets, i.e. about 2*sizeof(void*) per tuple. But that's
+		 * the consequence of how NTUP_PER_BUCKET is chosen, and work_mem limit.
+		 */
+
+		while (batch_bytes + buckets_bytes > hash_table_bytes)
+		{
+
+			/* increment number of batches, and re-evaluate the amount of memory */
+			nbatch <<= 1;
+
+			/* assuming equally-sized batches, compute bytes of hash table and buckets */
+			batch_bytes = inner_rel_bytes / nbatch;
+
+		}
+
+		/* protect against nbucket overflow */
+		lbuckets = Min(lbuckets, max_pointers);
+		nbuckets = (int) lbuckets;
 	}
 	else
 	{
@@ -561,6 +630,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
+
 	HashChunk	oldchunks = hashtable->chunks_batch;
 
 	/* do nothing if we've decided to shut off growth */
@@ -614,24 +684,38 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 */
 	ninmemory = nfreed = 0;
 
-	/* we're going to scan through the chunks directly, not buckets, so we can reset the
-	 * buckets and reinsert all the tuples there - just set them to NULL */
-	memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
+	/* 
+	 * We will scan the tuples through chunks, not buckets, so we can reset the
+	 * buckets and reinsert all the tuples there. So resetting buckets is OK.
+	 */
+	memset(hashtable->buckets, 0, sizeof(HashJoinTuple) * hashtable->nbuckets);
+
+	/* If we need to do a resize of buckets, we can do it while rebatching. */
+	if (hashtable->nbuckets_optimal != hashtable->nbuckets)
+	{
+		hashtable->nbuckets = hashtable->nbuckets_optimal;
+		hashtable->log2_nbuckets = hashtable->log2_nbuckets_optimal;
 
-	/* reset the chunks (we have a copy in oldchunks) - this way we'll allocate */
+		hashtable->buckets = repalloc(hashtable->buckets,
+									  sizeof(HashJoinTuple) * hashtable->nbuckets);
+	}
+
+	/* Reset the chunks, so that we allocate everything into a new list of chunks. */
 	hashtable->chunks_batch = NULL;
 
-	/* so, let's scan through the old chunks, and all tuples in each chunk */
-	while (oldchunks != NULL) {
+	/* Scan through the old chunks, keep only tuples from the current batch. */
+	while (oldchunks != NULL)
+	{
 
 		/* pointer to the next memory chunk (may be NULL for the last chunk) */
 		HashChunk nextchunk = oldchunks->next;
 
-		/* position within the buffer (up to oldchunks->used) */
-		size_t idx = 0;	
+		/* position within the chunk (up to oldchunks->used) */
+		size_t idx = 0;
 
 		/* process all tuples stored in this chunk (and then free it) */
-		while (idx < oldchunks->used) {
+		while (idx < oldchunks->used)
+		{
 
 			/* get the hashjoin tuple and mintuple stored right after it */
 			HashJoinTuple hashTuple = (HashJoinTuple)(oldchunks->data + idx);
@@ -703,6 +787,93 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	HashChunk	chunk;
+
+	/* do nothing if not an increase (it's called increase for a reason) */
+	if (hashtable->nbuckets >= hashtable->nbuckets_optimal)
+		return;
+
+	/*
+	 * We already know the optimal number of buckets, so let's just
+	 * compute the log2_nbuckets for it.
+	 */
+	hashtable->nbuckets = hashtable->nbuckets_optimal;
+	hashtable->log2_nbuckets = my_log2(hashtable->nbuckets_optimal);
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets <= (INT_MAX/2));
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/* 
+	 * Just reallocate the proper number of buckets - we don't need to
+	 * walk through them - we can walk the dense-allocated chunks
+	 * (just like in ExecHashIncreaseNumBatches, but without all the
+	 * copying into new chunks)
+	 */
+
+	hashtable->buckets = (HashJoinTuple *) repalloc(hashtable->buckets,
+								hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
+
+	/* scan through all tuples in all chunks, rebuild the hash table */
+	chunk = hashtable->chunks_batch;
+	while (chunk != NULL)
+	{
+
+		/* position within the buffer (up to chunk->used) */
+		size_t idx = 0;
+
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx < chunk->used)
+		{
+
+			/* get the hashjoin tuple and mintuple stored right after it */
+			HashJoinTuple hashTuple = (HashJoinTuple)(chunk->data + idx);
+			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+			int		hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+			int		bucketno;
+			int		batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* just add the tuple to the proper bucket */
+			hashTuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = hashTuple;
+
+			/* bytes occupied in memory HJ tuple overhead + actual tuple length */
+			idx += hashTupleSize;
+
+		}
+
+		/* proceed to the next chunk */
+		chunk = chunk->next;
+
+	}
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, batchTuples / hashtable->nbuckets);
+#endif
+
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -755,12 +926,34 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		/*
+		 * Increase the (optimal) number of buckets if we just exceeded NTUP_PER_BUCKET,
+		 * but only when there's still a single batch at this moment.
+		 */
+		if ((hashtable->nbatch == 1) &&
+			((hashtable->totalTuples - hashtable->skewTuples)
+				>= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET)))
+		{
+			/* overflow protection */
+			if (hashtable->nbuckets_optimal <= (INT_MAX/2))
+			{
+				hashtable->nbuckets_optimal *= 2;
+				hashtable->log2_nbuckets_optimal += 1;
+			}
+		}
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
-		if (hashtable->spaceUsed > hashtable->spaceAllowed)
+
+		/*
+		 * Do we need to increase number of batches? Account for space required
+		 * by buckets (optimal number).
+		 */
+		if (hashtable->spaceUsed + BUCKETS_SPACE(hashtable) > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
 	}
 	else
 	{
@@ -883,7 +1076,10 @@ ExecHashGetHashValue(HashJoinTable hashtable,
  * functions are good about randomizing all their output bits, else we are
  * likely to have very skewed bucket or batch occupancy.)
  *
- * nbuckets doesn't change over the course of the join.
+ * The nbuckets/log2_nbuckets may change while (nbatch==1) because of dynamic
+ * buckets growth. Once we start batching, the value is fixed and does not
+ * change over the course of join (making it possible to compute batch number
+ * the way we do here).
  *
  * nbatch is always a power of 2; we increase it only by doubling it.  This
  * effectively adds one more bit to the top of the batchno.
@@ -1090,8 +1286,10 @@ ExecHashTableReset(HashJoinTable hashtable)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	/* reset the chunks too (the memory was allocated within batchCxt, so it's
-	 * already freed by resetting the batch context -just set it to NULL) */
+	/*
+	 * Reset the chunks too (the memory was allocated within batchCxt, so it's
+	 * already freed by resetting the batch context -just set it to NULL).
+	 */
 	hashtable->chunks_batch = NULL;
 
 }
@@ -1347,23 +1545,32 @@ ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue)
 static
 char * chunk_alloc(HashJoinTable hashtable, int size) {
 
-	/* maybe we should use MAXALIGN(size) here ... */
+	/* XXX maybe we should use MAXALIGN(size) here ... */
 
-	/* if tuple size is larger than of 1/8 of chunk size, allocate a separate chunk */
-	if (size > (HASH_CHUNK_SIZE/8)) {
+	/* if requested size is larger than of 1/8 of chunk size, allocate a separate chunk */
+	if (size > (HASH_CHUNK_SIZE/8))
+	{
 
 		/* allocate new chunk and put it at the beginning of the list */
-		HashChunk newChunk = (HashChunk)MemoryContextAlloc(hashtable->batchCxt, offsetof(HashChunkData, data) + size);
+		HashChunk newChunk
+			= (HashChunk)MemoryContextAlloc(hashtable->batchCxt,
+											offsetof(HashChunkData, data) + size);
 
 		newChunk->maxlen = size;
 		newChunk->used = 0;
 		newChunk->ntuples = 0;
 
-		/* If there already is a chunk, add if after it so we can still use the space in it */
-		if (hashtable->chunks_batch != NULL) {
+		/*
+		 * If there already is a chunk, add the new one after it, so we can still use the
+		 * space in the existing one.
+		 */
+		if (hashtable->chunks_batch != NULL)
+		{
 			newChunk->next = hashtable->chunks_batch->next;
 			hashtable->chunks_batch->next = newChunk;
-		} else {
+		}
+		else
+		{
 			newChunk->next = hashtable->chunks_batch;
 			hashtable->chunks_batch = newChunk;
 		}
@@ -1375,12 +1582,18 @@ char * chunk_alloc(HashJoinTable hashtable, int size) {
 
 	}
 
-	/* ok, it's within chunk size, let's see if we have enough space for it in the current
-	 * chunk => if not, allocate a new chunk (chunks==NULL means this is the first chunk) */
-	if ((hashtable->chunks_batch == NULL) || (hashtable->chunks_batch->maxlen - hashtable->chunks_batch->used) < size) {
+	/*
+	 * Requested size is less than 1/8 of a chunk, so place it in the current chunk if there
+	 * is enough free space. If not, allocate a new chunk and add it there.
+	 */
+	if ((hashtable->chunks_batch == NULL) ||
+		(hashtable->chunks_batch->maxlen - hashtable->chunks_batch->used) < size)
+	{
 
 		/* allocate new chunk and put it at the beginning of the list */
-		HashChunk newChunk = (HashChunk)MemoryContextAlloc(hashtable->batchCxt, offsetof(HashChunkData, data) + HASH_CHUNK_SIZE);
+		HashChunk newChunk
+			= (HashChunk)MemoryContextAlloc(hashtable->batchCxt,
+											offsetof(HashChunkData, data) + HASH_CHUNK_SIZE);
 
 		newChunk->maxlen = HASH_CHUNK_SIZE;
 		newChunk->used = 0;
@@ -1421,7 +1634,6 @@ ExecHashSkewTableInsert(HashJoinTable hashtable,
 	hashTupleSize = HJTUPLE_OVERHEAD + tuple->t_len;
 	hashTuple = (HashJoinTuple) MemoryContextAlloc(hashtable->batchCxt,
 												   hashTupleSize);
-
 	hashTuple->hashvalue = hashvalue;
 	memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
 	HeapTupleHeaderClearMatch(HJTUPLE_MINTUPLE(hashTuple));
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index c80a4d1..eda7502 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -103,7 +103,6 @@ typedef struct HashSkewBucket
 #define SKEW_MIN_OUTER_FRACTION  0.01
 
 #define HASH_CHUNK_SIZE			(16*1024L)	/* 16kB chunks by default */
-#define HASH_SKEW_CHUNK_SIZE	(4*1024L)	/* 4kB chunks for skew buckets */
 
 typedef struct HashChunkData
 {
@@ -124,6 +123,10 @@ typedef struct HashJoinTableData
 	int			nbuckets;		/* # buckets in the in-memory hash table */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
+	int			nbuckets_original;	/* # buckets when starting the first hash */
+	int			nbuckets_optimal;	/* optimal # buckets (per batch) */
+	int			log2_nbuckets_optimal;	/* same as log2_nbuckets optimal */
+
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
 	struct HashJoinTupleData **buckets;
 	/* buckets array is per-batch storage, as are all the tuples */
@@ -145,6 +148,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	double		skewTuples;		/* # tuples inserted into skew tuples */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if

hashjoin-nbuckets-growth-v11-combined.patchtext/x-diff; name=hashjoin-nbuckets-growth-v11-combined.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..d128e08 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,21 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) ||
+				 (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..688b6b4 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,8 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
-
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -47,6 +47,10 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 						int bucketNumber);
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
+static char * chunk_alloc(HashJoinTable hashtable, int tupleSize);
+
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets_optimal * sizeof(HashJoinTuple))
 
 /* ----------------------------------------------------------------
  *		ExecHash
@@ -116,6 +120,7 @@ MultiExecHash(HashState *node)
 				/* It's a skew tuple, so put it into that hash table */
 				ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 										bucketNumber);
+				hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -126,6 +131,27 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
+	if (hashtable->nbuckets != hashtable->nbuckets_optimal)
+	{
+
+		/* We never decrease the number of buckets. */
+		Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+#ifdef HJDEBUG
+		printf("Increasing nbuckets %d => %d\n",
+			   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+
+		ExecHashIncreaseNumBuckets(hashtable);
+
+	}
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += BUCKETS_SPACE(hashtable);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -271,7 +297,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
+	hashtable->nbuckets_optimal = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
+	hashtable->log2_nbuckets_optimal = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
 	hashtable->skewEnabled = false;
@@ -285,6 +314,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -294,6 +324,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->spaceAllowedSkew =
 		hashtable->spaceAllowed * SKEW_WORK_MEM_PERCENT / 100;
 
+	hashtable->chunks_batch = NULL;
+
 	/*
 	 * Get info about the hash functions to be used for each hash key. Also
 	 * remember whether the join operators are strict.
@@ -384,7 +416,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
  */
 
 /* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
+#define NTUP_PER_BUCKET			1
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
@@ -394,6 +426,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 {
 	int			tupsize;
 	double		inner_rel_bytes;
+	double		buckets_bytes;
 	long		hash_table_bytes;
 	long		skew_table_bytes;
 	long		max_pointers;
@@ -416,6 +449,13 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	inner_rel_bytes = ntuples * tupsize;
 
 	/*
+	 * Estimate memory needed for buckets, assuming all the tuples fit into
+	 * a single batch (consider NTUP_PER_BUCKET tuples per bucket) - buckets
+	 * are just usual 'HashJoinTuple' (pointers to HashJoinTupleData).
+	 */
+	buckets_bytes = sizeof(HashJoinTuple) * my_log2(ntuples / NTUP_PER_BUCKET);
+
+	/*
 	 * Target in-memory hashtable size is work_mem kilobytes.
 	 */
 	hash_table_bytes = work_mem * 1024L;
@@ -466,16 +506,13 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/* also ensure we avoid integer overflow in nbatch and nbuckets */
 	max_pointers = Min(max_pointers, INT_MAX / 2);
 
-	if (inner_rel_bytes > hash_table_bytes)
+	if ((inner_rel_bytes + buckets_bytes) > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
 		long		lbuckets;
 		double		dbatch;
 		int			minbatch;
-
-		lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
-		lbuckets = Min(lbuckets, max_pointers);
-		nbuckets = (int) lbuckets;
+		double		batch_bytes;
 
 		dbatch = ceil(inner_rel_bytes / hash_table_bytes);
 		dbatch = Min(dbatch, max_pointers);
@@ -483,6 +520,40 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch = 2;
 		while (nbatch < minbatch)
 			nbatch <<= 1;
+
+		/* assuming equally-sized batches, compute bytes of hash table */
+		batch_bytes = inner_rel_bytes / nbatch;
+
+		/* when batching, for the buckets, assume full work_mem */
+		lbuckets = 1 << my_log2(hash_table_bytes / (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple)));
+		buckets_bytes = lbuckets * sizeof(HashJoinTuple);
+
+		/*
+		 * Increase the nbatch until we get both tuples and buckets into work_mem.
+		 * 
+		 * The loop should not execute more than once in most cases, becase tuples are
+		 * usually much wider than buckets (usually 8B pointers), so by using only
+		 * (batch_bytes/2) should get us below work_mem.
+		 *
+		 * The worst case is that (nbuckets == 2*ntuples-1), giving us about twice the
+		 * number of buckets, i.e. about 2*sizeof(void*) per tuple. But that's
+		 * the consequence of how NTUP_PER_BUCKET is chosen, and work_mem limit.
+		 */
+
+		while (batch_bytes + buckets_bytes > hash_table_bytes)
+		{
+
+			/* increment number of batches, and re-evaluate the amount of memory */
+			nbatch <<= 1;
+
+			/* assuming equally-sized batches, compute bytes of hash table and buckets */
+			batch_bytes = inner_rel_bytes / nbatch;
+
+		}
+
+		/* protect against nbucket overflow */
+		lbuckets = Min(lbuckets, max_pointers);
+		nbuckets = (int) lbuckets;
 	}
 	else
 	{
@@ -556,11 +627,12 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	int			oldnbatch = hashtable->nbatch;
 	int			curbatch = hashtable->curbatch;
 	int			nbatch;
-	int			i;
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
 
+	HashChunk	oldchunks = hashtable->chunks_batch;
+
 	/* do nothing if we've decided to shut off growth */
 	if (!hashtable->growEnabled)
 		return;
@@ -612,51 +684,84 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 */
 	ninmemory = nfreed = 0;
 
-	for (i = 0; i < hashtable->nbuckets; i++)
+	/* 
+	 * We will scan the tuples through chunks, not buckets, so we can reset the
+	 * buckets and reinsert all the tuples there. So resetting buckets is OK.
+	 */
+	memset(hashtable->buckets, 0, sizeof(HashJoinTuple) * hashtable->nbuckets);
+
+	/* If we need to do a resize of buckets, we can do it while rebatching. */
+	if (hashtable->nbuckets_optimal != hashtable->nbuckets)
+	{
+		hashtable->nbuckets = hashtable->nbuckets_optimal;
+		hashtable->log2_nbuckets = hashtable->log2_nbuckets_optimal;
+
+		hashtable->buckets = repalloc(hashtable->buckets,
+									  sizeof(HashJoinTuple) * hashtable->nbuckets);
+	}
+
+	/* Reset the chunks, so that we allocate everything into a new list of chunks. */
+	hashtable->chunks_batch = NULL;
+
+	/* Scan through the old chunks, keep only tuples from the current batch. */
+	while (oldchunks != NULL)
 	{
-		HashJoinTuple prevtuple;
-		HashJoinTuple tuple;
 
-		prevtuple = NULL;
-		tuple = hashtable->buckets[i];
+		/* pointer to the next memory chunk (may be NULL for the last chunk) */
+		HashChunk nextchunk = oldchunks->next;
+
+		/* position within the chunk (up to oldchunks->used) */
+		size_t idx = 0;
 
-		while (tuple != NULL)
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx < oldchunks->used)
 		{
-			/* save link in case we delete */
-			HashJoinTuple nexttuple = tuple->next;
-			int			bucketno;
-			int			batchno;
+
+			/* get the hashjoin tuple and mintuple stored right after it */
+			HashJoinTuple hashTuple = (HashJoinTuple)(oldchunks->data + idx);
+			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+			int		hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+			int		bucketno;
+			int		batchno;
 
 			ninmemory++;
-			ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
 									  &bucketno, &batchno);
-			Assert(bucketno == i);
+
 			if (batchno == curbatch)
 			{
-				/* keep tuple */
-				prevtuple = tuple;
+				/* keep tuple - this means we need to copy it into the new chunks */
+				HashJoinTuple copyTuple = (HashJoinTuple) chunk_alloc(hashtable, hashTupleSize);
+				memcpy(copyTuple, hashTuple, hashTupleSize);
+
+				/* and of course add it to the new buckets - just push the copy it onto the front
+				 * of the bucket's list */
+				copyTuple->next = hashtable->buckets[bucketno];
+				hashtable->buckets[bucketno] = copyTuple;
 			}
 			else
 			{
 				/* dump it out */
 				Assert(batchno > curbatch);
-				ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(tuple),
-									  tuple->hashvalue,
+				ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(hashTuple),
+									  hashTuple->hashvalue,
 									  &hashtable->innerBatchFile[batchno]);
-				/* and remove from hash table */
-				if (prevtuple)
-					prevtuple->next = nexttuple;
-				else
-					hashtable->buckets[i] = nexttuple;
-				/* prevtuple doesn't change */
-				hashtable->spaceUsed -=
-					HJTUPLE_OVERHEAD + HJTUPLE_MINTUPLE(tuple)->t_len;
-				pfree(tuple);
+
+				hashtable->spaceUsed -= hashTupleSize;
 				nfreed++;
 			}
 
-			tuple = nexttuple;
+			/* bytes occupied in memory HJ tuple overhead + actual tuple length */
+			idx += hashTupleSize;
+
 		}
+
+		/* so, we're done with this chunk - free it and proceed to the next one */
+		pfree(oldchunks);
+		oldchunks = nextchunk;
+
 	}
 
 #ifdef HJDEBUG
@@ -682,6 +787,93 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	HashChunk	chunk;
+
+	/* do nothing if not an increase (it's called increase for a reason) */
+	if (hashtable->nbuckets >= hashtable->nbuckets_optimal)
+		return;
+
+	/*
+	 * We already know the optimal number of buckets, so let's just
+	 * compute the log2_nbuckets for it.
+	 */
+	hashtable->nbuckets = hashtable->nbuckets_optimal;
+	hashtable->log2_nbuckets = my_log2(hashtable->nbuckets_optimal);
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets <= (INT_MAX/2));
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/* 
+	 * Just reallocate the proper number of buckets - we don't need to
+	 * walk through them - we can walk the dense-allocated chunks
+	 * (just like in ExecHashIncreaseNumBatches, but without all the
+	 * copying into new chunks)
+	 */
+
+	hashtable->buckets = (HashJoinTuple *) repalloc(hashtable->buckets,
+								hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
+
+	/* scan through all tuples in all chunks, rebuild the hash table */
+	chunk = hashtable->chunks_batch;
+	while (chunk != NULL)
+	{
+
+		/* position within the buffer (up to chunk->used) */
+		size_t idx = 0;
+
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx < chunk->used)
+		{
+
+			/* get the hashjoin tuple and mintuple stored right after it */
+			HashJoinTuple hashTuple = (HashJoinTuple)(chunk->data + idx);
+			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+			int		hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+			int		bucketno;
+			int		batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* just add the tuple to the proper bucket */
+			hashTuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = hashTuple;
+
+			/* bytes occupied in memory HJ tuple overhead + actual tuple length */
+			idx += hashTupleSize;
+
+		}
+
+		/* proceed to the next chunk */
+		chunk = chunk->next;
+
+	}
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, batchTuples / hashtable->nbuckets);
+#endif
+
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -717,8 +909,8 @@ ExecHashTableInsert(HashJoinTable hashtable,
 
 		/* Create the HashJoinTuple */
 		hashTupleSize = HJTUPLE_OVERHEAD + tuple->t_len;
-		hashTuple = (HashJoinTuple) MemoryContextAlloc(hashtable->batchCxt,
-													   hashTupleSize);
+		hashTuple = (HashJoinTuple) chunk_alloc(hashtable, hashTupleSize);
+
 		hashTuple->hashvalue = hashvalue;
 		memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
 
@@ -734,12 +926,34 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		/*
+		 * Increase the (optimal) number of buckets if we just exceeded NTUP_PER_BUCKET,
+		 * but only when there's still a single batch at this moment.
+		 */
+		if ((hashtable->nbatch == 1) &&
+			((hashtable->totalTuples - hashtable->skewTuples)
+				>= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET)))
+		{
+			/* overflow protection */
+			if (hashtable->nbuckets_optimal <= (INT_MAX/2))
+			{
+				hashtable->nbuckets_optimal *= 2;
+				hashtable->log2_nbuckets_optimal += 1;
+			}
+		}
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
-		if (hashtable->spaceUsed > hashtable->spaceAllowed)
+
+		/*
+		 * Do we need to increase number of batches? Account for space required
+		 * by buckets (optimal number).
+		 */
+		if (hashtable->spaceUsed + BUCKETS_SPACE(hashtable) > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
 	}
 	else
 	{
@@ -862,7 +1076,10 @@ ExecHashGetHashValue(HashJoinTable hashtable,
  * functions are good about randomizing all their output bits, else we are
  * likely to have very skewed bucket or batch occupancy.)
  *
- * nbuckets doesn't change over the course of the join.
+ * The nbuckets/log2_nbuckets may change while (nbatch==1) because of dynamic
+ * buckets growth. Once we start batching, the value is fixed and does not
+ * change over the course of join (making it possible to compute batch number
+ * the way we do here).
  *
  * nbatch is always a power of 2; we increase it only by doubling it.  This
  * effectively adds one more bit to the top of the batchno.
@@ -1068,6 +1285,13 @@ ExecHashTableReset(HashJoinTable hashtable)
 	hashtable->spaceUsed = 0;
 
 	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Reset the chunks too (the memory was allocated within batchCxt, so it's
+	 * already freed by resetting the batch context -just set it to NULL).
+	 */
+	hashtable->chunks_batch = NULL;
+
 }
 
 /*
@@ -1318,6 +1542,76 @@ ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue)
 	return INVALID_SKEW_BUCKET_NO;
 }
 
+static
+char * chunk_alloc(HashJoinTable hashtable, int size) {
+
+	/* XXX maybe we should use MAXALIGN(size) here ... */
+
+	/* if requested size is larger than of 1/8 of chunk size, allocate a separate chunk */
+	if (size > (HASH_CHUNK_SIZE/8))
+	{
+
+		/* allocate new chunk and put it at the beginning of the list */
+		HashChunk newChunk
+			= (HashChunk)MemoryContextAlloc(hashtable->batchCxt,
+											offsetof(HashChunkData, data) + size);
+
+		newChunk->maxlen = size;
+		newChunk->used = 0;
+		newChunk->ntuples = 0;
+
+		/*
+		 * If there already is a chunk, add the new one after it, so we can still use the
+		 * space in the existing one.
+		 */
+		if (hashtable->chunks_batch != NULL)
+		{
+			newChunk->next = hashtable->chunks_batch->next;
+			hashtable->chunks_batch->next = newChunk;
+		}
+		else
+		{
+			newChunk->next = hashtable->chunks_batch;
+			hashtable->chunks_batch = newChunk;
+		}
+
+		newChunk->used += size;
+		newChunk->ntuples += 1;
+
+		return newChunk->data;
+
+	}
+
+	/*
+	 * Requested size is less than 1/8 of a chunk, so place it in the current chunk if there
+	 * is enough free space. If not, allocate a new chunk and add it there.
+	 */
+	if ((hashtable->chunks_batch == NULL) ||
+		(hashtable->chunks_batch->maxlen - hashtable->chunks_batch->used) < size)
+	{
+
+		/* allocate new chunk and put it at the beginning of the list */
+		HashChunk newChunk
+			= (HashChunk)MemoryContextAlloc(hashtable->batchCxt,
+											offsetof(HashChunkData, data) + HASH_CHUNK_SIZE);
+
+		newChunk->maxlen = HASH_CHUNK_SIZE;
+		newChunk->used = 0;
+		newChunk->ntuples = 0;
+
+		newChunk->next = hashtable->chunks_batch;
+		hashtable->chunks_batch = newChunk;
+	}
+
+	/* OK, we have enough space in the chunk, let's add the tuple */
+	hashtable->chunks_batch->used += size;
+	hashtable->chunks_batch->ntuples += 1;
+
+	/* allocate pointer to the start of the tuple memory */
+	return hashtable->chunks_batch->data + (hashtable->chunks_batch->used - size);
+
+}
+
 /*
  * ExecHashSkewTableInsert
  *
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3beae40..eda7502 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -102,12 +102,31 @@ typedef struct HashSkewBucket
 #define SKEW_WORK_MEM_PERCENT  2
 #define SKEW_MIN_OUTER_FRACTION  0.01
 
+#define HASH_CHUNK_SIZE			(16*1024L)	/* 16kB chunks by default */
+
+typedef struct HashChunkData
+{
+	int			ntuples;	/* number of tuples stored in this chunk */
+	size_t		maxlen;		/* length of the buffer */
+	size_t		used;		/* number of chunk bytes already used */
+
+	struct HashChunkData   *next;	/* pointer to the next chunk (linked list) */
+
+	char		data[1];	/* buffer allocated at the end */
+} HashChunkData;
+
+typedef struct HashChunkData* HashChunk;
+
 
 typedef struct HashJoinTableData
 {
 	int			nbuckets;		/* # buckets in the in-memory hash table */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
+	int			nbuckets_original;	/* # buckets when starting the first hash */
+	int			nbuckets_optimal;	/* optimal # buckets (per batch) */
+	int			log2_nbuckets_optimal;	/* same as log2_nbuckets optimal */
+
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
 	struct HashJoinTupleData **buckets;
 	/* buckets array is per-batch storage, as are all the tuples */
@@ -129,6 +148,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	double		skewTuples;		/* # tuples inserted into skew tuples */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if
@@ -157,6 +177,10 @@ typedef struct HashJoinTableData
 
 	MemoryContext hashCxt;		/* context for whole-hash-join storage */
 	MemoryContext batchCxt;		/* context for this-batch-only storage */
+
+	/* used for dense allocation of tuples (into linked chunks) */
+	HashChunk	chunks_batch;	/*  one list for the whole batch */
+
 }	HashJoinTableData;
 
 #endif   /* HASHJOIN_H */

#19

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#18)

Re: bad estimation together with large work_mem generates terrible slow hash joins

Hi everyone,

as Heikki mentioned in his "commitfest status" message, this patch still
has no reviewers :-( Is there anyone willing to pick up that, together
with the 'dense allocation' patch (as those two are closely related)?

I'm looking in Robert's direction, as he's the one who came up with the
dense allocation idea, and also commented on the hasjoin bucket resizing
patch ...

I'd ask Pavel Stehule, but as he's sitting next to me in the office he's
not really independent ;-) I'll do some reviews on the other patches
over the weekend, to balance this of course.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

robertmhaas@gmail.com

over 11 years ago

In reply to: Tomas Vondra (#19)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On Fri, Sep 5, 2014 at 3:23 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

as Heikki mentioned in his "commitfest status" message, this patch still
has no reviewers :-( Is there anyone willing to pick up that, together
with the 'dense allocation' patch (as those two are closely related)?

I'm looking in Robert's direction, as he's the one who came up with the
dense allocation idea, and also commented on the hasjoin bucket resizing
patch ...

I'd ask Pavel Stehule, but as he's sitting next to me in the office he's
not really independent ;-) I'll do some reviews on the other patches
over the weekend, to balance this of course.

Will any of those patches to be, heh heh, mine?

I am a bit confused by the relationship between the two patches you
posted. The "combined" patch applies, but the other one does not. If
this is a sequence of two patches, it seems like it would be more
helpful to post A and B rather than B and A+B. Perhaps the other
patch is on a different thread but there's not an obvious reference to
said thread here.

In ExecChooseHashTableSize(), if buckets_bytes is independent of
nbatch, could we just compute it before working out dbatch, and just
deduct it from hash_table_bytes? That seems like it'd do the same
thing for less work. Either way, what happens if buckets_bytes is
already bigger than hash_table_bytes?

A few more stylistic issues that I see:

+               if ((hashtable->nbatch == 1) &&
+                       if (hashtable->nbuckets_optimal <= (INT_MAX/2))
+       if (size > (HASH_CHUNK_SIZE/8))

While I'm all in favor of using parentheses generously where it's
needed to add clarity, these and similar usages seem over the top to
me.

On a related note, the first of these reads like this if (stuff) { if
(more stuff) { do things } } which makes one wonder why we need two if
statements.

+
+       /* used for dense allocation of tuples (into linked chunks) */
+       HashChunk       chunks_batch;   /*  one list for the whole batch */
+
 }      HashJoinTableData;

If the original coding didn't have a blank line between the last
structure member and the closing brace, it's probably not right to
make it that way now. There are similar issues at the end of some
functions you modified, and a few other places (like the new code in
ExecChooseHashTableSize and chunk_alloc) where there are extra blank
lines at the starts and ends of blocks.

+char * chunk_alloc(HashJoinTable hashtable, int size) {

Eh... no.

+ /* XXX maybe we should use MAXALIGN(size) here ... */

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

/messages/by-id/53CBEB8A.207@fuzzy.cz

tv@fuzzy.cz

over 11 years ago

In reply to: Robert Haas (#20)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 8.9.2014 22:44, Robert Haas wrote:

On Fri, Sep 5, 2014 at 3:23 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

as Heikki mentioned in his "commitfest status" message, this patch
still has no reviewers :-( Is there anyone willing to pick up that,
together with the 'dense allocation' patch (as those two are
closely related)?

I'm looking in Robert's direction, as he's the one who came up with
the dense allocation idea, and also commented on the hasjoin bucket
resizing patch ...

I'd ask Pavel Stehule, but as he's sitting next to me in the office
he's not really independent ;-) I'll do some reviews on the other
patches over the weekend, to balance this of course.

Will any of those patches to be, heh heh, mine?

I'll exchange some of the credit for +1. Deal? ;-)

I am a bit confused by the relationship between the two patches you
posted. The "combined" patch applies, but the other one does not. If
this is a sequence of two patches, it seems like it would be more
helpful to post A and B rather than B and A+B. Perhaps the other
patch is on a different thread but there's not an obvious reference
to said thread here.

Yeah, it's probably a bit confusing. The thing is the "dense allocation"
idea was discussed in a different thread, so I posted the patch there:

The patch tweaking hash join buckets builds on the dense allocation,
because it (a) makes up for the memory used for more buckets and (b)
it's actually easier to rebuild the buckets this way.

So I only posted the separate patch for those who want to do a review,
and then a "complete patch" with both parts combined. But it sure may be
a bit confusing.

In ExecChooseHashTableSize(), if buckets_bytes is independent of
nbatch, could we just compute it before working out dbatch, and just
deduct it from hash_table_bytes? That seems like it'd do the same
thing for less work.

Well, we can do that. But I really don't think that's going to make
measurable difference. And I think the code is more clear this way.

Either way, what happens if buckets_bytes is already bigger than
hash_table_bytes?

Hm, I don't see how that could happen.

FWIW, I think the first buckets_bytes formula is actually wrong:

buckets_bytes
= sizeof(HashJoinTuple) * my_log2(ntuples / NTUP_PER_BUCKET);

and should be

buckets_bytes =
sizeof(HashJoinTuple) * (1 << my_log2(ntuples / NTUP_PER_BUCKET));

instead. Also, we might consider that we never use less than 1024
buckets (but that's minor issue, I guess).

But once we get into the "need batching" branch, we do this:

lbuckets = 1 << my_log2(hash_table_bytes / (tupsize * NTUP_PER_BUCKET
+ sizeof(HashJoinTuple)));

which includes both the bucket (pointer) and tuple size, and IMHO
guarantees that bucket_bytes will never be over work_mem (which is what
hash_table_bytes is).

The only case where this might happen is (tupsize < 8), thanks to the
my_log2 (getting (50% work_mem + epsilon), doubled to >100% work_mem).

But tupsize is defined as this:

tupsize = HJTUPLE_OVERHEAD +
MAXALIGN(sizeof(MinimalTupleData)) +
MAXALIGN(tupwidth);

and HJTUPLE_OVERHEAD alone is 12B, MinimalTupleData is >11B (ignoring
alignment) etc.

So I believe this is safe ...

A few more stylistic issues that I see:
+               if ((hashtable->nbatch == 1) &&
+                       if (hashtable->nbuckets_optimal <= (INT_MAX/2))
+       if (size > (HASH_CHUNK_SIZE/8))
While I'm all in favor of using parentheses generously where it's
needed to add clarity, these and similar usages seem over the top to
me.

It seemed more readable for me at the time of coding it, and it still
seems better this way, but ...

https://www.youtube.com/watch?v=CxK_nA2iVXw

On a related note, the first of these reads like this if (stuff) { if
(more stuff) { do things } } which makes one wonder why we need two if
statements.

We probably don't ...

+
+       /* used for dense allocation of tuples (into linked chunks) */
+       HashChunk       chunks_batch;   /*  one list for the whole batch */
+
}      HashJoinTableData;
If the original coding didn't have a blank line between the last
structure member and the closing brace, it's probably not right to
make it that way now. There are similar issues at the end of some
functions you modified, and a few other places (like the new code in
ExecChooseHashTableSize and chunk_alloc) where there are extra blank
lines at the starts and ends of blocks.

I'll fix that. FWIW, those issues seem to be in the 'dense allocation'
patch.

+char * chunk_alloc(HashJoinTable hashtable, int size) {

Eh... no.

+ /* XXX maybe we should use MAXALIGN(size) here ... */

+1.

I'm not sure what's the 'no' pointing at? Maybe that the parenthesis
should be on the next line? And is the +1 about doing MAXALING?

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

robertmhaas@gmail.com

over 11 years ago

In reply to: Tomas Vondra (#21)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

So I only posted the separate patch for those who want to do a review,
and then a "complete patch" with both parts combined. But it sure may be
a bit confusing.

Let's do this: post each new version of the patches only on this
thread, as a series of patches that can be applied one after another.

In ExecChooseHashTableSize(), if buckets_bytes is independent of
nbatch, could we just compute it before working out dbatch, and just
deduct it from hash_table_bytes? That seems like it'd do the same
thing for less work.

Well, we can do that. But I really don't think that's going to make
measurable difference. And I think the code is more clear this way.

Really? It seems to me that you're doing more or less the same
calculation that's already been done a second time. It took me 20
minutes of staring to figure out what it was really doing. Now
admittedly I was a little low on caffeine, but...

Either way, what happens if buckets_bytes is already bigger than
hash_table_bytes?

Hm, I don't see how that could happen.

How about an Assert() and a comment, then?

A few more stylistic issues that I see:
+               if ((hashtable->nbatch == 1) &&
+                       if (hashtable->nbuckets_optimal <= (INT_MAX/2))
+       if (size > (HASH_CHUNK_SIZE/8))
While I'm all in favor of using parentheses generously where it's
needed to add clarity, these and similar usages seem over the top to
me.
It seemed more readable for me at the time of coding it, and it still
seems better this way, but ...

https://www.youtube.com/watch?v=CxK_nA2iVXw

Heh. Well, at any rate, I think PostgreSQL style is to not use
parentheses in that situation.

+char * chunk_alloc(HashJoinTable hashtable, int size) {

Eh... no.

+ /* XXX maybe we should use MAXALIGN(size) here ... */

+1.

I'm not sure what's the 'no' pointing at? Maybe that the parenthesis
should be on the next line? And is the +1 about doing MAXALING?

The "no" is about the fact that you have the return type, the function
name, and the opening brace on one line instead of three lines as is
project style. The +1 is for applying MAXALIGN to the size.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

tv@fuzzy.cz

over 11 years ago

In reply to: Robert Haas (#22)

2 attachment(s)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 9.9.2014 16:09, Robert Haas wrote:

On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

So I only posted the separate patch for those who want to do a review,
and then a "complete patch" with both parts combined. But it sure may be
a bit confusing.

Let's do this: post each new version of the patches only on this
thread, as a series of patches that can be applied one after another.

OK, attached. Apply in this order

1) dense-alloc-v5.patch
2) hashjoin-nbuckets-v12.patch

In ExecChooseHashTableSize(), if buckets_bytes is independent of
nbatch, could we just compute it before working out dbatch, and just
deduct it from hash_table_bytes? That seems like it'd do the same
thing for less work.

Well, we can do that. But I really don't think that's going to make
measurable difference. And I think the code is more clear this way.

Really? It seems to me that you're doing more or less the same
calculation that's already been done a second time. It took me 20
minutes of staring to figure out what it was really doing. Now
admittedly I was a little low on caffeine, but...

I reworked this part a bit, to make it easier to understand. Mostly
following your recommendations.

Either way, what happens if buckets_bytes is already bigger than
hash_table_bytes?

Hm, I don't see how that could happen.

How about an Assert() and a comment, then?

Done. According to the comment, we should never really get over 25%,
except maybe for strange work_mem values, because we're keeping nbuckets
as 2^N. But even then we should not reach 50%, so I added this assert:

Assert(buckets_bytes <= hash_table_bytes/2);

And then we subtract the buckets_bytes, and continue with the loop.

A few more stylistic issues that I see:
+               if ((hashtable->nbatch == 1) &&
+                       if (hashtable->nbuckets_optimal <= (INT_MAX/2))
+       if (size > (HASH_CHUNK_SIZE/8))
While I'm all in favor of using parentheses generously where it's
needed to add clarity, these and similar usages seem over the top to
me.
It seemed more readable for me at the time of coding it, and it still
seems better this way, but ...

https://www.youtube.com/watch?v=CxK_nA2iVXw
Heh. Well, at any rate, I think PostgreSQL style is to not use
parentheses in that situation.

FWIW, I added HASH_CHUNK_THRESHOLD macro, specifying the threshold. I
also simplified the conditions a bit, and removed some of the parentheses.

+char * chunk_alloc(HashJoinTable hashtable, int size) {

Eh... no.

+ /* XXX maybe we should use MAXALIGN(size) here ... */

+1.

I'm not sure what's the 'no' pointing at? Maybe that the parenthesis
should be on the next line? And is the +1 about doing MAXALING?

The "no" is about the fact that you have the return type, the function
name, and the opening brace on one line instead of three lines as is
project style. The +1 is for applying MAXALIGN to the size.

OK, fixed.

I also did a few 'minor' changes to the dense allocation patch, most
notably:

* renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
The original naming seemed a bit awkward.

* renamed the function to 'dense_alloc' (instead of 'chunk_alloc')
Seems like a better fit to what the function does.

* the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be >=8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.

I also considered Heikki's suggestion that while rebatching, we can
reuse chunks with a single large tuple, instead of copying it
needlessly. My attempt however made the code much uglier (I almost used
a GOTO, but my hands rejected to type it ...). I'll look into that.

Tomas

Attachments:

dense-alloc-v5.patchtext/x-diff; name=dense-alloc-v5.patchDownload

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..d5acfb1 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -47,6 +47,7 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 						int bucketNumber);
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
+static char * dense_alloc(HashJoinTable hashtable, int tupleSize);
 
 /* ----------------------------------------------------------------
  *		ExecHash
@@ -294,6 +295,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->spaceAllowedSkew =
 		hashtable->spaceAllowed * SKEW_WORK_MEM_PERCENT / 100;
 
+	hashtable->chunks = NULL;
+
 	/*
 	 * Get info about the hash functions to be used for each hash key. Also
 	 * remember whether the join operators are strict.
@@ -556,10 +559,10 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	int			oldnbatch = hashtable->nbatch;
 	int			curbatch = hashtable->curbatch;
 	int			nbatch;
-	int			i;
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
+	MemoryChunk	oldchunks = hashtable->chunks;
 
 	/* do nothing if we've decided to shut off growth */
 	if (!hashtable->growEnabled)
@@ -612,51 +615,71 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 */
 	ninmemory = nfreed = 0;
 
-	for (i = 0; i < hashtable->nbuckets; i++)
+	/*
+	 * We will scan through the chunks directly, so we can reset the buckets
+	 * and copy/insert the tuples from scratch. So set all buckets to NULL.
+	 * And throw away the old chunks - we have a pointer in oldchunks, and
+	 * we'll free them once we've copied all the tuples.
+	 */
+	memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
+	hashtable->chunks = NULL;
+
+	/* so, let's scan through the old chunks, and all tuples in each chunk */
+	while (oldchunks != NULL)
 	{
-		HashJoinTuple prevtuple;
-		HashJoinTuple tuple;
 
-		prevtuple = NULL;
-		tuple = hashtable->buckets[i];
+		/* pointer to the next memory chunk (may be NULL for the last chunk) */
+		MemoryChunk nextchunk = oldchunks->next;
+
+		/* position within the buffer (up to oldchunks->used) */
+		size_t idx = 0;
 
-		while (tuple != NULL)
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx < oldchunks->used)
 		{
-			/* save link in case we delete */
-			HashJoinTuple nexttuple = tuple->next;
-			int			bucketno;
-			int			batchno;
+
+			/* get the hashjoin tuple and mintuple stored right after it */
+			HashJoinTuple hashTuple = (HashJoinTuple)(oldchunks->data + idx);
+			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+			int		hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+			int		bucketno;
+			int		batchno;
 
 			ninmemory++;
-			ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
 									  &bucketno, &batchno);
-			Assert(bucketno == i);
+
 			if (batchno == curbatch)
 			{
-				/* keep tuple */
-				prevtuple = tuple;
+				/* keep tuple - this means we need to copy it into the new chunks */
+				HashJoinTuple copyTuple = (HashJoinTuple) dense_alloc(hashtable, hashTupleSize);
+				memcpy(copyTuple, hashTuple, hashTupleSize);
+
+				/* course add it to the apropriate bucket */
+				copyTuple->next = hashtable->buckets[bucketno];
+				hashtable->buckets[bucketno] = copyTuple;
 			}
 			else
 			{
 				/* dump it out */
 				Assert(batchno > curbatch);
-				ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(tuple),
-									  tuple->hashvalue,
+				ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(hashTuple),
+									  hashTuple->hashvalue,
 									  &hashtable->innerBatchFile[batchno]);
-				/* and remove from hash table */
-				if (prevtuple)
-					prevtuple->next = nexttuple;
-				else
-					hashtable->buckets[i] = nexttuple;
-				/* prevtuple doesn't change */
-				hashtable->spaceUsed -=
-					HJTUPLE_OVERHEAD + HJTUPLE_MINTUPLE(tuple)->t_len;
-				pfree(tuple);
+
+				hashtable->spaceUsed -= hashTupleSize;
 				nfreed++;
 			}
 
-			tuple = nexttuple;
+			/* bytes occupied in memory HJ tuple overhead + actual tuple length */
+			idx += MAXALIGN(hashTupleSize);
 		}
+
+		/* so, we're done with this chunk - free it and proceed to the next one */
+		pfree(oldchunks);
+		oldchunks = nextchunk;
 	}
 
 #ifdef HJDEBUG
@@ -717,8 +740,8 @@ ExecHashTableInsert(HashJoinTable hashtable,
 
 		/* Create the HashJoinTuple */
 		hashTupleSize = HJTUPLE_OVERHEAD + tuple->t_len;
-		hashTuple = (HashJoinTuple) MemoryContextAlloc(hashtable->batchCxt,
-													   hashTupleSize);
+		hashTuple = (HashJoinTuple) dense_alloc(hashtable, hashTupleSize);
+
 		hashTuple->hashvalue = hashvalue;
 		memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
 
@@ -1068,6 +1091,9 @@ ExecHashTableReset(HashJoinTable hashtable)
 	hashtable->spaceUsed = 0;
 
 	MemoryContextSwitchTo(oldcxt);
+
+	/* Forget the chunks (the memory was freed by the context reset above). */
+	hashtable->chunks = NULL;
 }
 
 /*
@@ -1318,6 +1344,69 @@ ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue)
 	return INVALID_SKEW_BUCKET_NO;
 }
 
+static char *
+dense_alloc(HashJoinTable hashtable, int size)
+{
+
+	/* just in case the size is not already aligned properly */
+	size = MAXALIGN(size);
+
+	/* if tuple size is larger than of 1/8 of chunk size, allocate a separate chunk */
+	if (size > HASH_CHUNK_THRESHOLD) {
+
+		/* allocate new chunk and put it at the beginning of the list */
+		MemoryChunk newChunk
+			= (MemoryChunk)MemoryContextAlloc(hashtable->batchCxt,
+											  offsetof(MemoryChunkData, data) + size);
+
+		newChunk->maxlen = size;
+		newChunk->used = 0;
+		newChunk->ntuples = 0;
+
+		/* If there already is a chunk, add if after it so we can still use the space in it */
+		if (hashtable->chunks != NULL) {
+			newChunk->next = hashtable->chunks->next;
+			hashtable->chunks->next = newChunk;
+		} else {
+			newChunk->next = hashtable->chunks;
+			hashtable->chunks = newChunk;
+		}
+
+		newChunk->used += size;
+		newChunk->ntuples += 1;
+
+		return newChunk->data;
+
+	}
+
+	/*
+	 * If it's within the size limit, see if we have enough space for it in the
+	 * current chunk (if there's one). If not, allocate a fresh chunk.
+	 */
+	if ((hashtable->chunks == NULL) || (hashtable->chunks->maxlen - hashtable->chunks->used) < size) {
+
+		/* allocate new chunk and put it at the beginning of the list */
+		MemoryChunk newChunk
+			= (MemoryChunk)MemoryContextAlloc(hashtable->batchCxt,
+											  offsetof(MemoryChunkData, data) + HASH_CHUNK_SIZE);
+
+		newChunk->maxlen = HASH_CHUNK_SIZE;
+		newChunk->used = 0;
+		newChunk->ntuples = 0;
+
+		newChunk->next = hashtable->chunks;
+		hashtable->chunks = newChunk;
+	}
+
+	/* OK, we have enough space in the chunk, let's add the tuple */
+	hashtable->chunks->used += size;
+	hashtable->chunks->ntuples += 1;
+
+	/* allocate pointer to the start of the tuple memory */
+	return hashtable->chunks->data + (hashtable->chunks->used - size);
+}
+
+
 /*
  * ExecHashSkewTableInsert
  *
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3beae40..166bb7a 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -102,6 +102,23 @@ typedef struct HashSkewBucket
 #define SKEW_WORK_MEM_PERCENT  2
 #define SKEW_MIN_OUTER_FRACTION  0.01
 
+/* used for dense allocation of tuples */
+#define HASH_CHUNK_SIZE			(32*1024L)
+#define HASH_CHUNK_THRESHOLD	(HASH_CHUNK_SIZE/4)
+
+typedef struct MemoryChunkData
+{
+	int			ntuples;	/* number of tuples stored in this chunk */
+	size_t		maxlen;		/* length of the buffer */
+	size_t		used;		/* number of chunk bytes already used */
+
+	struct MemoryChunkData   *next;	/* pointer to the next chunk (linked list) */
+
+	char		data[1];	/* buffer allocated at the end */
+} MemoryChunkData;
+
+typedef struct MemoryChunkData* MemoryChunk;
+
 
 typedef struct HashJoinTableData
 {
@@ -157,6 +174,9 @@ typedef struct HashJoinTableData
 
 	MemoryContext hashCxt;		/* context for whole-hash-join storage */
 	MemoryContext batchCxt;		/* context for this-batch-only storage */
+
+	/* used for dense allocation of tuples (into linked chunks) */
+	MemoryChunk	chunks;			/*  one list for the whole batch */
 }	HashJoinTableData;
 
 #endif   /* HASHJOIN_H */

hashjoin-nbuckets-v12.patchtext/x-diff; name=hashjoin-nbuckets-v12.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..d128e08 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,21 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) ||
+				 (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index d5acfb1..a76e878 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -49,6 +50,9 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 static char * dense_alloc(HashJoinTable hashtable, int tupleSize);
 
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets_optimal * sizeof(HashJoinTuple))
+
 /* ----------------------------------------------------------------
  *		ExecHash
  *
@@ -117,6 +121,7 @@ MultiExecHash(HashState *node)
 				/* It's a skew tuple, so put it into that hash table */
 				ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 										bucketNumber);
+				hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -127,6 +132,25 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
+	if (hashtable->nbuckets != hashtable->nbuckets_optimal)
+	{
+		/* We never decrease the number of buckets. */
+		Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+#ifdef HJDEBUG
+		printf("Increasing nbuckets %d => %d\n",
+			   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+
+		ExecHashIncreaseNumBuckets(hashtable);
+	}
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += BUCKETS_SPACE(hashtable);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -272,7 +296,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
+	hashtable->nbuckets_optimal = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
+	hashtable->log2_nbuckets_optimal = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
 	hashtable->skewEnabled = false;
@@ -286,6 +313,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -387,7 +415,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
  */
 
 /* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
+#define NTUP_PER_BUCKET			1
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
@@ -397,6 +425,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 {
 	int			tupsize;
 	double		inner_rel_bytes;
+	long		buckets_bytes;
 	long		hash_table_bytes;
 	long		skew_table_bytes;
 	long		max_pointers;
@@ -419,6 +448,16 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	inner_rel_bytes = ntuples * tupsize;
 
 	/*
+	 * Compute memory occupied by buckets, assuming all tuples fit into
+	 * a single batch (consider NTUP_PER_BUCKET tuples per bucket) - buckets
+	 * are just 'HashJoinTuple' elements (pointers to HashJoinTupleData).
+	 * Also, we never use less than 1024 buckets.
+	 */
+	nbuckets = (1 << my_log2(ntuples / NTUP_PER_BUCKET));
+	nbuckets = Max(1024, nbuckets);
+	buckets_bytes = sizeof(HashJoinTuple) * nbuckets;
+
+	/*
 	 * Target in-memory hashtable size is work_mem kilobytes.
 	 */
 	hash_table_bytes = work_mem * 1024L;
@@ -469,16 +508,13 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/* also ensure we avoid integer overflow in nbatch and nbuckets */
 	max_pointers = Min(max_pointers, INT_MAX / 2);
 
-	if (inner_rel_bytes > hash_table_bytes)
+	if ((inner_rel_bytes + buckets_bytes) > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
 		long		lbuckets;
 		double		dbatch;
 		int			minbatch;
-
-		lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
-		lbuckets = Min(lbuckets, max_pointers);
-		nbuckets = (int) lbuckets;
+		long		bucket_size;
 
 		dbatch = ceil(inner_rel_bytes / hash_table_bytes);
 		dbatch = Min(dbatch, max_pointers);
@@ -486,6 +522,54 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch = 2;
 		while (nbatch < minbatch)
 			nbatch <<= 1;
+
+		/* memory needed by a single "full" bucket (including tuples) */
+		bucket_size = (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple));
+
+		/*
+		 * When batching, we use buckets size for full work_mem. We simply
+		 * divide work_mem by memory needed per 'full bucket' (a pointer and
+		 * NTUP_PER_BUCKET tuples, each 'tupsize' bytes, including additional
+		 * overhead for hash, pointer to the next tuple etc.).
+		 */
+		lbuckets = 1 << my_log2(hash_table_bytes / bucket_size);
+
+		/* protect against nbucket overflow */
+		lbuckets = Min(lbuckets, max_pointers);
+		nbuckets = (int) lbuckets;
+
+		/* Compute memory needed for buckets. */
+		buckets_bytes = nbuckets * sizeof(HashJoinTuple);
+
+		/*
+		 * Buckets are simple pointers to hashjoin tuples, while tupsize is
+		 * always >= 24B, plus actual data. So buckets should never really
+		 * exceed 25% of work_mem (even for NTUP_PER_BUCKET=1). Except maybe
+		 * for work_mem values that are not 2^N bytes, where we might get more
+		 * because of doubling. So let's look for 50% here.
+		 */
+		Assert(buckets_bytes <= hash_table_bytes/2);
+
+		/*
+		 * Subtract the buckets from work_mem, so we know how much memory
+		 * remains for the actual tuples.
+		 */
+		hash_table_bytes -= buckets_bytes;
+
+		/*
+		 * Increase the nbatch until we get both tuples and buckets into work_mem.
+		 *
+		 * The loop should not execute more than once in most cases, becase tuples are
+		 * usually much wider than buckets (usually 8B pointers), so by using only
+		 * (batch_bytes/2) should get us below work_mem.
+		 *
+		 * The worst case is that (nbuckets == 2*ntuples-1), giving us about twice the
+		 * number of buckets, i.e. about 2*sizeof(void*) per tuple. But that's
+		 * the consequence of how NTUP_PER_BUCKET is chosen, and work_mem limit.
+		 */
+		while ((inner_rel_bytes / nbatch) > hash_table_bytes)
+			nbatch <<= 1;
+
 	}
 	else
 	{
@@ -620,7 +704,23 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 * and copy/insert the tuples from scratch. So set all buckets to NULL.
 	 * And throw away the old chunks - we have a pointer in oldchunks, and
 	 * we'll free them once we've copied all the tuples.
+	 *
+	 * If we need to do a resize of buckets, we can do it while rebatching.
 	 */
+
+	if (hashtable->nbuckets_optimal != hashtable->nbuckets)
+	{
+		/* we never decrease the number of buckets */
+		Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+		hashtable->nbuckets = hashtable->nbuckets_optimal;
+		hashtable->log2_nbuckets = hashtable->log2_nbuckets_optimal;
+
+		hashtable->buckets = repalloc(hashtable->buckets,
+									  sizeof(HashJoinTuple) * hashtable->nbuckets);
+	}
+
+	/* zero the buckets, start allocating into new chunks */
 	memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
 	hashtable->chunks = NULL;
 
@@ -705,6 +805,93 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	MemoryChunk	chunk;
+
+	/* do nothing if not an increase (it's called increase for a reason) */
+	if (hashtable->nbuckets >= hashtable->nbuckets_optimal)
+		return;
+
+	/*
+	 * We already know the optimal number of buckets, so let's just
+	 * compute the log2_nbuckets for it.
+	 */
+	hashtable->nbuckets = hashtable->nbuckets_optimal;
+	hashtable->log2_nbuckets = my_log2(hashtable->nbuckets_optimal);
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets <= (INT_MAX/2));
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/*
+	 * Just reallocate the proper number of buckets - we don't need to
+	 * walk through them - we can walk the dense-allocated chunks
+	 * (just like in ExecHashIncreaseNumBatches, but without all the
+	 * copying into new chunks)
+	 */
+
+	hashtable->buckets = (HashJoinTuple *) repalloc(hashtable->buckets,
+								hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
+
+	/* scan through all tuples in all chunks, rebuild the hash table */
+	chunk = hashtable->chunks;
+	while (chunk != NULL)
+	{
+
+		/* position within the buffer (up to chunk->used) */
+		size_t idx = 0;
+
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx < chunk->used)
+		{
+
+			/* get the hashjoin tuple and mintuple stored right after it */
+			HashJoinTuple hashTuple = (HashJoinTuple)(chunk->data + idx);
+			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+			int		hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+			int		bucketno;
+			int		batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* just add the tuple to the proper bucket */
+			hashTuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = hashTuple;
+
+			/* bytes occupied in memory HJ tuple overhead + actual tuple length */
+			idx += MAXALIGN(hashTupleSize);
+
+		}
+
+		/* proceed to the next chunk */
+		chunk = chunk->next;
+
+	}
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, batchTuples / hashtable->nbuckets);
+#endif
+
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -737,6 +924,7 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		 */
 		HashJoinTuple hashTuple;
 		int			hashTupleSize;
+		double		ntuples = (hashtable->totalTuples - hashtable->skewTuples);
 
 		/* Create the HashJoinTuple */
 		hashTupleSize = HJTUPLE_OVERHEAD + tuple->t_len;
@@ -757,12 +945,30 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		/*
+		 * Increase the (optimal) number of buckets if we just exceeded NTUP_PER_BUCKET,
+		 * but only when there's still a single batch at this moment.
+		 */
+		if ((hashtable->nbatch == 1) &&
+			(hashtable->nbuckets_optimal <= INT_MAX/2) &&	/* overflow protection */
+			(ntuples >= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET)))
+		{
+			hashtable->nbuckets_optimal *= 2;
+			hashtable->log2_nbuckets_optimal += 1;
+		}
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
-		if (hashtable->spaceUsed > hashtable->spaceAllowed)
+
+		/*
+		 * Do we need to increase number of batches? Account for space required
+		 * by buckets (optimal number).
+		 */
+		if (hashtable->spaceUsed + BUCKETS_SPACE(hashtable) > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
 	}
 	else
 	{
@@ -885,7 +1091,10 @@ ExecHashGetHashValue(HashJoinTable hashtable,
  * functions are good about randomizing all their output bits, else we are
  * likely to have very skewed bucket or batch occupancy.)
  *
- * nbuckets doesn't change over the course of the join.
+ * The nbuckets/log2_nbuckets may change while (nbatch==1) because of dynamic
+ * buckets growth. Once we start batching, the value is fixed and does not
+ * change over the course of join (making it possible to compute batch number
+ * the way we do here).
  *
  * nbatch is always a power of 2; we increase it only by doubling it.  This
  * effectively adds one more bit to the top of the batchno.
@@ -1351,7 +1560,7 @@ dense_alloc(HashJoinTable hashtable, int size)
 	/* just in case the size is not already aligned properly */
 	size = MAXALIGN(size);
 
-	/* if tuple size is larger than of 1/8 of chunk size, allocate a separate chunk */
+	/* if tuple size is larger than of 1/4 of chunk size, allocate a separate chunk */
 	if (size > HASH_CHUNK_THRESHOLD) {
 
 		/* allocate new chunk and put it at the beginning of the list */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 166bb7a..f0775ea 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -125,6 +125,10 @@ typedef struct HashJoinTableData
 	int			nbuckets;		/* # buckets in the in-memory hash table */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
+	int			nbuckets_original;	/* # buckets when starting the first hash */
+	int			nbuckets_optimal;	/* optimal # buckets (per batch) */
+	int			log2_nbuckets_optimal;	/* same as log2_nbuckets optimal */
+
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
 	struct HashJoinTupleData **buckets;
 	/* buckets array is per-batch storage, as are all the tuples */
@@ -146,6 +150,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	double		skewTuples;		/* # tuples inserted into skew tuples */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if

#24

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Tomas Vondra (#23)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 09/10/2014 01:49 AM, Tomas Vondra wrote:

On 9.9.2014 16:09, Robert Haas wrote:

On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

So I only posted the separate patch for those who want to do a review,
and then a "complete patch" with both parts combined. But it sure may be
a bit confusing.

Let's do this: post each new version of the patches only on this
thread, as a series of patches that can be applied one after another.

OK, attached. Apply in this order

1) dense-alloc-v5.patch
2) hashjoin-nbuckets-v12.patch

The dense-alloc-v5.patch looks good to me. I have committed that with
minor cleanup (more comments below). I have not looked at the second patch.

I also did a few 'minor' changes to the dense allocation patch, most
notably:

* renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
The original naming seemed a bit awkward.

That's too easy to confuse with regular MemoryContext and AllocChunk
stuff. I renamed it to HashMemoryChunk.

* the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be >=8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.

Should we care about the fact that if there are only a few tuples, we
will nevertheless waste 32kB of memory for the chunk? I guess not, but I
thought I'd mention it. The smallest allowed value for work_mem is 64kB.

I also considered Heikki's suggestion that while rebatching, we can
reuse chunks with a single large tuple, instead of copying it
needlessly. My attempt however made the code much uglier (I almost used
a GOTO, but my hands rejected to type it ...). I'll look into that.

I tried constructing a test case where the rehashing time would be
significant enough for this to matter, but I couldn't find one. Even if
the original estimate on the number of batches is way too small, we ramp
up quickly to a reasonable size and the rehashing time is completely
insignificant compared to all the other work. So I think we can drop
that idea.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

robertmhaas@gmail.com

over 11 years ago

In reply to: Heikki Linnakangas (#24)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

The dense-alloc-v5.patch looks good to me. I have committed that with minor
cleanup (more comments below). I have not looked at the second patch.

Gah. I was in the middle of doing this. Sigh.

* the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be >=8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.

Should we care about the fact that if there are only a few tuples, we will
nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
I'd mention it. The smallest allowed value for work_mem is 64kB.

I think we should change the threshold here to 1/8th. The worst case
memory wastage as-is ~32k/5 > 6k.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Robert Haas (#25)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 09/10/2014 09:31 PM, Robert Haas wrote:

On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

The dense-alloc-v5.patch looks good to me. I have committed that with minor
cleanup (more comments below). I have not looked at the second patch.

Gah. I was in the middle of doing this. Sigh.

Sorry.

* the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be >=8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.

Should we care about the fact that if there are only a few tuples, we will
nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
I'd mention it. The smallest allowed value for work_mem is 64kB.

I think we should change the threshold here to 1/8th. The worst case
memory wastage as-is ~32k/5 > 6k.

Not sure how you arrived at that number. The worst case is that each 32k
chunk is filled up to 24k+1 bytes, i.e. 8k-1 bytes is wasted in each
chunk. That's 25% wastage. That's not too bad IMHO - the worst case
waste of AllocSet is 50%.

But if you want to twiddle it, feel free. Note that there's some
interplay with allocset code, like Tomas mentioned. If the threshold is
< 8k, palloc() will round up request sizes smaller than 8k anyway. That
wastes some memory, nothing more serious than that, but it seems like a
good idea to avoid it.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

tv@fuzzy.cz

over 11 years ago

In reply to: Heikki Linnakangas (#24)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 10.9.2014 20:25, Heikki Linnakangas wrote:

On 09/10/2014 01:49 AM, Tomas Vondra wrote:

I also did a few 'minor' changes to the dense allocation patch, most
notably:

* renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
The original naming seemed a bit awkward.

That's too easy to confuse with regular MemoryContext and AllocChunk
stuff. I renamed it to HashMemoryChunk.

OK.

* the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be >=8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.

Should we care about the fact that if there are only a few tuples, we
will nevertheless waste 32kB of memory for the chunk? I guess not, but I
thought I'd mention it. The smallest allowed value for work_mem is 64kB.

I don't think that's a problem.

I also considered Heikki's suggestion that while rebatching, we can
reuse chunks with a single large tuple, instead of copying it
needlessly. My attempt however made the code much uglier (I almost used
a GOTO, but my hands rejected to type it ...). I'll look into that.

I tried constructing a test case where the rehashing time would be
significant enough for this to matter, but I couldn't find one. Even
if the original estimate on the number of batches is way too small,
we ramp up quickly to a reasonable size and the rehashing time is
completely insignificant compared to all the other work. So I think
we can drop that idea.

I don't think that had anything to do with rehashing. What you pointed
out is that when rebatching, we have to keep ~50% of the tuples, which
means 'copy into a new chunk, then throw away the old ones'.

For large tuples (you mentioned 100MB tuples, IIRC), we needlessly
allocate this amount of memory only to copy the single tuple and then
throw away the old tuple. So (a) that's an additional memcpy, and (b) it
allocates additional 100MB of memory for a short period of time.

Now, I'd guess when dealing with tuples this large, there will be more
serious trouble elsewhere, but I don't want to neglect it. Maybe it's
worth optimizing?

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

tv@fuzzy.cz

over 11 years ago

In reply to: Robert Haas (#25)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 10.9.2014 20:31, Robert Haas wrote:

On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

The dense-alloc-v5.patch looks good to me. I have committed that with minor
cleanup (more comments below). I have not looked at the second patch.

Gah. I was in the middle of doing this. Sigh.

* the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be >=8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.

Should we care about the fact that if there are only a few tuples, we will
nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
I'd mention it. The smallest allowed value for work_mem is 64kB.

I think we should change the threshold here to 1/8th. The worst case
memory wastage as-is ~32k/5 > 6k.

So you'd lower the threshold to 4kB? That may lower the wastage in the
chunks, but palloc will actually allocate 8kB anyway, wasting up to
additional 4kB. So I don't see how lowering the threshold to 1/8th
improves the situation ...

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

robertmhaas@gmail.com

over 11 years ago

In reply to: Tomas Vondra (#28)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On Wed, Sep 10, 2014 at 3:02 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 10.9.2014 20:31, Robert Haas wrote:

On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

The dense-alloc-v5.patch looks good to me. I have committed that with minor
cleanup (more comments below). I have not looked at the second patch.

Gah. I was in the middle of doing this. Sigh.

* the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be >=8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.

Should we care about the fact that if there are only a few tuples, we will
nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
I'd mention it. The smallest allowed value for work_mem is 64kB.

I think we should change the threshold here to 1/8th. The worst case
memory wastage as-is ~32k/5 > 6k.

So you'd lower the threshold to 4kB? That may lower the wastage in the
chunks, but palloc will actually allocate 8kB anyway, wasting up to
additional 4kB. So I don't see how lowering the threshold to 1/8th
improves the situation ...

Ah, OK. Well, never mind then. :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

tv@fuzzy.cz

over 11 years ago

In reply to: Heikki Linnakangas (#26)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 10.9.2014 20:55, Heikki Linnakangas wrote:

On 09/10/2014 09:31 PM, Robert Haas wrote:

* the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be >=8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with
ALLOC_CHUNK_FRACTION.

Should we care about the fact that if there are only a few tuples, we
will
nevertheless waste 32kB of memory for the chunk? I guess not, but I
thought
I'd mention it. The smallest allowed value for work_mem is 64kB.

I think we should change the threshold here to 1/8th. The worst case
memory wastage as-is ~32k/5 > 6k.

Not sure how you arrived at that number. The worst case is that each 32k
chunk is filled up to 24k+1 bytes, i.e. 8k-1 bytes is wasted in each
chunk. That's 25% wastage. That's not too bad IMHO - the worst case
waste of AllocSet is 50%.

But if you want to twiddle it, feel free. Note that there's some
interplay with allocset code, like Tomas mentioned. If the threshold is
< 8k, palloc() will round up request sizes smaller than 8k anyway. That
wastes some memory, nothing more serious than that, but it seems like a
good idea to avoid it.

Yes, and pfree then keeps these blocks, which means a chance of keeping
memory that won't be reused. That's why I chose the 8kB threshold. So
I'd keep the 32kB/8kB, as proposed in the patch.

But I don't see this as a big issue - my experience is that either we
have very much smaller than 4kB, or much larger tuples than 8kB. And
even for the tuples between 4kB and 8kB, the waste is not that bad.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

tv@fuzzy.cz

over 11 years ago

In reply to: Heikki Linnakangas (#24)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 10.9.2014 20:25, Heikki Linnakangas wrote:

On 09/10/2014 01:49 AM, Tomas Vondra wrote:

I also did a few 'minor' changes to the dense allocation patch, most
notably:

* renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
The original naming seemed a bit awkward.

That's too easy to confuse with regular MemoryContext and AllocChunk
stuff. I renamed it to HashMemoryChunk.

BTW this breaks the second patch, which is allocating new chunks when
resizing the hash table. Should I rebase the patch, or can the commiter
do s/MemoryChunk/HashMemoryChunk/ ?

Assuming there are no more fixes needed, of course.

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

robertmhaas@gmail.com

over 11 years ago

In reply to: Tomas Vondra (#31)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On Wed, Sep 10, 2014 at 3:12 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 10.9.2014 20:25, Heikki Linnakangas wrote:

On 09/10/2014 01:49 AM, Tomas Vondra wrote:

I also did a few 'minor' changes to the dense allocation patch, most
notably:

* renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
The original naming seemed a bit awkward.

That's too easy to confuse with regular MemoryContext and AllocChunk
stuff. I renamed it to HashMemoryChunk.

BTW this breaks the second patch, which is allocating new chunks when
resizing the hash table. Should I rebase the patch, or can the commiter
do s/MemoryChunk/HashMemoryChunk/ ?

Assuming there are no more fixes needed, of course.

Rebasing it will save the committer time, which will increase the
chances that one will look at your patch. So it's highly recommended.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

tv@fuzzy.cz

over 11 years ago

In reply to: Robert Haas (#32)

1 attachment(s)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 10.9.2014 21:34, Robert Haas wrote:

On Wed, Sep 10, 2014 at 3:12 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 10.9.2014 20:25, Heikki Linnakangas wrote:

On 09/10/2014 01:49 AM, Tomas Vondra wrote:

I also did a few 'minor' changes to the dense allocation patch, most
notably:

* renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
The original naming seemed a bit awkward.

That's too easy to confuse with regular MemoryContext and AllocChunk
stuff. I renamed it to HashMemoryChunk.

BTW this breaks the second patch, which is allocating new chunks when
resizing the hash table. Should I rebase the patch, or can the commiter
do s/MemoryChunk/HashMemoryChunk/ ?

Assuming there are no more fixes needed, of course.

Rebasing it will save the committer time, which will increase the
chances that one will look at your patch. So it's highly recommended.

OK. So here's v13 of the patch, reflecting this change.

regards
Tomas

Attachments:

hashjoin-nbuckets-v13.patchtext/x-diff; name=hashjoin-nbuckets-v13.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..d128e08 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,21 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) ||
+				 (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 3ef7cfb..4f00426 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -49,6 +50,9 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 static void *dense_alloc(HashJoinTable hashtable, Size size);
 
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets_optimal * sizeof(HashJoinTuple))
+
 /* ----------------------------------------------------------------
  *		ExecHash
  *
@@ -117,6 +121,7 @@ MultiExecHash(HashState *node)
 				/* It's a skew tuple, so put it into that hash table */
 				ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 										bucketNumber);
+				hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -127,6 +132,25 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
+	if (hashtable->nbuckets != hashtable->nbuckets_optimal)
+	{
+		/* We never decrease the number of buckets. */
+		Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+#ifdef HJDEBUG
+		printf("Increasing nbuckets %d => %d\n",
+			   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+
+		ExecHashIncreaseNumBuckets(hashtable);
+	}
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += BUCKETS_SPACE(hashtable);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -272,7 +296,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
+	hashtable->nbuckets_optimal = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
+	hashtable->log2_nbuckets_optimal = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
 	hashtable->skewEnabled = false;
@@ -286,6 +313,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -386,7 +414,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
  */
 
 /* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
+#define NTUP_PER_BUCKET			1
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
@@ -396,6 +424,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 {
 	int			tupsize;
 	double		inner_rel_bytes;
+	long		buckets_bytes;
 	long		hash_table_bytes;
 	long		skew_table_bytes;
 	long		max_pointers;
@@ -418,6 +447,16 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	inner_rel_bytes = ntuples * tupsize;
 
 	/*
+	 * Compute memory occupied by buckets, assuming all tuples fit into
+	 * a single batch (consider NTUP_PER_BUCKET tuples per bucket) - buckets
+	 * are just 'HashJoinTuple' elements (pointers to HashJoinTupleData).
+	 * Also, we never use less than 1024 buckets.
+	 */
+	nbuckets = (1 << my_log2(ntuples / NTUP_PER_BUCKET));
+	nbuckets = Max(1024, nbuckets);
+	buckets_bytes = sizeof(HashJoinTuple) * nbuckets;
+
+	/*
 	 * Target in-memory hashtable size is work_mem kilobytes.
 	 */
 	hash_table_bytes = work_mem * 1024L;
@@ -468,16 +507,13 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/* also ensure we avoid integer overflow in nbatch and nbuckets */
 	max_pointers = Min(max_pointers, INT_MAX / 2);
 
-	if (inner_rel_bytes > hash_table_bytes)
+	if ((inner_rel_bytes + buckets_bytes) > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
 		long		lbuckets;
 		double		dbatch;
 		int			minbatch;
-
-		lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
-		lbuckets = Min(lbuckets, max_pointers);
-		nbuckets = (int) lbuckets;
+		long		bucket_size;
 
 		dbatch = ceil(inner_rel_bytes / hash_table_bytes);
 		dbatch = Min(dbatch, max_pointers);
@@ -485,6 +521,54 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch = 2;
 		while (nbatch < minbatch)
 			nbatch <<= 1;
+
+		/* memory needed by a single "full" bucket (including tuples) */
+		bucket_size = (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple));
+
+		/*
+		 * When batching, we use buckets size for full work_mem. We simply
+		 * divide work_mem by memory needed per 'full bucket' (a pointer and
+		 * NTUP_PER_BUCKET tuples, each 'tupsize' bytes, including additional
+		 * overhead for hash, pointer to the next tuple etc.).
+		 */
+		lbuckets = 1 << my_log2(hash_table_bytes / bucket_size);
+
+		/* protect against nbucket overflow */
+		lbuckets = Min(lbuckets, max_pointers);
+		nbuckets = (int) lbuckets;
+
+		/* Compute memory needed for buckets. */
+		buckets_bytes = nbuckets * sizeof(HashJoinTuple);
+
+		/*
+		 * Buckets are simple pointers to hashjoin tuples, while tupsize is
+		 * always >= 24B, plus actual data. So buckets should never really
+		 * exceed 25% of work_mem (even for NTUP_PER_BUCKET=1). Except maybe
+		 * for work_mem values that are not 2^N bytes, where we might get more
+		 * because of doubling. So let's look for 50% here.
+		 */
+		Assert(buckets_bytes <= hash_table_bytes/2);
+
+		/*
+		 * Subtract the buckets from work_mem, so we know how much memory
+		 * remains for the actual tuples.
+		 */
+		hash_table_bytes -= buckets_bytes;
+
+		/*
+		 * Increase the nbatch until we get both tuples and buckets into work_mem.
+		 *
+		 * The loop should not execute more than once in most cases, becase tuples are
+		 * usually much wider than buckets (usually 8B pointers), so by using only
+		 * (batch_bytes/2) should get us below work_mem.
+		 *
+		 * The worst case is that (nbuckets == 2*ntuples-1), giving us about twice the
+		 * number of buckets, i.e. about 2*sizeof(void*) per tuple. But that's
+		 * the consequence of how NTUP_PER_BUCKET is chosen, and work_mem limit.
+		 */
+		while ((inner_rel_bytes / nbatch) > hash_table_bytes)
+			nbatch <<= 1;
+
 	}
 	else
 	{
@@ -614,6 +698,19 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 */
 	ninmemory = nfreed = 0;
 
+	/* If know we need to resize nbuckets, we can do it while rebatching. */
+	if (hashtable->nbuckets_optimal != hashtable->nbuckets)
+	{
+		/* we never decrease the number of buckets */
+		Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+		hashtable->nbuckets = hashtable->nbuckets_optimal;
+		hashtable->log2_nbuckets = hashtable->log2_nbuckets_optimal;
+
+		hashtable->buckets = repalloc(hashtable->buckets,
+									  sizeof(HashJoinTuple) * hashtable->nbuckets);
+	}
+
 	/*
 	 * We will scan through the chunks directly, so that we can reset the
 	 * buckets now and not have to keep track which tuples in the buckets have
@@ -698,6 +795,93 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	HashMemoryChunk	chunk;
+
+	/* do nothing if not an increase (it's called increase for a reason) */
+	if (hashtable->nbuckets >= hashtable->nbuckets_optimal)
+		return;
+
+	/*
+	 * We already know the optimal number of buckets, so let's just
+	 * compute the log2_nbuckets for it.
+	 */
+	hashtable->nbuckets = hashtable->nbuckets_optimal;
+	hashtable->log2_nbuckets = my_log2(hashtable->nbuckets_optimal);
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets <= (INT_MAX/2));
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/*
+	 * Just reallocate the proper number of buckets - we don't need to
+	 * walk through them - we can walk the dense-allocated chunks
+	 * (just like in ExecHashIncreaseNumBatches, but without all the
+	 * copying into new chunks)
+	 */
+
+	hashtable->buckets = (HashJoinTuple *) repalloc(hashtable->buckets,
+								hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
+
+	/* scan through all tuples in all chunks, rebuild the hash table */
+	chunk = hashtable->chunks;
+	while (chunk != NULL)
+	{
+
+		/* position within the buffer (up to chunk->used) */
+		size_t idx = 0;
+
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx < chunk->used)
+		{
+
+			/* get the hashjoin tuple and mintuple stored right after it */
+			HashJoinTuple hashTuple = (HashJoinTuple)(chunk->data + idx);
+			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+			int		hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+			int		bucketno;
+			int		batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* just add the tuple to the proper bucket */
+			hashTuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = hashTuple;
+
+			/* bytes occupied in memory HJ tuple overhead + actual tuple length */
+			idx += MAXALIGN(hashTupleSize);
+
+		}
+
+		/* proceed to the next chunk */
+		chunk = chunk->next;
+
+	}
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, batchTuples / hashtable->nbuckets);
+#endif
+
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -730,6 +914,7 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		 */
 		HashJoinTuple hashTuple;
 		int			hashTupleSize;
+		double		ntuples = (hashtable->totalTuples - hashtable->skewTuples);
 
 		/* Create the HashJoinTuple */
 		hashTupleSize = HJTUPLE_OVERHEAD + tuple->t_len;
@@ -750,12 +935,30 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		/*
+		 * Increase the (optimal) number of buckets if we just exceeded NTUP_PER_BUCKET,
+		 * but only when there's still a single batch at this moment.
+		 */
+		if ((hashtable->nbatch == 1) &&
+			(hashtable->nbuckets_optimal <= INT_MAX/2) &&	/* overflow protection */
+			(ntuples >= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET)))
+		{
+			hashtable->nbuckets_optimal *= 2;
+			hashtable->log2_nbuckets_optimal += 1;
+		}
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
-		if (hashtable->spaceUsed > hashtable->spaceAllowed)
+
+		/*
+		 * Do we need to increase number of batches? Account for space required
+		 * by buckets (optimal number).
+		 */
+		if (hashtable->spaceUsed + BUCKETS_SPACE(hashtable) > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
 	}
 	else
 	{
@@ -878,7 +1081,10 @@ ExecHashGetHashValue(HashJoinTable hashtable,
  * functions are good about randomizing all their output bits, else we are
  * likely to have very skewed bucket or batch occupancy.)
  *
- * nbuckets doesn't change over the course of the join.
+ * The nbuckets/log2_nbuckets may change while (nbatch==1) because of dynamic
+ * buckets growth. Once we start batching, the value is fixed and does not
+ * change over the course of join (making it possible to compute batch number
+ * the way we do here).
  *
  * nbatch is always a power of 2; we increase it only by doubling it.  This
  * effectively adds one more bit to the top of the batchno.
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index c9e61df..0e1e0cd 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -127,6 +127,10 @@ typedef struct HashJoinTableData
 	int			nbuckets;		/* # buckets in the in-memory hash table */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
+	int			nbuckets_original;	/* # buckets when starting the first hash */
+	int			nbuckets_optimal;	/* optimal # buckets (per batch) */
+	int			log2_nbuckets_optimal;	/* same as log2_nbuckets optimal */
+
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
 	struct HashJoinTupleData **buckets;
 	/* buckets array is per-batch storage, as are all the tuples */
@@ -148,6 +152,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	double		skewTuples;		/* # tuples inserted into skew tuples */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if

#34

robertmhaas@gmail.com

over 11 years ago

In reply to: Tomas Vondra (#33)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On Wed, Sep 10, 2014 at 5:09 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

OK. So here's v13 of the patch, reflecting this change.

With the exception of ExecChooseHashTableSize() and a lot of stylistic
issues along the lines of what I've already complained about, this
patch seems pretty good to me. It does three things:

(1) It changes NTUP_PER_BUCKET to 1. Although this increases memory
utilization, it also improves hash table performance, and the
additional memory we use is less than what we saved as a result of
commit 45f6240a8fa9d35548eb2ef23dba2c11540aa02a.

(2) It changes ExecChooseHashTableSize() to include the effect of the
memory consumed by the bucket array. If we care about properly
respecting work_mem, this is clearly right for any NTUP_PER_BUCKET
setting, but more important for a small setting like 1 than for the
current value of 10. I'm not sure the logic in this function is as
clean and simple as it can be just yet.

(3) It allows the number of batches to increase on the fly while the
hash join is in process. This case arises when we initially estimate
that we only need a small hash table, and then it turns out that there
are more tuples than we expect. Without this code, the hash table's
load factor gets too high and things start to suck.

I recommend separating this patch in two patches, the first covering
items (1) and (2) and the second covering item (3), which really seems
like a separate improvement.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Robert Haas (#34)

Re: bad estimation together with large work_mem generates terrible slow hash joins

Robert Haas <robertmhaas@gmail.com> writes:

With the exception of ExecChooseHashTableSize() and a lot of stylistic
issues along the lines of what I've already complained about, this
patch seems pretty good to me. It does three things:
...
(3) It allows the number of batches to increase on the fly while the
hash join is in process. This case arises when we initially estimate
that we only need a small hash table, and then it turns out that there
are more tuples than we expect. Without this code, the hash table's
load factor gets too high and things start to suck.

Pardon me for not having read the patch yet, but what part of (3)
wasn't there already?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

robertmhaas@gmail.com

over 11 years ago

In reply to: Tom Lane (#35)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

With the exception of ExecChooseHashTableSize() and a lot of stylistic
issues along the lines of what I've already complained about, this
patch seems pretty good to me. It does three things:
...
(3) It allows the number of batches to increase on the fly while the
hash join is in process. This case arises when we initially estimate
that we only need a small hash table, and then it turns out that there
are more tuples than we expect. Without this code, the hash table's
load factor gets too high and things start to suck.

Pardon me for not having read the patch yet, but what part of (3)
wasn't there already?

EINSUFFICIENTCAFFEINE.

It allows the number of BUCKETS to increase, not the number of
batches. As you say, the number of batches could already increase.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Robert Haas (#36)

Re: bad estimation together with large work_mem generates terrible slow hash joins

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

(3) It allows the number of batches to increase on the fly while the
hash join is in process.

Pardon me for not having read the patch yet, but what part of (3)
wasn't there already?

EINSUFFICIENTCAFFEINE.

It allows the number of BUCKETS to increase, not the number of
batches. As you say, the number of batches could already increase.

Ah. Well, that would mean that we need a heuristic for deciding when to
increase the number of buckets versus the number of batches ... seems
like a difficult decision.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

tv@fuzzy.cz

over 11 years ago

In reply to: Robert Haas (#34)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 11 Září 2014, 15:31, Robert Haas wrote:

On Wed, Sep 10, 2014 at 5:09 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

OK. So here's v13 of the patch, reflecting this change.

With the exception of ExecChooseHashTableSize() and a lot of stylistic
issues along the lines of what I've already complained about, this
patch seems pretty good to me. It does three things:

I believe I fixed the stylistic issues you pointed out, except maybe the
bracketing (which seems fine to me, though). So if you could point the
issues out, that'd be helpful.

(1) It changes NTUP_PER_BUCKET to 1. Although this increases memory
utilization, it also improves hash table performance, and the
additional memory we use is less than what we saved as a result of
commit 45f6240a8fa9d35548eb2ef23dba2c11540aa02a.

(2) It changes ExecChooseHashTableSize() to include the effect of the
memory consumed by the bucket array. If we care about properly
respecting work_mem, this is clearly right for any NTUP_PER_BUCKET
setting, but more important for a small setting like 1 than for the
current value of 10. I'm not sure the logic in this function is as
clean and simple as it can be just yet.

I made that as clear as I was able to (based on your feedback), but I'm
"stuck" as I'm familiar with the logic. That is not to say it can't be
improved, but this needs to be reviewed by someone else.

(3) It allows the number of batches to increase on the fly while the
hash join is in process. This case arises when we initially estimate
that we only need a small hash table, and then it turns out that there
are more tuples than we expect. Without this code, the hash table's
load factor gets too high and things start to suck.

I recommend separating this patch in two patches, the first covering
items (1) and (2) and the second covering item (3), which really seems
like a separate improvement.

That probably makes sense.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

tv@fuzzy.cz

over 11 years ago

In reply to: Tom Lane (#37)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 11 Září 2014, 16:11, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

(3) It allows the number of batches to increase on the fly while the
hash join is in process.

Pardon me for not having read the patch yet, but what part of (3)
wasn't there already?

EINSUFFICIENTCAFFEINE.

It allows the number of BUCKETS to increase, not the number of
batches. As you say, the number of batches could already increase.

Ah. Well, that would mean that we need a heuristic for deciding when to
increase the number of buckets versus the number of batches ... seems
like a difficult decision.

That's true, but that's not the aim of this patch. The patch simply
increases the number of buckets if the load happens to get too high, and
does not try to decide between increasing nbuckets and nbatch.

It's true that we can often get better performance with more batches, but
the cases I was able to inspect were caused either by (a) underestimates
resulting in inappropriate nbucket count, or (b) caching effects. This
patch solves (a), not even trying to fix (b).

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

robertmhaas@gmail.com

over 11 years ago

In reply to: Tom Lane (#37)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On Thu, Sep 11, 2014 at 10:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

(3) It allows the number of batches to increase on the fly while the
hash join is in process.

Pardon me for not having read the patch yet, but what part of (3)
wasn't there already?

EINSUFFICIENTCAFFEINE.

It allows the number of BUCKETS to increase, not the number of
batches. As you say, the number of batches could already increase.

Ah. Well, that would mean that we need a heuristic for deciding when to
increase the number of buckets versus the number of batches ... seems
like a difficult decision.

Well, the patch takes the approach of increasing the number of buckets
when (1) there's only one batch and (2) the load factor exceeds
NTUP_PER_BUCKET. As Tomas points out, it's just increasing the number
of buckets to the exact same number which we would have allocated had
we known that this many tuples would show up.

Now, there is a possibility that we could lose. Let's suppose that
the tuples overflow work_mem, but just barely. If we've accurately
estimate how many tuples there are, the patch changes nothing: either
way we're going to need two batches. But let's say we've
underestimated the number of tuples by 3x. Then it could be that
without the patch we'd allocate fewer buckets, never increase the
number of buckets, and avoid batching; whereas with the patch, we'll
discover that our tuple estimate was wrong, increase the number of
buckets on the fly, and then be forced by the slightly-increased
memory consumption that results from increasing the number of buckets
to do two batches. That would suck.

But catering to that case is basically hoping that we'll fall into a
septic tank and come out smelling like a rose - that is, we're hoping
that our initial poor estimate will cause us to accidentally do the
right thing later. I don't think that's the way to go. It's much
more important to get the case where things are bigger than we
expected but still fit within work_mem right; and we're currently
taking a huge run-time penalty in that case, as Tomas' results show.
If we wanted to cater to the other case in the future, we could
consider the opposite approach: if work_mem is exhahusted, throw the
bucket headers away and keep reading tuples into the dense-packed
chunks added by yesterday's commit. If we again run out of work_mem,
then we *definitely* need to increase the batch count. If we manage
to make everything fit, then we know exactly how many bucket headers
we can afford, and can use some heuristic to decide between that and
using more batches.

I don't think that should be a requirement for this patch, though: I
think the cumulative effect of Tomas's patches is going to be a win
*much* more often than it's a loss. In 100% of cases, the
dense-packing stuff will make a hash table containing the same tuples
significantly smaller. Except when we've radically overestimated the
tuple count, the change to NTUP_PER_BUCKET will mean less time spent
chasing hash chains. It does use more memory, but that's more than
paid for by the first change. Both the dense-packing stuff and the
changes to include bucket headers in the work_mem calculation will
have the effect of making the work_mem bound on memory utilization far
more accurate than it's ever been previously. And the change to
increase the number of buckets at runtime will mean that we're
significantly more resilient against the planner underestimating the
number of tuples. That's clearly good. The fact that we can
construct borderline cases where it loses shouldn't deter us from
pushing forward.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Tomas Vondra (#39)

Re: bad estimation together with large work_mem generates terrible slow hash joins

"Tomas Vondra" <tv@fuzzy.cz> writes:

On 11 Září 2014, 16:11, Tom Lane wrote:

Ah. Well, that would mean that we need a heuristic for deciding when to
increase the number of buckets versus the number of batches ... seems
like a difficult decision.

That's true, but that's not the aim of this patch. The patch simply
increases the number of buckets if the load happens to get too high, and
does not try to decide between increasing nbuckets and nbatch.

On further thought, increasing nbuckets without changing the batch
boundaries would not get us out of an out-of-work_mem situation, in fact
it makes memory consumption worse not better (assuming you count the
bucket headers towards work_mem ;-)).

So in principle, the rule seems like it ought to go "if load (defined as
max bucket chain length, I imagine?) gets too high, but we are still
well below work_mem, increase nbuckets; else increase nbatch". And
perhaps we reset nbuckets again for the next batch, not sure. If we
are dealing with an out-of-work_mem situation then only increasing nbatch
would be a suitable response.

Because of the risk that increasing nbuckets would itself lead to a
work_mem violation, I don't think it's sane to ignore the interaction
entirely, even in a first patch.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

tv@fuzzy.cz

over 11 years ago

In reply to: Tom Lane (#41)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 11 Září 2014, 17:28, Tom Lane wrote:

"Tomas Vondra" <tv@fuzzy.cz> writes:

On 11 Z?????? 2014, 16:11, Tom Lane wrote:

Ah. Well, that would mean that we need a heuristic for deciding when
to
increase the number of buckets versus the number of batches ... seems
like a difficult decision.

That's true, but that's not the aim of this patch. The patch simply
increases the number of buckets if the load happens to get too high, and
does not try to decide between increasing nbuckets and nbatch.

On further thought, increasing nbuckets without changing the batch
boundaries would not get us out of an out-of-work_mem situation, in fact
it makes memory consumption worse not better (assuming you count the
bucket headers towards work_mem ;-)).

Yes, the patch counts the bucket headers towards work_mem, while the
original code didn't. The reasoning is that due to changing
NTUP_PER_BUCKET from 10 to 1, the memory occupied by buckets is not
negligible and should be accounted for.

So in principle, the rule seems like it ought to go "if load (defined as
max bucket chain length, I imagine?) gets too high, but we are still
well below work_mem, increase nbuckets; else increase nbatch". And
perhaps we reset nbuckets again for the next batch, not sure. If we
are dealing with an out-of-work_mem situation then only increasing nbatch
would be a suitable response.

Almost.

If we expect batching at the very beginning, we size nbuckets for "full
work_mem" (see how many tuples we can get into work_mem, while not
breaking NTUP_PER_BUCKET threshold).

If we expect to be fine without batching, we start with the 'right'
nbuckets and track the optimal nbuckets as we go (without actually
resizing the hash table). Once we hit work_mem (considering the optimal
nbuckets value), we keep the value.

Only at the end, we check whether (nbuckets != nbuckets_optimal) and
resize the hash table if needed. Also, we keep this value for all batches
(it's OK because it assumes full work_mem, and it makes the batchno
evaluation trivial). So the resize happens only once and only for the
first batch.

Because of the risk that increasing nbuckets would itself lead to a
work_mem violation, I don't think it's sane to ignore the interaction
entirely, even in a first patch.

This possibility of increasing the number of batches is the consequence of
counting the buckets towards work_mem. I believe that's the right thing to
do here, to make the work_mem guarantee stronger, but it's not something
this patch depends on. If this is the only concern, we can leave this out.

It's also worth mentioning that the dense allocation of tuples makes a
huge difference. palloc easily results in 100% overhead for narrow tuples
(that is, allocating 2x the amount of needed memory). For example over
here [1]/messages/by-id/53BEEA9E.2080009@fuzzy.cz is an example of a hashjoin consuming 1.4GB with work_mem=800MB.

And the dense allocation patch pretty much makes this go away (as
demonstrated in the [1]/messages/by-id/53BEEA9E.2080009@fuzzy.cz). For wider tuples, the difference is smaller, but
that when the buckets are less important.

What I'm trying to say is that it's only a matter of work_mem values.
Currently, because of the weak guarantee, people tend to set the value
lower than necessary. The dense allocation makes this unnecessary,
allowing higher work_mem values, and thus eliminating the issue.

If this is not enough, we can stop counting the buckets towards work_mem.
It'll still be more memory efficient than the old approach (as a pointer
is smaller than palloc overhead), and it won't cause additional batches.
But I don't like this - I think we should make work_mem a stronger
guarantee.

[1]: /messages/by-id/53BEEA9E.2080009@fuzzy.cz

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#38)

2 attachment(s)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 11.9.2014 16:33, Tomas Vondra wrote:

On 11 Září 2014, 15:31, Robert Haas wrote:

On Wed, Sep 10, 2014 at 5:09 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

OK. So here's v13 of the patch, reflecting this change.

[...] It does three things:

(1) It changes NTUP_PER_BUCKET to 1. Although this increases memory
utilization, it also improves hash table performance, and the
additional memory we use is less than what we saved as a result of
commit 45f6240a8fa9d35548eb2ef23dba2c11540aa02a.

(2) It changes ExecChooseHashTableSize() to include the effect of the
memory consumed by the bucket array. If we care about properly
respecting work_mem, this is clearly right for any NTUP_PER_BUCKET
setting, but more important for a small setting like 1 than for the
current value of 10. I'm not sure the logic in this function is as
clean and simple as it can be just yet.
(3) It allows the number of batches to increase on the fly while the
hash join is in process. This case arises when we initially estimate
that we only need a small hash table, and then it turns out that there
are more tuples than we expect. Without this code, the hash table's
load factor gets too high and things start to suck.

I recommend separating this patch in two patches, the first covering
items (1) and (2) and the second covering item (3), which really seems
like a separate improvement.

That probably makes sense.

Attached is the patch split as suggested:

(a) hashjoin-nbuckets-v14a-size.patch

* NTUP_PER_BUCKET=1
* counting buckets towards work_mem
* changes in ExecChooseHashTableSize (due to the other changes)

(b) hashjoin-nbuckets-v14a-resize.patch

* rest of the patch, that is ...
* tracking optimal number of buckets
* dynamic resize of the hash table
* adding info the the EXPLAIN ANALYZE output

regards
Tomas

Attachments:

hashjoin-nbuckets-v14b-resize.patchtext/x-diff; name=hashjoin-nbuckets-v14b-resize.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..d128e08 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,21 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) ||
+				 (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index fa0e700a..4f00426 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -49,8 +50,8 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 static void *dense_alloc(HashJoinTable hashtable, Size size);
 
-/* Memory needed for buckets. */
-#define BUCKETS_SPACE(htab)	((htab)->nbuckets * sizeof(HashJoinTuple))
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets_optimal * sizeof(HashJoinTuple))
 
 /* ----------------------------------------------------------------
  *		ExecHash
@@ -120,6 +121,7 @@ MultiExecHash(HashState *node)
 				/* It's a skew tuple, so put it into that hash table */
 				ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 										bucketNumber);
+				hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -130,6 +132,25 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
+	if (hashtable->nbuckets != hashtable->nbuckets_optimal)
+	{
+		/* We never decrease the number of buckets. */
+		Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+#ifdef HJDEBUG
+		printf("Increasing nbuckets %d => %d\n",
+			   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+
+		ExecHashIncreaseNumBuckets(hashtable);
+	}
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += BUCKETS_SPACE(hashtable);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -275,7 +296,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
+	hashtable->nbuckets_optimal = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
+	hashtable->log2_nbuckets_optimal = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
 	hashtable->skewEnabled = false;
@@ -289,6 +313,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -543,6 +568,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		 */
 		while ((inner_rel_bytes / nbatch) > hash_table_bytes)
 			nbatch <<= 1;
+
 	}
 	else
 	{
@@ -672,6 +698,19 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 */
 	ninmemory = nfreed = 0;
 
+	/* If know we need to resize nbuckets, we can do it while rebatching. */
+	if (hashtable->nbuckets_optimal != hashtable->nbuckets)
+	{
+		/* we never decrease the number of buckets */
+		Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+		hashtable->nbuckets = hashtable->nbuckets_optimal;
+		hashtable->log2_nbuckets = hashtable->log2_nbuckets_optimal;
+
+		hashtable->buckets = repalloc(hashtable->buckets,
+									  sizeof(HashJoinTuple) * hashtable->nbuckets);
+	}
+
 	/*
 	 * We will scan through the chunks directly, so that we can reset the
 	 * buckets now and not have to keep track which tuples in the buckets have
@@ -756,6 +795,93 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	HashMemoryChunk	chunk;
+
+	/* do nothing if not an increase (it's called increase for a reason) */
+	if (hashtable->nbuckets >= hashtable->nbuckets_optimal)
+		return;
+
+	/*
+	 * We already know the optimal number of buckets, so let's just
+	 * compute the log2_nbuckets for it.
+	 */
+	hashtable->nbuckets = hashtable->nbuckets_optimal;
+	hashtable->log2_nbuckets = my_log2(hashtable->nbuckets_optimal);
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets <= (INT_MAX/2));
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/*
+	 * Just reallocate the proper number of buckets - we don't need to
+	 * walk through them - we can walk the dense-allocated chunks
+	 * (just like in ExecHashIncreaseNumBatches, but without all the
+	 * copying into new chunks)
+	 */
+
+	hashtable->buckets = (HashJoinTuple *) repalloc(hashtable->buckets,
+								hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
+
+	/* scan through all tuples in all chunks, rebuild the hash table */
+	chunk = hashtable->chunks;
+	while (chunk != NULL)
+	{
+
+		/* position within the buffer (up to chunk->used) */
+		size_t idx = 0;
+
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx < chunk->used)
+		{
+
+			/* get the hashjoin tuple and mintuple stored right after it */
+			HashJoinTuple hashTuple = (HashJoinTuple)(chunk->data + idx);
+			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+			int		hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+			int		bucketno;
+			int		batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* just add the tuple to the proper bucket */
+			hashTuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = hashTuple;
+
+			/* bytes occupied in memory HJ tuple overhead + actual tuple length */
+			idx += MAXALIGN(hashTupleSize);
+
+		}
+
+		/* proceed to the next chunk */
+		chunk = chunk->next;
+
+	}
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, batchTuples / hashtable->nbuckets);
+#endif
+
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -788,6 +914,7 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		 */
 		HashJoinTuple hashTuple;
 		int			hashTupleSize;
+		double		ntuples = (hashtable->totalTuples - hashtable->skewTuples);
 
 		/* Create the HashJoinTuple */
 		hashTupleSize = HJTUPLE_OVERHEAD + tuple->t_len;
@@ -808,12 +935,30 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		/*
+		 * Increase the (optimal) number of buckets if we just exceeded NTUP_PER_BUCKET,
+		 * but only when there's still a single batch at this moment.
+		 */
+		if ((hashtable->nbatch == 1) &&
+			(hashtable->nbuckets_optimal <= INT_MAX/2) &&	/* overflow protection */
+			(ntuples >= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET)))
+		{
+			hashtable->nbuckets_optimal *= 2;
+			hashtable->log2_nbuckets_optimal += 1;
+		}
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
+
+		/*
+		 * Do we need to increase number of batches? Account for space required
+		 * by buckets (optimal number).
+		 */
 		if (hashtable->spaceUsed + BUCKETS_SPACE(hashtable) > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
 	}
 	else
 	{
@@ -936,7 +1081,10 @@ ExecHashGetHashValue(HashJoinTable hashtable,
  * functions are good about randomizing all their output bits, else we are
  * likely to have very skewed bucket or batch occupancy.)
  *
- * nbuckets doesn't change over the course of the join.
+ * The nbuckets/log2_nbuckets may change while (nbatch==1) because of dynamic
+ * buckets growth. Once we start batching, the value is fixed and does not
+ * change over the course of join (making it possible to compute batch number
+ * the way we do here).
  *
  * nbatch is always a power of 2; we increase it only by doubling it.  This
  * effectively adds one more bit to the top of the batchno.
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index c9e61df..0e1e0cd 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -127,6 +127,10 @@ typedef struct HashJoinTableData
 	int			nbuckets;		/* # buckets in the in-memory hash table */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
+	int			nbuckets_original;	/* # buckets when starting the first hash */
+	int			nbuckets_optimal;	/* optimal # buckets (per batch) */
+	int			log2_nbuckets_optimal;	/* same as log2_nbuckets optimal */
+
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
 	struct HashJoinTupleData **buckets;
 	/* buckets array is per-batch storage, as are all the tuples */
@@ -148,6 +152,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	double		skewTuples;		/* # tuples inserted into skew tuples */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if

hashjoin-nbuckets-v14a-size.patchtext/x-diff; name=hashjoin-nbuckets-v14a-size.patchDownload

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 3ef7cfb..fa0e700a 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -49,6 +49,9 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 static void *dense_alloc(HashJoinTable hashtable, Size size);
 
+/* Memory needed for buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets * sizeof(HashJoinTuple))
+
 /* ----------------------------------------------------------------
  *		ExecHash
  *
@@ -386,7 +389,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
  */
 
 /* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
+#define NTUP_PER_BUCKET			1
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
@@ -396,6 +399,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 {
 	int			tupsize;
 	double		inner_rel_bytes;
+	long		buckets_bytes;
 	long		hash_table_bytes;
 	long		skew_table_bytes;
 	long		max_pointers;
@@ -418,6 +422,16 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	inner_rel_bytes = ntuples * tupsize;
 
 	/*
+	 * Compute memory occupied by buckets, assuming all tuples fit into
+	 * a single batch (consider NTUP_PER_BUCKET tuples per bucket) - buckets
+	 * are just 'HashJoinTuple' elements (pointers to HashJoinTupleData).
+	 * Also, we never use less than 1024 buckets.
+	 */
+	nbuckets = (1 << my_log2(ntuples / NTUP_PER_BUCKET));
+	nbuckets = Max(1024, nbuckets);
+	buckets_bytes = sizeof(HashJoinTuple) * nbuckets;
+
+	/*
 	 * Target in-memory hashtable size is work_mem kilobytes.
 	 */
 	hash_table_bytes = work_mem * 1024L;
@@ -468,16 +482,13 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/* also ensure we avoid integer overflow in nbatch and nbuckets */
 	max_pointers = Min(max_pointers, INT_MAX / 2);
 
-	if (inner_rel_bytes > hash_table_bytes)
+	if ((inner_rel_bytes + buckets_bytes) > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
 		long		lbuckets;
 		double		dbatch;
 		int			minbatch;
-
-		lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
-		lbuckets = Min(lbuckets, max_pointers);
-		nbuckets = (int) lbuckets;
+		long		bucket_size;
 
 		dbatch = ceil(inner_rel_bytes / hash_table_bytes);
 		dbatch = Min(dbatch, max_pointers);
@@ -485,6 +496,53 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch = 2;
 		while (nbatch < minbatch)
 			nbatch <<= 1;
+
+		/* memory needed by a single "full" bucket (including tuples) */
+		bucket_size = (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple));
+
+		/*
+		 * When batching, we use buckets size for full work_mem. We simply
+		 * divide work_mem by memory needed per 'full bucket' (a pointer and
+		 * NTUP_PER_BUCKET tuples, each 'tupsize' bytes, including additional
+		 * overhead for hash, pointer to the next tuple etc.).
+		 */
+		lbuckets = 1 << my_log2(hash_table_bytes / bucket_size);
+
+		/* protect against nbucket overflow */
+		lbuckets = Min(lbuckets, max_pointers);
+		nbuckets = (int) lbuckets;
+
+		/* Compute memory needed for buckets. */
+		buckets_bytes = nbuckets * sizeof(HashJoinTuple);
+
+		/*
+		 * Buckets are simple pointers to hashjoin tuples, while tupsize is
+		 * always >= 24B, plus actual data. So buckets should never really
+		 * exceed 25% of work_mem (even for NTUP_PER_BUCKET=1). Except maybe
+		 * for work_mem values that are not 2^N bytes, where we might get more
+		 * because of doubling. So let's look for 50% here.
+		 */
+		Assert(buckets_bytes <= hash_table_bytes/2);
+
+		/*
+		 * Subtract the buckets from work_mem, so we know how much memory
+		 * remains for the actual tuples.
+		 */
+		hash_table_bytes -= buckets_bytes;
+
+		/*
+		 * Increase the nbatch until we get both tuples and buckets into work_mem.
+		 *
+		 * The loop should not execute more than once in most cases, becase tuples are
+		 * usually much wider than buckets (usually 8B pointers), so by using only
+		 * (batch_bytes/2) should get us below work_mem.
+		 *
+		 * The worst case is that (nbuckets == 2*ntuples-1), giving us about twice the
+		 * number of buckets, i.e. about 2*sizeof(void*) per tuple. But that's
+		 * the consequence of how NTUP_PER_BUCKET is chosen, and work_mem limit.
+		 */
+		while ((inner_rel_bytes / nbatch) > hash_table_bytes)
+			nbatch <<= 1;
 	}
 	else
 	{
@@ -754,7 +812,7 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
-		if (hashtable->spaceUsed > hashtable->spaceAllowed)
+		if (hashtable->spaceUsed + BUCKETS_SPACE(hashtable) > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
 	}
 	else

#44

robertmhaas@gmail.com

over 11 years ago

In reply to: Tomas Vondra (#43)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Attached is the patch split as suggested:

(a) hashjoin-nbuckets-v14a-size.patch

* NTUP_PER_BUCKET=1
* counting buckets towards work_mem
* changes in ExecChooseHashTableSize (due to the other changes)

OK, I'm going to work on this one now. I have some ideas about how to
simplify this a bit and will post an update in a few hours once I see
how that pans out.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45

robertmhaas@gmail.com

over 11 years ago

In reply to: Robert Haas (#44)

1 attachment(s)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Attached is the patch split as suggested:

(a) hashjoin-nbuckets-v14a-size.patch

* NTUP_PER_BUCKET=1
* counting buckets towards work_mem
* changes in ExecChooseHashTableSize (due to the other changes)

OK, I'm going to work on this one now. I have some ideas about how to
simplify this a bit and will post an update in a few hours once I see
how that pans out.

Here's what I came up with. I believe this has the same semantics as
your patch for less code, but please check, because I might be wrong.
The main simplifications I made were:

(1) In master, we decide whether or not to batch based on the size of
the data, without knowing the number of buckets. If we want to
account for the bucket overhead, that doesn't work. But in your
patch, we basically computed the number of buckets we'd need for the
single-batch case, then decide whether to do a single batch, and if
yes, computed the same values over again with the same formula phrased
differently. I revised that so that the single-batch case is
calculated only once. If everything fits, we're all set, and don't
need to recalculate anything.

(2) In your patch, once we decide we need to batch, you set nbatch as
now, and then check whether the computed value is still adequate after
subtracting buckets_byte from hash_table_bytes. I revised that so
that we make the initial batch estimate of dbatch by dividing
inner_rel_bytes by hash_table_bytes - bucket_bytes rather than
hash_table_bytes. I think that avoids the need to maybe increase
nbatch again afterwards.

I'm comfortable with this version if you are, but (maybe as a
follow-on commit) I think we could make this even a bit smarter. If
inner_rel_bytes + bucket_bytes > hash_table_bytes but inner_rel_bytes
+ bucket_bytes / 4 < hash_table_bytes, then reduce the number of
buckets by 2x or 4x to make it fit. That would provide a bit of
additional safety margin against cases where this patch might
unluckily lose.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

hashjoin-lf1-rmh.patchtext/x-patch; charset=US-ASCII; name=hashjoin-lf1-rmh.patchDownload

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 3ef7cfb..b3203ba 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -386,7 +386,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
  */
 
 /* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
+#define NTUP_PER_BUCKET			1
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
@@ -396,12 +396,13 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 {
 	int			tupsize;
 	double		inner_rel_bytes;
+	long		bucket_bytes;
 	long		hash_table_bytes;
 	long		skew_table_bytes;
 	long		max_pointers;
-	int			nbatch;
+	int			nbatch = 1;
 	int			nbuckets;
-	int			i;
+	double		dbuckets;
 
 	/* Force a plausible relation size if no info */
 	if (ntuples <= 0.0)
@@ -460,56 +461,61 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 
 	/*
 	 * Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
-	 * memory is filled.  Set nbatch to the smallest power of 2 that appears
-	 * sufficient.  The Min() steps limit the results so that the pointer
-	 * arrays we'll try to allocate do not exceed work_mem.
+	 * memory is filled, assuming a single batch.  The Min() step limits the
+	 * results so that the pointer arrays we'll try to allocate do not exceed
+	 * work_mem.
 	 */
 	max_pointers = (work_mem * 1024L) / sizeof(void *);
 	/* also ensure we avoid integer overflow in nbatch and nbuckets */
 	max_pointers = Min(max_pointers, INT_MAX / 2);
+	dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
+	dbuckets = Min(dbuckets, max_pointers);
+	nbuckets = Max((int) dbuckets, 1024);
+	nbuckets = 1 << my_log2(nbuckets);
+	bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
 
-	if (inner_rel_bytes > hash_table_bytes)
+	/*
+	 * If there's not enough space to store the projected number of tuples
+	 * and the required bucket headers, we will need multiple batches.
+	 */
+	if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
 		long		lbuckets;
 		double		dbatch;
 		int			minbatch;
+		long		bucket_size;
 
-		lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
+		/*
+		 * Estimate the number of buckets we'll want to have when work_mem
+		 * is entirely full.  Each bucket will contain a bucket pointer plus
+		 * NTUP_PER_BUCKET tuples, whose projected size already includes
+		 * overhead for the hash code, pointer to the next tuple, etc.
+		 */
+		bucket_size = (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple));
+		lbuckets = 1 << my_log2(hash_table_bytes / bucket_size);
 		lbuckets = Min(lbuckets, max_pointers);
 		nbuckets = (int) lbuckets;
+		bucket_bytes = nbuckets * sizeof(HashJoinTuple);
+
+		/*
+		 * Buckets are simple pointers to hashjoin tuples, while tupsize
+		 * includes the pointer, hash code, and MinimalTupleData.  So buckets
+		 * should never really exceed 25% of work_mem (even for
+		 * NTUP_PER_BUCKET=1); except maybe * for work_mem values that are
+		 * not 2^N bytes, where we might get more * because of doubling.
+		 * So let's look for 50% here.
+		 */
+		Assert(bucket_bytes <= hash_table_bytes / 2);
 
-		dbatch = ceil(inner_rel_bytes / hash_table_bytes);
+		/* Calculate required number of batches. */
+		dbatch = ceil(inner_rel_bytes / (hash_table_bytes - bucket_bytes));
 		dbatch = Min(dbatch, max_pointers);
 		minbatch = (int) dbatch;
 		nbatch = 2;
 		while (nbatch < minbatch)
 			nbatch <<= 1;
 	}
-	else
-	{
-		/* We expect the hashtable to fit in memory */
-		double		dbuckets;
-
-		dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
-		dbuckets = Min(dbuckets, max_pointers);
-		nbuckets = (int) dbuckets;
-
-		nbatch = 1;
-	}
-
-	/*
-	 * Both nbuckets and nbatch must be powers of 2 to make
-	 * ExecHashGetBucketAndBatch fast.  We already fixed nbatch; now inflate
-	 * nbuckets to the next larger power of 2.  We also force nbuckets to not
-	 * be real small, by starting the search at 2^10.  (Note: above we made
-	 * sure that nbuckets is not more than INT_MAX / 2, so this loop cannot
-	 * overflow, nor can the final shift to recalculate nbuckets.)
-	 */
-	i = 10;
-	while ((1 << i) < nbuckets)
-		i++;
-	nbuckets = (1 << i);
 
 	*numbuckets = nbuckets;
 	*numbatches = nbatch;
@@ -754,7 +760,8 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
-		if (hashtable->spaceUsed > hashtable->spaceAllowed)
+		if (hashtable->spaceUsed + hashtable->nbuckets * sizeof(HashJoinTuple)
+			> hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
 	}
 	else

#46

tv@fuzzy.cz

over 11 years ago

In reply to: Robert Haas (#45)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 12.9.2014 18:49, Robert Haas wrote:

On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Attached is the patch split as suggested:

(a) hashjoin-nbuckets-v14a-size.patch

* NTUP_PER_BUCKET=1
* counting buckets towards work_mem
* changes in ExecChooseHashTableSize (due to the other changes)

OK, I'm going to work on this one now. I have some ideas about how to
simplify this a bit and will post an update in a few hours once I see
how that pans out.

Here's what I came up with. I believe this has the same semantics as
your patch for less code, but please check, because I might be
wrong. The main simplifications I made were:

(1) In master, we decide whether or not to batch based on the size
of the data, without knowing the number of buckets. If we want to
account for the bucket overhead, that doesn't work. But in your
patch, we basically computed the number of buckets we'd need for the
single-batch case, then decide whether to do a single batch, and if
yes, computed the same values over again with the same formula
phrased differently. I revised that so that the single-batch case is
calculated only once. If everything fits, we're all set, and don't
need to recalculate anything.

(2) In your patch, once we decide we need to batch, you set nbatch
as now, and then check whether the computed value is still adequate
after subtracting buckets_byte from hash_table_bytes. I revised that
so that we make the initial batch estimate of dbatch by dividing
inner_rel_bytes by hash_table_bytes - bucket_bytes rather than
hash_table_bytes. I think that avoids the need to maybe increase
nbatch again afterwards.

Yes, I like those changes and I think your reasoning is correct in both
cases. It certainly makes the method shorter and more readable - I was
too "stuck" in the original logic, so thanks for improving this.

So +1 from me to those changes.

I'm comfortable with this version if you are, but (maybe as a
follow-on commit) I think we could make this even a bit smarter. If
inner_rel_bytes + bucket_bytes > hash_table_bytes but
inner_rel_bytes + bucket_bytes / 4 < hash_table_bytes, then reduce
the number of buckets by 2x or 4x to make it fit. That would provide
a bit of additional safety margin against cases where this patch
might unluckily lose.

I don't think that's a good idea. That essentially says 'we're shooting
for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where
batching (because of smaller work_mem) actually significantly improves
performance. If we want to make this reasoning properly, deciding
whether to add batches or reduce buckets, we need a better heuristics -
that's quite complex and I'd expect it to result ain a quite complex patch.

So -1 from me to this at this moment (it certainly needs much more
thought than a mere follow-on commit).

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

robertmhaas@gmail.com

over 11 years ago

In reply to: Tomas Vondra (#46)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On Fri, Sep 12, 2014 at 3:39 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 12.9.2014 18:49, Robert Haas wrote:

On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Attached is the patch split as suggested:

(a) hashjoin-nbuckets-v14a-size.patch

* NTUP_PER_BUCKET=1
* counting buckets towards work_mem
* changes in ExecChooseHashTableSize (due to the other changes)

OK, I'm going to work on this one now. I have some ideas about how to
simplify this a bit and will post an update in a few hours once I see
how that pans out.

Here's what I came up with. I believe this has the same semantics as
your patch for less code, but please check, because I might be
wrong. The main simplifications I made were:

(1) In master, we decide whether or not to batch based on the size
of the data, without knowing the number of buckets. If we want to
account for the bucket overhead, that doesn't work. But in your
patch, we basically computed the number of buckets we'd need for the
single-batch case, then decide whether to do a single batch, and if
yes, computed the same values over again with the same formula
phrased differently. I revised that so that the single-batch case is
calculated only once. If everything fits, we're all set, and don't
need to recalculate anything.

(2) In your patch, once we decide we need to batch, you set nbatch
as now, and then check whether the computed value is still adequate
after subtracting buckets_byte from hash_table_bytes. I revised that
so that we make the initial batch estimate of dbatch by dividing
inner_rel_bytes by hash_table_bytes - bucket_bytes rather than
hash_table_bytes. I think that avoids the need to maybe increase
nbatch again afterwards.

Yes, I like those changes and I think your reasoning is correct in both
cases. It certainly makes the method shorter and more readable - I was
too "stuck" in the original logic, so thanks for improving this.

So +1 from me to those changes.

OK, committed. Please rebase the rest.

I'm comfortable with this version if you are, but (maybe as a
follow-on commit) I think we could make this even a bit smarter. If
inner_rel_bytes + bucket_bytes > hash_table_bytes but
inner_rel_bytes + bucket_bytes / 4 < hash_table_bytes, then reduce
the number of buckets by 2x or 4x to make it fit. That would provide
a bit of additional safety margin against cases where this patch
might unluckily lose.

I don't think that's a good idea. That essentially says 'we're shooting
for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where
batching (because of smaller work_mem) actually significantly improves
performance. If we want to make this reasoning properly, deciding
whether to add batches or reduce buckets, we need a better heuristics -
that's quite complex and I'd expect it to result ain a quite complex patch.

So -1 from me to this at this moment (it certainly needs much more
thought than a mere follow-on commit).

OK, but let's discuss further. You make it sound like treating
NTUP_PER_BUCKET as a soft limit is a bad thing, but I'm not convinced.
I think it's perfectly reasonable to treat NTUP_PER_BUCKET as a range:
if we can get NTUP_PER_BUCKET=1, great, but if not, NTUP_PER_BUCKET=2
or NTUP_PER_BUCKET=4 may be perfectly reasonable.

I'm actually quite surprised that you find batching to be a better
strategy than skimping on buckets, because I would have expect the
opposite, almost categorically. Batching means having to write out
the tuples we can't process right away and read them back in. If that
involves disk I/O, I think the cost of that I/O is going to be FAR
more than the overhead we would have incurred by searching slightly
longer bucket chains. If it didn't, then you could've set work_mem
higher and avoided batching in the first place.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

tv@fuzzy.cz

over 11 years ago

In reply to: Robert Haas (#47)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 12.9.2014 22:24, Robert Haas wrote:

On Fri, Sep 12, 2014 at 3:39 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 12.9.2014 18:49, Robert Haas wrote:

I'm comfortable with this version if you are, but (maybe as a
follow-on commit) I think we could make this even a bit smarter. If
inner_rel_bytes + bucket_bytes > hash_table_bytes but
inner_rel_bytes + bucket_bytes / 4 < hash_table_bytes, then reduce
the number of buckets by 2x or 4x to make it fit. That would provide
a bit of additional safety margin against cases where this patch
might unluckily lose.

I don't think that's a good idea. That essentially says 'we're shooting
for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where
batching (because of smaller work_mem) actually significantly improves
performance. If we want to make this reasoning properly, deciding
whether to add batches or reduce buckets, we need a better heuristics -
that's quite complex and I'd expect it to result ain a quite complex patch.

So -1 from me to this at this moment (it certainly needs much more
thought than a mere follow-on commit).

OK, but let's discuss further. You make it sound like treating
NTUP_PER_BUCKET as a soft limit is a bad thing, but I'm not convinced.
I think it's perfectly reasonable to treat NTUP_PER_BUCKET as a range:
if we can get NTUP_PER_BUCKET=1, great, but if not, NTUP_PER_BUCKET=2
or NTUP_PER_BUCKET=4 may be perfectly reasonable.

I think this really depends on various factors. For example what may
work for small work_mem (fitting into L2 cache) may not work for large
values, etc.

I'm actually quite surprised that you find batching to be a better
strategy than skimping on buckets, because I would have expect the
opposite, almost categorically. Batching means having to write out
the tuples we can't process right away and read them back in. If
that involves disk I/O, I think the cost of that I/O is going to be
FAR more than the overhead we would have incurred by searching
slightly longer bucket chains. If it didn't, then you could've set
work_mem higher and avoided batching in the first place.

No, I don't find batching to be a better strategy. I just think this
really needs more discussion than a simple "let's use NTUP_PER_BUCKET=4
to avoid batching" follow-up patch.

For example, let's say we switch to NTUP_PER_BUCKET=4 to avoid batching,
and then discover we need to start batching anyway. Should we keep the
settings, or should we revert NTUP_PER_BUCKET=1? Or maybe not doing that
for nbatch=2, but for nbatch=16?

I'm not saying it's wrong, but I think it's more complicated than it
seems from your description.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49

tv@fuzzy.cz

over 11 years ago

In reply to: Robert Haas (#47)

1 attachment(s)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 12.9.2014 22:24, Robert Haas wrote:

On Fri, Sep 12, 2014 at 3:39 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Yes, I like those changes and I think your reasoning is correct in both
cases. It certainly makes the method shorter and more readable - I was
too "stuck" in the original logic, so thanks for improving this.

So +1 from me to those changes.

OK, committed. Please rebase the rest.

Attached is v15 of the patch, rebased to current HEAD.

Tomas

Attachments:

hashjoin-nbuckets-v15-resize.patchtext/x-diff; name=hashjoin-nbuckets-v15-resize.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..d128e08 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,21 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) ||
+				 (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index b3203ba..33994c5 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -117,6 +118,7 @@ MultiExecHash(HashState *node)
 				/* It's a skew tuple, so put it into that hash table */
 				ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 										bucketNumber);
+				hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -127,6 +129,25 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
+	if (hashtable->nbuckets != hashtable->nbuckets_optimal)
+	{
+		/* We never decrease the number of buckets. */
+		Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+#ifdef HJDEBUG
+		printf("Increasing nbuckets %d => %d\n",
+			   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+
+		ExecHashIncreaseNumBuckets(hashtable);
+	}
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += hashtable->nbuckets * sizeof(HashJoinTuple);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -272,7 +293,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
+	hashtable->nbuckets_optimal = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
+	hashtable->log2_nbuckets_optimal = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
 	hashtable->skewEnabled = false;
@@ -286,6 +310,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -515,6 +540,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch = 2;
 		while (nbatch < minbatch)
 			nbatch <<= 1;
+
 	}
 
 	*numbuckets = nbuckets;
@@ -620,6 +646,19 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 */
 	ninmemory = nfreed = 0;
 
+	/* If know we need to resize nbuckets, we can do it while rebatching. */
+	if (hashtable->nbuckets_optimal != hashtable->nbuckets)
+	{
+		/* we never decrease the number of buckets */
+		Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+		hashtable->nbuckets = hashtable->nbuckets_optimal;
+		hashtable->log2_nbuckets = hashtable->log2_nbuckets_optimal;
+
+		hashtable->buckets = repalloc(hashtable->buckets,
+									  sizeof(HashJoinTuple) * hashtable->nbuckets);
+	}
+
 	/*
 	 * We will scan through the chunks directly, so that we can reset the
 	 * buckets now and not have to keep track which tuples in the buckets have
@@ -704,6 +743,93 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	HashMemoryChunk	chunk;
+
+	/* do nothing if not an increase (it's called increase for a reason) */
+	if (hashtable->nbuckets >= hashtable->nbuckets_optimal)
+		return;
+
+	/*
+	 * We already know the optimal number of buckets, so let's just
+	 * compute the log2_nbuckets for it.
+	 */
+	hashtable->nbuckets = hashtable->nbuckets_optimal;
+	hashtable->log2_nbuckets = my_log2(hashtable->nbuckets_optimal);
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets <= (INT_MAX/2));
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/*
+	 * Just reallocate the proper number of buckets - we don't need to
+	 * walk through them - we can walk the dense-allocated chunks
+	 * (just like in ExecHashIncreaseNumBatches, but without all the
+	 * copying into new chunks)
+	 */
+
+	hashtable->buckets = (HashJoinTuple *) repalloc(hashtable->buckets,
+								hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
+
+	/* scan through all tuples in all chunks, rebuild the hash table */
+	chunk = hashtable->chunks;
+	while (chunk != NULL)
+	{
+
+		/* position within the buffer (up to chunk->used) */
+		size_t idx = 0;
+
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx < chunk->used)
+		{
+
+			/* get the hashjoin tuple and mintuple stored right after it */
+			HashJoinTuple hashTuple = (HashJoinTuple)(chunk->data + idx);
+			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+			int		hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+			int		bucketno;
+			int		batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* just add the tuple to the proper bucket */
+			hashTuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = hashTuple;
+
+			/* bytes occupied in memory HJ tuple overhead + actual tuple length */
+			idx += MAXALIGN(hashTupleSize);
+
+		}
+
+		/* proceed to the next chunk */
+		chunk = chunk->next;
+
+	}
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, batchTuples / hashtable->nbuckets);
+#endif
+
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -736,6 +862,7 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		 */
 		HashJoinTuple hashTuple;
 		int			hashTupleSize;
+		double		ntuples = (hashtable->totalTuples - hashtable->skewTuples);
 
 		/* Create the HashJoinTuple */
 		hashTupleSize = HJTUPLE_OVERHEAD + tuple->t_len;
@@ -756,11 +883,23 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		/*
+		 * Increase the (optimal) number of buckets if we just exceeded the
+		 * NTUP_PER_BUCKET threshold, but only when there's still a single batch.
+		 */
+		if ((hashtable->nbatch == 1) &&
+			(hashtable->nbuckets_optimal <= INT_MAX/2) &&	/* overflow protection */
+			(ntuples >= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET)))
+		{
+			hashtable->nbuckets_optimal *= 2;
+			hashtable->log2_nbuckets_optimal += 1;
+		}
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
-		if (hashtable->spaceUsed + hashtable->nbuckets * sizeof(HashJoinTuple)
+		if (hashtable->spaceUsed + hashtable->nbuckets_optimal * sizeof(HashJoinTuple)
 			> hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
 	}
@@ -885,7 +1024,10 @@ ExecHashGetHashValue(HashJoinTable hashtable,
  * functions are good about randomizing all their output bits, else we are
  * likely to have very skewed bucket or batch occupancy.)
  *
- * nbuckets doesn't change over the course of the join.
+ * The nbuckets/log2_nbuckets may change while (nbatch==1) because of dynamic
+ * buckets growth. Once we start batching, the value is fixed and does not
+ * change over the course of join (making it possible to compute batch number
+ * the way we do here).
  *
  * nbatch is always a power of 2; we increase it only by doubling it.  This
  * effectively adds one more bit to the top of the batchno.
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index c9e61df..0e1e0cd 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -127,6 +127,10 @@ typedef struct HashJoinTableData
 	int			nbuckets;		/* # buckets in the in-memory hash table */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
+	int			nbuckets_original;	/* # buckets when starting the first hash */
+	int			nbuckets_optimal;	/* optimal # buckets (per batch) */
+	int			log2_nbuckets_optimal;	/* same as log2_nbuckets optimal */
+
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
 	struct HashJoinTupleData **buckets;
 	/* buckets array is per-batch storage, as are all the tuples */
@@ -148,6 +152,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	double		skewTuples;		/* # tuples inserted into skew tuples */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if

#50

robertmhaas@gmail.com

over 11 years ago

In reply to: Tomas Vondra (#48)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On Fri, Sep 12, 2014 at 4:55 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

I'm actually quite surprised that you find batching to be a better
strategy than skimping on buckets, because I would have expect the
opposite, almost categorically. Batching means having to write out
the tuples we can't process right away and read them back in. If
that involves disk I/O, I think the cost of that I/O is going to be
FAR more than the overhead we would have incurred by searching
slightly longer bucket chains. If it didn't, then you could've set
work_mem higher and avoided batching in the first place.

No, I don't find batching to be a better strategy. I just think this
really needs more discussion than a simple "let's use NTUP_PER_BUCKET=4
to avoid batching" follow-up patch.

For example, let's say we switch to NTUP_PER_BUCKET=4 to avoid batching,
and then discover we need to start batching anyway. Should we keep the
settings, or should we revert NTUP_PER_BUCKET=1? Or maybe not doing that
for nbatch=2, but for nbatch=16?

My first thought is to revert to NTUP_PER_BUCKET=1, but it's certainly
arguable. Either method, though, figures to be better than doing
nothing, so let's do something.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51

tv@fuzzy.cz

over 11 years ago

In reply to: Robert Haas (#50)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 12.9.2014 23:22, Robert Haas wrote:

On Fri, Sep 12, 2014 at 4:55 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

I'm actually quite surprised that you find batching to be a
better strategy than skimping on buckets, because I would have
expect the opposite, almost categorically. Batching means having
to write out the tuples we can't process right away and read them
back in. If that involves disk I/O, I think the cost of that I/O
is going to be FAR more than the overhead we would have incurred
by searching slightly longer bucket chains. If it didn't, then
you could've set work_mem higher and avoided batching in the
first place.

No, I don't find batching to be a better strategy. I just think
this really needs more discussion than a simple "let's use
NTUP_PER_BUCKET=4 to avoid batching" follow-up patch.

For example, let's say we switch to NTUP_PER_BUCKET=4 to avoid
batching, and then discover we need to start batching anyway.
Should we keep the settings, or should we revert NTUP_PER_BUCKET=1?
Or maybe not doing that for nbatch=2, but for nbatch=16?

My first thought is to revert to NTUP_PER_BUCKET=1, but it's
certainly arguable. Either method, though, figures to be better than
doing nothing, so let's do something.

OK, but can we commit the remaining part first? Because it'll certainly
interact with this proposed part, and it's easier to tweak when the code
is already there. Instead of rebasing over and over.

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

kgrittn@ymail.com

over 11 years ago

In reply to: Tomas Vondra (#51)

Re: bad estimation together with large work_mem generates terrible slow hash joins

Tomas Vondra <tv@fuzzy.cz> wrote:

On 12.9.2014 23:22, Robert Haas wrote:

My first thought is to revert to NTUP_PER_BUCKET=1, but it's
certainly arguable. Either method, though, figures to be better than
doing nothing, so let's do something.

OK, but can we commit the remaining part first? Because it'll certainly
interact with this proposed part, and it's easier to tweak when the code
is already there. Instead of rebasing over and over.

The patch applied and built without problem, and pass `make
check-world`. The only thing that caught my eye was the addition
of debug code using printf() instead of logging at a DEBUG level.
Is there any reason for that?

I still need to work through the (rather voluminous) email threads
and the code to be totally comfortable with all the issues, but
if Robert and Heikki are comfortable with it, I'm not gonna argue.

Preliminary benchmarks look outstanding! I tried to control
carefully to ensure consistent results (by dropping, recreating,
vacuuming, analyzing, and checkpointing before each test), but
there were surprising outliers in the unpatched version. It turned
out that it was because slight differences in the random samples
caused different numbers of buckets for both unpatched and patched,
but the patched version dealt with the difference gracefully while
the unpatched version's performance fluctuated randomly.

My thinking is that we should get this much committed and then
discuss further optimizations. I tried to throw something at it
that would be something of a "worst case" because with the new code
it would start with one batch and go to two.

Five runs per test, alternating between unpatched and patched.

First I tried with the default work_mem size of 4MB:

Planning time / Execution time (ms)

unpatched, work_mem = 4MB

0.694 / 10392.930
0.327 / 10388.787
0.412 / 10533.036
0.650 / 17186.719
0.338 / 10707.423

Fast: Buckets: 2048 Batches: 1 Memory Usage: 3516kB
Slow: Buckets: 1024 Batches: 1 Memory Usage: 3516kB

patched, work_mem = 4MB

0.764 / 5021.792
0.655 / 4991.887
0.436 / 5068.777
0.410 / 5057.902
0.444 / 5021.543

1st & 5th: Buckets: 131072 (originally 1024) Batches: 2 (originally 1) Memory Usage: 3073kB
all others: Buckets: 131072 (originally 2048) Batches: 2 (originally 1) Memory Usage: 3073kB

Then, just to see how both dealt with extra memory I did it again with 1GB:

unpatched, work_mem = 1GB

0.328 / 10407.836
0.319 / 10465.348
0.324 / 16606.948
0.569 / 10408.671
0.542 / 16996.209

Memory usage same as before. Guess which runs started with 1024 buckets. ;-)

0.556 / 3974.741
0.325 / 4012.613
0.334 / 3941.734
0.834 / 3894.391
0.603 / 3959.440

2nd & 4th: Buckets: 131072 (originally 2048) Batches: 1 (originally 1) Memory Usage: 4540kB
all others: Buckets: 131072 (originally 1024) Batches: 1 (originally 1) Memory Usage: 4540kB

My only concern from the benchmarks is that it seemed like there
was a statistically significant increase in planning time:

unpatched plan time average: 0.450 ms
patched plan time average: 0.536 ms

That *might* just be noise, but it seems likely to be real. For
the improvement in run time, I'd put up with an extra 86us in
planning time per hash join; but if there's any way to shave some
of that off, all the better.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Kevin Grittner (#52)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 10/02/2014 03:20 AM, Kevin Grittner wrote:

My only concern from the benchmarks is that it seemed like there
was a statistically significant increase in planning time:

unpatched plan time average: 0.450 ms
patched plan time average: 0.536 ms

That *might* just be noise, but it seems likely to be real. For
the improvement in run time, I'd put up with an extra 86us in
planning time per hash join; but if there's any way to shave some
of that off, all the better.

The patch doesn't modify the planner at all, so it would be rather
surprising if it increased planning time. I'm willing to just write that
off as noise.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

tv@fuzzy.cz

over 11 years ago

In reply to: Kevin Grittner (#52)

Re: bad estimation together with large work_mem generates terrible slow hash joins

Dne 2 Říjen 2014, 2:20, Kevin Grittner napsal(a):

Tomas Vondra <tv@fuzzy.cz> wrote:

On 12.9.2014 23:22, Robert Haas wrote:

My first thought is to revert to NTUP_PER_BUCKET=1, but it's
certainly arguable. Either method, though, figures to be better than
doing nothing, so let's do something.

OK, but can we commit the remaining part first? Because it'll certainly
interact with this proposed part, and it's easier to tweak when the code
is already there. Instead of rebasing over and over.

The patch applied and built without problem, and pass `make
check-world`. The only thing that caught my eye was the addition
of debug code using printf() instead of logging at a DEBUG level.
Is there any reason for that?

Not really. IIRC the main reason it that the other code in nodeHash.c uses
the same approach.

I still need to work through the (rather voluminous) email threads
and the code to be totally comfortable with all the issues, but
if Robert and Heikki are comfortable with it, I'm not gonna argue.

;-)

Preliminary benchmarks look outstanding! I tried to control
carefully to ensure consistent results (by dropping, recreating,
vacuuming, analyzing, and checkpointing before each test), but
there were surprising outliers in the unpatched version. It turned
out that it was because slight differences in the random samples
caused different numbers of buckets for both unpatched and patched,
but the patched version dealt with the difference gracefully while
the unpatched version's performance fluctuated randomly.

Can we get a bit more details on the test cases? I did a number of tests
while working on the patch, and I generally tested two cases:

(a) well-estimated queries (i.e. with nbucket matching NTUP_PER_BUCKET)

(b) mis-estimated queries, with various levels of accuracy (say, 10x -
1000x misestimates)

From the description, it seems you only tested (b) - is that correct?

The thing is that while the resize is very fast and happens only once,
it's not perfectly free. In my tests, this was always more than
compensated by the speedups (even in the weird cases reported by Stephen
Frost), so I think we're safe here.

Also, I definitely recommend testing with different hash table sizes (say,
work_mem=256MB and the hash table just enough to fit in without batching).
The thing is the effect of CPU caches is very different for small and
large hash tables. (This is not about work_mem alone, but about how much
memory is used by the hash table - according to the results you posted it
never gets over ~4.5MB)

You tested against current HEAD, right? This patch was split into three
parts, two of which were already commited (45f6240a and 8cce08f1). The
logic of the patch was "this might take a of time/memory, but it's
compensated by these other changes". Maybe running the tests on the
original code would be interesting?

Although, if this last part of the patch actually improves the performance
on it's own, we're fine - it'll improve it even more compared to the old
code (especially before lowering NTUP_PER_BUCKET 10 to 1).

My only concern from the benchmarks is that it seemed like there
was a statistically significant increase in planning time:

unpatched plan time average: 0.450 ms
patched plan time average: 0.536 ms

That *might* just be noise, but it seems likely to be real. For
the improvement in run time, I'd put up with an extra 86us in
planning time per hash join; but if there's any way to shave some
of that off, all the better.

I agree with Heikki that this is probably noise, because the patch does
not mess with planner at all.

The only thing I can think of is adding a few fields into
HashJoinTableData. Maybe this makes the structure too large to fit into a
cacheline, or something?

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

kgrittn@ymail.com

over 11 years ago

In reply to: Heikki Linnakangas (#53)

Re: bad estimation together with large work_mem generates terrible slow hash joins

Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 10/02/2014 03:20 AM, Kevin Grittner wrote:

My only concern from the benchmarks is that it seemed like there
was a statistically significant increase in planning time:

unpatched plan time average: 0.450 ms
patched plan time average: 0.536 ms

That *might* just be noise, but it seems likely to be real. For
the improvement in run time, I'd put up with an extra 86us in
planning time per hash join; but if there's any way to shave some
of that off, all the better.

The patch doesn't modify the planner at all, so it would be rather
surprising if it increased planning time. I'm willing to just write that
off as noise.

Fair enough. I have seen much larger variations caused by how
the executable code happened to line up relative to cacheline
boundaries; and we've never made any effort to manage that.

I've tried various other tests using \timing rather than EXPLAIN,
and the patched version looks even better in those cases. I have
seen up to 4x the performance for a query using the patched
version, higher variability in run time without the patch, and have
yet to devise a benchmark where the patched version came out slower
(although I admit to not being as good at creating such cases as
some here).

When given a generous work_mem setting the patched version often
uses more of what it is allowed than the unpatched version (which
is probably one of the reasons it tends to do better). If someone
has set a high work_mem and has gotten by only because the
configured amount is not all being used when it would benefit
performance, they may need to adjust work_mem down to avoid memory
problems. That doesn't seem to me to be a reason to reject the
patch.

This is in "Waiting on Author" status only because I never got an
answer about why the debug code used printf() rather the elog() at
a DEBUG level. Other than that, I would say this patch is Ready
for Committer. Tomas? You there?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56

tv@fuzzy.cz

over 11 years ago

In reply to: Kevin Grittner (#55)

Re: bad estimation together with large work_mem generates terrible slow hash joins

On 9.10.2014 16:55, Kevin Grittner wrote:

I've tried various other tests using \timing rather than EXPLAIN, and
the patched version looks even better in those cases. I have seen up
to 4x the performance for a query using the patched version, higher
variability in run time without the patch, and have yet to devise a
benchmark where the patched version came out slower (although I admit
to not being as good at creating such cases as some here).

Nice. Thanks for the testing!

The only case I've been able to come up with is when the hash table fits
into work_mem only thanks to not counting the buckets. The new code will
start batching in this case.

That is mostly luck, however, because it depends on the cardinality
estimates, and the best way to fix it is increasing work_mem (which is
safer thanks to reducing the overhead).

Also, Robert proposed a way to mitigate this, if we realize we'd have to
do batching during the initial sizing, we can peek whether reducing the
number of buckets (to 1/2 or maybe 1/4) would help. I believe this is a
good approach, and will look into that after pgconf.eu (i.e. early
November), unless someone else is interested.

When given a generous work_mem setting the patched version often uses
more of what it is allowed than the unpatched version (which is
probably one of the reasons it tends to do better). If someone has
set a high work_mem and has gotten by only because the configured
amount is not all being used when it would benefit performance, they
may need to adjust work_mem down to avoid memory problems. That
doesn't seem to me to be a reason to reject the patch.

I'm not entirely sure I understand this paragraph. What do you mean by
"configured amount is not all being used when it would benefit
performance"? Can you give an example?

The only thing I can think of is the batching behavior described above.

This is in "Waiting on Author" status only because I never got an
answer about why the debug code used printf() rather the elog() at
a DEBUG level. Other than that, I would say this patch is Ready
for Committer. Tomas? You there?

I think I responded to that on October 2, quoting:

===================================================================
On 2.10.2014 09:50, Tomas Vondra wrote:

On 2.10.2014, 2:20, Kevin Grittner wrote:

The patch applied and built without problem, and pass `make
check-world`. The only thing that caught my eye was the addition of
debug code using printf() instead of logging at a DEBUG level. Is
there any reason for that?

Not really. IIRC the main reason it that the other code in
nodeHash.c uses the same approach.

===================================================================

I believe it's safe to switch the logging to elog(). IMHO the printf
logging is there from some very early version of the code, before elog
was introduced. Or something like that.

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

kgrittn@ymail.com

over 11 years ago

In reply to: Tomas Vondra (#56)

Re: bad estimation together with large work_mem generates terrible slow hash joins

Tomas Vondra <tv@fuzzy.cz> wrote:

On 9.10.2014 16:55, Kevin Grittner wrote:

I've tried various other tests using \timing rather than EXPLAIN, and
the patched version looks even better in those cases. I have seen up
to 4x the performance for a query using the patched version, higher
variability in run time without the patch, and have yet to devise a
benchmark where the patched version came out slower (although I admit
to not being as good at creating such cases as some here).

Nice. Thanks for the testing!

The only case I've been able to come up with is when the hash table fits
into work_mem only thanks to not counting the buckets. The new code will
start batching in this case.

Hmm. If you look at the timings in my posting from 2014-10-01 I
had cases where the patched version started with one batch and went
to two, while the unpatched used just one batch, and the patched
version was still more than twice as fast. I'm sure the "on disk"
batch was fully cached, however; it might work out differently if
disk speed actually came into the picture.

That is mostly luck, however, because it depends on the cardinality
estimates, and the best way to fix it is increasing work_mem (which is
safer thanks to reducing the overhead).

Also, Robert proposed a way to mitigate this, if we realize we'd have to
do batching during the initial sizing, we can peek whether reducing the
number of buckets (to 1/2 or maybe 1/4) would help. I believe this is a
good approach, and will look into that after pgconf.eu (i.e. early
November), unless someone else is interested.

Sure, but it would be good to confirm that it's actually needed first.

When given a generous work_mem setting the patched version often uses
more of what it is allowed than the unpatched version (which is
probably one of the reasons it tends to do better). If someone has
set a high work_mem and has gotten by only because the configured
amount is not all being used when it would benefit performance, they
may need to adjust work_mem down to avoid memory problems. That
doesn't seem to me to be a reason to reject the patch.

I'm not entirely sure I understand this paragraph. What do you mean by
"configured amount is not all being used when it would benefit
performance"? Can you give an example?

Again, the earlier post showed that while the unpatched used 3516kB
whether it had work_mem set to 4MB or 1GB, the patched version used
3073kB when work_mem was set to 4MB and 4540kB when work_mem was
set to 1GB. The extra memory allowed the patched version to stay
at 1 batch, improving performance over the other setting.

The only thing I can think of is the batching behavior described above.

This is in "Waiting on Author" status only because I never got an
answer about why the debug code used printf() rather the elog() at
a DEBUG level. Other than that, I would say this patch is Ready
for Committer. Tomas? You there?

I think I responded to that on October 2, quoting:

===================================================================
On 2.10.2014 09:50, Tomas Vondra wrote:

On 2.10.2014, 2:20, Kevin Grittner wrote:

The patch applied and built without problem, and pass `make
check-world`. The only thing that caught my eye was the addition of
debug code using printf() instead of logging at a DEBUG level. Is
there any reason for that?

Not really. IIRC the main reason it that the other code in
nodeHash.c uses the same approach.

===================================================================

Ah, I never received that email. That tends to happen every now
and then. :-(

I believe it's safe to switch the logging to elog(). IMHO the printf
logging is there from some very early version of the code, before elog
was introduced. Or something like that.

I guess it might be best to remain consistent in this patch and
change that in a separate patch. I just wanted to make sure you
didn't see any reason not to do so.

With that addressed I will move this to Ready for Committer. Since
both Heikki and Robert spent time on this patch earlier, I'll give
either of them a shot at committing it if they want; otherwise I'll
do it.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

[1]: /messages/by-id/c84680e818f6a0f4a26223cd750ff252.squirrel@2.emaily.eu
/messages/by-id/c84680e818f6a0f4a26223cd750ff252.squirrel@2.emaily.eu

tv@fuzzy.cz

over 11 years ago

In reply to: Kevin Grittner (#57)

Re: bad estimation together with large work_mem generates terrible slow hash joins

Hi!

On 9.10.2014 22:28, Kevin Grittner wrote:

Tomas Vondra <tv@fuzzy.cz> wrote:

The only case I've been able to come up with is when the hash table
fits into work_mem only thanks to not counting the buckets. The new
code will start batching in this case.

Hmm. If you look at the timings in my posting from 2014-10-01 I had
cases where the patched version started with one batch and went to
two, while the unpatched used just one batch, and the patched version
was still more than twice as fast. I'm sure the "on disk" batch was
fully cached, however; it might work out differently if disk speed
actually came into the picture.

I think the case with large outer table and memory pressure, forcing the
batches to be actually written to disk (instead of just being in the
page cache) is the main concern here.

That is mostly luck, however, because it depends on the cardinality
estimates, and the best way to fix it is increasing work_mem (which is
safer thanks to reducing the overhead).

Also, Robert proposed a way to mitigate this, if we realize we'd
have to do batching during the initial sizing, we can peek whether
reducing the number of buckets (to 1/2 or maybe 1/4) would help. I
believe this is a good approach, and will look into that after
pgconf.eu (i.e. early November), unless someone else is interested.

Sure, but it would be good to confirm that it's actually needed first.

Yeah. But with cleverly crafted test case it's possible to confirm
almost everything ;-)

When given a generous work_mem setting the patched version often uses
more of what it is allowed than the unpatched version (which is
probably one of the reasons it tends to do better). If someone has
set a high work_mem and has gotten by only because the configured
amount is not all being used when it would benefit performance, they
may need to adjust work_mem down to avoid memory problems. That
doesn't seem to me to be a reason to reject the patch.

I'm not entirely sure I understand this paragraph. What do you mean by
"configured amount is not all being used when it would benefit
performance"? Can you give an example?

Again, the earlier post showed that while the unpatched used 3516kB
whether it had work_mem set to 4MB or 1GB, the patched version used
3073kB when work_mem was set to 4MB and 4540kB when work_mem was
set to 1GB. The extra memory allowed the patched version to stay
at 1 batch, improving performance over the other setting.

OK, so with 4MB the new version was batching, while the original code
keeps a single batch (likely due to not counting buckets into work_mem).
I believe that's expected behavior.

Also, in the e-mail from Oct 2 that got lost [1]/messages/by-id/c84680e818f6a0f4a26223cd750ff252.squirrel@2.emaily.eu, I pointed out that by
testing only with small hash tables the results are likely influenced by
high CPU cache hit ratio. I think benchmarking with larger inner
relation (thus increasing the memory occupied by the hash table) might
be interesting.

The only thing I can think of is the batching behavior described above.

This is in "Waiting on Author" status only because I never got an
answer about why the debug code used printf() rather the elog() at
a DEBUG level. Other than that, I would say this patch is Ready
for Committer. Tomas? You there?

I think I responded to that on October 2, quoting:

...

Ah, I never received that email. That tends to happen every now
and then. :-(

I believe it's safe to switch the logging to elog(). IMHO the printf
logging is there from some very early version of the code, before elog
was introduced. Or something like that.

I guess it might be best to remain consistent in this patch and
change that in a separate patch. I just wanted to make sure you
didn't see any reason not to do so.

OK, I'll put that on my TODO list for the next CF.

With that addressed I will move this to Ready for Committer. Since
both Heikki and Robert spent time on this patch earlier, I'll give
either of them a shot at committing it if they want; otherwise I'll
do it.

Great, thanks!
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59