Replace hashtable growEnable flag

Started by Hubert Zhangover 6 years ago3 messages

hzhang@pivotal.io

over 6 years ago

1 attachment(s)

Hi all,

When we build hash table for a hash join node etc., we split tuples into
different hash buckets. Since tuples could not all be held in memory.
Postgres splits each bucket into batches, only the current batch of bucket
is in memory while other batches are written to disk.

During ExecHashTableInsert(), if the memory cost exceeds the operator
allowed limit(hashtable->spaceAllowed), batches will be split on the fly by
calling ExecHashIncreaseNumBatches().

In past, if data is distributed unevenly, the split of batch may failed(All
the tuples falls into one split batch and the other batch is empty) Then
Postgres will set hashtable->growEnable to false. And never expand batch
number any more.

If tuples become diverse in future, spliting batch is still valuable and
could avoid the current batch become too big and finally OOM.

To fix this, we introduce a penalty on hashtable->spaceAllowed, which is
the threshold to determine whether to increase batch number.
If batch split failed, we increase the penalty instead of just turn off the
growEnable flag.

Any comments?

--
Thanks

Hubert Zhang

Attachments:

0001-Using-growPenalty-to-replace-growEnable-in-hashtable.patchapplication/octet-stream; name=0001-Using-growPenalty-to-replace-growEnable-in-hashtable.patchDownload

From 203988f5e788482b6445f03b3cbe4d3b47a9dc0d Mon Sep 17 00:00:00 2001
From: Hubert Zhang <hzhang@pivotal.io>
Date: Wed, 15 May 2019 10:04:00 +0000
Subject: [PATCH] Using growPenalty to replace growEnable in hashtable

When build hash table, we need to increase batch number
when spaceAllowed is reached. If the split process failed
(all the tuples fall into one batch and the other is empty)
then the split flag growEnable will be turned off in past.
Since later tuples may become diverse, and we'd better add
penalty instead of disable split.
---
 src/backend/executor/nodeHash.c | 28 ++++++++++++----------------
 src/include/executor/hashjoin.h |  2 +-
 2 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 64eec91..f8cc8e9 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -493,7 +493,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 	hashtable->curbatch = 0;
 	hashtable->nbatch_original = nbatch;
 	hashtable->nbatch_outstart = nbatch;
-	hashtable->growEnabled = true;
+	hashtable->growPenalty = 0;
 	hashtable->totalTuples = 0;
 	hashtable->partialTuples = 0;
 	hashtable->skewTuples = 0;
@@ -892,10 +892,6 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	long		nfreed;
 	HashMemoryChunk oldchunks;
 
-	/* do nothing if we've decided to shut off growth */
-	if (!hashtable->growEnabled)
-		return;
-
 	/* safety check to avoid overflow */
 	if (oldnbatch > Min(INT_MAX / 2, MaxAllocSize / (sizeof(void *) * 2)))
 		return;
@@ -1030,19 +1026,19 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 #endif
 
 	/*
-	 * If we dumped out either all or none of the tuples in the table, disable
-	 * further expansion of nbatch.  This situation implies that we have
-	 * enough tuples of identical hashvalues to overflow spaceAllowed.
-	 * Increasing nbatch will not fix it since there's no way to subdivide the
-	 * group any more finely. We have to just gut it out and hope the server
-	 * has enough RAM.
+	 * If we dumped out either all or none of the tuples in the table, punish
+	 * further expansion of nbatch. The penalty will increased the threshold
+	 * to recheck whether it is worth to increase nbatch next time.
+	 * In past, Postgres uses a simple flag growEnable to diskable expansion
+	 * of nbatch, this is not appropriate since data may with identical
+	 * hashvalues at the begining but would be diverse in future.
 	 */
 	if (nfreed == 0 || nfreed == ninmemory)
 	{
-		hashtable->growEnabled = false;
+		hashtable->growPenalty++;
 #ifdef HJDEBUG
-		printf("Hashjoin %p: disabling further increase of nbatch\n",
-			   hashtable);
+		printf("Hashjoin %p: further increase of nbatch with penalty: %d\n",
+			   hashtable, hashtable->growPenalty);
 #endif
 	}
 }
@@ -1655,7 +1651,7 @@ ExecHashTableInsert(HashJoinTable hashtable,
 			hashtable->spacePeak = hashtable->spaceUsed;
 		if (hashtable->spaceUsed +
 			hashtable->nbuckets_optimal * sizeof(HashJoinTuple)
-			> hashtable->spaceAllowed)
+			> hashtable->spaceAllowed * (hashtable->growPenalty + 1) )
 			ExecHashIncreaseNumBatches(hashtable);
 	}
 	else
@@ -2434,7 +2430,7 @@ ExecHashSkewTableInsert(HashJoinTable hashtable,
 		ExecHashRemoveNextSkewBucket(hashtable);
 
 	/* Check we are not over the total spaceAllowed, either */
-	if (hashtable->spaceUsed > hashtable->spaceAllowed)
+	if (hashtable->spaceUsed > hashtable->spaceAllowed * (hashtable->growPenalty + 1))
 		ExecHashIncreaseNumBatches(hashtable);
 
 	if (shouldFree)
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 2c94b92..fbc721d 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -313,7 +313,7 @@ typedef struct HashJoinTableData
 	int			nbatch_original;	/* nbatch when we started inner scan */
 	int			nbatch_outstart;	/* nbatch when we started outer scan */
 
-	bool		growEnabled;	/* flag to shut off nbatch increases */
+	int			growPenalty;	/* nbatch increases penalty, default 0 */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
 	double		partialTuples;	/* # tuples obtained from inner plan by me */
-- 
1.8.3.1

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Hubert Zhang (#1)

Re: Replace hashtable growEnable flag

On Wed, May 15, 2019 at 06:19:38PM +0800, Hubert Zhang wrote:

Hi all,

When we build hash table for a hash join node etc., we split tuples into
different hash buckets. Since tuples could not all be held in memory.
Postgres splits each bucket into batches, only the current batch of bucket
is in memory while other batches are written to disk.

During ExecHashTableInsert(), if the memory cost exceeds the operator
allowed limit(hashtable->spaceAllowed), batches will be split on the fly by
calling ExecHashIncreaseNumBatches().

In past, if data is distributed unevenly, the split of batch may failed(All
the tuples falls into one split batch and the other batch is empty) Then
Postgres will set hashtable->growEnable to false. And never expand batch
number any more.

If tuples become diverse in future, spliting batch is still valuable and
could avoid the current batch become too big and finally OOM.

To fix this, we introduce a penalty on hashtable->spaceAllowed, which is
the threshold to determine whether to increase batch number.
If batch split failed, we increase the penalty instead of just turn off the
growEnable flag.

Any comments?

There's already another thread discussing various issues with how hashjoin
increases the number of batches, including various issues with how/when we
disable adding more batches.

https://commitfest.postgresql.org/23/2109/

In general I think you're right something like this is necessary, but I
think we may need to rethink growEnable a bit more.

For example, the way you implemented it, after reaching the increased
limit, we just increase the number of batches just like today, and then
decide whether it actually helped. But that means we double the number of
BufFile entries, which uses more and more memory (because each is 8kB and
we need 1 per batch). I think in this case (after increasing the limit) we
should check whether increasing batches makes sense or not. And only do it
if it helps. Otherwise we'll double the amount of memory for BufFile(s)
and also the work_mem. That's not a good idea.

But as I said, there are other issues discussed on the other thread. For
example we only disable the growth when all rows fall into the same batch.
But that's overly strict.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Hubert Zhang

hzhang@pivotal.io

over 6 years ago

In reply to: Tomas Vondra (#2)

Re: Replace hashtable growEnable flag

Thanks Tomas.
I will follow this problem on your thread. This thread could be terminated.

On Thu, May 16, 2019 at 3:58 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Wed, May 15, 2019 at 06:19:38PM +0800, Hubert Zhang wrote:

Hi all,

When we build hash table for a hash join node etc., we split tuples into
different hash buckets. Since tuples could not all be held in memory.
Postgres splits each bucket into batches, only the current batch of bucket
is in memory while other batches are written to disk.

During ExecHashTableInsert(), if the memory cost exceeds the operator
allowed limit(hashtable->spaceAllowed), batches will be split on the fly

by

calling ExecHashIncreaseNumBatches().

In past, if data is distributed unevenly, the split of batch may

failed(All

the tuples falls into one split batch and the other batch is empty) Then
Postgres will set hashtable->growEnable to false. And never expand batch
number any more.

If tuples become diverse in future, spliting batch is still valuable and
could avoid the current batch become too big and finally OOM.

To fix this, we introduce a penalty on hashtable->spaceAllowed, which is
the threshold to determine whether to increase batch number.
If batch split failed, we increase the penalty instead of just turn off

the

growEnable flag.

Any comments?

There's already another thread discussing various issues with how hashjoin
increases the number of batches, including various issues with how/when we
disable adding more batches.

https://commitfest.postgresql.org/23/2109/

In general I think you're right something like this is necessary, but I
think we may need to rethink growEnable a bit more.

For example, the way you implemented it, after reaching the increased
limit, we just increase the number of batches just like today, and then
decide whether it actually helped. But that means we double the number of
BufFile entries, which uses more and more memory (because each is 8kB and
we need 1 per batch). I think in this case (after increasing the limit) we
should check whether increasing batches makes sense or not. And only do it
if it helps. Otherwise we'll double the amount of memory for BufFile(s)
and also the work_mem. That's not a good idea.

But as I said, there are other issues discussed on the other thread. For
example we only disable the growth when all rows fall into the same batch.
But that's overly strict.

regards

--
Tomas Vondra
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.2ndQuadrant.com&d=DwIBAg&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=lz-kpGdw_rtpgYV2ho3DjDSB5Psxis_b-3VZKON7K7c&m=y2bI6_b4EPRd9aTQGv9Pio3c_ZtCWs_jzKd4t8CtJEI&s=XHLORM8U7I6XR_EDkgSFtJDvhxIVd2rDA7r-xvJa278&e=
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Thanks

Hubert Zhang