PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

Started by Tomas Vondraabout 11 years ago16 messages
#1Tomas Vondra
tv@fuzzy.cz
1 attachment(s)

Hi,

back when we were discussing the hashjoin patches (now committed),
Robert proposed that maybe it'd be a good idea to sometimes increase the
number of tuples per bucket instead of batching.

That is, while initially sizing the hash table - if the hash table with
enough buckets to satisfy NTUP_PER_BUCKET load factor does not fit into
work_mem, try a bit higher load factor before starting to batch.

Attached patch is an initial attempt to implement this - it's a bit
rough on the edges, but hopefully enough to judge the benefit of this.

The default load factor is 1. The patch tries to degrade this to 2, 4 or
8 in attempt to fit the hash table into work mem. If it doesn't work, it
starts batching with the default load factor. If the batching is
required while the hashjoin is running, the load factor is switched back
to the default one (if we're batching, there's no point in keeping the
slower hash table).

The patch also modifies explain output, to show the load factor.

The testing I've done so far are rather disappointing, though.

create table a as select i from generate_series(1,1000000) s(i);
create table b as select i from generate_series(1,1000000) s(i);

analyze a;
analyze b;

select a.i, b.i from a join b on (a.i = b.i);

work_mem batches tuples per bucket duration
-----------------------------------------------------
64 MB 1 1 585 ms
46 MB 1 2 639 ms
43 MB 1 4 794 ms
40 MB 1 8 1075 ms
39 MB 2 1 623 ms

So, even increasing the load factor to 2 is slower than batching.

Of course, with other examples the results may be different. For example
with a much larger outer table:

create table a as select mod(i,1000000) i
from generate_series(1,10000000) s(i);
analyze a;

work_mem batches tuples per bucket duration
-----------------------------------------------------
64 MB 1 1 3904 ms
46 MB 1 2 4692 ms
43 MB 1 4 6499 ms
40 MB 1 8 9264 ms
39 MB 2 1 4054 ms

Same results. Of course, for "huge" outer tables it will make a
difference. For example on a ~8GB outer table (on a machine with 8GB of
RAM), the batching causes enough I/O to make it slower than ntup=2 (50s
vs. 80s, for example).

I'm not sure what's the best way forward - clearly, for cases that fit
into RAM (temp files into page cache), batching is faster. For tables
large enough to cause a lot of I/O, it may make a difference - but I'm
not sure how to identify these cases.

So ISTM implementing this requires a reliable way to identify which case
we're dealing with - if the outer table is large enough (prefer
increasing load factor) or not (prefer batching). Until then keeping the
current simple/predictible approach is probably better.

Of course, this also depends on additional variables (e.g. is the system
memory-stressed?).

regards
Tomas

Attachments:

hashjoin-graceful-v1.patchtext/x-diff; name=hashjoin-graceful-v1.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 332f04a..06e612a 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1936,26 +1936,33 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
+			ExplainPropertyLong("Hash Tuples Per Bucket", hashtable->ntup_per_bucket, es);
+			ExplainPropertyLong("Original Tuples Per Bucket",
+								hashtable->ntup_per_bucket, es);
+			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
 		else if (hashtable->nbatch_original != hashtable->nbatch ||
-				 hashtable->nbuckets_original != hashtable->nbuckets)
+				 hashtable->nbuckets_original != hashtable->nbuckets ||
+				 hashtable->ntup_per_bucket_original != hashtable->ntup_per_bucket)
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-							 "Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 "Buckets: %d (originally %d)  Batches: %d (originally %d)  Tuples: %d (originally %d)  Memory Usage: %ldkB\n",
 							 hashtable->nbuckets,
 							 hashtable->nbuckets_original,
 							 hashtable->nbatch,
 							 hashtable->nbatch_original,
+							 hashtable->ntup_per_bucket,
+							 hashtable->ntup_per_bucket_original,
 							 spacePeakKb);
 		}
 		else
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-						   "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
+						   "Buckets: %d  Batches: %d  Tuples: %d  Memory Usage: %ldkB\n",
 							 hashtable->nbuckets, hashtable->nbatch,
-							 spacePeakKb);
+							 hashtable->ntup_per_bucket, spacePeakKb);
 		}
 	}
 }
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 7c5bb77..fd4f56d 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -259,6 +259,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	Plan	   *outerNode;
 	int			nbuckets;
 	int			nbatch;
+	int			ntup;
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
 	int			nkeys;
@@ -275,7 +276,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 
 	ExecChooseHashTableSize(outerNode->plan_rows, outerNode->plan_width,
 							OidIsValid(node->skewTable),
-							&nbuckets, &nbatch, &num_skew_mcvs);
+							&nbuckets, &nbatch, &ntup, &num_skew_mcvs);
 
 #ifdef HJDEBUG
 	printf("nbatch = %d, nbuckets = %d\n", nbatch, nbuckets);
@@ -320,6 +321,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->spaceAllowedSkew =
 		hashtable->spaceAllowed * SKEW_WORK_MEM_PERCENT / 100;
 	hashtable->chunks = NULL;
+	hashtable->ntup_per_bucket = ntup;
+	hashtable->ntup_per_bucket_original = ntup;
 
 	/*
 	 * Get info about the hash functions to be used for each hash key. Also
@@ -410,13 +413,15 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
  * This is exported so that the planner's costsize.c can use it.
  */
 
-/* Target bucket loading (tuples per bucket) */
+/* Target bucket loading (tuples per bucket) - keep the values 2^N */
 #define NTUP_PER_BUCKET			1
+#define MAX_NTUP_PER_BUCKET		8
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *numbuckets,
 						int *numbatches,
+						int *ntup,
 						int *num_skew_mcvs)
 {
 	int			tupsize;
@@ -428,6 +433,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	int			nbatch = 1;
 	int			nbuckets;
 	double		dbuckets;
+	int			ntup_per_bucket;
 
 	/* Force a plausible relation size if no info */
 	if (ntuples <= 0.0)
@@ -498,6 +504,51 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	nbuckets = Max((int) dbuckets, 1024);
 	nbuckets = 1 << my_log2(nbuckets);
 	bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
+	ntup_per_bucket = NTUP_PER_BUCKET;
+
+	/*
+	 * If there's not enough space to store the projected number of tupples,
+	 * we can try to increase the number of tuples per bucket first, up tp
+	 * MAX_NTUP_PER_BUCKET, in the hope that the smaller hash table will
+	 * fit into memory. This is likely still faster than batching,
+	 * especially with large outer tables that cause a lot of I/O.
+	 */
+	if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
+	{
+		int		shrink_factor = 1;
+		int		tmp_per_bucket = ntup_per_bucket;
+		bool	fits_into_memory = false;
+
+		/*
+		 * Cut the number of buckets in half, until we reach the maximum
+		 * allowed value (MAX_NTUP_PER_BUCKET). This makes evaluating
+		 * bucket_bytes much easier (just cut the size in half).
+		 */
+		while (tmp_per_bucket * 2 <= MAX_NTUP_PER_BUCKET)
+		{
+			shrink_factor *= 2;
+			tmp_per_bucket *= 2;
+			if (inner_rel_bytes + bucket_bytes / shrink_factor <= hash_table_bytes)
+			{
+				fits_into_memory = true;
+				break;
+			}
+		}
+
+		/*
+		 * If resizing the hash table helps, we need to keep track of the
+		 * number of buckets, limit, etc.
+		 */
+		if (fits_into_memory)
+		{
+			elog(WARNING, "hash table fits into memory with shrink factor = %d",
+				shrink_factor);
+
+			nbuckets     /= shrink_factor;
+			bucket_bytes /= shrink_factor;
+			ntup_per_bucket = tmp_per_bucket;
+		}
+	}
 
 	/*
 	 * If there's not enough space to store the projected number of tuples
@@ -544,6 +595,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 
 	*numbuckets = nbuckets;
 	*numbatches = nbatch;
+	*ntup = ntup_per_bucket;
 }
 
 
@@ -610,6 +662,78 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 		   nbatch, (unsigned long) hashtable->spaceUsed);
 #endif
 
+	/*
+	 * In a batching mode always shoot for NTUP_PER_BUCKET, without any
+	 * graceful degradation (allowing more tuples instead of batching).
+	 * If that's the case (can only happen once, on the first split),
+	 * then we need to compute the number of buckets for the batching
+	 * mode, using the same logic as in ExecChooseHashTableSize().
+	 *
+	 * At this point we don't know know the estimated tuple width etc.
+	 * But we do know how much space we're using and how many tuples
+	 * there are, so we can comptute the tuple size (maybe even more
+	 * accurately).
+	 *
+	 * The question is whether this increase of bucket space (we're
+	 * lowering NTUP_PER_BUCKET, thus requiring more space for the
+	 * buckets) can result in more than doubling the number of batches
+	 * in a single step. And the answer is no, because even if we
+	 * increase the number of buckets to the optimum value (with a load
+	 * factor 1), it still can't occupy more than 50% of work_mem, just
+	 * like the tuples after increasing the number of batches. So we're
+	 * still just doubling the number of batches.
+	 *
+	 * XXX Could this reasoning be broken by computing a different
+	 *     tuple width estimate?
+	 *
+	 * XXX We have already increased nbatch.
+	 */
+	if (hashtable->ntup_per_bucket != NTUP_PER_BUCKET)
+	{
+		long		lbuckets;
+		long		bucket_size;
+		long		tupsize;
+
+		long		hash_table_bytes = work_mem * 1024L;
+		long		max_pointers = (work_mem * 1024L) / sizeof(void *);
+
+		long		bucket_bytes;
+		long		nbuckets;
+
+		Assert(oldnbatch == 1);
+
+		/* use the optiomal number of tuples per bucket */
+		hashtable->ntup_per_bucket = NTUP_PER_BUCKET;
+
+		/* compute size of a single tuple (including overhead) */
+		tupsize = ceil((double)hashtable->spaceUsed / hashtable->totalTuples);
+
+		/*
+		 * Estimate the number of buckets we'll want to have when work_mem
+		 * is entirely full.  Each bucket will contain a bucket pointer plus
+		 * NTUP_PER_BUCKET tuples, whose projected size already includes
+		 * overhead for the hash code, pointer to the next tuple, etc.
+		 */
+		bucket_size = (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple));
+		lbuckets = 1 << my_log2(hash_table_bytes / bucket_size);
+		lbuckets = Min(lbuckets, max_pointers);
+		nbuckets = (int) lbuckets;
+		bucket_bytes = nbuckets * sizeof(HashJoinTuple);
+
+		/*
+		 * Buckets are simple pointers to hashjoin tuples, while tupsize
+		 * includes the pointer, hash code, and MinimalTupleData.  So buckets
+		 * should never really exceed 25% of work_mem (even for
+		 * NTUP_PER_BUCKET=1); except maybe * for work_mem values that are
+		 * not 2^N bytes, where we might get more * because of doubling.
+		 * So let's look for 50% here.
+		 */
+		Assert(bucket_bytes <= hash_table_bytes / 2);
+
+		hashtable->nbuckets_optimal = nbuckets;
+		hashtable->log2_nbuckets_optimal = my_log2(nbuckets);
+	}
+
 	oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
 
 	if (hashtable->innerBatchFile == NULL)
@@ -873,7 +997,7 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		 */
 		if ((hashtable->nbatch == 1) &&
 			(hashtable->nbuckets_optimal <= INT_MAX/2) &&	/* overflow protection */
-			(ntuples >= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET)))
+			(ntuples >= (hashtable->nbuckets_optimal * hashtable->ntup_per_bucket)))
 		{
 			hashtable->nbuckets_optimal *= 2;
 			hashtable->log2_nbuckets_optimal += 1;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 659daa2..ef6e0f7 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2420,6 +2420,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	int			num_hashclauses = list_length(hashclauses);
 	int			numbuckets;
 	int			numbatches;
+	int			ntup;
 	int			num_skew_mcvs;
 
 	/* cost of source data */
@@ -2456,6 +2457,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 							true,		/* useskew */
 							&numbuckets,
 							&numbatches,
+							&ntup,
 							&num_skew_mcvs);
 
 	/*
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 0e1e0cd..b7eefac 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -131,6 +131,9 @@ typedef struct HashJoinTableData
 	int			nbuckets_optimal;	/* optimal # buckets (per batch) */
 	int			log2_nbuckets_optimal;	/* same as log2_nbuckets optimal */
 
+	int			ntup_per_bucket;	/* number of tuples per bucket */
+	int			ntup_per_bucket_original;	/* number of tuples per bucket */
+
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
 	struct HashJoinTupleData **buckets;
 	/* buckets array is per-batch storage, as are all the tuples */
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 75be5bd..a96ad90 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -47,6 +47,7 @@ extern void ExecHashTableResetMatchFlags(HashJoinTable hashtable);
 extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *numbuckets,
 						int *numbatches,
+						int *ntup,
 						int *num_skew_mcvs);
 extern int	ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);

#2Kevin Grittner
kgrittn@ymail.com
In reply to: Tomas Vondra (#1)
Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

Tomas Vondra <tv@fuzzy.cz> wrote:

back when we were discussing the hashjoin patches (now committed),
Robert proposed that maybe it'd be a good idea to sometimes increase the
number of tuples per bucket instead of batching.

That is, while initially sizing the hash table - if the hash table with
enough buckets to satisfy NTUP_PER_BUCKET load factor does not fit into
work_mem, try a bit higher load factor before starting to batch.

Attached patch is an initial attempt to implement this - it's a bit
rough on the edges, but hopefully enough to judge the benefit of this.

The default load factor is 1. The patch tries to degrade this to 2, 4 or
8 in attempt to fit the hash table into work mem. If it doesn't work, it
starts batching with the default load factor. If the batching is
required while the hashjoin is running, the load factor is switched back
to the default one (if we're batching, there's no point in keeping the
slower hash table).

The patch also modifies explain output, to show the load factor.

The testing I've done so far are rather disappointing, though.

create table a as select i from generate_series(1,1000000) s(i);
create table b as select i from generate_series(1,1000000) s(i);

analyze a;
analyze b;

select a.i, b.i from a join b on (a.i = b.i);

work_mem batches tuples per bucket duration
-----------------------------------------------------
64 MB 1 1 585 ms
46 MB 1 2 639 ms
43 MB 1 4 794 ms
40 MB 1 8 1075 ms
39 MB 2 1 623 ms

So, even increasing the load factor to 2 is slower than batching.

Right, I saw the same thing.

I tried pretty hard to create a case where increasing the load
factor from 1 to 2 was faster than going to a second batch, and was
unable to do so. I hypothesized that the second batch was cached
by the OS and flipping the data in and out of the OS cache was
faster than chasing through the links. I expect that if you have a
large enough hashtable to barely exceed your machines ability to
cache, you *might* see a benefit in the hash table access by
increasing the load factor. I think it would be incredibly hard to
accurately identify those cases, and every time a hash table was
used there would be a cost to trying to figure it out. I just
can't see this being a win.

I'm not sure what's the best way forward - clearly, for cases that fit
into RAM (temp files into page cache), batching is faster. For tables
large enough to cause a lot of I/O, it may make a difference - but I'm
not sure how to identify these cases.

So ISTM implementing this requires a reliable way to identify which case
we're dealing with - if the outer table is large enough (prefer
increasing load factor) or not (prefer batching). Until then keeping the
current simple/predictible approach is probably better.

Of course, this also depends on additional variables (e.g. is the system
memory-stressed?).

All that is on-target, but my take-away is that increasing load
factor to avoid a second batch was an interesting idea that turns
out to not really be a good one. If someone can craft a
reproducible test case that demonstrates a win for increasing the
load factor that doesn't have significant cost for cases where it
isn't a win, I might change my opinion; but count me as a skeptic.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#1)
Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

On Sat, Dec 6, 2014 at 10:08 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

select a.i, b.i from a join b on (a.i = b.i);

I think the concern is that the inner side might be something more
elaborate than a plain table scan, like an aggregate or join. I might
be all wet, but my impression is that you can make rescanning
arbitrarily expensive if you work at it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Kevin Grittner
kgrittn@ymail.com
In reply to: Robert Haas (#3)
Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Dec 6, 2014 at 10:08 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

select a.i, b.i from a join b on (a.i = b.i);

I think the concern is that the inner side might be something more
elaborate than a plain table scan, like an aggregate or join. I might
be all wet, but my impression is that you can make rescanning
arbitrarily expensive if you work at it.

I'm not sure I'm following. Let's use a function to select from b:

create or replace function fb()
returns setof b
language plpgsql
rows 1
as $$
begin
return query select i from b;
end;
$$;

explain (analyze, buffers, verbose)
select a.i, b.i from a join fb() b on (a.i = b.i);

I used the low row estimate to cause the planner to put this on the inner side.

16 batches
Execution time: 1638.582 ms

Now let's make it slow.

create or replace function fb()
returns setof b
language plpgsql
rows 1
as $$
begin
perform pg_sleep(2.0);
return query select i from b;
end;
$$;
explain (analyze, buffers, verbose)
select a.i, b.i from a join fb() b on (a.i = b.i);

16 batches
Execution time: 3633.859 ms

Under what conditions do you see the inner side get loaded into the
hash table multiple times?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#4)
Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

On Thu, Dec 11, 2014 at 12:29 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Dec 6, 2014 at 10:08 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

select a.i, b.i from a join b on (a.i = b.i);

I think the concern is that the inner side might be something more
elaborate than a plain table scan, like an aggregate or join. I might
be all wet, but my impression is that you can make rescanning
arbitrarily expensive if you work at it.

I'm not sure I'm following. Let's use a function to select from b:

create or replace function fb()
returns setof b
language plpgsql
rows 1
as $$
begin
return query select i from b;
end;
$$;

explain (analyze, buffers, verbose)
select a.i, b.i from a join fb() b on (a.i = b.i);

I used the low row estimate to cause the planner to put this on the inner side.

16 batches
Execution time: 1638.582 ms

Now let's make it slow.

create or replace function fb()
returns setof b
language plpgsql
rows 1
as $$
begin
perform pg_sleep(2.0);
return query select i from b;
end;
$$;
explain (analyze, buffers, verbose)
select a.i, b.i from a join fb() b on (a.i = b.i);

16 batches
Execution time: 3633.859 ms

Under what conditions do you see the inner side get loaded into the
hash table multiple times?

Huh, interesting. I guess I was thinking that the inner side got
rescanned for each new batch, but I guess that's not what happens.

Maybe there's no real problem here, and we just win.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tomas Vondra
tv@fuzzy.cz
In reply to: Robert Haas (#5)
Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

On 11.12.2014 20:00, Robert Haas wrote:

On Thu, Dec 11, 2014 at 12:29 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Under what conditions do you see the inner side get loaded into the
hash table multiple times?

Huh, interesting. I guess I was thinking that the inner side got
rescanned for each new batch, but I guess that's not what happens.

No, it's not rescanned. It's scanned only once (for the batch #0), and
tuples belonging to the other batches are stored in files. If the number
of batches needs to be increased (e.g. because of incorrect estimate of
the inner table), the tuples are moved later.

Maybe there's no real problem here, and we just win.

I'm a bit confused by this discussion, because the inner relation has
nothing to do with this patch. It gets scanned exactly once, no matter
what the load factor is. If a batching is necessary, only the already
files (without reexecuting the inner part) are read. However in that
case this patch makes no difference, because it explicitly reverts to
load factor = NTUP_PER_BUCKET (which is 1).

The only point of this patch was to prevent batching because of the
outer table. Usually, the outer table is much larger than the inner one
(e.g. in a star schema, outer = fact table, inner = dimension). Batching
the outer table means you have to write >= 50% into a temporary file.

The idea was that if we could increase the load a bit (e.g. using 2
tuples per bucket instead of 1), we will still use a single batch in
some cases (when we miss the work_mem threshold by just a bit). The
lookups will be slower, but we'll save the I/O.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#6)
Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

On Thu, Dec 11, 2014 at 2:51 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

No, it's not rescanned. It's scanned only once (for the batch #0), and
tuples belonging to the other batches are stored in files. If the number
of batches needs to be increased (e.g. because of incorrect estimate of
the inner table), the tuples are moved later.

Yeah, I think I sort of knew that, but I got confused. Thanks for clarifying.

The idea was that if we could increase the load a bit (e.g. using 2
tuples per bucket instead of 1), we will still use a single batch in
some cases (when we miss the work_mem threshold by just a bit). The
lookups will be slower, but we'll save the I/O.

Yeah. That seems like a valid theory, but your test results so far
seem to indicate that it's not working out like that - which I find
quite surprising, but, I mean, it is what it is, right?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tomas Vondra
tv@fuzzy.cz
In reply to: Robert Haas (#7)
Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

On 11.12.2014 22:16, Robert Haas wrote:

On Thu, Dec 11, 2014 at 2:51 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

No, it's not rescanned. It's scanned only once (for the batch #0), and
tuples belonging to the other batches are stored in files. If the number
of batches needs to be increased (e.g. because of incorrect estimate of
the inner table), the tuples are moved later.

Yeah, I think I sort of knew that, but I got confused. Thanks for clarifying.

The idea was that if we could increase the load a bit (e.g. using 2
tuples per bucket instead of 1), we will still use a single batch in
some cases (when we miss the work_mem threshold by just a bit). The
lookups will be slower, but we'll save the I/O.

Yeah. That seems like a valid theory, but your test results so far
seem to indicate that it's not working out like that - which I find
quite surprising, but, I mean, it is what it is, right?

Not exactly. My tests show that as long as the outer table batches fit
into page cache, icreasing the load factor results in worse performance
than batching.

When the outer table is "sufficiently small", the batching is faster.

Regarding the "sufficiently small" - considering today's hardware, we're
probably talking about gigabytes. On machines with significant memory
pressure (forcing the temporary files to disk), it might be much lower,
of course. Of course, it also depends on kernel settings (e.g.
dirty_bytes/dirty_background_bytes).

If we could identify those cases (at least the "temp files > RAM") then
maybe we could do this. Otherwise we're going to penalize all the other
queries ...

Maybe the best solution for now is "increase the work_mem a bit"
recommendation.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#8)
Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

On Thu, Dec 11, 2014 at 5:46 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

The idea was that if we could increase the load a bit (e.g. using 2
tuples per bucket instead of 1), we will still use a single batch in
some cases (when we miss the work_mem threshold by just a bit). The
lookups will be slower, but we'll save the I/O.

Yeah. That seems like a valid theory, but your test results so far
seem to indicate that it's not working out like that - which I find
quite surprising, but, I mean, it is what it is, right?

Not exactly. My tests show that as long as the outer table batches fit
into page cache, icreasing the load factor results in worse performance
than batching.

When the outer table is "sufficiently small", the batching is faster.

Regarding the "sufficiently small" - considering today's hardware, we're
probably talking about gigabytes. On machines with significant memory
pressure (forcing the temporary files to disk), it might be much lower,
of course. Of course, it also depends on kernel settings (e.g.
dirty_bytes/dirty_background_bytes).

Well, this is sort of one of the problems with work_mem. When we
switch to a tape sort, or a tape-based materialize, we're probably far
from out of memory. But trying to set work_mem to the amount of
memory we have can easily result in a memory overrun if a load spike
causes lots of people to do it all at the same time. So we have to
set work_mem conservatively, but then the costing doesn't really come
out right. We could add some more costing parameters to try to model
this, but it's not obvious how to get it right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tomas Vondra
tv@fuzzy.cz
In reply to: Robert Haas (#9)
Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

On 12.12.2014 14:19, Robert Haas wrote:

On Thu, Dec 11, 2014 at 5:46 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Regarding the "sufficiently small" - considering today's hardware, we're
probably talking about gigabytes. On machines with significant memory
pressure (forcing the temporary files to disk), it might be much lower,
of course. Of course, it also depends on kernel settings (e.g.
dirty_bytes/dirty_background_bytes).

Well, this is sort of one of the problems with work_mem. When we
switch to a tape sort, or a tape-based materialize, we're probably far
from out of memory. But trying to set work_mem to the amount of
memory we have can easily result in a memory overrun if a load spike
causes lots of people to do it all at the same time. So we have to
set work_mem conservatively, but then the costing doesn't really come
out right. We could add some more costing parameters to try to model
this, but it's not obvious how to get it right.

Ummm, I don't think that's what I proposed. What I had in mind was a
flag "the batches are likely to stay in page cache". Because when it is
likely, batching is probably faster (compared to increased load factor).

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#10)
Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

On Fri, Dec 12, 2014 at 11:50 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 12.12.2014 14:19, Robert Haas wrote:

On Thu, Dec 11, 2014 at 5:46 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Regarding the "sufficiently small" - considering today's hardware, we're
probably talking about gigabytes. On machines with significant memory
pressure (forcing the temporary files to disk), it might be much lower,
of course. Of course, it also depends on kernel settings (e.g.
dirty_bytes/dirty_background_bytes).

Well, this is sort of one of the problems with work_mem. When we
switch to a tape sort, or a tape-based materialize, we're probably far
from out of memory. But trying to set work_mem to the amount of
memory we have can easily result in a memory overrun if a load spike
causes lots of people to do it all at the same time. So we have to
set work_mem conservatively, but then the costing doesn't really come
out right. We could add some more costing parameters to try to model
this, but it's not obvious how to get it right.

Ummm, I don't think that's what I proposed. What I had in mind was a
flag "the batches are likely to stay in page cache". Because when it is
likely, batching is probably faster (compared to increased load factor).

How will you know whether to set the flag?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Tomas Vondra
tv@fuzzy.cz
In reply to: Robert Haas (#11)
Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

On 12.12.2014 22:13, Robert Haas wrote:

On Fri, Dec 12, 2014 at 11:50 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 12.12.2014 14:19, Robert Haas wrote:

On Thu, Dec 11, 2014 at 5:46 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Regarding the "sufficiently small" - considering today's hardware, we're
probably talking about gigabytes. On machines with significant memory
pressure (forcing the temporary files to disk), it might be much lower,
of course. Of course, it also depends on kernel settings (e.g.
dirty_bytes/dirty_background_bytes).

Well, this is sort of one of the problems with work_mem. When we
switch to a tape sort, or a tape-based materialize, we're probably far
from out of memory. But trying to set work_mem to the amount of
memory we have can easily result in a memory overrun if a load spike
causes lots of people to do it all at the same time. So we have to
set work_mem conservatively, but then the costing doesn't really come
out right. We could add some more costing parameters to try to model
this, but it's not obvious how to get it right.

Ummm, I don't think that's what I proposed. What I had in mind was a
flag "the batches are likely to stay in page cache". Because when it is
likely, batching is probably faster (compared to increased load factor).

How will you know whether to set the flag?

I don't know. I just wanted to make it clear that I'm not suggesting
messing with work_mem (increasing it or whatewer). Or maybe I got your
comments about memory overrun etc. wrong - now that I read it again,
maybe it's meant just as an example of how difficult problem it is?

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#12)
Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

On Fri, Dec 12, 2014 at 4:54 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Well, this is sort of one of the problems with work_mem. When we
switch to a tape sort, or a tape-based materialize, we're probably far
from out of memory. But trying to set work_mem to the amount of
memory we have can easily result in a memory overrun if a load spike
causes lots of people to do it all at the same time. So we have to
set work_mem conservatively, but then the costing doesn't really come
out right. We could add some more costing parameters to try to model
this, but it's not obvious how to get it right.

Ummm, I don't think that's what I proposed. What I had in mind was a
flag "the batches are likely to stay in page cache". Because when it is
likely, batching is probably faster (compared to increased load factor).

How will you know whether to set the flag?

I don't know. I just wanted to make it clear that I'm not suggesting
messing with work_mem (increasing it or whatewer). Or maybe I got your
comments about memory overrun etc. wrong - now that I read it again,
maybe it's meant just as an example of how difficult problem it is?

More or less, yeah.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#9)
Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

On Fri, Dec 12, 2014 at 5:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Well, this is sort of one of the problems with work_mem. When we
switch to a tape sort, or a tape-based materialize, we're probably far
from out of memory. But trying to set work_mem to the amount of
memory we have can easily result in a memory overrun if a load spike
causes lots of people to do it all at the same time. So we have to
set work_mem conservatively, but then the costing doesn't really come
out right. We could add some more costing parameters to try to model
this, but it's not obvious how to get it right.

I've heard of using "set work_mem = *" with advisory locks plenty of
times. There might be a better way to set it dynamically than a full
admission control implementation.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#8)
Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

On 11.12.2014 23:46, Tomas Vondra wrote:

On 11.12.2014 22:16, Robert Haas wrote:

On Thu, Dec 11, 2014 at 2:51 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

The idea was that if we could increase the load a bit (e.g. using 2
tuples per bucket instead of 1), we will still use a single batch in
some cases (when we miss the work_mem threshold by just a bit). The
lookups will be slower, but we'll save the I/O.

Yeah. That seems like a valid theory, but your test results so far
seem to indicate that it's not working out like that - which I find
quite surprising, but, I mean, it is what it is, right?

Not exactly. My tests show that as long as the outer table batches fit
into page cache, icreasing the load factor results in worse performance
than batching.

When the outer table is "sufficiently small", the batching is faster.

Regarding the "sufficiently small" - considering today's hardware, we're
probably talking about gigabytes. On machines with significant memory
pressure (forcing the temporary files to disk), it might be much lower,
of course. Of course, it also depends on kernel settings (e.g.
dirty_bytes/dirty_background_bytes).

If we could identify those cases (at least the "temp files > RAM") then
maybe we could do this. Otherwise we're going to penalize all the other
queries ...

Maybe the best solution for now is "increase the work_mem a bit"
recommendation.

I think it's time to mark this patch as rejected (or maybe returned with
feedback). The patch was meant as an attempt to implement Robert's idea
from the hashjoin patch, but apparently we have no clear idea how to do
it without hurting performance for many existing users.

Maybe we can try later again, but there's no poin in keeping this in the
current CF.

Any objections?

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Michael Paquier
michael.paquier@gmail.com
In reply to: Tomas Vondra (#15)
Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

On Thu, Jan 15, 2015 at 5:02 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Maybe we can try later again, but there's no poin in keeping this in the
current CF.

Any objections?

None, marked as rejected.
--
Michael