Planner chose a much slower plan in hashjoin, using a large table as the inner table.

Started by Jinbao Chenabout 6 years ago7 messages
#1Jinbao Chen
jinchen@pivotal.io

Hi Hackers,

The planner will use big table as inner table in hash join if small table
have fewer unique values.
But this plan is much slower than using small table as inner table. This
problem occurs on master
branch without parallel scan.

For example

create table t_small(a int);
create table t_big(b int);
insert into t_small select i%100 from generate_series(0, 3000);
insert into t_big select i%100000 from generate_series(1, 100000000)i ;
analyze t_small;
analyze t_big;
set max_parallel_workers_per_gather = 0;

and the plan made by planner is
demo2=# explain select * from t_small, t_big where a = b;
QUERY PLAN
-------------------------------------------------------------------------------
Hash Join (cost=3083104.72..3508073.65 rows=3045990 width=8)
Hash Cond: (t_small.a = t_big.b)
-> Seq Scan on t_small (cost=0.00..44.01 rows=3001 width=4)
-> Hash (cost=1442478.32..1442478.32 rows=100000032 width=4)
-> Seq Scan on t_big (cost=0.00..1442478.32 rows=100000032
width=4)

and it runs nearly 58s
demo2=# select * from t_small, t_big where a = b;
Time: 58544.525 ms (00:58.545)

But if we do some hack and use the small table as inner. It runs 19s.
demo2=# explain select * from t_small, t_big where a = b;
QUERY PLAN
-------------------------------------------------------------------------
Hash Join (cost=81.52..1723019.82 rows=3045990 width=8)
Hash Cond: (t_big.b = t_small.a)
-> Seq Scan on t_big (cost=0.00..1442478.32 rows=100000032 width=4)
-> Hash (cost=44.01..44.01 rows=3001 width=4)
-> Seq Scan on t_small (cost=0.00..44.01 rows=3001 width=4)

demo2=# select * from t_small, t_big where a = b;
Time: 18751.588 ms (00:18.752)

RCA:

The cost of the inner table mainly comes from creating a hash table.
startup_cost += (cpu_operator_cost * num_hashclauses + cpu_tuple_cost)
* inner_path_rows;

The cost of the outer table mainly comes from search the hash table.
Calculate the hash value:
run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows;

Traverse the linked list in the bucket and compare:
run_cost += hash_qual_cost.per_tuple * outer_path_rows *
clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;

In general, the cost of creating a hash table is higher than the cost of
querying a hash table.
So we tend to use small tables as internal tables. But if the average chain
length of the bucket
is large, the situation is just the opposite.

In the test case above, the small table has 3000 tuples and 100 distinct
values on column ‘a’.
If we use small table as inner table. The chan length of the bucket is 30.
And we need to
search the whole chain on probing the hash table. So the cost of probing is
bigger than build
hash table, and we need to use big table as inner.

But in fact this is not true. We initialized 620,000 buckets in hashtable.
But only 100 buckets
has chains with length 30. Other buckets are empty. Only hash values need
to be compared.
Its costs are very small. We have 100,000 distinct key and 100,000,000
tuple on outer table.
Only (100/100000)* tuple_num tuples will search the whole chain. The other
tuples
(number = (98900/100000)*tuple_num*) in outer
table just compare with the hash value. So the actual cost is much smaller
than the planner
calculated. This is the reason why using a small table as inner is faster.

#2Thomas Munro
thomas.munro@gmail.com
In reply to: Jinbao Chen (#1)
Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.

On Mon, Nov 18, 2019 at 7:48 PM Jinbao Chen <jinchen@pivotal.io> wrote:

In the test case above, the small table has 3000 tuples and 100 distinct values on column ‘a’.
If we use small table as inner table. The chan length of the bucket is 30. And we need to
search the whole chain on probing the hash table. So the cost of probing is bigger than build
hash table, and we need to use big table as inner.

But in fact this is not true. We initialized 620,000 buckets in hashtable. But only 100 buckets
has chains with length 30. Other buckets are empty. Only hash values need to be compared.
Its costs are very small. We have 100,000 distinct key and 100,000,000 tuple on outer table.
Only (100/100000)* tuple_num tuples will search the whole chain. The other tuples
(number = (98900/100000)*tuple_num*) in outer
table just compare with the hash value. So the actual cost is much smaller than the planner
calculated. This is the reason why using a small table as inner is faster.

So basically we think that if t_big is on the outer side, we'll do
100,000,000 probes and each one is going to scan a t_small bucket with
chain length 30, so that looks really expensive. Actually only a
small percentage of its probes find tuples with the right hash value,
but final_cost_hash_join() doesn't know that. So we hash t_big
instead, which we estimated pretty well and it finishes up with
buckets of length 1,000 (which is actually fine in this case, they're
not unwanted hash collisions, they're duplicate keys that we need to
emit) and we probe them 3,000 times (which is also fine in this case),
but we had to do a bunch of memory allocation and/or batch file IO and
that turns out to be slower.

I am not at all sure about this but I wonder if it would be better to
use something like:

run_cost += outer_path_rows * some_small_probe_cost;
run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();

If we can estimate how many tuples will actually match accurately,
that should also be the number of times we have to run the quals,
since we don't usually expect hash collisions (bucket collisions, yes,
but hash collisions where the key doesn't turn out to be equal, no*).

* ... but also yes as you approach various limits, so you could also
factor in bucket chain length that is due to being prevented from
expanding the number of buckets by arbitrary constraints, and perhaps
also birthday_problem(hash size, key space) to factor in unwanted hash
collisions that start to matter once you get to billions of keys and
expect collisions with short hashes.

#3Jinbao Chen
jinchen@pivotal.io
In reply to: Thomas Munro (#2)
Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.

I think we have the same understanding of this issue.

Sometimes use smaller costs on scanning the chain in bucket like below would
be better.
run_cost += outer_path_rows * some_small_probe_cost;
run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();
In some version of GreenPlum(a database based on postgres), we just disabled
the cost on scanning the bucket chain. In most cases, this can get a better
query
plan. But I am worried that it will be worse in some cases.

Now only the small table's distinct value is much smaller than the bucket
number,
and much smaller than the distinct value of the large table, the planner
will get the
wrong plan.

For example, if inner table has 100 distinct values, and 3000 rows. Hash
table
has 1000 buckets. Outer table has 10000 distinct values.
We can assume that all the 100 distinct values of the inner table are
included in the
10000 distinct values of the outer table. So (100/10000)*outer_rows tuples
will
probe the buckets has chain. And (9900/10000)*outer_rows tuples will probe
all the 1000 buckets randomly. So (9900/10000)*outer_rows*(900/1000) tuples
will
probe empty buckets. So the costs on scanning bucket chain is

hash_qual_cost.per_tuple*innerbucketsize*outer_rows*
(1 - ((outer_distinct - inner_distinct)/outer_distinct)*((buckets_num -
inner_disttinct)/buckets_num))

Do you think this assumption is reasonable?

On Tue, Nov 19, 2019 at 3:46 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Show quoted text

On Mon, Nov 18, 2019 at 7:48 PM Jinbao Chen <jinchen@pivotal.io> wrote:

In the test case above, the small table has 3000 tuples and 100 distinct

values on column ‘a’.

If we use small table as inner table. The chan length of the bucket is

30. And we need to

search the whole chain on probing the hash table. So the cost of probing

is bigger than build

hash table, and we need to use big table as inner.

But in fact this is not true. We initialized 620,000 buckets in

hashtable. But only 100 buckets

has chains with length 30. Other buckets are empty. Only hash values

need to be compared.

Its costs are very small. We have 100,000 distinct key and 100,000,000

tuple on outer table.

Only (100/100000)* tuple_num tuples will search the whole chain. The

other tuples

(number = (98900/100000)*tuple_num*) in outer
table just compare with the hash value. So the actual cost is much

smaller than the planner

calculated. This is the reason why using a small table as inner is

faster.

So basically we think that if t_big is on the outer side, we'll do
100,000,000 probes and each one is going to scan a t_small bucket with
chain length 30, so that looks really expensive. Actually only a
small percentage of its probes find tuples with the right hash value,
but final_cost_hash_join() doesn't know that. So we hash t_big
instead, which we estimated pretty well and it finishes up with
buckets of length 1,000 (which is actually fine in this case, they're
not unwanted hash collisions, they're duplicate keys that we need to
emit) and we probe them 3,000 times (which is also fine in this case),
but we had to do a bunch of memory allocation and/or batch file IO and
that turns out to be slower.

I am not at all sure about this but I wonder if it would be better to
use something like:

run_cost += outer_path_rows * some_small_probe_cost;
run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();

If we can estimate how many tuples will actually match accurately,
that should also be the number of times we have to run the quals,
since we don't usually expect hash collisions (bucket collisions, yes,
but hash collisions where the key doesn't turn out to be equal, no*).

* ... but also yes as you approach various limits, so you could also
factor in bucket chain length that is due to being prevented from
expanding the number of buckets by arbitrary constraints, and perhaps
also birthday_problem(hash size, key space) to factor in unwanted hash
collisions that start to matter once you get to billions of keys and
expect collisions with short hashes.

#4Jinbao Chen
jinchen@pivotal.io
In reply to: Jinbao Chen (#3)
1 attachment(s)
Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.

Hi hackers,

I have made a patch to fix the problem.

Added the selection rate of the inner table non-empty bucket

The planner will use big table as inner table in hash join
if small table have fewer unique values. But this plan is
much slower than using small table as inner table.

In general, the cost of creating a hash table is higher
than the cost of querying a hash table. So we tend to use
small tables as internal tables. But if the average chain
length of the bucket is large, the situation is just the
opposite.

If virtualbuckets is much larger than innerndistinct, and
outerndistinct is much larger than innerndistinct. Then most
tuples of the outer table will match the empty bucket. So when
we calculate the cost of traversing the bucket, we need to
ignore the tuple matching empty bucket.

So we add the selection rate of the inner table non-empty
bucket. The formula is:
(1 - ((outerndistinct - innerndistinct)/outerndistinct)*
((virtualbuckets - innerndistinct)/virtualbuckets))

On Tue, Nov 19, 2019 at 5:56 PM Jinbao Chen <jinchen@pivotal.io> wrote:

Show quoted text

I think we have the same understanding of this issue.

Sometimes use smaller costs on scanning the chain in bucket like below
would
be better.
run_cost += outer_path_rows * some_small_probe_cost;
run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();
In some version of GreenPlum(a database based on postgres), we just
disabled
the cost on scanning the bucket chain. In most cases, this can get a
better query
plan. But I am worried that it will be worse in some cases.

Now only the small table's distinct value is much smaller than the bucket
number,
and much smaller than the distinct value of the large table, the planner
will get the
wrong plan.

For example, if inner table has 100 distinct values, and 3000 rows. Hash
table
has 1000 buckets. Outer table has 10000 distinct values.
We can assume that all the 100 distinct values of the inner table are
included in the
10000 distinct values of the outer table. So (100/10000)*outer_rows tuples
will
probe the buckets has chain. And (9900/10000)*outer_rows tuples will probe
all the 1000 buckets randomly. So (9900/10000)*outer_rows*(900/1000)
tuples will
probe empty buckets. So the costs on scanning bucket chain is

hash_qual_cost.per_tuple*innerbucketsize*outer_rows*
(1 - ((outer_distinct - inner_distinct)/outer_distinct)*((buckets_num -
inner_disttinct)/buckets_num))

Do you think this assumption is reasonable?

On Tue, Nov 19, 2019 at 3:46 PM Thomas Munro <thomas.munro@gmail.com>
wrote:

On Mon, Nov 18, 2019 at 7:48 PM Jinbao Chen <jinchen@pivotal.io> wrote:

In the test case above, the small table has 3000 tuples and 100

distinct values on column ‘a’.

If we use small table as inner table. The chan length of the bucket is

30. And we need to

search the whole chain on probing the hash table. So the cost of

probing is bigger than build

hash table, and we need to use big table as inner.

But in fact this is not true. We initialized 620,000 buckets in

hashtable. But only 100 buckets

has chains with length 30. Other buckets are empty. Only hash values

need to be compared.

Its costs are very small. We have 100,000 distinct key and 100,000,000

tuple on outer table.

Only (100/100000)* tuple_num tuples will search the whole chain. The

other tuples

(number = (98900/100000)*tuple_num*) in outer
table just compare with the hash value. So the actual cost is much

smaller than the planner

calculated. This is the reason why using a small table as inner is

faster.

So basically we think that if t_big is on the outer side, we'll do
100,000,000 probes and each one is going to scan a t_small bucket with
chain length 30, so that looks really expensive. Actually only a
small percentage of its probes find tuples with the right hash value,
but final_cost_hash_join() doesn't know that. So we hash t_big
instead, which we estimated pretty well and it finishes up with
buckets of length 1,000 (which is actually fine in this case, they're
not unwanted hash collisions, they're duplicate keys that we need to
emit) and we probe them 3,000 times (which is also fine in this case),
but we had to do a bunch of memory allocation and/or batch file IO and
that turns out to be slower.

I am not at all sure about this but I wonder if it would be better to
use something like:

run_cost += outer_path_rows * some_small_probe_cost;
run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();

If we can estimate how many tuples will actually match accurately,
that should also be the number of times we have to run the quals,
since we don't usually expect hash collisions (bucket collisions, yes,
but hash collisions where the key doesn't turn out to be equal, no*).

* ... but also yes as you approach various limits, so you could also
factor in bucket chain length that is due to being prevented from
expanding the number of buckets by arbitrary constraints, and perhaps
also birthday_problem(hash size, key space) to factor in unwanted hash
collisions that start to matter once you get to billions of keys and
expect collisions with short hashes.

Attachments:

hash_with_small.patchapplication/octet-stream; name=hash_with_small.patchDownload
From 755b6383aa8abaeff2e040841e80a2a1cf893e12 Mon Sep 17 00:00:00 2001
From: Jinbao Chen <jinchen@pivotal.io>
Date: Fri, 22 Nov 2019 16:55:09 +0800
Subject: [PATCH] Added the selection rate of the inner table non-empty bucket

The planner will use big table as inner table in hash join
if small table have fewer unique values. But this plan is
much slower than using small table as inner table.

In general, the cost of creating a hash table is higher
than the cost of querying a hash table. So we tend to use
small tables as internal tables. But if the average chain
length of the bucket is large, the situation is just the
opposite.

If virtualbuckets is much larger than innerndistinct, and
outerndistinct is much larger than innerndistinct. Then most
tuples of the outer table will match the empty bucket. So when
we calculate the cost of traversing the bucket, we need to
ignore the tuple matching empty bucket.

So we add the selection rate of the inner table non-empty
bucket. The formula is:
(1 - ((outerndistinct - innerndistinct)/outerndistinct)*
((virtualbuckets - innerndistinct)/virtualbuckets))
---
 src/backend/optimizer/path/costsize.c   | 73 ++++++++++++++++++++++++++++++++-
 src/test/regress/expected/join_hash.out | 24 +++++++++++
 src/test/regress/sql/join_hash.sql      | 19 +++++++++
 3 files changed, 115 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f6593485..2633b020ed 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -3382,6 +3382,9 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 	double		virtualbuckets;
 	Selectivity innerbucketsize;
 	Selectivity innermcvfreq;
+	double		outerndistinct;
+	double		innerndistinct;
+	Selectivity outer_match_nonempty_frac;
 	ListCell   *hcl;
 
 	/* Mark the path with the correct row estimate */
@@ -3426,20 +3429,30 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 	 * because we avoid contaminating the cache with a value that's wrong for
 	 * non-unique-ified paths.
 	 */
+	outerndistinct = 1.0;
+
 	if (IsA(inner_path, UniquePath))
 	{
 		innerbucketsize = 1.0 / virtualbuckets;
 		innermcvfreq = 0.0;
+		innerndistinct = inner_path_rows;
 	}
 	else
 	{
 		innerbucketsize = 1.0;
 		innermcvfreq = 1.0;
+		innerndistinct = 1.0;
+
 		foreach(hcl, hashclauses)
 		{
 			RestrictInfo *restrictinfo = lfirst_node(RestrictInfo, hcl);
 			Selectivity thisbucketsize;
 			Selectivity thismcvfreq;
+			double thisinnerndistinct;
+			double thisouterndistinct;
+			VariableStatData vardatainner;
+			VariableStatData vardataouter;
+			bool isdefault;
 
 			/*
 			 * First we have to figure out which side of the hashjoin clause
@@ -3465,6 +3478,25 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 					thisbucketsize = restrictinfo->right_bucketsize;
 				}
 				thismcvfreq = restrictinfo->right_mcvfreq;
+
+				examine_variable(root, get_rightop(restrictinfo->clause), 0, &vardatainner);
+				thisinnerndistinct = get_variable_numdistinct(&vardatainner, &isdefault);
+				if (vardatainner.rel && vardatainner.rel->tuples > 0)
+				{
+					thisinnerndistinct *= vardatainner.rel->rows / vardatainner.rel->tuples;
+					thisinnerndistinct = clamp_row_est(thisinnerndistinct);
+				}
+				ReleaseVariableStats(vardatainner);
+
+				/* lefthand side is outer */
+				examine_variable(root, get_leftop(restrictinfo->clause), 0, &vardataouter);
+				thisouterndistinct = get_variable_numdistinct(&vardataouter, &isdefault);
+				if (vardataouter.rel && vardataouter.rel->tuples > 0)
+				{
+					thisinnerndistinct *= vardataouter.rel->rows / vardataouter.rel->tuples;
+					thisinnerndistinct = clamp_row_est(thisinnerndistinct);
+				}
+				ReleaseVariableStats(vardataouter);
 			}
 			else
 			{
@@ -3483,12 +3515,35 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 					thisbucketsize = restrictinfo->left_bucketsize;
 				}
 				thismcvfreq = restrictinfo->left_mcvfreq;
+
+				examine_variable(root, get_leftop(restrictinfo->clause), 0, &vardatainner);
+				thisinnerndistinct = get_variable_numdistinct(&vardatainner, &isdefault);
+				if (vardatainner.rel && vardatainner.rel->tuples > 0)
+				{
+					thisinnerndistinct *= vardatainner.rel->rows / vardatainner.rel->tuples;
+					thisinnerndistinct = clamp_row_est(thisinnerndistinct);
+				}
+				ReleaseVariableStats(vardatainner);
+
+				/* righthand side is outers */
+				examine_variable(root, get_rightop(restrictinfo->clause), 0, &vardataouter);
+				thisouterndistinct = get_variable_numdistinct(&vardataouter, &isdefault);
+				if (vardataouter.rel && vardataouter.rel->tuples > 0)
+				{
+					thisinnerndistinct *= vardataouter.rel->rows / vardataouter.rel->tuples;
+					thisinnerndistinct = clamp_row_est(thisinnerndistinct);
+				}
+				ReleaseVariableStats(vardataouter);
 			}
 
 			if (innerbucketsize > thisbucketsize)
 				innerbucketsize = thisbucketsize;
 			if (innermcvfreq > thismcvfreq)
 				innermcvfreq = thismcvfreq;
+			if (outerndistinct < thisouterndistinct)
+				outerndistinct = thisouterndistinct;
+			if (innerndistinct < thisinnerndistinct)
+				innerndistinct =  thisinnerndistinct;
 		}
 	}
 
@@ -3516,6 +3571,21 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 
 	/* CPU costs */
 
+	/*
+	 * If virtualbuckets is much larger than innerndistinct, and
+	 * outerndistinct is much larger than innerndistinct. Then most
+	 * tuples of the outer table will match the empty bucket. So when
+	 * we calculate the cost of traversing the bucket, we need to ignore
+	 * the tuple matching empty bucket.
+	 */
+	outer_match_nonempty_frac = 1.0;
+	if (virtualbuckets > innerndistinct * 2 && outerndistinct > innerndistinct * 2)
+	{
+		outer_match_nonempty_frac = (1 -
+				((outerndistinct - innerndistinct)/outerndistinct)*
+				((virtualbuckets - innerndistinct)/virtualbuckets));
+	}
+
 	if (path->jpath.jointype == JOIN_SEMI ||
 		path->jpath.jointype == JOIN_ANTI ||
 		extra->inner_unique)
@@ -3539,7 +3609,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 		inner_scan_frac = 2.0 / (extra->semifactors.match_count + 1.0);
 
 		startup_cost += hash_qual_cost.startup;
-		run_cost += hash_qual_cost.per_tuple * outer_matched_rows *
+		run_cost += hash_qual_cost.per_tuple * outer_matched_rows * outer_match_nonempty_frac *
 			clamp_row_est(inner_path_rows * innerbucketsize * inner_scan_frac) * 0.5;
 
 		/*
@@ -3579,6 +3649,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 		 */
 		startup_cost += hash_qual_cost.startup;
 		run_cost += hash_qual_cost.per_tuple * outer_path_rows *
+			outer_match_nonempty_frac *
 			clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;
 
 		/*
diff --git a/src/test/regress/expected/join_hash.out b/src/test/regress/expected/join_hash.out
index 3a91c144a2..548f9724ad 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -1012,4 +1012,28 @@ WHERE
  text | t  | hjtest_1 | hjtest_2
 (1 row)
 
+-- If virtualbuckets is much larger than innerndistinct, and
+-- outerndistinct is much larger than innerndistinct. Then most
+-- tuples of the outer table will match the empty bucket. So when
+-- we calculate the cost of traversing the bucket, we need to ignore
+-- the tuple matching empty bucket.
+savepoint settings;
+set max_parallel_workers_per_gather = 0;
+create table join_hash_t_small(a int);
+create table join_hash_t_big(b int);
+insert into join_hash_t_small select i%100 from generate_series(0, 3000)i;
+insert into join_hash_t_big select i%100000 from generate_series(1, 100000)i ;
+analyze join_hash_t_small;
+analyze join_hash_t_big;
+explain (costs off) select * from join_hash_t_small, join_hash_t_big where a = b;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Hash Join
+   Hash Cond: (join_hash_t_big.b = join_hash_t_small.a)
+   ->  Seq Scan on join_hash_t_big
+   ->  Hash
+         ->  Seq Scan on join_hash_t_small
+(5 rows)
+
+rollback to settings;
 ROLLBACK;
diff --git a/src/test/regress/sql/join_hash.sql b/src/test/regress/sql/join_hash.sql
index 68c1a8c7b6..154e9e0085 100644
--- a/src/test/regress/sql/join_hash.sql
+++ b/src/test/regress/sql/join_hash.sql
@@ -537,4 +537,23 @@ WHERE
     AND (SELECT hjtest_2.c * 5) < 55
     AND hjtest_1.a <> hjtest_2.b;
 
+-- If virtualbuckets is much larger than innerndistinct, and
+-- outerndistinct is much larger than innerndistinct. Then most
+-- tuples of the outer table will match the empty bucket. So when
+-- we calculate the cost of traversing the bucket, we need to ignore
+-- the tuple matching empty bucket.
+savepoint settings;
+set max_parallel_workers_per_gather = 0;
+create table join_hash_t_small(a int);
+create table join_hash_t_big(b int);
+
+insert into join_hash_t_small select i%100 from generate_series(0, 3000)i;
+insert into join_hash_t_big select i%100000 from generate_series(1, 100000)i ;
+
+analyze join_hash_t_small;
+analyze join_hash_t_big;
+
+explain (costs off) select * from join_hash_t_small, join_hash_t_big where a = b;
+rollback to settings;
+
 ROLLBACK;
-- 
2.14.2

#5Andy Fan
zhihui.fan1213@gmail.com
In reply to: Jinbao Chen (#4)
Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.

On Fri, Nov 22, 2019 at 6:51 PM Jinbao Chen <jinchen@pivotal.io> wrote:

Hi hackers,

I have made a patch to fix the problem.

Added the selection rate of the inner table non-empty bucket

The planner will use big table as inner table in hash join
if small table have fewer unique values. But this plan is
much slower than using small table as inner table.

In general, the cost of creating a hash table is higher
than the cost of querying a hash table. So we tend to use
small tables as internal tables. But if the average chain
length of the bucket is large, the situation is just the
opposite.

If virtualbuckets is much larger than innerndistinct, and
outerndistinct is much larger than innerndistinct. Then most
tuples of the outer table will match the empty bucket. So when
we calculate the cost of traversing the bucket, we need to
ignore the tuple matching empty bucket.

So we add the selection rate of the inner table non-empty
bucket. The formula is:
(1 - ((outerndistinct - innerndistinct)/outerndistinct)*
((virtualbuckets - innerndistinct)/virtualbuckets))

On Tue, Nov 19, 2019 at 5:56 PM Jinbao Chen <jinchen@pivotal.io> wrote:

I think we have the same understanding of this issue.

Sometimes use smaller costs on scanning the chain in bucket like below
would
be better.
run_cost += outer_path_rows * some_small_probe_cost;
run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();
In some version of GreenPlum(a database based on postgres), we just
disabled
the cost on scanning the bucket chain. In most cases, this can get a
better query
plan. But I am worried that it will be worse in some cases.

Now only the small table's distinct value is much smaller than the bucket
number,
and much smaller than the distinct value of the large table, the planner
will get the
wrong plan.

For example, if inner table has 100 distinct values, and 3000 rows. Hash
table
has 1000 buckets. Outer table has 10000 distinct values.
We can assume that all the 100 distinct values of the inner table are
included in the
10000 distinct values of the outer table. So (100/10000)*outer_rows
tuples will
probe the buckets has chain. And (9900/10000)*outer_rows tuples will probe
all the 1000 buckets randomly. So (9900/10000)*outer_rows*(900/1000)
tuples will
probe empty buckets. So the costs on scanning bucket chain is

hash_qual_cost.per_tuple*innerbucketsize*outer_rows*
(1 - ((outer_distinct - inner_distinct)/outer_distinct)*((buckets_num -
inner_disttinct)/buckets_num))

Do you think this assumption is reasonable?

On Tue, Nov 19, 2019 at 3:46 PM Thomas Munro <thomas.munro@gmail.com>
wrote:

On Mon, Nov 18, 2019 at 7:48 PM Jinbao Chen <jinchen@pivotal.io> wrote:

In the test case above, the small table has 3000 tuples and 100

distinct values on column ‘a’.

If we use small table as inner table. The chan length of the bucket

is 30. And we need to

search the whole chain on probing the hash table. So the cost of

probing is bigger than build

hash table, and we need to use big table as inner.

But in fact this is not true. We initialized 620,000 buckets in

hashtable. But only 100 buckets

has chains with length 30. Other buckets are empty. Only hash values

need to be compared.

Its costs are very small. We have 100,000 distinct key and 100,000,000

tuple on outer table.

Only (100/100000)* tuple_num tuples will search the whole chain. The

other tuples

(number = (98900/100000)*tuple_num*) in outer
table just compare with the hash value. So the actual cost is much

smaller than the planner

calculated. This is the reason why using a small table as inner is

faster.

So basically we think that if t_big is on the outer side, we'll do
100,000,000 probes and each one is going to scan a t_small bucket with
chain length 30, so that looks really expensive. Actually only a
small percentage of its probes find tuples with the right hash value,
but final_cost_hash_join() doesn't know that. So we hash t_big
instead, which we estimated pretty well and it finishes up with
buckets of length 1,000 (which is actually fine in this case, they're
not unwanted hash collisions, they're duplicate keys that we need to
emit) and we probe them 3,000 times (which is also fine in this case),
but we had to do a bunch of memory allocation and/or batch file IO and
that turns out to be slower.

I am not at all sure about this but I wonder if it would be better to
use something like:

run_cost += outer_path_rows * some_small_probe_cost;
run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();

If we can estimate how many tuples will actually match accurately,
that should also be the number of times we have to run the quals,
since we don't usually expect hash collisions (bucket collisions, yes,
but hash collisions where the key doesn't turn out to be equal, no*).

* ... but also yes as you approach various limits, so you could also
factor in bucket chain length that is due to being prevented from
expanding the number of buckets by arbitrary constraints, and perhaps
also birthday_problem(hash size, key space) to factor in unwanted hash
collisions that start to matter once you get to billions of keys and
expect collisions with short hashes.

FYI: I tried this on 12.1, and find it use small_table as inner table
already. I didn't look into the details so far.

postgres=# explain (costs off) select * from join_hash_t_small,
join_hash_t_big where a = b;
QUERY PLAN
--------------------------------------------------------
Hash Join
Hash Cond: (join_hash_t_big.b = join_hash_t_small.a)
-> Seq Scan on join_hash_t_big
-> Hash
-> Seq Scan on join_hash_t_small
(5 rows)

postgres=# select version();
version
-----------------------------------------------------------------------------------------------------------------
PostgreSQL 12.1 on x86_64-apple-darwin18.7.0, compiled by Apple LLVM
version 10.0.1 (clang-1001.0.46.4), 64-bit
(1 row)

#6Jinbao Chen
jinchen@pivotal.io
In reply to: Andy Fan (#5)
Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.

Hi Andy,

I just test the query on 12.1. But pg use big_table as inner.

demo=# explain (costs off) select * from t_small, t_big where a = b;
QUERY PLAN
------------------------------------
Hash Join
Hash Cond: (t_small.a = t_big.b)
-> Seq Scan on t_small
-> Hash
-> Seq Scan on t_big

Do you insert data and set max_parallel_workers_per_gather to 0 like above?

create table t_small(a int);
create table t_big(b int);
insert into t_small select i%100 from generate_series(0, 3000)i;
insert into t_big select i%100000 from generate_series(1, 100000000)i ;
analyze t_small;
analyze t_big;
set max_parallel_workers_per_gather = 0;

On Thu, Nov 28, 2019 at 5:46 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:

Show quoted text

On Fri, Nov 22, 2019 at 6:51 PM Jinbao Chen <jinchen@pivotal.io> wrote:

Hi hackers,

I have made a patch to fix the problem.

Added the selection rate of the inner table non-empty bucket

The planner will use big table as inner table in hash join
if small table have fewer unique values. But this plan is
much slower than using small table as inner table.

In general, the cost of creating a hash table is higher
than the cost of querying a hash table. So we tend to use
small tables as internal tables. But if the average chain
length of the bucket is large, the situation is just the
opposite.

If virtualbuckets is much larger than innerndistinct, and
outerndistinct is much larger than innerndistinct. Then most
tuples of the outer table will match the empty bucket. So when
we calculate the cost of traversing the bucket, we need to
ignore the tuple matching empty bucket.

So we add the selection rate of the inner table non-empty
bucket. The formula is:
(1 - ((outerndistinct - innerndistinct)/outerndistinct)*
((virtualbuckets - innerndistinct)/virtualbuckets))

On Tue, Nov 19, 2019 at 5:56 PM Jinbao Chen <jinchen@pivotal.io> wrote:

I think we have the same understanding of this issue.

Sometimes use smaller costs on scanning the chain in bucket like below
would
be better.
run_cost += outer_path_rows * some_small_probe_cost;
run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();
In some version of GreenPlum(a database based on postgres), we just
disabled
the cost on scanning the bucket chain. In most cases, this can get a
better query
plan. But I am worried that it will be worse in some cases.

Now only the small table's distinct value is much smaller than the
bucket number,
and much smaller than the distinct value of the large table, the planner
will get the
wrong plan.

For example, if inner table has 100 distinct values, and 3000 rows. Hash
table
has 1000 buckets. Outer table has 10000 distinct values.
We can assume that all the 100 distinct values of the inner table are
included in the
10000 distinct values of the outer table. So (100/10000)*outer_rows
tuples will
probe the buckets has chain. And (9900/10000)*outer_rows tuples will
probe
all the 1000 buckets randomly. So (9900/10000)*outer_rows*(900/1000)
tuples will
probe empty buckets. So the costs on scanning bucket chain is

hash_qual_cost.per_tuple*innerbucketsize*outer_rows*
(1 - ((outer_distinct - inner_distinct)/outer_distinct)*((buckets_num -
inner_disttinct)/buckets_num))

Do you think this assumption is reasonable?

On Tue, Nov 19, 2019 at 3:46 PM Thomas Munro <thomas.munro@gmail.com>
wrote:

On Mon, Nov 18, 2019 at 7:48 PM Jinbao Chen <jinchen@pivotal.io> wrote:

In the test case above, the small table has 3000 tuples and 100

distinct values on column ‘a’.

If we use small table as inner table. The chan length of the bucket

is 30. And we need to

search the whole chain on probing the hash table. So the cost of

probing is bigger than build

hash table, and we need to use big table as inner.

But in fact this is not true. We initialized 620,000 buckets in

hashtable. But only 100 buckets

has chains with length 30. Other buckets are empty. Only hash values

need to be compared.

Its costs are very small. We have 100,000 distinct key and

100,000,000 tuple on outer table.

Only (100/100000)* tuple_num tuples will search the whole chain. The

other tuples

(number = (98900/100000)*tuple_num*) in outer
table just compare with the hash value. So the actual cost is much

smaller than the planner

calculated. This is the reason why using a small table as inner is

faster.

So basically we think that if t_big is on the outer side, we'll do
100,000,000 probes and each one is going to scan a t_small bucket with
chain length 30, so that looks really expensive. Actually only a
small percentage of its probes find tuples with the right hash value,
but final_cost_hash_join() doesn't know that. So we hash t_big
instead, which we estimated pretty well and it finishes up with
buckets of length 1,000 (which is actually fine in this case, they're
not unwanted hash collisions, they're duplicate keys that we need to
emit) and we probe them 3,000 times (which is also fine in this case),
but we had to do a bunch of memory allocation and/or batch file IO and
that turns out to be slower.

I am not at all sure about this but I wonder if it would be better to
use something like:

run_cost += outer_path_rows * some_small_probe_cost;
run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();

If we can estimate how many tuples will actually match accurately,
that should also be the number of times we have to run the quals,
since we don't usually expect hash collisions (bucket collisions, yes,
but hash collisions where the key doesn't turn out to be equal, no*).

* ... but also yes as you approach various limits, so you could also
factor in bucket chain length that is due to being prevented from
expanding the number of buckets by arbitrary constraints, and perhaps
also birthday_problem(hash size, key space) to factor in unwanted hash
collisions that start to matter once you get to billions of keys and
expect collisions with short hashes.

FYI: I tried this on 12.1, and find it use small_table as inner table
already. I didn't look into the details so far.

postgres=# explain (costs off) select * from join_hash_t_small,
join_hash_t_big where a = b;
QUERY PLAN
--------------------------------------------------------
Hash Join
Hash Cond: (join_hash_t_big.b = join_hash_t_small.a)
-> Seq Scan on join_hash_t_big
-> Hash
-> Seq Scan on join_hash_t_small
(5 rows)

postgres=# select version();
version

-----------------------------------------------------------------------------------------------------------------
PostgreSQL 12.1 on x86_64-apple-darwin18.7.0, compiled by Apple LLVM
version 10.0.1 (clang-1001.0.46.4), 64-bit
(1 row)

#7Andy Fan
zhihui.fan1213@gmail.com
In reply to: Jinbao Chen (#6)
Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.

On Thu, Nov 28, 2019 at 7:19 PM Jinbao Chen <jinchen@pivotal.io> wrote:

Hi Andy,

I just test the query on 12.1. But pg use big_table as inner.

demo=# explain (costs off) select * from t_small, t_big where a = b;
QUERY PLAN
------------------------------------
Hash Join
Hash Cond: (t_small.a = t_big.b)
-> Seq Scan on t_small
-> Hash
-> Seq Scan on t_big

Do you insert data and set max_parallel_workers_per_gather to 0 like
above?

Sorry for the noise.. you are right. I thought I load the data but and
run the query immediately without running the analyzing.

now it is using big table as inner table.

Show quoted text

create table t_small(a int);
create table t_big(b int);
insert into t_small select i%100 from generate_series(0, 3000)i;
insert into t_big select i%100000 from generate_series(1, 100000000)i ;
analyze t_small;
analyze t_big;
set max_parallel_workers_per_gather = 0;

On Thu, Nov 28, 2019 at 5:46 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:

On Fri, Nov 22, 2019 at 6:51 PM Jinbao Chen <jinchen@pivotal.io> wrote:

Hi hackers,

I have made a patch to fix the problem.

Added the selection rate of the inner table non-empty bucket

The planner will use big table as inner table in hash join
if small table have fewer unique values. But this plan is
much slower than using small table as inner table.

In general, the cost of creating a hash table is higher
than the cost of querying a hash table. So we tend to use
small tables as internal tables. But if the average chain
length of the bucket is large, the situation is just the
opposite.

If virtualbuckets is much larger than innerndistinct, and
outerndistinct is much larger than innerndistinct. Then most
tuples of the outer table will match the empty bucket. So when
we calculate the cost of traversing the bucket, we need to
ignore the tuple matching empty bucket.

So we add the selection rate of the inner table non-empty
bucket. The formula is:
(1 - ((outerndistinct - innerndistinct)/outerndistinct)*
((virtualbuckets - innerndistinct)/virtualbuckets))

On Tue, Nov 19, 2019 at 5:56 PM Jinbao Chen <jinchen@pivotal.io> wrote:

I think we have the same understanding of this issue.

Sometimes use smaller costs on scanning the chain in bucket like below
would
be better.
run_cost += outer_path_rows * some_small_probe_cost;
run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();
In some version of GreenPlum(a database based on postgres), we just
disabled
the cost on scanning the bucket chain. In most cases, this can get a
better query
plan. But I am worried that it will be worse in some cases.

Now only the small table's distinct value is much smaller than the
bucket number,
and much smaller than the distinct value of the large table, the
planner will get the
wrong plan.

For example, if inner table has 100 distinct values, and 3000 rows.
Hash table
has 1000 buckets. Outer table has 10000 distinct values.
We can assume that all the 100 distinct values of the inner table are
included in the
10000 distinct values of the outer table. So (100/10000)*outer_rows
tuples will
probe the buckets has chain. And (9900/10000)*outer_rows tuples will
probe
all the 1000 buckets randomly. So (9900/10000)*outer_rows*(900/1000)
tuples will
probe empty buckets. So the costs on scanning bucket chain is

hash_qual_cost.per_tuple*innerbucketsize*outer_rows*
(1 - ((outer_distinct - inner_distinct)/outer_distinct)*((buckets_num -
inner_disttinct)/buckets_num))

Do you think this assumption is reasonable?

On Tue, Nov 19, 2019 at 3:46 PM Thomas Munro <thomas.munro@gmail.com>
wrote:

On Mon, Nov 18, 2019 at 7:48 PM Jinbao Chen <jinchen@pivotal.io>
wrote:

In the test case above, the small table has 3000 tuples and 100

distinct values on column ‘a’.

If we use small table as inner table. The chan length of the bucket

is 30. And we need to

search the whole chain on probing the hash table. So the cost of

probing is bigger than build

hash table, and we need to use big table as inner.

But in fact this is not true. We initialized 620,000 buckets in

hashtable. But only 100 buckets

has chains with length 30. Other buckets are empty. Only hash values

need to be compared.

Its costs are very small. We have 100,000 distinct key and

100,000,000 tuple on outer table.

Only (100/100000)* tuple_num tuples will search the whole chain. The

other tuples

(number = (98900/100000)*tuple_num*) in outer
table just compare with the hash value. So the actual cost is much

smaller than the planner

calculated. This is the reason why using a small table as inner is

faster.

So basically we think that if t_big is on the outer side, we'll do
100,000,000 probes and each one is going to scan a t_small bucket with
chain length 30, so that looks really expensive. Actually only a
small percentage of its probes find tuples with the right hash value,
but final_cost_hash_join() doesn't know that. So we hash t_big
instead, which we estimated pretty well and it finishes up with
buckets of length 1,000 (which is actually fine in this case, they're
not unwanted hash collisions, they're duplicate keys that we need to
emit) and we probe them 3,000 times (which is also fine in this case),
but we had to do a bunch of memory allocation and/or batch file IO and
that turns out to be slower.

I am not at all sure about this but I wonder if it would be better to
use something like:

run_cost += outer_path_rows * some_small_probe_cost;
run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();

If we can estimate how many tuples will actually match accurately,
that should also be the number of times we have to run the quals,
since we don't usually expect hash collisions (bucket collisions, yes,
but hash collisions where the key doesn't turn out to be equal, no*).

* ... but also yes as you approach various limits, so you could also
factor in bucket chain length that is due to being prevented from
expanding the number of buckets by arbitrary constraints, and perhaps
also birthday_problem(hash size, key space) to factor in unwanted hash
collisions that start to matter once you get to billions of keys and
expect collisions with short hashes.

FYI: I tried this on 12.1, and find it use small_table as inner table
already. I didn't look into the details so far.

postgres=# explain (costs off) select * from join_hash_t_small,
join_hash_t_big where a = b;
QUERY PLAN
--------------------------------------------------------
Hash Join
Hash Cond: (join_hash_t_big.b = join_hash_t_small.a)
-> Seq Scan on join_hash_t_big
-> Hash
-> Seq Scan on join_hash_t_small
(5 rows)

postgres=# select version();
version

-----------------------------------------------------------------------------------------------------------------
PostgreSQL 12.1 on x86_64-apple-darwin18.7.0, compiled by Apple LLVM
version 10.0.1 (clang-1001.0.46.4), 64-bit
(1 row)