bad estimation together with large work_mem generates terrible slow hash joins
Hello all,
today I had to work with one slow query - when I checked different
scenarios I found a dependency on work_mem size - but it is atypical -
bigger work_mem increased query execution 31 minutes (600MB work mem) and 1
minute (1MB).
db_kost07e2d9cdmg20b1takpqntobo6ghj=# set work_mem to '600MB';
SET
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT
"f_users_aax8qg0e23asx5h"."firtsclickutmsource_id" AS "a_5078", COUNT( (
CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN
"f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS
"c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS
"c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 =
"f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS
"def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( (
"f_emails_aaekn159p5aw3t8" INNER JOIN "f_users_aax8qg0e23asx5h" ON (
"f_emails_aaekn159p5aw3t8"."userid_id" = "f_users_aax8qg0e23asx5h"."id" )
) INNER JOIN "dwh_dm_abnblk9j5oagorw" ON (
"f_users_aax8qg0e23asx5h"."dt_registrationdatetime_id" =
"dwh_dm_abnblk9j5oagorw"."id" ) ) WHERE ( 2014 =
"dwh_dm_abnblk9j5oagorw"."id_year" ) GROUP BY 1;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2145957.37..2145957.96 rows=59 width=12) (actual
time=1864088.145..1864088.155 rows=31 loops=1)
-> Hash Join (cost=882573.27..2142753.08 rows=213619 width=12) (actual
time=9083.326..1859069.171 rows=7070141 loops=1)
Hash Cond: (f_emails_aaekn159p5aw3t8.userid_id =
f_users_aax8qg0e23asx5h.id)
-> Seq Scan on f_emails_aaekn159p5aw3t8 (cost=0.00..854559.95
rows=32278695 width=12) (actual time=0.006..14271.239 rows=32211565 loops=1)
-> Hash (cost=881801.58..881801.58 rows=61735 width=8) (actual
time=9076.153..9076.153 rows=3310768 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 129327kB
-> Hash Join (cost=782.15..881801.58 rows=61735 width=8)
(actual time=1.423..8086.228 rows=3310768 loops=1)
Hash Cond:
(f_users_aax8qg0e23asx5h.dt_registrationdatetime_id =
dwh_dm_abnblk9j5oagorw.id)
-> Seq Scan on f_users_aax8qg0e23asx5h
(cost=0.00..845420.42 rows=9328442 width=12) (actual time=0.015..4098.915
rows=9322096 loops=1)
-> Hash (cost=777.59..777.59 rows=365 width=4)
(actual time=1.400..1.400 rows=365 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Bitmap Heap Scan on dwh_dm_abnblk9j5oagorw
(cost=11.08..777.59 rows=365 width=4) (actual time=0.111..1.218 rows=365
loops=1)
Recheck Cond: (2014 = id_year)
-> Bitmap Index Scan on
dwh_dm_abnblk9j5oagorw_id_year_idx (cost=0.00..10.99 rows=365 width=0)
(actual time=0.068..0.068 rows=365 loops=1)
Index Cond: (2014 = id_year)
Total runtime: 1864107.373 ms
(16 rows)
db_kost07e2d9cdmg20b1takpqntobo6ghj=# set work_mem to '1MB';
SET
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT
"f_users_aax8qg0e23asx5h"."firtsclickutmsource_id" AS "a_5078", COUNT( (
CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN
"f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS
"c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS
"c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 =
"f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS
"def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( (
"f_emails_aaekn159p5aw3t8" INNER JOIN "f_users_aax8qg0e23asx5h" ON (
"f_emails_aaekn159p5aw3t8"."userid_id" = "f_users_aax8qg0e23asx5h"."id" )
) INNER JOIN "dwh_dm_abnblk9j5oagorw" ON (
"f_users_aax8qg0e23asx5h"."dt_registrationdatetime_id" =
"dwh_dm_abnblk9j5oagorw"."id" ) ) WHERE ( 2014 =
"dwh_dm_abnblk9j5oagorw"."id_year" ) GROUP BY 1;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2275675.45..2275675.95 rows=50 width=12) (actual
time=48438.407..48438.416 rows=31 loops=1)
-> Hash Join (cost=882772.88..2272526.68 rows=209918 width=12) (actual
time=9384.610..45211.963 rows=6988804 loops=1)
Hash Cond: (f_emails_aaekn159p5aw3t8.userid_id =
f_users_aax8qg0e23asx5h.id)
-> Seq Scan on f_emails_aaekn159p5aw3t8 (cost=0.00..839754.64
rows=31719464 width=12) (actual time=0.034..14299.338 rows=31653392 loops=1)
-> Hash (cost=881759.44..881759.44 rows=61715 width=8) (actual
time=9368.371..9368.371 rows=3310768 loops=1)
Buckets: 4096 Batches: 256 (originally 4) Memory Usage:
1025kB
-> Hash Join (cost=782.15..881759.44 rows=61715 width=8)
(actual time=3.839..8223.097 rows=3310768 loops=1)
Hash Cond:
(f_users_aax8qg0e23asx5h.dt_registrationdatetime_id =
dwh_dm_abnblk9j5oagorw.id)
-> Seq Scan on f_users_aax8qg0e23asx5h
(cost=0.00..845389.92 rows=9325392 width=12) (actual time=0.017..4118.598
rows=9322096 loops=1)
-> Hash (cost=777.59..777.59 rows=365 width=4)
(actual time=3.804..3.804 rows=365 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Bitmap Heap Scan on dwh_dm_abnblk9j5oagorw
(cost=11.08..777.59 rows=365 width=4) (actual time=0.188..3.641 rows=365
loops=1)
Recheck Cond: (2014 = id_year)
-> Bitmap Index Scan on
dwh_dm_abnblk9j5oagorw_id_year_idx (cost=0.00..10.99 rows=365 width=0)
(actual time=0.115..0.115 rows=365 loops=1)
Index Cond: (2014 = id_year)
Total runtime: 48439.958 ms
(16 rows)
db_kost07e2d9cdmg20b1takpqntobo6ghj=# set work_mem to '10MB';
SET
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT
"f_users_aax8qg0e23asx5h"."firtsclickutmsource_id" AS "a_5078", COUNT( (
CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN
"f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS
"c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS
"c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 =
"f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS
"def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( (
"f_emails_aaekn159p5aw3t8" INNER JOIN "f_users_aax8qg0e23asx5h" ON (
"f_emails_aaekn159p5aw3t8"."userid_id" = "f_users_aax8qg0e23asx5h"."id" )
) INNER JOIN "dwh_dm_abnblk9j5oagorw" ON (
"f_users_aax8qg0e23asx5h"."dt_registrationdatetime_id" =
"dwh_dm_abnblk9j5oagorw"."id" ) ) WHERE ( 2014 =
"dwh_dm_abnblk9j5oagorw"."id_year" ) GROUP BY 1;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2124026.77..2124027.27 rows=50 width=12) (actual
time=100739.984..100739.998 rows=31 loops=1)
-> Hash Join (cost=882530.88..2120878.00 rows=209918 width=12) (actual
time=9467.702..97238.959 rows=6988804 loops=1)
Hash Cond: (f_emails_aaekn159p5aw3t8.userid_id =
f_users_aax8qg0e23asx5h.id)
-> Seq Scan on f_emails_aaekn159p5aw3t8 (cost=0.00..839754.64
rows=31719464 width=12) (actual time=0.015..9185.978 rows=31653392 loops=1)
-> Hash (cost=881759.44..881759.44 rows=61715 width=8) (actual
time=9466.440..9466.440 rows=3310768 loops=1)
Buckets: 8192 Batches: 16 (originally 1) Memory Usage:
10241kB
-> Hash Join (cost=782.15..881759.44 rows=61715 width=8)
(actual time=3.681..8183.423 rows=3310768 loops=1)
Hash Cond:
(f_users_aax8qg0e23asx5h.dt_registrationdatetime_id =
dwh_dm_abnblk9j5oagorw.id)
-> Seq Scan on f_users_aax8qg0e23asx5h
(cost=0.00..845389.92 rows=9325392 width=12) (actual time=0.012..4064.044
rows=9322096 loops=1)
-> Hash (cost=777.59..777.59 rows=365 width=4)
(actual time=3.654..3.654 rows=365 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Bitmap Heap Scan on dwh_dm_abnblk9j5oagorw
(cost=11.08..777.59 rows=365 width=4) (actual time=0.129..3.449 rows=365
loops=1)
Recheck Cond: (2014 = id_year)
-> Bitmap Index Scan on
dwh_dm_abnblk9j5oagorw_id_year_idx (cost=0.00..10.99 rows=365 width=0)
(actual time=0.077..0.077 rows=365 loops=1)
Index Cond: (2014 = id_year)
Total runtime: 100740.552 ms
(16 rows)
The basic problem is wrong estimation - but I am not able to fix it - I
changed a statistics to 10000 without any change. I cannot to change SQL in
application, but for debugging I can test it with fixed estimation:
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT
"stehule"."firtsclickutmsource_id" AS "a_5078", COUNT( ( CASE WHEN ( 89388
= "f_emails_aaekn159p5aw3t8"."read_id" ) THEN
"f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS
"c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS
"c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 =
"f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS
"def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( (
"f_emails_aaekn159p5aw3t8" INNER JOIN stehule on
"f_emails_aaekn159p5aw3t8"."userid_id" = stehule."id" ) ) GROUP BY 1 ;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2073456.55..2073457.23 rows=68 width=12) (actual
time=58724.094..58724.106 rows=31 loops=1)
-> Hash Join (cost=89142.28..1589276.13 rows=32278695 width=12)
(actual time=2191.019..55499.331 rows=7070141 loops=1)
Hash Cond: (f_emails_aaekn159p5aw3t8.userid_id = stehule.id)
-> Seq Scan on f_emails_aaekn159p5aw3t8 (cost=0.00..854559.95
rows=32278695 width=12) (actual time=0.018..8364.554 rows=32211565 loops=1)
-> Hash (cost=47757.68..47757.68 rows=3310768 width=8) (actual
time=2188.858..2188.858 rows=3310768 loops=1)
Buckets: 524288 Batches: 1 Memory Usage: 129327kB
-> Seq Scan on stehule (cost=0.00..47757.68 rows=3310768
width=8) (actual time=0.026..927.235 rows=3310768 loops=1)
Total runtime: 58741.224 ms
(8 rows)
db_kost07e2d9cdmg20b1takpqntobo6ghj=# set work_mem to '1MB';
SET
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT
"stehule"."firtsclickutmsource_id" AS "a_5078", COUNT( ( CASE WHEN ( 89388
= "f_emails_aaekn159p5aw3t8"."read_id" ) THEN
"f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS
"c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS
"c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 =
"f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS
"def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( (
"f_emails_aaekn159p5aw3t8" INNER JOIN stehule on
"f_emails_aaekn159p5aw3t8"."userid_id" = stehule."id" ) ) GROUP BY 1 ;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2293499.45..2293500.13 rows=68 width=12) (actual
time=37357.967..37357.976 rows=31 loops=1)
-> Hash Join (cost=102075.28..1809319.02 rows=32278695 width=12)
(actual time=1814.232..34290.818 rows=7070141 loops=1)
Hash Cond: (f_emails_aaekn159p5aw3t8.userid_id = stehule.id)
-> Seq Scan on f_emails_aaekn159p5aw3t8 (cost=0.00..854559.95
rows=32278695 width=12) (actual time=0.007..14066.754 rows=32211565 loops=1)
-> Hash (cost=47757.68..47757.68 rows=3310768 width=8) (actual
time=1806.453..1806.453 rows=3310768 loops=1)
Buckets: 4096 Batches: 256 Memory Usage: 518kB
-> Seq Scan on stehule (cost=0.00..47757.68 rows=3310768
width=8) (actual time=0.007..781.672 rows=3310768 loops=1)
Total runtime: 37359.264 ms
(8 rows)
Still less work_mem performs better - but with our default 600MB it
performs relatively well
db_kost07e2d9cdmg20b1takpqntobo6ghj=# select version();
version
---------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.2.8 on x86_64-gdc-linux-gnu, with GoodData patches, compiled
by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit
(1 row)
db_kost07e2d9cdmg20b1takpqntobo6ghj=# \dt+ stehule
List of relations
Schema | Name | Type | Owner | Size | Description
--------+---------+-------+----------+--------+-------------
public | stehule | table | postgres | 114 MB |
(1 row)
db_kost07e2d9cdmg20b1takpqntobo6ghj=# \dt+ f_users_aax8qg0e23asx5h
List of relations
Schema | Name | Type | Owner | Size | Description
--------+-------------------------+-------+-------+---------+-------------
public | f_users_aax8qg0e23asx5h | table | beard | 5878 MB |
(1 row)
db_kost07e2d9cdmg20b1takpqntobo6ghj=# \dt+ f_emails_aaekn159p5aw3t8
List of relations
Schema | Name | Type | Owner | Size | Description
--------+--------------------------+-------+-------+---------+-------------
public | f_emails_aaekn159p5aw3t8 | table | beard | 4156 MB |
(1 row)
all data are in FS cache. There are 60GB RAM.
Is interesting so 31M nested loops is significantly faster, than bad hash
join
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT
"f_users_aax8qg0e23asx5h"."firtsclickutmsource_id" AS "a_5078", COUNT( (
CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN
"f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS
"c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS
"c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 =
"f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS
"def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( (
"f_emails_aaekn159p5aw3t8" INNER JOIN "f_users_aax8qg0e23asx5h" ON (
"f_emails_aaekn159p5aw3t8"."userid_id" = "f_users_aax8qg0e23asx5h"."id" )
) INNER JOIN "dwh_dm_abnblk9j5oagorw" ON (
"f_users_aax8qg0e23asx5h"."dt_registrationdatetime_id" =
"dwh_dm_abnblk9j5oagorw"."id" ) ) WHERE ( 2014 =
"dwh_dm_abnblk9j5oagorw"."id_year" ) GROUP BY 1;
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=40076389.70..40076448.70 rows=59 width=12) (actual
time=31124.643..31124.658 rows=31 loops=1)
-> Nested Loop (cost=2773.92..40073185.42 rows=213619 width=12)
(actual time=37.945..27670.473 rows=7070141 loops=1)
-> Hash Join (cost=2773.92..10180068.58 rows=61735 width=8)
(actual time=0.951..8934.170 rows=3310768 loops=1)
Hash Cond:
(f_users_aax8qg0e23asx5h.dt_registrationdatetime_id =
dwh_dm_abnblk9j5oagorw.id)
-> Seq Scan on f_users_aax8qg0e23asx5h
(cost=0.00..10080578.00 rows=9328442 width=12) (actual time=0.004..4297.579
rows=9322096 loops=1)
-> Hash (cost=2408.01..2408.01 rows=365 width=4) (actual
time=0.919..0.919 rows=365 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Bitmap Heap Scan on dwh_dm_abnblk9j5oagorw
(cost=386.27..2408.01 rows=365 width=4) (actual time=0.104..0.733 rows=365
loops=1)
Recheck Cond: (2014 = id_year)
-> Bitmap Index Scan on
dwh_dm_abnblk9j5oagorw_id_year_idx (cost=0.00..386.18 rows=365 width=0)
(actual time=0.059..0.059 rows=365 loops=1)
Index Cond: (2014 = id_year)
-> Index Scan using f_emails_aaekn159p5aw3t8_userid_id_idx on
f_emails_aaekn159p5aw3t8 (cost=0.00..396.22 rows=88 width=12) (actual
time=0.002..0.004 rows=2 loops=3310768)
Index Cond: (userid_id = f_users_aax8qg0e23asx5h.id)
Total runtime: 31125.975 ms
Regards
Pavel
Hi,
Dne 2014-06-26 14:10, Pavel Stehule napsal:
Hello all,
today I had to work with one slow query - when I checked different
scenarios I found a dependency on work_mem size - but it is atypical -
bigger work_mem increased query execution 31 minutes (600MB work mem)
and 1 minute (1MB).
The problem described in Pavel's emails (illustrated by the awful
explain plans) is in this part:
-> Hash (cost=881801.58..881801.58 rows=61735 width=8) (actual
time=9076.153..9076.153 rows=3310768 loops=1)
That is, the estimated number of rows is ~60k, but in reality it's
~3.3M.
This then leads to a hash table with small number of buckets (8192)
containing
large number of tuples (~400 in this case) in a linked list. Which
significantly
slows down the lookups during the hash join.
This issue is actually quite common - all it takes is a significant
underestimation of the hashed relation, either because it's a complex
join
(thus inherently difficult to estimate) or because the WHERE conditions
are
not independent (see the example below).
The attached patch (early WIP, after ~30 minutes of hacking) addresses
this by
increasing the number of bucket whenever the average number of tuples
per item
exceeds 1.5x NTUP_PER_BUCKET.
======= Example ========
create table table_a (id int, fk_id int);
create table table_b (fk_id int, val_a int, val_b int);
insert into table_a select i, i from generate_series(1,10000000) s(i);
insert into table_b select i, mod(i,1000), mod(i,1000) from
generate_series(1,10000000) s(i);
analyze table_a;
analyze table_b;
explain analyze select count(*) from table_a join table_b on (table_a.id
= table_b.fk_id) where val_a < 10 and val_b < 10;
without the patch:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=385834.56..385834.57 rows=1 width=0) (actual
time=49543.263..49543.264 rows=1 loops=1)
-> Hash Join (cost=204069.89..385831.16 rows=1359 width=0) (actual
time=923.751..49531.554 rows=100000 loops=1)
Hash Cond: (table_a.id = table_b.fk_id)
-> Seq Scan on table_a (cost=0.00..144247.77 rows=9999977
width=4) (actual time=0.104..967.090 rows=10000000 loops=1)
-> Hash (cost=204052.90..204052.90 rows=1359 width=4) (actual
time=923.615..923.615 rows=100000 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 3516kB
-> Seq Scan on table_b (cost=0.00..204052.90 rows=1359
width=4) (actual time=0.086..910.656 rows=100000 loops=1)
Filter: ((val_a < 10) AND (val_b < 10))
Rows Removed by Filter: 9900000
Planning time: 0.271 ms
Execution time: 49545.053 ms
(11 rows)
with the patch:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=385834.56..385834.57 rows=1 width=0) (actual
time=9780.346..9780.347 rows=1 loops=1)
-> Hash Join (cost=204069.89..385831.16 rows=1359 width=0) (actual
time=939.297..9772.256 rows=100000 loops=1)
Hash Cond: (table_a.id = table_b.fk_id)
-> Seq Scan on table_a (cost=0.00..144247.77 rows=9999977
width=4) (actual time=0.103..962.446 rows=10000000 loops=1)
-> Hash (cost=204052.90..204052.90 rows=1359 width=4) (actual
time=939.172..939.172 rows=100000 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 3516kB
-> Seq Scan on table_b (cost=0.00..204052.90 rows=1359
width=4) (actual time=0.064..909.191 rows=100000 loops=1)
Filter: ((val_a < 10) AND (val_b < 10))
Rows Removed by Filter: 9900000
Planning time: 0.276 ms
Execution time: 9782.295 ms
(11 rows)
Time: 9784.392 ms
So the duration improved significantly - from ~52 seconds to ~10
seconds.
The patch is not perfect, it needs a bit more love, as illustrated by
the FIXME/TODO items. Feel free to comment.
regards
Tomas
Attachments:
hashjoin-nbuckets-growth-v1.patchtext/x-diff; name=hashjoin-nbuckets-growth-v1.patchDownload+83-0
Attached is v2 of the patch, with some cleanups / minor improvements:
* improved comments, whitespace fixed / TODOs etc.
* tracking inital # of buckets (similar to initial # of batches)
* adding info about buckets to EXPLAIN ANALYZE, similar to batches - I
didn't want to make it overly complex, so the info about initial
bucket/batch count is added if at least one them is modified
* modified threshold triggering the growth, to get NTUP_PER_BUCKET on
average (see the NTUP_GROW_THRESHOLD comment nodeHash.c)
* there's a single FIXME, related to counting tuples in the
One thing that's important to note is the difference between # of
batches and # of buckets. While one # of batches is "global" the # of
buckets is 'within a batch'. So theoretically each batch can use
different number of buckets.
However the value is reused between batches, so it only grows. That
means this is possible:
initial: 1024 buckets (before 1st batch)
batch 1: 1024 buckets
batch 2: 1024 buckets
batch 3: 4096 buckets
batch 4: 8192 buckets
while this is not:
initial: 1024 buckets (before 1st batch)
batch 1: 1024 buckets
batch 2: 4096 buckets
batch 3: 1024 buckets
batch 4: 8192 buckets
However in practice I expect the first batch will to do all the work,
and the following batches will just reuse the same number of buckets.
This of course assumes the batches have similar tuple sizes etc.
So the first batch will do all the reshuffling the tables, and the
following batches will reuse the 'right' number of buckets from the start.
regards
Tomas
Attachments:
hashjoin-nbuckets-growth-v2.patchtext/x-diff; name=hashjoin-nbuckets-growth-v2.patchDownload+127-4
On 26.6.2014 20:43, Tomas Vondra wrote:
Attached is v2 of the patch, with some cleanups / minor improvements:
* there's a single FIXME, related to counting tuples in the
Meh, I couldn't resist resolving this FIXME, so attached is v3 of the
patch. This just adds a proper 'batch tuples' counter to the hash table.
All comments, measurements on different queries etc. welcome. We'll
certainly do a lot of testing, because this was a big issue for us.
regards
Tomas
Attachments:
hashjoin-nbuckets-growth-v3.patchtext/x-diff; name=hashjoin-nbuckets-growth-v3.patchDownload+131-4
On 26.6.2014 23:48, Tomas Vondra wrote:
On 26.6.2014 20:43, Tomas Vondra wrote:
Attached is v2 of the patch, with some cleanups / minor improvements:
* there's a single FIXME, related to counting tuples in the
Meh, I couldn't resist resolving this FIXME, so attached is v3 of the
patch. This just adds a proper 'batch tuples' counter to the hash table.All comments, measurements on different queries etc. welcome. We'll
certainly do a lot of testing, because this was a big issue for us.
Attached is v4 of the patch, with a few minor improvements. The only
thing worth mentioning is overflow protection, similar to what's done in
the ExecChooseHashTableSize() function. Otherwise it's mostly about
improving comments.
Also attached is a v4 with GUC, making it easier to compare effect of
the patch, by simply setting "enable_hashjoin_bucket" to "off" (original
behaviour) or "on" (new behaviour).
And finally there's an SQL script demonstrating the effect of the patch
with various work_mem settings. For example what I see on my desktop is
this (averages from 3 runs):
===== SMALL WORK MEM (2MB) =====
no dynamic buckets dynamic buckets
query A 5945 ms 5969 ms
query B 6080 ms 5943 ms
query C 6531 ms 6822 ms
query D 6962 ms 6618 ms
===== MEDIUM WORK MEM (16MB) =====
no dynamic buckets dynamic buckets
query A 7955 ms 7944 ms
query B 9970 ms 7569 ms
query C 8643 ms 8560 ms
query D 33908 ms 7700 ms
===== LARGE WORK MEM (64MB) =====
no dynamic buckets dynamic buckets
query A 10235 ms 10233 ms
query B 32229 ms 9318 ms
query C 14500 ms 10554 ms
query D 213344 ms 9145 ms
Where "A" is "exactly estimated" and the other queries suffer by various
underestimates. My observations from this are:
(1) For small work_mem values it does not really matter, thanks to the
caching effects (the whole hash table fits into L2 CPU cache).
(2) For medium work_mem values (not really huge, but exceeding CPU
caches), the differences are negligible, except for the last query
with most severe underestimate. In that case the new behaviour is
much faster.
(3) For large work_mem values, the speedup is pretty obvious and
dependent on the underestimate.
The question is why to choose large work_mem values when the smaller
values actually perform better. Well, the example tables are not
perfectly representative. In case the outer table is much larger and
does not fit into RAM that easily (which is the case of large fact
tables or joins), the rescans (because of more batches) are more
expensive and outweight the caching benefits.
Also, the work_mem is shared with other nodes, e.g. aggregates, and
decreasing it because of hash joins would hurt them.
regards
Tomas
Hi,
attached is v5 of the patch. The main change is that scaling the number
of buckets is done only once, after the initial hash table is build. The
main advantage of this is lower price. This also allowed some cleanup of
unecessary code.
However, this new patch causes warning like this:
WARNING: temporary file leak: File 231 still referenced
I've been eyeballing the code for a while now, but I still fail to see
where this comes from :-( Any ideas?
regards
Tomas
Attachments:
hashjoin-nbuckets-growth-v5-with-guc.patchtext/x-diff; name=hashjoin-nbuckets-growth-v5-with-guc.patchDownload+177-15
On 30.6.2014 23:12, Tomas Vondra wrote:
Hi,
attached is v5 of the patch. The main change is that scaling the number
of buckets is done only once, after the initial hash table is build. The
main advantage of this is lower price. This also allowed some cleanup of
unecessary code.However, this new patch causes warning like this:
WARNING: temporary file leak: File 231 still referenced
I've been eyeballing the code for a while now, but I still fail to see
where this comes from :-( Any ideas?
Meh, the patches are wrong - I haven't realized the tight coupling
between buckets/batches in ExecHashGetBucketAndBatch:
*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
The previous patches worked mostly by pure luck (the nbuckets was set
only in the first batch), but once I moved the code at the end, it
started failing. And by "worked" I mean "didn't throw an error, but
probably returned bogus results with (nbatch>1)".
However, ISTM this nbuckets-nbatch coupling is not really necessary,
it's only constructed this way to assign independent batch/bucket. So I
went and changed the code like this:
*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> (32 - hashtable->log2_nbatch));
I.e. using the top bits for batchno, low bits for bucketno (as before).
Hopefully I got it right this time. At least it seems to be working for
cases that failed before (no file leaks, proper rowcounts so far).
regards
Tomas
Attachments:
hashjoin-nbuckets-growth-v6-with-guc.patchtext/x-diff; name=hashjoin-nbuckets-growth-v6-with-guc.patchDownload+190-19
On Tue, Jul 1, 2014 at 4:54 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 30.6.2014 23:12, Tomas Vondra wrote:
Hi,
attached is v5 of the patch. The main change is that scaling the number
of buckets is done only once, after the initial hash table is build. The
main advantage of this is lower price. This also allowed some cleanup of
unecessary code.However, this new patch causes warning like this:
WARNING: temporary file leak: File 231 still referenced
I've been eyeballing the code for a while now, but I still fail to see
where this comes from :-( Any ideas?Meh, the patches are wrong - I haven't realized the tight coupling
between buckets/batches in ExecHashGetBucketAndBatch:*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);The previous patches worked mostly by pure luck (the nbuckets was set
only in the first batch), but once I moved the code at the end, it
started failing. And by "worked" I mean "didn't throw an error, but
probably returned bogus results with (nbatch>1)".However, ISTM this nbuckets-nbatch coupling is not really necessary,
it's only constructed this way to assign independent batch/bucket. So I
went and changed the code like this:*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> (32 - hashtable->log2_nbatch));I.e. using the top bits for batchno, low bits for bucketno (as before).
Hopefully I got it right this time. At least it seems to be working for
cases that failed before (no file leaks, proper rowcounts so far).
Hi,
I had a look at the patch and I was wondering if there is a way we can set
a cap on the resized size, and instead increase the number of batches
instead? Since this patch evaluates the growth of tuples vs increase of
space, we could look at increasing the number of batches itself instead of
number of buckets, if the resized number of buckets exceeds a threshold. It
shouldnt be too hard, AIUI it would involve a call in MultiExecHash right
after the new cost evaluation the patch adds.
It isnt a target of the current patch, but I think that it is a logical
extension to the same.
Thoughts/Comments?
Regards,
Atri
--
Regards,
Atri
*l'apprenant*
On 1.7.2014 01:24, Tomas Vondra wrote:
On 30.6.2014 23:12, Tomas Vondra wrote:
Hi,
Hopefully I got it right this time. At least it seems to be working for
cases that failed before (no file leaks, proper rowcounts so far).
Attached v7, fixing nbatch/ntuples in an assert.
regards
Tomas
Attachments:
hashjoin-nbuckets-growth-v7.patchtext/x-diff; name=hashjoin-nbuckets-growth-v7.patchDownload+190-19
On 3.7.2014 15:42, Atri Sharma wrote:
On Tue, Jul 1, 2014 at 4:54 AM, Tomas Vondra <tv@fuzzy.cz
<mailto:tv@fuzzy.cz>> wrote:On 30.6.2014 23:12, Tomas Vondra wrote:
Hi,
attached is v5 of the patch. The main change is that scaling the
number
of buckets is done only once, after the initial hash table is
build. The
main advantage of this is lower price. This also allowed some
cleanup of
unecessary code.
However, this new patch causes warning like this:
WARNING: temporary file leak: File 231 still referenced
I've been eyeballing the code for a while now, but I still fail to see
where this comes from :-( Any ideas?Meh, the patches are wrong - I haven't realized the tight coupling
between buckets/batches in ExecHashGetBucketAndBatch:*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);The previous patches worked mostly by pure luck (the nbuckets was set
only in the first batch), but once I moved the code at the end, it
started failing. And by "worked" I mean "didn't throw an error, but
probably returned bogus results with (nbatch>1)".However, ISTM this nbuckets-nbatch coupling is not really necessary,
it's only constructed this way to assign independent batch/bucket. So I
went and changed the code like this:*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> (32 - hashtable->log2_nbatch));I.e. using the top bits for batchno, low bits for bucketno (as before).
Hopefully I got it right this time. At least it seems to be working for
cases that failed before (no file leaks, proper rowcounts so far).Hi,
I had a look at the patch and I was wondering if there is a way we can
set a cap on the resized size, and instead increase the number of
batches instead? Since this patch evaluates the growth of tuples vs
increase of space, we could look at increasing the number of batches
itself instead of number of buckets, if the resized number of buckets
exceeds a threshold. It shouldnt be too hard, AIUI it would involve a
call in MultiExecHash right after the new cost evaluation the patch adds.
So you propose to fill the hash table, and when hitting NTUP_PER_BUCKET
(i.e. after adding 'nbuckets * NTUP_PER_BUCKET' tuples) increase the
number of batches?
Hmmm, that'd be easy to implement. I don't think it's possible to do
that in MultiExecHash (that's too late). But it's a matter of changing
this condition in ExecHashTableInsert:
if (hashtable->spaceUsed > hashtable->spaceAllowed)
ExecHashIncreaseNumBatches(hashtable);
to something like this
if ((hashtable->spaceUsed > hashtable->spaceAllowed) ||
(hashtable->totalTuples / hashtable->nbatch
== NTUP_PER_BUCKET * hashtable->nbuckets))
ExecHashIncreaseNumBatches(hashtable);
But I don't think increasing number of batches is a good approach, as it
means more rescans. I think using memory is more efficient (after all,
that's what work_mem is for, right?).
regards
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3.7.2014 19:36, Tomas Vondra wrote:
On 1.7.2014 01:24, Tomas Vondra wrote:
On 30.6.2014 23:12, Tomas Vondra wrote:
Hi,
Hopefully I got it right this time. At least it seems to be working
for cases that failed before (no file leaks, proper rowcounts so
far).Attached v7, fixing nbatch/ntuples in an assert.
regards
Tomas
Attached is v8 of this patch, significantly reworked based on the
related discussion (mostly in 'tweaking NTUP_PER_BUCKET').
With the patch applied, the hash table works like this:
initial sizing (ExecChooseHashTableSize)
----------------------------------------
- size nbuckets for NTUP_PER_BUCKET=1
- account for memory allocated for buckets
building the hash table
-----------------------
- while adding tuples, keep track of optimal number of buckets (for the
current number of tuples per batch)
- don't resize until all the tuples are added (by increasing nbatch the
number of buckets may decrease)
- after adding all tuples, consider resizing (optimal nbuckets !=
initial nbuckets)
- do the resize
This means that for well estimated plans (reasonably exact tuple count
for the inner relation), the hash table is properly sized and no resize
is needed.
For badly estimated plans, this works about the same as the previous
patches (we're accounting for memory needed for buckets etc.).
More batches or less buckets?
-----------------------------
In the related threads, I repeatedly mentioned that I don't know a good
answer to "Should I add more batches or decrease the number of buckets?"
while sizing the hash table. The current code (as in HEAD) does not face
this dillema because (a) it does not count buckets into work_mem and (b)
does not allow changing nbuckets.
I still don't have a good answer to this question, but I think the patch
can stand even without it. If you want more/less batches, use
smaller/larger work_mem - same as with the current code. Also, with the
'dense allocation' patch [1]/messages/by-id/53C2DECC.2010408@fuzzy.cz, it's possible to use larger work_mem
values (and thus compensate for counting buckets into work_mem).
Changes since v7
----------------
(a) remove NTUP_GROW_THRESHOLD (just use NTUP_PER_BUCKET directly)
(b) set NTUP_PER_BUCKET=1
(c) include buckets (optimal) when considering nbatch increase
(d) track optimal number of buckets (for NTUP_PER_BUCKET)
(e) after loading all the tuples, resize the hash table to get
nbuckets_optimal (if needed)
(f) renamed GUC to enable_hash_resize (makes a bit more sense)
Comments
--------
(a) NTUP_GROW_THRESHOLD was overcomplicated and mostly a result of
misunderstanding how NTUP_PER_BUCKET works (it's upper threshold,
not average) and is not really needed.
(b) The value "1" gives the best performance, but also requires more
memory for the buckets. This should be more than compensated by
densely allocating the tuples (see hashjoin-alloc-v3.patch
in the "tweaking NTUP_PER_BUCKET" thread [1]/messages/by-id/53C2DECC.2010408@fuzzy.cz).
(c,d) Originally, the memory needed for buckets was not included in
'spaceUsed', but because the NTUP_PER_BUCKET=1 change this is
not really acceptable (needs much more memory than before).
So now it's included both in the initial sizing of the hash
table, and when adding more tuples (nbuckets_optimal).
Still, spaceUsed does not include palloc overhead (which may be
a significant amount of memory). This is tackled by [1]/messages/by-id/53C2DECC.2010408@fuzzy.cz.
(e) No change here.
(f) A bit better GUC name. Anyway, this is for easier development
only, and won't be included in the final patch.
regards
Tomas
Attachments:
hashjoin-nbuckets-growth-v8.patchtext/x-diff; name=hashjoin-nbuckets-growth-v8.patchDownload+216-16
Attached v9 of the patch. Aside from a few minor fixes, the main change
is that this is assumed to be combined with the "dense allocation" patch.
It also rewrites the ExecHashIncreaseNumBuckets to follow the same
pattern as ExecHashIncreaseNumBatches (i.e. scanning chunks directly,
instead of buckets). It's cleaner and more consistent.
hashjoin-nbuckets-growth-v9.patch contains only this patch, so you need
to grab the hashjoin-alloc-v4.patch from a different thread and apply it
first)
hashjoin-nbuckets-growth-v9-combined.patch contains both patches combined
regards
Tomas Vondra
On 20.7.2014 18:29, Tomas Vondra wrote:
Attached v9 of the patch. Aside from a few minor fixes, the main change
is that this is assumed to be combined with the "dense allocation" patch.It also rewrites the ExecHashIncreaseNumBuckets to follow the same
pattern as ExecHashIncreaseNumBatches (i.e. scanning chunks directly,
instead of buckets). It's cleaner and more consistent.hashjoin-nbuckets-growth-v9.patch contains only this patch, so you need
to grab the hashjoin-alloc-v4.patch from a different thread and apply it
first)hashjoin-nbuckets-growth-v9-combined.patch contains both patches combined
I just noticed that the changes made to ExecHashGetBucketAndBatch may
not be perfectly fine - it does not "break" the hash join (i.e. the
results are still correct), but I believe it may cause more batches. But
I'm not entirely sure about it, or how to fix that.
First, an intro into ExecHashGetBucketAndBatch internals, and how the
patch changes that. If not interested, skip to the "problem" section below.
The "old" ExecHashGetBucketAndBatch did this:
*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
i.e. given the 32-bit hash value, the lowest log2_nbuckets bits are used
to determine bucket number. The rest of the hash (after removing the
bits used for buckets) is used for batch. I.e. something like this
31 23 15 7 0
|----------------|----------------|----------------|----------------|
| | <- batches | buckets |
This worked fine, because the number of bits for buckets was fixed, and
only the number of batches was growing. This was done by adding
most-significant bits (as illustrated by the tiny arrow. So when there
were 4 bits for batch number, after adding another bit (doubling
nbatches) batch '0000' would be split either into '00000' or '10000'.
With dynamic bucket count this does not work, because the batches and
buckets need to be independend (i.e. derived from non-overlapping parts
of the hash). The simplest approach of 'moving batches around' does not
work, because that would break this assert:
Assert(batchno > curbatch);
In other words "don't move tuples to already-processed batches". So the
batch number for a tuple needs to be sufficiently stable, and only ever
increase (never decrease).
So what I did is this:
31 23 15 7 0
|----------------|----------------|----------------|----------------|
| batches -> | | <- buckets |
and this is what happens in ExecHashGetBucketAndBatch:
*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> (32 - hashtable->log2_nbatch));
So the "bucketno" expression is exactly the same (but now the nbuckets
may change dynamically), but "batchno" is now computed from the other
end of the hash (highest bits), and grows by adding a least-significant bit.
Problem
-------
The original implementation had the nice feature, that each increase of
number of batches (nbatches *= 2) split each batch in half. Half the
tuples stayed in the batch, half the tuples moved to one of the newly
created batches, thanks to adding a most-significant bit. The tuples
getting 0 bit stayed in the same batch, tuples getting 1 moved to the
new one (and it was guaranteed to be one of the new ones).
It's slightly more complicated because of the lazy behavior when
repeatedly incrementing the number of batches, but this is the
principle. Keep 1/2 the tuples, move 1/2 to another bucket.
The new implementation changes this, because the batch number grows in
the opposite direction - adds a lest-significant bit, so for example
batch '1111' gets split into '11111' and '11110'.
It's rougly equal to
(batchno << 1) + K
where K is either 0 or 1, which is always >= than the old batchno. So it
does not violate the Assert (which is why I haven't noticed this earlier).
But it breaks the desirable behavior that 1/2 the tuples remain in the
current batch, and instead moves a lot of them to batches further down
the road, and needlessly increases the number of batches.
At least that's how I understand it ...
Possible solutions
------------------
Adding least-significant bit does not work, we need get back to adding
the most-significant one. Not sure what's the least complex way to do
that, though.
I'm thinking about computing the nbuckets limit (how many buckets may
fit into work_mem):
tupsize = HJTUPLE_OVERHEAD +
MAXALIGN(sizeof(MinimalTupleData)) +
MAXALIGN(tupwidth);
nbuckets_bits_max = my_log2(work_mem / tupsize)
and then start with nbatches from this bit, like this:
31 23 max 15 7 0
|----------------|----------------|----------------|----------------|
| | <- batches | free | <- buckets |
That might work, I guess. But maybe there's something better (using some
secret bitwise-operator-fu ...).
Further concerns
----------------
I'm wondering whether we should deploy some safe-guards against
exceeding the 32 bits in the hash. It probably was not a big issue
before, but with large work_mem values and NTUP_PER_BUCKET=1 this might
become a problem.
With work_mem=1GB, and 50B tuples (including overhead):
buckets = 21474836
which means ~25 bits reserved for nbucketno. That leaves only 7 bits for
batch number. Yeah, that's ~128 batches (so ~128GB hash table), but
maybe it's not that far.
Also, what if the tupwidth estimate is too low. In that case the
nbuckets_bits_max would be artificially high (and we'd have to do more
batches).
regards
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Aug 9, 2014 at 9:13 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Adding least-significant bit does not work, we need get back to adding
the most-significant one. Not sure what's the least complex way to do
that, though.I'm thinking about computing the nbuckets limit (how many buckets may
fit into work_mem):tupsize = HJTUPLE_OVERHEAD +
MAXALIGN(sizeof(MinimalTupleData)) +
MAXALIGN(tupwidth);nbuckets_bits_max = my_log2(work_mem / tupsize)
and then start with nbatches from this bit, like this:
31 23 max 15 7 0
|----------------|----------------|----------------|----------------|
| | <- batches | free | <- buckets |
That doesn't seem unreasonable provide there's enough bit space, but,
as you point out, the day might not be far off when there isn't.
It also strikes me that when there's only 1 batch, the set of bits
that map onto the batch number is zero-width, and one zero-width bit
range is as good as another. In other words, if you're only planning
to do one batch, you can easily grow the number of buckets on the fly.
Growing the number of buckets only becomes difficult once you have
more than one batch.
And I mention that because, in the scenario mentioned in your original
post ("a hash table with small number of buckets (8192) containing
large number of tuples [~3.3M]"), presumably what happens is you start
out thinking you are going to have one batch with 8K buckets. Then,
as more tuples continue to pour in, you see the load factor rising and
responding by bumping up the size of the hash table. Now eventually
you realize, gosh, this isn't even going to fit into work_mem, so you
need to start batching it. But by that point you've presumably done
as much as you're going to do in terms of increasing the number of
buckets; after that, you're just going to add more batches as
necessary. So not being able to further increase the number of
buckets once the number of batches is no longer 1 wouldn't be a
problem in that case.
Now if you start out planning for multiple batches, and then you
discover you'd like to keep the same number of batches but have more
buckets per batch, that's gonna be hard. It's not impossible, because
you could steal some of the unused high-order bits above the number of
batches, and then concatenate them with the low-order bits that you
already had, but that seems like kind of an ugly kludge, and AFAICS it
only helps if the new tuples are narrower by enough to justify the
cost of splitting all the buckets. I won't say that couldn't happen,
but I think it would be better to keep that complexity out of the
patch unless we're absolutely certain it's justified.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11.8.2014 20:25, Robert Haas wrote:
On Sat, Aug 9, 2014 at 9:13 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Adding least-significant bit does not work, we need get back to adding
the most-significant one. Not sure what's the least complex way to do
that, though.I'm thinking about computing the nbuckets limit (how many buckets may
fit into work_mem):tupsize = HJTUPLE_OVERHEAD +
MAXALIGN(sizeof(MinimalTupleData)) +
MAXALIGN(tupwidth);nbuckets_bits_max = my_log2(work_mem / tupsize)
and then start with nbatches from this bit, like this:
31 23 max 15 7 0
|----------------|----------------|----------------|----------------|
| | <- batches | free | <- buckets |That doesn't seem unreasonable provide there's enough bit space, but,
as you point out, the day might not be far off when there isn't.
Thinking about this a bit more, I believe the danger is not that
imminent. 32 bits mean the Hash node should handle 2^32 tuples in total
(possibly split into multiple batches).
It used to be "2^32 buckets" thanks to NTUP_PER_BUCKET=10, but the patch
lowers this to 1 so it's "tuples" now.
But while we're we're a bit closer to exhausting the 32 bits, I think
it's not really that big issue. Not only hashing a table with ~4e9 rows
is going to be a hellish experience, but if we really want to support
it, we should probably switch to 64-bit hashes.
Because adding some checks is not going to help - it may detect an
issue, but it won't fix it. And actually, even if the two values get
overlap, it does not mean the hashjoin will stop working. There'll be
dependency between batches and buckets, causing non-uniform usage of the
buckets, but it won't blow up.
So I don't think we need to worry about exhausting the 32 bits for now.
It also strikes me that when there's only 1 batch, the set of bits
that map onto the batch number is zero-width, and one zero-width bit
range is as good as another. In other words, if you're only planning
to do one batch, you can easily grow the number of buckets on the fly.
Growing the number of buckets only becomes difficult once you have
more than one batch.
Yes, that's true. It's also roughly what the patch does IMHO.
If you know you'll need more batches, you can start with the maximum
number of buckets right away (because you expect there's more data than
work_mem, so shoot for
nbuckets = mylog2(work_mem/tuple_size)
or something like that. It's also true that if you're starting with a
single batch, you can do whatever you want with nbuckets until you need
to do (nbatches *= 2).
It's also true that once you're done with the first batch, all the other
batches will be just fine with the number of buckets, because the
batches be about equally sized.
But I think this is pretty much what the patch does, in the end.
And I mention that because, in the scenario mentioned in your original
post ("a hash table with small number of buckets (8192) containing
large number of tuples [~3.3M]"), presumably what happens is you start
out thinking you are going to have one batch with 8K buckets. Then,
as more tuples continue to pour in, you see the load factor rising and
responding by bumping up the size of the hash table. Now eventually
you realize, gosh, this isn't even going to fit into work_mem, so you
need to start batching it. But by that point you've presumably done
as much as you're going to do in terms of increasing the number of
buckets; after that, you're just going to add more batches as
necessary. So not being able to further increase the number of
buckets once the number of batches is no longer 1 wouldn't be a
problem in that case.Now if you start out planning for multiple batches, and then you
discover you'd like to keep the same number of batches but have more
buckets per batch, that's gonna be hard. It's not impossible, because
you could steal some of the unused high-order bits above the number of
batches, and then concatenate them with the low-order bits that you
already had, but that seems like kind of an ugly kludge, and AFAICS it
only helps if the new tuples are narrower by enough to justify the
cost of splitting all the buckets. I won't say that couldn't happen,
but I think it would be better to keep that complexity out of the
patch unless we're absolutely certain it's justified.
I'm definitely in favor of keeping the patch as simple as possible.
I was considering using reversing the bits of the hash, because that's
pretty much the simplest solution. But I think you're right it might
actually work like this:
* Are more batches needed?
(yes) => just use nbuckets = my_log2(work_mem / tuple_size)
(no) => go ahead, until processing all tuples or hitting work_mem
(work_mem) => meh, use the same nbuckets above
(all tuples) => compute optimal nbuckets / resize
But I need to think about this a bit. So far it seems to me there's no
way additional batches might benefit from increasing nbuckets further.
Each batch should contain about the same number of tuples, or more
correctly, the same number of distinct hash values. But if there are
multiple tuples with the same hash value, increasing the number of
buckets is not going to help (it's still going to end up in the same
bucket).
And the number of tuples in a batch may only get lower, while adding
batches (e.g. if a batch somewhere in the middle gets a lot of tuples
with the same hash value, exceeding work_mem).
I need to think about this a bit more, but so far it seems like a good
clean solution to me.
regards
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 12.8.2014 00:30, Tomas Vondra wrote:
On 11.8.2014 20:25, Robert Haas wrote:
It also strikes me that when there's only 1 batch, the set of bits
that map onto the batch number is zero-width, and one zero-width bit
range is as good as another. In other words, if you're only planning
to do one batch, you can easily grow the number of buckets on the fly.
Growing the number of buckets only becomes difficult once you have
more than one batch.
...
I was considering using reversing the bits of the hash, because that's
pretty much the simplest solution. But I think you're right it might
actually work like this:* Are more batches needed?
(yes) => just use nbuckets = my_log2(work_mem / tuple_size)
(no) => go ahead, until processing all tuples or hitting work_mem
(work_mem) => meh, use the same nbuckets above
(all tuples) => compute optimal nbuckets / resize
But I need to think about this a bit. So far it seems to me there's no
way additional batches might benefit from increasing nbuckets further.
I think this is a simple and solid solution, solving the batchno
computation issues quite nicely. Attached is v10 patch (bare and
combined with the dense allocation), that does this:
1) when we know we'll need batching, buckets are sized for full work_mem
(using the estimated tuple width, etc.)
2) without the batching, we estimate the 'right number of buckets' for
the estimated number of tuples, and keep track of the optimal number
as tuples are added to the hash table
- if we discover we need to start batching, we keep the current
optimal value (which should be the same as the max number of
buckets) and don't mess with it anymore (making it possible to
compute batch IDs just like before)
- also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table
is resized as part of the rebatch
- if the hash build completes without batching, we do the resize
I believe the patch is pretty much perfect. I plan to do more thorough
testing on a wide range of queries in the next few days.
I also removed the 'enable_hash_resize' GUC, because it would be more
complex to implement this properly after doing the resize as part of
rebatch etc.. So either it would make the patch more complex, or it
wouldn't do what the name promises.
regards
Tomas
On Sat, Aug 16, 2014 at 9:31 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 12.8.2014 00:30, Tomas Vondra wrote:
On 11.8.2014 20:25, Robert Haas wrote:
It also strikes me that when there's only 1 batch, the set of bits
that map onto the batch number is zero-width, and one zero-width bit
range is as good as another. In other words, if you're only planning
to do one batch, you can easily grow the number of buckets on the fly.
Growing the number of buckets only becomes difficult once you have
more than one batch....
I was considering using reversing the bits of the hash, because that's
pretty much the simplest solution. But I think you're right it might
actually work like this:* Are more batches needed?
(yes) => just use nbuckets = my_log2(work_mem / tuple_size)
(no) => go ahead, until processing all tuples or hitting work_mem
(work_mem) => meh, use the same nbuckets above
(all tuples) => compute optimal nbuckets / resize
But I need to think about this a bit. So far it seems to me there's no
way additional batches might benefit from increasing nbuckets further.I think this is a simple and solid solution, solving the batchno
computation issues quite nicely. Attached is v10 patch (bare and
combined with the dense allocation), that does this:1) when we know we'll need batching, buckets are sized for full work_mem
(using the estimated tuple width, etc.)2) without the batching, we estimate the 'right number of buckets' for
the estimated number of tuples, and keep track of the optimal number
as tuples are added to the hash table- if we discover we need to start batching, we keep the current
optimal value (which should be the same as the max number of
buckets) and don't mess with it anymore (making it possible to
compute batch IDs just like before)- also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table
is resized as part of the rebatch- if the hash build completes without batching, we do the resize
I believe the patch is pretty much perfect. I plan to do more thorough
testing on a wide range of queries in the next few days.I also removed the 'enable_hash_resize' GUC, because it would be more
complex to implement this properly after doing the resize as part of
rebatch etc.. So either it would make the patch more complex, or it
wouldn't do what the name promises.
A variety of trivial comments on this:
PostgreSQL style is un-cuddled curly braces. Also, multi-line
comments need to start with a line containing only /* and end with a
line containing only */. In one place you've added curly braces
around a single-line block that is otherwise unmodified; please don't
do that. In one place, you have "becase" instead of "because". In
another place, you write "add if after it" but it should say "add it
after it" or maybe better "add the new one after it". Avoid using
punctuation like "=>" in comments to illustrate the connection between
sentences; instead, use a connecting word like "then" or "therefore"
or whatever is appropriate; in this instance, a period followed by the
start of a new sentence seems sufficient. Revert the removal of a
single line of whitespace near the top of nodeHash.c.
There are too many things marked XXX in this patch. They should
either be fixed, if they are real problems, or they should be
commented in a way that doesn't give rise to the idea that they're
problems if they aren't.
OK, now on to some more substantive stuff:
1. It's not clear to me what the overall effect of this patch on
memory utilization is. Reducing NTUP_PER_BUCKET from 10 to 1 is going
to use, on the average, 10x as much bucket-header memory per tuple.
Specifically, I think it means we'll use about 8 bytes of
bucket-header memory per tuple instead of 0.8 bytes per tuple. If the
tuples are narrow, that could be significant; concerns have been
expressed about that here in the past. Increasing the number of
buckets could also increase memory usage. On the other hand, the
dense allocation stuff probably saves a ton of memory, so maybe we end
up overall, but I'm not sure. Your thoughts, and maybe some test
results with narrow and wide tuples, would be appreciated.
2. But, on the positive side, modulo the memory utilization questions
mentioned above, I would expect the impact on hash join performance to
be positive. Going from 10 tuples per bucket to just 1 should help,
and on cases where the actual load factor would have ended up much
higher because of poor estimation, increasing the number of buckets on
the fly should help even more. I haven't tested this, though.
I haven't had a chance to completely go through this yet, so these are
just some initial thoughts.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 19.8.2014 19:05, Robert Haas wrote:
On Sat, Aug 16, 2014 at 9:31 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 12.8.2014 00:30, Tomas Vondra wrote:
On 11.8.2014 20:25, Robert Haas wrote:
It also strikes me that when there's only 1 batch, the set of bits
that map onto the batch number is zero-width, and one zero-width bit
range is as good as another. In other words, if you're only planning
to do one batch, you can easily grow the number of buckets on the fly.
Growing the number of buckets only becomes difficult once you have
more than one batch....
I was considering using reversing the bits of the hash, because that's
pretty much the simplest solution. But I think you're right it might
actually work like this:* Are more batches needed?
(yes) => just use nbuckets = my_log2(work_mem / tuple_size)
(no) => go ahead, until processing all tuples or hitting work_mem
(work_mem) => meh, use the same nbuckets above
(all tuples) => compute optimal nbuckets / resize
But I need to think about this a bit. So far it seems to me there's no
way additional batches might benefit from increasing nbuckets further.I think this is a simple and solid solution, solving the batchno
computation issues quite nicely. Attached is v10 patch (bare and
combined with the dense allocation), that does this:1) when we know we'll need batching, buckets are sized for full work_mem
(using the estimated tuple width, etc.)2) without the batching, we estimate the 'right number of buckets' for
the estimated number of tuples, and keep track of the optimal number
as tuples are added to the hash table- if we discover we need to start batching, we keep the current
optimal value (which should be the same as the max number of
buckets) and don't mess with it anymore (making it possible to
compute batch IDs just like before)- also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table
is resized as part of the rebatch- if the hash build completes without batching, we do the resize
I believe the patch is pretty much perfect. I plan to do more thorough
testing on a wide range of queries in the next few days.I also removed the 'enable_hash_resize' GUC, because it would be more
complex to implement this properly after doing the resize as part of
rebatch etc.. So either it would make the patch more complex, or it
wouldn't do what the name promises.A variety of trivial comments on this:
PostgreSQL style is un-cuddled curly braces. Also, multi-line
comments need to start with a line containing only /* and end with a
line containing only */. In one place you've added curly braces
around a single-line block that is otherwise unmodified; please don't
do that. In one place, you have "becase" instead of "because". In
another place, you write "add if after it" but it should say "add it
after it" or maybe better "add the new one after it". Avoid using
punctuation like "=>" in comments to illustrate the connection between
sentences; instead, use a connecting word like "then" or "therefore"
or whatever is appropriate; in this instance, a period followed by the
start of a new sentence seems sufficient. Revert the removal of a
single line of whitespace near the top of nodeHash.c.There are too many things marked XXX in this patch. They should
either be fixed, if they are real problems, or they should be
commented in a way that doesn't give rise to the idea that they're
problems if they aren't.
OK, thanks for pointing this out. Attached is v11 of the patch (both
separate and combined with the dense allocation, as before).
I fixed as many of those issues as possible. All the XXX items were
obsolete, except for one in the chunk_alloc function.
I have also removed one constant
OK, now on to some more substantive stuff:
1. It's not clear to me what the overall effect of this patch on
memory utilization is. Reducing NTUP_PER_BUCKET from 10 to 1 is going
to use, on the average, 10x as much bucket-header memory per tuple.
Specifically, I think it means we'll use about 8 bytes of
bucket-header memory per tuple instead of 0.8 bytes per tuple. If the
tuples are narrow, that could be significant; concerns have been
expressed about that here in the past. Increasing the number of
buckets could also increase memory usage. On the other hand, the
dense allocation stuff probably saves a ton of memory, so maybe we end
up overall, but I'm not sure. Your thoughts, and maybe some test
results with narrow and wide tuples, would be appreciated.
The effect of the dense allocation was briefly discussed in this thread,
along with some quick measurements:
/messages/by-id/53BEEA9E.2080009@fuzzy.cz
The dense allocation removes pretty much all the palloc overhead. For a
40B tuple, I did get this before the dense allocation
HashBatchContext: 1451221040 total in 182 blocks; 2826592 free (11
chunks); 1448394448 used
and this with the dense allocation patch applied
HashBatchContext: 729651520 total in 21980 blocks; 0 free (0 chunks);
729651520 used
So it's pretty much 50% reduction in memory consumption.
Now, the palloc header is >=16B, and removing this more than compensates
the increased number of buckets (in the worst case, there may be ~2x
buckets per tuple, which is 2x8B pointers).
I did a number of tests, and the results are quite consistent with this.
2. But, on the positive side, modulo the memory utilization questions
mentioned above, I would expect the impact on hash join performance to
be positive. Going from 10 tuples per bucket to just 1 should help,
and on cases where the actual load factor would have ended up much
higher because of poor estimation, increasing the number of buckets on
the fly should help even more. I haven't tested this, though.
Yes, that's the results I was getting.
I've done a number of tests, and not a single one showed a slowdown so
far. Most well-estimated queries were 2-3x faster, and the poorly
estimated ones were orders of magnitude faster.
We're working on getting this fixed on a range of workloads, I'll post
some results once I have that.
I haven't had a chance to completely go through this yet, so these are
just some initial thoughts.
Thanks!
Tomas
Hi everyone,
as Heikki mentioned in his "commitfest status" message, this patch still
has no reviewers :-( Is there anyone willing to pick up that, together
with the 'dense allocation' patch (as those two are closely related)?
I'm looking in Robert's direction, as he's the one who came up with the
dense allocation idea, and also commented on the hasjoin bucket resizing
patch ...
I'd ask Pavel Stehule, but as he's sitting next to me in the office he's
not really independent ;-) I'll do some reviews on the other patches
over the weekend, to balance this of course.
regards
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 5, 2014 at 3:23 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
as Heikki mentioned in his "commitfest status" message, this patch still
has no reviewers :-( Is there anyone willing to pick up that, together
with the 'dense allocation' patch (as those two are closely related)?I'm looking in Robert's direction, as he's the one who came up with the
dense allocation idea, and also commented on the hasjoin bucket resizing
patch ...I'd ask Pavel Stehule, but as he's sitting next to me in the office he's
not really independent ;-) I'll do some reviews on the other patches
over the weekend, to balance this of course.
Will any of those patches to be, heh heh, mine?
I am a bit confused by the relationship between the two patches you
posted. The "combined" patch applies, but the other one does not. If
this is a sequence of two patches, it seems like it would be more
helpful to post A and B rather than B and A+B. Perhaps the other
patch is on a different thread but there's not an obvious reference to
said thread here.
In ExecChooseHashTableSize(), if buckets_bytes is independent of
nbatch, could we just compute it before working out dbatch, and just
deduct it from hash_table_bytes? That seems like it'd do the same
thing for less work. Either way, what happens if buckets_bytes is
already bigger than hash_table_bytes?
A few more stylistic issues that I see:
+ if ((hashtable->nbatch == 1) &&
+ if (hashtable->nbuckets_optimal <= (INT_MAX/2))
+ if (size > (HASH_CHUNK_SIZE/8))
While I'm all in favor of using parentheses generously where it's
needed to add clarity, these and similar usages seem over the top to
me.
On a related note, the first of these reads like this if (stuff) { if
(more stuff) { do things } } which makes one wonder why we need two if
statements.
+
+ /* used for dense allocation of tuples (into linked chunks) */
+ HashChunk chunks_batch; /* one list for the whole batch */
+
} HashJoinTableData;
If the original coding didn't have a blank line between the last
structure member and the closing brace, it's probably not right to
make it that way now. There are similar issues at the end of some
functions you modified, and a few other places (like the new code in
ExecChooseHashTableSize and chunk_alloc) where there are extra blank
lines at the starts and ends of blocks.
+char * chunk_alloc(HashJoinTable hashtable, int size) {
Eh... no.
+ /* XXX maybe we should use MAXALIGN(size) here ... */
+1.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers