DBT-3 with SF=20 got failed

Started by Kouhei Kaigaiover 10 years ago51 messages
#1Kouhei Kaigai
kaigai@ak.jp.nec.com

Hello,

I got the following error during DBT-3 benchmark with SF=20.

psql:query21.sql:50: ERROR: invalid memory alloc request size 1073741824
psql:query21.sql:50: ERROR: invalid memory alloc request size 1073741824

It looks to me Hash node tries to 1GB area using palloc0(), but it exceeds
the limitation of none-huge interface.

(gdb) bt
#0 0x00007f669d29a989 in raise () from /lib64/libc.so.6
#1 0x00007f669d29c098 in abort () from /lib64/libc.so.6
#2 0x000000000090ccfd in ExceptionalCondition (conditionName=0xb18130 "!(((Size) (size) <= ((Size) 0x3fffffff)))",
errorType=0xb17efd "FailedAssertion", fileName=0xb17e40 "mcxt.c", lineNumber=856) at assert.c:54
#3 0x000000000093ad53 in palloc0 (size=1073741824) at mcxt.c:856
#4 0x0000000000673045 in ExecHashTableCreate (node=0x7f669de951f0, hashOperators=0x24dbf90, keepNulls=0 '\000') at nodeHash.c:391
#5 0x00000000006752e1 in ExecHashJoin (node=0x24d74e0) at nodeHashjoin.c:169
#6 0x000000000065abf4 in ExecProcNode (node=0x24d74e0) at execProcnode.c:477
#7 0x0000000000681026 in ExecNestLoop (node=0x24d6668) at nodeNestloop.c:123
#8 0x000000000065abca in ExecProcNode (node=0x24d6668) at execProcnode.c:469
#9 0x0000000000681026 in ExecNestLoop (node=0x24d61f8) at nodeNestloop.c:123
#10 0x000000000065abca in ExecProcNode (node=0x24d61f8) at execProcnode.c:469
#11 0x0000000000681026 in ExecNestLoop (node=0x24d5478) at nodeNestloop.c:123
#12 0x000000000065abca in ExecProcNode (node=0x24d5478) at execProcnode.c:469
#13 0x0000000000681026 in ExecNestLoop (node=0x24d51d0) at nodeNestloop.c:123
#14 0x000000000065abca in ExecProcNode (node=0x24d51d0) at execProcnode.c:469

The attached patch replaces this palloc0() by MemoryContextAllocHuge() + memset().
Indeed, this hash table is constructed towards the relation with nrows=119994544,
so, it is not strange even if hash-slot itself is larger than 1GB.

Another allocation request potentially reset of expand hash-slot may also needs
to be "Huge" version of memory allocation, I think.

Thanks,

Below is the query itself and EXPLAIN result.
--------------------------------------------------------------------
dbt3c=# EXPLAIN VERBOSE
dbt3c-# select
dbt3c-# s_name,
dbt3c-# count(*) as numwait
dbt3c-# from
dbt3c-# supplier,
dbt3c-# lineitem l1,
dbt3c-# orders,
dbt3c-# nation
dbt3c-# where
dbt3c-# s_suppkey = l1.l_suppkey
dbt3c-# and o_orderkey = l1.l_orderkey
dbt3c-# and o_orderstatus = 'F'
dbt3c-# and l1.l_receiptdate > l1.l_commitdate
dbt3c-# and exists (
dbt3c(# select
dbt3c(# *
dbt3c(# from
dbt3c(# lineitem l2
dbt3c(# where
dbt3c(# l2.l_orderkey = l1.l_orderkey
dbt3c(# and l2.l_suppkey <> l1.l_suppkey
dbt3c(# )
dbt3c-# and not exists (
dbt3c(# select
dbt3c(# *
dbt3c(# from
dbt3c(# lineitem l3
dbt3c(# where
dbt3c(# l3.l_orderkey = l1.l_orderkey
dbt3c(# and l3.l_suppkey <> l1.l_suppkey
dbt3c(# and l3.l_receiptdate > l3.l_commitdate
dbt3c(# )
dbt3c-# and s_nationkey = n_nationkey
dbt3c-# and n_name = 'UNITED KINGDOM'
dbt3c-# group by
dbt3c-# s_name
dbt3c-# order by
dbt3c-# numwait desc,
dbt3c-# s_name
dbt3c-# LIMIT 100;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------------
------------------
Limit (cost=6792765.24..6792765.24 rows=1 width=26)
Output: supplier.s_name, (count(*))
-> Sort (cost=6792765.24..6792765.24 rows=1 width=26)
Output: supplier.s_name, (count(*))
Sort Key: (count(*)) DESC, supplier.s_name
-> HashAggregate (cost=6792765.22..6792765.23 rows=1 width=26)
Output: supplier.s_name, count(*)
Group Key: supplier.s_name
-> Nested Loop Anti Join (cost=4831094.94..6792765.21 rows=1 width=26)
Output: supplier.s_name
-> Nested Loop (cost=4831094.37..6792737.52 rows=1 width=34)
Output: supplier.s_name, l1.l_suppkey, l1.l_orderkey
Join Filter: (supplier.s_nationkey = nation.n_nationkey)
-> Nested Loop (cost=4831094.37..6792736.19 rows=1 width=38)
Output: supplier.s_name, supplier.s_nationkey, l1.l_suppkey, l1.l_orderkey
-> Nested Loop (cost=4831093.81..6792728.20 rows=1 width=42)
Output: supplier.s_name, supplier.s_nationkey, l1.l_suppkey, l1.l_orderkey, l2.l_orderkey
Join Filter: (l1.l_suppkey = supplier.s_suppkey)
-> Hash Semi Join (cost=4831093.81..6783870.20 rows=1 width=12)
Output: l1.l_suppkey, l1.l_orderkey, l2.l_orderkey
Hash Cond: (l1.l_orderkey = l2.l_orderkey)
Join Filter: (l2.l_suppkey <> l1.l_suppkey)
-> Index Scan using lineitem_l_orderkey_idx_part1 on public.lineitem l1 (cost=0.57..1847781.73 rows
=39998181 width=8)
Output: l1.l_orderkey, l1.l_partkey, l1.l_suppkey, l1.l_linenumber, l1.l_quantity, l1.l_extende
dprice, l1.l_discount, l1.l_tax, l1.l_returnflag, l1.l_linestatus, l1.l_shipdate, l1.l_commitdate, l1.l_receiptdate, l1.l_shipinstruct, l1.l_shipm
ode, l1.l_comment
-> Hash (cost=3331161.44..3331161.44 rows=119994544 width=8)
Output: l2.l_orderkey, l2.l_suppkey
-> Seq Scan on public.lineitem l2 (cost=0.00..3331161.44 rows=119994544 width=8)
Output: l2.l_orderkey, l2.l_suppkey
-> Seq Scan on public.supplier (cost=0.00..6358.00 rows=200000 width=34)
Output: supplier.s_suppkey, supplier.s_name, supplier.s_address, supplier.s_nationkey, supplier.s_pho
ne, supplier.s_acctbal, supplier.s_comment
-> Index Scan using orders_o_orderkey_o_orderdate_idx on public.orders (cost=0.56..7.98 rows=1 width=4)
Output: orders.o_orderkey, orders.o_custkey, orders.o_orderstatus, orders.o_totalprice, orders.o_orderdate,
orders.o_orderpriority, orders.o_clerk, orders.o_shippriority, orders.o_comment
Index Cond: (orders.o_orderkey = l1.l_orderkey)
Filter: (orders.o_orderstatus = 'F'::bpchar)
-> Seq Scan on public.nation (cost=0.00..1.31 rows=1 width=4)
Output: nation.n_nationkey, nation.n_name, nation.n_regionkey, nation.n_comment
Filter: (nation.n_name = 'UNITED KINGDOM'::bpchar)
-> Index Scan using lineitem_l_orderkey_idx_part1 on public.lineitem l3 (cost=0.57..13.69 rows=89 width=8)
Output: l3.l_orderkey, l3.l_partkey, l3.l_suppkey, l3.l_linenumber, l3.l_quantity, l3.l_extendedprice, l3.l_discount, l
3.l_tax, l3.l_returnflag, l3.l_linestatus, l3.l_shipdate, l3.l_commitdate, l3.l_receiptdate, l3.l_shipinstruct, l3.l_shipmode, l3.l_comment
Index Cond: (l3.l_orderkey = l1.l_orderkey)
Filter: (l3.l_suppkey <> l1.l_suppkey)
(41 rows)

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Merlin Moncure
mmoncure@gmail.com
In reply to: Kouhei Kaigai (#1)
Re: DBT-3 with SF=20 got failed

On Wed, Jun 10, 2015 at 9:57 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Hello,

I got the following error during DBT-3 benchmark with SF=20.

psql:query21.sql:50: ERROR: invalid memory alloc request size 1073741824
psql:query21.sql:50: ERROR: invalid memory alloc request size 1073741824

It looks to me Hash node tries to 1GB area using palloc0(), but it exceeds
the limitation of none-huge interface.

(gdb) bt
#0 0x00007f669d29a989 in raise () from /lib64/libc.so.6
#1 0x00007f669d29c098 in abort () from /lib64/libc.so.6
#2 0x000000000090ccfd in ExceptionalCondition (conditionName=0xb18130 "!(((Size) (size) <= ((Size) 0x3fffffff)))",
errorType=0xb17efd "FailedAssertion", fileName=0xb17e40 "mcxt.c", lineNumber=856) at assert.c:54
#3 0x000000000093ad53 in palloc0 (size=1073741824) at mcxt.c:856
#4 0x0000000000673045 in ExecHashTableCreate (node=0x7f669de951f0, hashOperators=0x24dbf90, keepNulls=0 '\000') at nodeHash.c:391
#5 0x00000000006752e1 in ExecHashJoin (node=0x24d74e0) at nodeHashjoin.c:169
#6 0x000000000065abf4 in ExecProcNode (node=0x24d74e0) at execProcnode.c:477
#7 0x0000000000681026 in ExecNestLoop (node=0x24d6668) at nodeNestloop.c:123
#8 0x000000000065abca in ExecProcNode (node=0x24d6668) at execProcnode.c:469
#9 0x0000000000681026 in ExecNestLoop (node=0x24d61f8) at nodeNestloop.c:123
#10 0x000000000065abca in ExecProcNode (node=0x24d61f8) at execProcnode.c:469
#11 0x0000000000681026 in ExecNestLoop (node=0x24d5478) at nodeNestloop.c:123
#12 0x000000000065abca in ExecProcNode (node=0x24d5478) at execProcnode.c:469
#13 0x0000000000681026 in ExecNestLoop (node=0x24d51d0) at nodeNestloop.c:123
#14 0x000000000065abca in ExecProcNode (node=0x24d51d0) at execProcnode.c:469

The attached patch replaces this palloc0() by MemoryContextAllocHuge() + memset().
Indeed, this hash table is constructed towards the relation with nrows=119994544,
so, it is not strange even if hash-slot itself is larger than 1GB.

Another allocation request potentially reset of expand hash-slot may also needs
to be "Huge" version of memory allocation, I think.

Thanks,

Below is the query itself and EXPLAIN result.
--------------------------------------------------------------------
dbt3c=# EXPLAIN VERBOSE
dbt3c-# select
dbt3c-# s_name,
dbt3c-# count(*) as numwait
dbt3c-# from
dbt3c-# supplier,
dbt3c-# lineitem l1,
dbt3c-# orders,
dbt3c-# nation
dbt3c-# where
dbt3c-# s_suppkey = l1.l_suppkey
dbt3c-# and o_orderkey = l1.l_orderkey
dbt3c-# and o_orderstatus = 'F'
dbt3c-# and l1.l_receiptdate > l1.l_commitdate
dbt3c-# and exists (
dbt3c(# select
dbt3c(# *
dbt3c(# from
dbt3c(# lineitem l2
dbt3c(# where
dbt3c(# l2.l_orderkey = l1.l_orderkey
dbt3c(# and l2.l_suppkey <> l1.l_suppkey
dbt3c(# )
dbt3c-# and not exists (
dbt3c(# select
dbt3c(# *
dbt3c(# from
dbt3c(# lineitem l3
dbt3c(# where
dbt3c(# l3.l_orderkey = l1.l_orderkey
dbt3c(# and l3.l_suppkey <> l1.l_suppkey
dbt3c(# and l3.l_receiptdate > l3.l_commitdate
dbt3c(# )
dbt3c-# and s_nationkey = n_nationkey
dbt3c-# and n_name = 'UNITED KINGDOM'
dbt3c-# group by
dbt3c-# s_name
dbt3c-# order by
dbt3c-# numwait desc,
dbt3c-# s_name
dbt3c-# LIMIT 100;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------------
------------------
Limit (cost=6792765.24..6792765.24 rows=1 width=26)
Output: supplier.s_name, (count(*))
-> Sort (cost=6792765.24..6792765.24 rows=1 width=26)
Output: supplier.s_name, (count(*))
Sort Key: (count(*)) DESC, supplier.s_name
-> HashAggregate (cost=6792765.22..6792765.23 rows=1 width=26)
Output: supplier.s_name, count(*)
Group Key: supplier.s_name
-> Nested Loop Anti Join (cost=4831094.94..6792765.21 rows=1 width=26)
Output: supplier.s_name
-> Nested Loop (cost=4831094.37..6792737.52 rows=1 width=34)
Output: supplier.s_name, l1.l_suppkey, l1.l_orderkey
Join Filter: (supplier.s_nationkey = nation.n_nationkey)
-> Nested Loop (cost=4831094.37..6792736.19 rows=1 width=38)
Output: supplier.s_name, supplier.s_nationkey, l1.l_suppkey, l1.l_orderkey
-> Nested Loop (cost=4831093.81..6792728.20 rows=1 width=42)
Output: supplier.s_name, supplier.s_nationkey, l1.l_suppkey, l1.l_orderkey, l2.l_orderkey
Join Filter: (l1.l_suppkey = supplier.s_suppkey)
-> Hash Semi Join (cost=4831093.81..6783870.20 rows=1 width=12)
Output: l1.l_suppkey, l1.l_orderkey, l2.l_orderkey
Hash Cond: (l1.l_orderkey = l2.l_orderkey)
Join Filter: (l2.l_suppkey <> l1.l_suppkey)
-> Index Scan using lineitem_l_orderkey_idx_part1 on public.lineitem l1 (cost=0.57..1847781.73 rows
=39998181 width=8)
Output: l1.l_orderkey, l1.l_partkey, l1.l_suppkey, l1.l_linenumber, l1.l_quantity, l1.l_extende
dprice, l1.l_discount, l1.l_tax, l1.l_returnflag, l1.l_linestatus, l1.l_shipdate, l1.l_commitdate, l1.l_receiptdate, l1.l_shipinstruct, l1.l_shipm
ode, l1.l_comment
-> Hash (cost=3331161.44..3331161.44 rows=119994544 width=8)
Output: l2.l_orderkey, l2.l_suppkey
-> Seq Scan on public.lineitem l2 (cost=0.00..3331161.44 rows=119994544 width=8)
Output: l2.l_orderkey, l2.l_suppkey
-> Seq Scan on public.supplier (cost=0.00..6358.00 rows=200000 width=34)
Output: supplier.s_suppkey, supplier.s_name, supplier.s_address, supplier.s_nationkey, supplier.s_pho
ne, supplier.s_acctbal, supplier.s_comment
-> Index Scan using orders_o_orderkey_o_orderdate_idx on public.orders (cost=0.56..7.98 rows=1 width=4)
Output: orders.o_orderkey, orders.o_custkey, orders.o_orderstatus, orders.o_totalprice, orders.o_orderdate,
orders.o_orderpriority, orders.o_clerk, orders.o_shippriority, orders.o_comment
Index Cond: (orders.o_orderkey = l1.l_orderkey)
Filter: (orders.o_orderstatus = 'F'::bpchar)
-> Seq Scan on public.nation (cost=0.00..1.31 rows=1 width=4)
Output: nation.n_nationkey, nation.n_name, nation.n_regionkey, nation.n_comment
Filter: (nation.n_name = 'UNITED KINGDOM'::bpchar)
-> Index Scan using lineitem_l_orderkey_idx_part1 on public.lineitem l3 (cost=0.57..13.69 rows=89 width=8)
Output: l3.l_orderkey, l3.l_partkey, l3.l_suppkey, l3.l_linenumber, l3.l_quantity, l3.l_extendedprice, l3.l_discount, l
3.l_tax, l3.l_returnflag, l3.l_linestatus, l3.l_shipdate, l3.l_commitdate, l3.l_receiptdate, l3.l_shipinstruct, l3.l_shipmode, l3.l_comment
Index Cond: (l3.l_orderkey = l1.l_orderkey)
Filter: (l3.l_suppkey <> l1.l_suppkey)

curious: what was work_mem set to?

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Merlin Moncure (#2)
Re: DBT-3 with SF=20 got failed

curious: what was work_mem set to?

work_mem=48GB

My machine mounts 256GB physical RAM.
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

-----Original Message-----
From: Merlin Moncure [mailto:mmoncure@gmail.com]
Sent: Thursday, June 11, 2015 10:52 PM
To: Kaigai Kouhei(海外 浩平)
Cc: pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] DBT-3 with SF=20 got failed

On Wed, Jun 10, 2015 at 9:57 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Hello,

I got the following error during DBT-3 benchmark with SF=20.

psql:query21.sql:50: ERROR: invalid memory alloc request size 1073741824
psql:query21.sql:50: ERROR: invalid memory alloc request size 1073741824

It looks to me Hash node tries to 1GB area using palloc0(), but it exceeds
the limitation of none-huge interface.

(gdb) bt
#0 0x00007f669d29a989 in raise () from /lib64/libc.so.6
#1 0x00007f669d29c098 in abort () from /lib64/libc.so.6
#2 0x000000000090ccfd in ExceptionalCondition (conditionName=0xb18130

"!(((Size) (size) <= ((Size) 0x3fffffff)))",

errorType=0xb17efd "FailedAssertion", fileName=0xb17e40 "mcxt.c",

lineNumber=856) at assert.c:54

#3 0x000000000093ad53 in palloc0 (size=1073741824) at mcxt.c:856
#4 0x0000000000673045 in ExecHashTableCreate (node=0x7f669de951f0,

hashOperators=0x24dbf90, keepNulls=0 '\000') at nodeHash.c:391

#5 0x00000000006752e1 in ExecHashJoin (node=0x24d74e0) at nodeHashjoin.c:169
#6 0x000000000065abf4 in ExecProcNode (node=0x24d74e0) at execProcnode.c:477
#7 0x0000000000681026 in ExecNestLoop (node=0x24d6668) at nodeNestloop.c:123
#8 0x000000000065abca in ExecProcNode (node=0x24d6668) at execProcnode.c:469
#9 0x0000000000681026 in ExecNestLoop (node=0x24d61f8) at nodeNestloop.c:123
#10 0x000000000065abca in ExecProcNode (node=0x24d61f8) at execProcnode.c:469
#11 0x0000000000681026 in ExecNestLoop (node=0x24d5478) at nodeNestloop.c:123
#12 0x000000000065abca in ExecProcNode (node=0x24d5478) at execProcnode.c:469
#13 0x0000000000681026 in ExecNestLoop (node=0x24d51d0) at nodeNestloop.c:123
#14 0x000000000065abca in ExecProcNode (node=0x24d51d0) at execProcnode.c:469

The attached patch replaces this palloc0() by MemoryContextAllocHuge() +

memset().

Indeed, this hash table is constructed towards the relation with

nrows=119994544,

so, it is not strange even if hash-slot itself is larger than 1GB.

Another allocation request potentially reset of expand hash-slot may also needs
to be "Huge" version of memory allocation, I think.

Thanks,

Below is the query itself and EXPLAIN result.
--------------------------------------------------------------------
dbt3c=# EXPLAIN VERBOSE
dbt3c-# select
dbt3c-# s_name,
dbt3c-# count(*) as numwait
dbt3c-# from
dbt3c-# supplier,
dbt3c-# lineitem l1,
dbt3c-# orders,
dbt3c-# nation
dbt3c-# where
dbt3c-# s_suppkey = l1.l_suppkey
dbt3c-# and o_orderkey = l1.l_orderkey
dbt3c-# and o_orderstatus = 'F'
dbt3c-# and l1.l_receiptdate > l1.l_commitdate
dbt3c-# and exists (
dbt3c(# select
dbt3c(# *
dbt3c(# from
dbt3c(# lineitem l2
dbt3c(# where
dbt3c(# l2.l_orderkey = l1.l_orderkey
dbt3c(# and l2.l_suppkey <> l1.l_suppkey
dbt3c(# )
dbt3c-# and not exists (
dbt3c(# select
dbt3c(# *
dbt3c(# from
dbt3c(# lineitem l3
dbt3c(# where
dbt3c(# l3.l_orderkey = l1.l_orderkey
dbt3c(# and l3.l_suppkey <> l1.l_suppkey
dbt3c(# and l3.l_receiptdate > l3.l_commitdate
dbt3c(# )
dbt3c-# and s_nationkey = n_nationkey
dbt3c-# and n_name = 'UNITED KINGDOM'
dbt3c-# group by
dbt3c-# s_name
dbt3c-# order by
dbt3c-# numwait desc,
dbt3c-# s_name
dbt3c-# LIMIT 100;

QUERY PLAN

----------------------------------------------------------------------------
----------------------------------------------------------------------

----------------------------------------------------------------------------
----------------------------------------------------------------------

------------------
Limit (cost=6792765.24..6792765.24 rows=1 width=26)
Output: supplier.s_name, (count(*))
-> Sort (cost=6792765.24..6792765.24 rows=1 width=26)
Output: supplier.s_name, (count(*))
Sort Key: (count(*)) DESC, supplier.s_name
-> HashAggregate (cost=6792765.22..6792765.23 rows=1 width=26)
Output: supplier.s_name, count(*)
Group Key: supplier.s_name
-> Nested Loop Anti Join (cost=4831094.94..6792765.21

rows=1 width=26)

Output: supplier.s_name
-> Nested Loop (cost=4831094.37..6792737.52 rows=1

width=34)

Output: supplier.s_name, l1.l_suppkey,

l1.l_orderkey

Join Filter: (supplier.s_nationkey =

nation.n_nationkey)

-> Nested Loop (cost=4831094.37..6792736.19

rows=1 width=38)

Output: supplier.s_name,

supplier.s_nationkey, l1.l_suppkey, l1.l_orderkey

-> Nested Loop

(cost=4831093.81..6792728.20 rows=1 width=42)

Output: supplier.s_name,

supplier.s_nationkey, l1.l_suppkey, l1.l_orderkey, l2.l_orderkey

Join Filter: (l1.l_suppkey =

supplier.s_suppkey)

-> Hash Semi Join

(cost=4831093.81..6783870.20 rows=1 width=12)

Output: l1.l_suppkey,

l1.l_orderkey, l2.l_orderkey

Hash Cond: (l1.l_orderkey =

l2.l_orderkey)

Join Filter: (l2.l_suppkey <>

l1.l_suppkey)

-> Index Scan using

lineitem_l_orderkey_idx_part1 on public.lineitem l1 (cost=0.57..1847781.73
rows

=39998181 width=8)
Output: l1.l_orderkey,

l1.l_partkey, l1.l_suppkey, l1.l_linenumber, l1.l_quantity, l1.l_extende

dprice, l1.l_discount, l1.l_tax, l1.l_returnflag, l1.l_linestatus,

l1.l_shipdate, l1.l_commitdate, l1.l_receiptdate, l1.l_shipinstruct,
l1.l_shipm

ode, l1.l_comment
-> Hash

(cost=3331161.44..3331161.44 rows=119994544 width=8)

Output: l2.l_orderkey,

l2.l_suppkey

-> Seq Scan on

public.lineitem l2 (cost=0.00..3331161.44 rows=119994544 width=8)

Output:

l2.l_orderkey, l2.l_suppkey

-> Seq Scan on public.supplier

(cost=0.00..6358.00 rows=200000 width=34)

Output: supplier.s_suppkey,

supplier.s_name, supplier.s_address, supplier.s_nationkey, supplier.s_pho

ne, supplier.s_acctbal, supplier.s_comment
-> Index Scan using

orders_o_orderkey_o_orderdate_idx on public.orders (cost=0.56..7.98 rows=1
width=4)

Output: orders.o_orderkey,

orders.o_custkey, orders.o_orderstatus, orders.o_totalprice,
orders.o_orderdate,

orders.o_orderpriority, orders.o_clerk, orders.o_shippriority,

orders.o_comment

Index Cond: (orders.o_orderkey =

l1.l_orderkey)

Filter: (orders.o_orderstatus =

'F'::bpchar)

-> Seq Scan on public.nation (cost=0.00..1.31

rows=1 width=4)

Output: nation.n_nationkey, nation.n_name,

nation.n_regionkey, nation.n_comment

Filter: (nation.n_name = 'UNITED

KINGDOM'::bpchar)

-> Index Scan using lineitem_l_orderkey_idx_part1 on

public.lineitem l3 (cost=0.57..13.69 rows=89 width=8)

Output: l3.l_orderkey, l3.l_partkey,

l3.l_suppkey, l3.l_linenumber, l3.l_quantity, l3.l_extendedprice,
l3.l_discount, l

3.l_tax, l3.l_returnflag, l3.l_linestatus, l3.l_shipdate, l3.l_commitdate,

l3.l_receiptdate, l3.l_shipinstruct, l3.l_shipmode, l3.l_comment

Index Cond: (l3.l_orderkey = l1.l_orderkey)
Filter: (l3.l_suppkey <> l1.l_suppkey)

curious: what was work_mem set to?

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Jan Wieck
jan@wi3ck.info
In reply to: Kouhei Kaigai (#3)
Re: DBT-3 with SF=20 got failed

On 06/11/2015 09:53 AM, Kouhei Kaigai wrote:

curious: what was work_mem set to?

work_mem=48GB

My machine mounts 256GB physical RAM.

work_mem can be allocated several times per backend. Nodes like sort and
hash_aggregate may each allocate that much. You should set work_mem to a
fraction of physical-RAM / concurrent-connections depending on the
complexity of your queries. 48GB does not sound reasonable.

Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Robert Haas
robertmhaas@gmail.com
In reply to: Kouhei Kaigai (#1)
Re: DBT-3 with SF=20 got failed

On Wed, Jun 10, 2015 at 10:57 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

The attached patch replaces this palloc0() by MemoryContextAllocHuge() + memset().
Indeed, this hash table is constructed towards the relation with nrows=119994544,
so, it is not strange even if hash-slot itself is larger than 1GB.

You forgot to attach the patch, I think. It looks to me like the size
of a HashJoinTuple is going to be 16 bytes, so 1GB/16 = ~64 million.
That's a lot of buckets, but maybe not unreasonably many if you've got
enough memory.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Kohei KaiGai
kaigai@kaigai.gr.jp
In reply to: Jan Wieck (#4)
Re: DBT-3 with SF=20 got failed

2015-06-11 23:20 GMT+09:00 Jan Wieck <jan@wi3ck.info>:

On 06/11/2015 09:53 AM, Kouhei Kaigai wrote:

curious: what was work_mem set to?

work_mem=48GB

My machine mounts 256GB physical RAM.

work_mem can be allocated several times per backend. Nodes like sort and
hash_aggregate may each allocate that much. You should set work_mem to a
fraction of physical-RAM / concurrent-connections depending on the
complexity of your queries. 48GB does not sound reasonable.

Smaller number of max_connections and large work_mem configuration are
usual for typical OLAP workloads.

Even if configuration is not reasonable, it is not a right error message.
People cannot understand how to fix it.

psql:query21.sql:50: ERROR: invalid memory alloc request size 1073741824

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jan Wieck (#4)
Re: DBT-3 with SF=20 got failed

Hi,

On 06/11/15 16:20, Jan Wieck wrote:

On 06/11/2015 09:53 AM, Kouhei Kaigai wrote:

curious: what was work_mem set to?

work_mem=48GB

My machine mounts 256GB physical RAM.

work_mem can be allocated several times per backend. Nodes like sort
and hash_aggregate may each allocate that much. You should set
work_mem to a fraction of physical-RAM / concurrent-connections
depending on the complexity of your queries. 48GB does not sound
reasonable.

That's true, but there are cases where values like this may be useful
(e.g. for a particular query). We do allow such work_mem values, so I
consider this failure to be a bug.

It probably existed in the past, but was amplified by the hash join
improvements I did for 9.5, because that uses NTUP_PER_BUCKET=1 instead
of NTUP_PER_BUCKET=10. So the arrays of buckets are much larger, and we
also much more memory than we had in the past.

Interestingly, the hash code checks for INT_MAX overflows on a number of
places, but does not check for this ...

regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Kohei KaiGai
kaigai@kaigai.gr.jp
In reply to: Robert Haas (#5)
1 attachment(s)
Re: DBT-3 with SF=20 got failed

2015-06-11 23:28 GMT+09:00 Robert Haas <robertmhaas@gmail.com>:

On Wed, Jun 10, 2015 at 10:57 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

The attached patch replaces this palloc0() by MemoryContextAllocHuge() + memset().
Indeed, this hash table is constructed towards the relation with nrows=119994544,
so, it is not strange even if hash-slot itself is larger than 1GB.

You forgot to attach the patch, I think.

Oops, I forgot to attach indeed.

It looks to me like the size
of a HashJoinTuple is going to be 16 bytes, so 1GB/16 = ~64 million.
That's a lot of buckets, but maybe not unreasonably many if you've got
enough memory.

EXPLAIN says, this Hash node takes underlying SeqScan with
119994544 (~119 million) rows, but it is much smaller than my
work_mem setting.

--
KaiGai Kohei <kaigai@kaigai.gr.jp>

Attachments:

hashslot-allocation-by-huge-alloc.patchapplication/octet-stream; name=hashslot-allocation-by-huge-alloc.patchDownload
 src/backend/executor/nodeHash.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 906cb46..63d484c 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -388,7 +388,9 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	MemoryContextSwitchTo(hashtable->batchCxt);
 
 	hashtable->buckets = (HashJoinTuple *)
-		palloc0(nbuckets * sizeof(HashJoinTuple));
+		MemoryContextAllocHuge(hashtable->batchCxt,
+							   nbuckets * sizeof(HashJoinTuple));
+	memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));
 
 	/*
 	 * Set up for skew optimization, if possible and there's a need for more
@@ -654,7 +656,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 		hashtable->nbuckets = hashtable->nbuckets_optimal;
 		hashtable->log2_nbuckets = hashtable->log2_nbuckets_optimal;
 
-		hashtable->buckets = repalloc(hashtable->buckets,
+		hashtable->buckets = repalloc_huge(hashtable->buckets,
 								sizeof(HashJoinTuple) * hashtable->nbuckets);
 	}
 
@@ -779,7 +781,7 @@ ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
 	 * chunks)
 	 */
 	hashtable->buckets =
-		(HashJoinTuple *) repalloc(hashtable->buckets,
+		(HashJoinTuple *) repalloc_huge(hashtable->buckets,
 								hashtable->nbuckets * sizeof(HashJoinTuple));
 
 	memset(hashtable->buckets, 0, sizeof(void *) * hashtable->nbuckets);
@@ -1217,7 +1219,9 @@ ExecHashTableReset(HashJoinTable hashtable)
 
 	/* Reallocate and reinitialize the hash bucket headers. */
 	hashtable->buckets = (HashJoinTuple *)
-		palloc0(nbuckets * sizeof(HashJoinTuple));
+		MemoryContextAllocHuge(hashtable->batchCxt,
+							   nbuckets * sizeof(HashJoinTuple));
+	memset(hashtable->buckets, 0, sizeof(nbuckets * sizeof(HashJoinTuple)));
 
 	hashtable->spaceUsed = 0;
 
#9Kohei KaiGai
kaigai@kaigai.gr.jp
In reply to: Tomas Vondra (#7)
Re: DBT-3 with SF=20 got failed

2015-06-11 23:33 GMT+09:00 Tomas Vondra <tomas.vondra@2ndquadrant.com>:

Hi,

On 06/11/15 16:20, Jan Wieck wrote:

On 06/11/2015 09:53 AM, Kouhei Kaigai wrote:

curious: what was work_mem set to?

work_mem=48GB

My machine mounts 256GB physical RAM.

work_mem can be allocated several times per backend. Nodes like sort
and hash_aggregate may each allocate that much. You should set
work_mem to a fraction of physical-RAM / concurrent-connections
depending on the complexity of your queries. 48GB does not sound
reasonable.

That's true, but there are cases where values like this may be useful (e.g.
for a particular query). We do allow such work_mem values, so I consider
this failure to be a bug.

It probably existed in the past, but was amplified by the hash join
improvements I did for 9.5, because that uses NTUP_PER_BUCKET=1 instead of
NTUP_PER_BUCKET=10. So the arrays of buckets are much larger, and we also
much more memory than we had in the past.

Interestingly, the hash code checks for INT_MAX overflows on a number of
places, but does not check for this ...

Which number should be changed in this case?

Indeed, nbuckets is declared as int, so INT_MAX is hard limit of hash-slot.
However, some extreme usage can easily create a situation that we shall
touch this restriction.

Do we have nbuckets using long int?

--
KaiGai Kohei <kaigai@kaigai.gr.jp>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#7)
Re: DBT-3 with SF=20 got failed

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

Interestingly, the hash code checks for INT_MAX overflows on a number of
places, but does not check for this ...

Yeah, and at least at one time there were checks to prevent the hash table
request from exceeding MaxAllocSize. Did those get removed by somebody?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#10)
Re: DBT-3 with SF=20 got failed

On 06/11/15 16:54, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

Interestingly, the hash code checks for INT_MAX overflows on a number of
places, but does not check for this ...

Yeah, and at least at one time there were checks to prevent the hash
table request from exceeding MaxAllocSize. Did those get removed by
somebody?

I think the problem is in this piece of code:

if ((hashtable->nbatch == 1) &&
(hashtable->nbuckets_optimal <= INT_MAX / 2) &&
/* overflow protection */
(ntuples >= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET)))
{
hashtable->nbuckets_optimal *= 2;
hashtable->log2_nbuckets_optimal += 1;
}

ISTM it does not check against the max_pointers (that's only done in
ExecChooseHashTableSize).

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#5)
Re: DBT-3 with SF=20 got failed

Hi,

On 06/11/15 16:28, Robert Haas wrote:

On Wed, Jun 10, 2015 at 10:57 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

The attached patch replaces this palloc0() by MemoryContextAllocHuge() + memset().
Indeed, this hash table is constructed towards the relation with nrows=119994544,
so, it is not strange even if hash-slot itself is larger than 1GB.

You forgot to attach the patch, I think. It looks to me like the size
of a HashJoinTuple is going to be 16 bytes, so 1GB/16 = ~64 million.
That's a lot of buckets, but maybe not unreasonably many if you've got
enough memory.

Actually, HashJoinTuple is just a pointer, so it's 8 bytes, so 1GB is
enough for 134217728 million rows, which is more than the 119994544 rows
from the plan.

Also, looking at the error message again:

ERROR: invalid memory alloc request size 1073741824

but this time with beer goggles, I noticed that the amount reported is
exactly 1GB. The backtrace also shows the error happens right inside
ExecHashTableCreate (and not in the resize which may happen later),
which means it gets the nbuckets from ExecChooseHashTableSize directly.

The resize is probably still broken as I mentioned before, but this
crash before reaching that code as the estimates are high enough to
trigger the issue. But ExecChooseHashTableSize is supposed to keep all
the checks from previous versions, and I think it actually does.

But I don't see there any checks regarding the 1GB boundary. What I see
is this:

max_pointers = (work_mem * 1024L) / sizeof(void *);
max_pointers = Min(max_pointers, INT_MAX / 2);

...

dbuckets = Min(dbuckets, max_pointers);

That has nothing to do with 1GB, and it's in the code since the time
work_mem was limited by 2GB, so perhaps there was some reasoning that
it's sufficient (because the tuples stored in the hash table will need
more than 1/2 of the memory, or something like that).

But today this issue is more likely, because people have more RAM and
use higher work_mem values, so the max_pointers value gets much higher.
In the extreme it may get to INT_MAX/2, so ~1 billion, so the buckets
would allocate ~8B on 64-bit machines (on 32-bit machines we'd also get
twice the number of pointers, compared to 64 bits, but that's mostly
irrelevant, because of the memory size limits).

It's also true, that the hash-join improvements in 9.5 - namely the
decrease of NTUP_PER_BUCKET from 10 to 1, made this error more likely.
With 9.4 we'd use only 16777216 buckets (128MB), because that gets us
below 10 tuples per bucket. But now we're shooting for 1 tuple per
bucket, so we end up with 131M buckets, and that's 1GB.

I see two ways to fix this:

(1) enforce the 1GB limit (probably better for back-patching, if that's
necessary)

(2) make it work with hash tables over 1GB

I'm in favor of (2) if there's a good way to do that. It seems a bit
stupid not to be able to use fast hash table because there's some
artificial limit. Are there any fundamental reasons not to use the
MemoryContextAllocHuge fix, proposed by KaiGai-san?

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13David Rowley
david.rowley@2ndquadrant.com
In reply to: Kohei KaiGai (#8)
Re: DBT-3 with SF=20 got failed

On 12 June 2015 at 02:40, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:

2015-06-11 23:28 GMT+09:00 Robert Haas <robertmhaas@gmail.com>:

On Wed, Jun 10, 2015 at 10:57 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com>

wrote:

The attached patch replaces this palloc0() by MemoryContextAllocHuge()

+ memset().

Indeed, this hash table is constructed towards the relation with

nrows=119994544,

so, it is not strange even if hash-slot itself is larger than 1GB.

You forgot to attach the patch, I think.

Oops, I forgot to attach indeed.

It looks to me like the size
of a HashJoinTuple is going to be 16 bytes, so 1GB/16 = ~64 million.
That's a lot of buckets, but maybe not unreasonably many if you've got
enough memory.

EXPLAIN says, this Hash node takes underlying SeqScan with
119994544 (~119 million) rows, but it is much smaller than my
work_mem setting.

I've just run into this problem while running a TPC-H benchmark of 100GB,
on a machine with 64GB of RAM.
When attempting to run Q21 with a work_mem of 10GB I'm getting:
ERROR: invalid memory alloc request size 1073741824

Setting work_mem to 1GB or less gets the query running.

I've patched the code with your patch Kohei, and it's now working.

Thought I'd better post this just in case this gets forgotten about.

Thanks

David

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services

#14Simon Riggs
simon@2ndQuadrant.com
In reply to: Tomas Vondra (#12)
Re: DBT-3 with SF=20 got failed

On 12 June 2015 at 00:29, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

I see two ways to fix this:

(1) enforce the 1GB limit (probably better for back-patching, if that's
necessary)

(2) make it work with hash tables over 1GB

I'm in favor of (2) if there's a good way to do that. It seems a bit
stupid not to be able to use fast hash table because there's some
artificial limit. Are there any fundamental reasons not to use the
MemoryContextAllocHuge fix, proposed by KaiGai-san?

If there are no objections, I will apply the patch for 2) to HEAD and
backpatch to 9.5.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15Kohei KaiGai
kaigai@kaigai.gr.jp
In reply to: Simon Riggs (#14)
Re: DBT-3 with SF=20 got failed

2015-08-19 20:12 GMT+09:00 Simon Riggs <simon@2ndquadrant.com>:

On 12 June 2015 at 00:29, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

I see two ways to fix this:

(1) enforce the 1GB limit (probably better for back-patching, if that's
necessary)

(2) make it work with hash tables over 1GB

I'm in favor of (2) if there's a good way to do that. It seems a bit
stupid not to be able to use fast hash table because there's some artificial
limit. Are there any fundamental reasons not to use the
MemoryContextAllocHuge fix, proposed by KaiGai-san?

If there are no objections, I will apply the patch for 2) to HEAD and
backpatch to 9.5.

Please don't be rush. :-)

It is not difficult to replace palloc() by palloc_huge(), however, it may lead
another problem once planner gives us a crazy estimation.
Below is my comment on the another thread.

==========
Also, we may need to pay attention to reliability of scale estimation
by planner.
Even though the plan below says that Join generates 60521928028 rows,
it actually generates 776157676 rows (0.12%).

tpcds100=# EXPLAIN ANALYZE select
ws1.ws_order_number,ws1.ws_warehouse_sk wh1,ws2.ws_warehouse_sk wh2
from web_sales ws1,web_sales ws2
where ws1.ws_order_number = ws2.ws_order_number
and ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=25374644.08..1160509591.61 rows=60521928028
width=24) (actual time=138347.979..491889.343 rows=776157676 loops=1)
Merge Cond: (ws1.ws_order_number = ws2.ws_order_number)
Join Filter: (ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
Rows Removed by Join Filter: 127853313
-> Sort (cost=12687322.04..12867325.16 rows=72001248 width=16)
(actual time=73252.300..79017.420 rows=72001237 loops=1)
Sort Key: ws1.ws_order_number
Sort Method: quicksort Memory: 7083296kB
-> Seq Scan on web_sales ws1 (cost=0.00..3290612.48
rows=72001248 width=16) (actual time=0.023..39951.201 rows=72001237
loops=1)
-> Sort (cost=12687322.04..12867325.16 rows=72001248 width=16)
(actual time=65095.655..128885.811 rows=904010978 loops=1)
Sort Key: ws2.ws_order_number
Sort Method: quicksort Memory: 7083296kB
-> Seq Scan on web_sales ws2 (cost=0.00..3290612.48
rows=72001248 width=16) (actual time=0.014..31046.888 rows=72001237
loops=1)
Planning time: 0.232 ms
Execution time: 530176.521 ms
(14 rows)

So, even if we allows nodeHash.c to allocate hash buckets larger than
1GB, its initial size may be determined carefully.
Probably, 1GB is a good starting point even if expanded later.

--
KaiGai Kohei <kaigai@kaigai.gr.jp>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Simon Riggs
simon@2ndQuadrant.com
In reply to: Kohei KaiGai (#15)
Re: DBT-3 with SF=20 got failed

On 19 August 2015 at 12:55, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:

2015-08-19 20:12 GMT+09:00 Simon Riggs <simon@2ndquadrant.com>:

On 12 June 2015 at 00:29, Tomas Vondra <tomas.vondra@2ndquadrant.com>

wrote:

I see two ways to fix this:

(1) enforce the 1GB limit (probably better for back-patching, if that's
necessary)

(2) make it work with hash tables over 1GB

I'm in favor of (2) if there's a good way to do that. It seems a bit
stupid not to be able to use fast hash table because there's some

artificial

limit. Are there any fundamental reasons not to use the
MemoryContextAllocHuge fix, proposed by KaiGai-san?

If there are no objections, I will apply the patch for 2) to HEAD and
backpatch to 9.5.

Please don't be rush. :-)

Please explain what rush you see?

It is not difficult to replace palloc() by palloc_huge(), however, it may
lead
another problem once planner gives us a crazy estimation.
Below is my comment on the another thread.

Yes, I can read both threads and would apply patches for each problem.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#17Kohei KaiGai
kaigai@kaigai.gr.jp
In reply to: Simon Riggs (#16)
Re: DBT-3 with SF=20 got failed

2015-08-19 21:29 GMT+09:00 Simon Riggs <simon@2ndquadrant.com>:

On 19 August 2015 at 12:55, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:

2015-08-19 20:12 GMT+09:00 Simon Riggs <simon@2ndquadrant.com>:

On 12 June 2015 at 00:29, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

I see two ways to fix this:

(1) enforce the 1GB limit (probably better for back-patching, if that's
necessary)

(2) make it work with hash tables over 1GB

I'm in favor of (2) if there's a good way to do that. It seems a bit
stupid not to be able to use fast hash table because there's some
artificial
limit. Are there any fundamental reasons not to use the
MemoryContextAllocHuge fix, proposed by KaiGai-san?

If there are no objections, I will apply the patch for 2) to HEAD and
backpatch to 9.5.

Please don't be rush. :-)

Please explain what rush you see?

Unless we have no fail-safe mechanism when planner estimated too
large number of tuples than actual needs, a strange estimation will
consume massive amount of RAMs. It's a bad side effect.
My previous patch didn't pay attention to the scenario, so needs to
revise the patch.

Thanks,

It is not difficult to replace palloc() by palloc_huge(), however, it may
lead
another problem once planner gives us a crazy estimation.
Below is my comment on the another thread.

Yes, I can read both threads and would apply patches for each problem.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
KaiGai Kohei <kaigai@kaigai.gr.jp>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#16)
Re: DBT-3 with SF=20 got failed

Simon Riggs <simon@2ndQuadrant.com> writes:

On 19 August 2015 at 12:55, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:

Please don't be rush. :-)

Please explain what rush you see?

Yours. You appear to be in a hurry to apply patches that there's no
consensus on.

It is not difficult to replace palloc() by palloc_huge(), however, it
may lead another problem once planner gives us a crazy estimation.
Below is my comment on the another thread.

Yes, I can read both threads and would apply patches for each problem.

I don't see anything very wrong with constraining the initial allocation
to 1GB, or even less. That will prevent consuming insane amounts of
work_mem when the planner's rows estimate is too high rather than too low.
And we do have the ability to increase the hash table size on the fly.

The real problem here is the potential integer overflow in
ExecChooseHashTableSize. Now that we know there is one, that should be
fixed (and not only in HEAD/9.5). But I believe Kaigai-san withdrew his
initial proposed patch, and we don't have a replacement as far as I saw.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#18)
Re: DBT-3 with SF=20 got failed

On 19 August 2015 at 14:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On 19 August 2015 at 12:55, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:

Please don't be rush. :-)

Please explain what rush you see?

Yours. You appear to be in a hurry to apply patches that there's no
consensus on.

I think that comment is unreasonable.

The problem was reported 2 months ago; following independent confirmation
of the proposed patch, I suggested committing it, with these words:

"If there are no objections, I will apply the patch for 2) to HEAD and
backpatch to 9.5."

I was clearly waiting for objections before acting, to test whether there
was consensus or not.

Please explain what procedure you would like committers to follow?

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#20Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kohei KaiGai (#15)
Re: DBT-3 with SF=20 got failed

Hi,

On 08/19/2015 01:55 PM, Kohei KaiGai wrote:

Merge Join (cost=25374644.08..1160509591.61 rows=60521928028
width=24) (actual time=138347.979..491889.343 rows=776157676 loops=1)
Merge Cond: (ws1.ws_order_number = ws2.ws_order_number)
Join Filter: (ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
Rows Removed by Join Filter: 127853313
-> Sort (cost=12687322.04..12867325.16 rows=72001248 width=16)
(actual time=73252.300..79017.420 rows=72001237 loops=1)
Sort Key: ws1.ws_order_number
Sort Method: quicksort Memory: 7083296kB
-> Seq Scan on web_sales ws1 (cost=0.00..3290612.48
rows=72001248 width=16) (actual time=0.023..39951.201 rows=72001237
loops=1)
-> Sort (cost=12687322.04..12867325.16 rows=72001248 width=16)
(actual time=65095.655..128885.811 rows=904010978 loops=1)
Sort Key: ws2.ws_order_number
Sort Method: quicksort Memory: 7083296kB
-> Seq Scan on web_sales ws2 (cost=0.00..3290612.48
rows=72001248 width=16) (actual time=0.014..31046.888 rows=72001237
loops=1)
Planning time: 0.232 ms
Execution time: 530176.521 ms
(14 rows)

So, even if we allows nodeHash.c to allocate hash buckets larger than
1GB, its initial size may be determined carefully.
Probably, 1GB is a good starting point even if expanded later.

I'm not sure I understand what is the problem here? Could you elaborate?

The initial size of the hash table is determined using the estimate, and
if we overestimate it will create more buckets (i.e. consuming more
memory) and/or start batching (which might be unnecessary).

But I don't really see any "more careful" way to do this, without
penalizing the cases where the estimate is actually correct - e.g. by
starting with much smaller buckets (and then resizing the hash table,
which is not free). Or by starting without batching, betting that we
won't actually need it.

I think it'll be very difficult to get those working without causing
real trouble to cases where we actually do have good estimates (and
those are vast majority of queries).

But both of those are features, and we're dealing with a bug fix here.

kind regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kohei KaiGai (#17)
Re: DBT-3 with SF=20 got failed

Hello KaiGain-san,

On 08/19/2015 03:19 PM, Kohei KaiGai wrote:

Unless we have no fail-safe mechanism when planner estimated too
large number of tuples than actual needs, a strange estimation will
consume massive amount of RAMs. It's a bad side effect.
My previous patch didn't pay attention to the scenario, so needs to
revise the patch.

I agree we need to put a few more safeguards there (e.g. make sure we
don't overflow INT when counting the buckets, which may happen with the
amounts of work_mem we'll see in the wild soon).

But I think we should not do any extensive changes to how we size the
hashtable - that's not something we should do in a bugfix I think.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#21)
2 attachment(s)
Re: DBT-3 with SF=20 got failed

Hi,

On 08/20/2015 04:15 AM, Tomas Vondra wrote:

Hello KaiGain-san,

On 08/19/2015 03:19 PM, Kohei KaiGai wrote:

Unless we have no fail-safe mechanism when planner estimated too
large number of tuples than actual needs, a strange estimation will
consume massive amount of RAMs. It's a bad side effect.
My previous patch didn't pay attention to the scenario, so needs to
revise the patch.

I agree we need to put a few more safeguards there (e.g. make sure we
don't overflow INT when counting the buckets, which may happen with the
amounts of work_mem we'll see in the wild soon).

But I think we should not do any extensive changes to how we size the
hashtable - that's not something we should do in a bugfix I think.

Attached are two alternative version of a backpatch. Both limit the
nbuckets so that it never exceeds MaxAllocSize - effectively 512MB due
to the doubling (so ~67M buckets on 64-bit architectures).

The only difference is that the 'alternative' patch limits max_pointers

+       /* ensure we don't exceed the maximum allocation size */
+       max_pointers = Min(max_pointers, MaxAllocSize / sizeof(void*));

so it affects both nbuckets and nbatches. That seems a bit more correct,
but I guess whoever gets this many batches would be grateful even for
the quick crash.

For master, I think the right patch is what KaiGai-san posted in June. I
don't think we should really try making it smarter about handling
overestimates at this point - that's 9.6 stuff IMNSHO.

I find it a bit awkward that we only have MemoryContextAllocHuge and
repalloc_huge, especially as nodeHash.c needs MemoryContextAllocHuge +
memset to zero the chunk.

So I think we should extend the memutils API by adding palloc0_huge (and
possibly palloc_huge, although that's not needed by nodeHash.c).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

hash-nbuckets-backpatch.patchtext/x-diff; name=hash-nbuckets-backpatch.patchDownload
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6917d3f..e72ac8b 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -475,6 +475,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 
 		lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
 		lbuckets = Min(lbuckets, max_pointers);
+		lbuckets = Min(lbuckets, MaxAllocSize / sizeof(void*));
 		nbuckets = (int) lbuckets;
 
 		dbatch = ceil(inner_rel_bytes / hash_table_bytes);
@@ -491,6 +492,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 
 		dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
 		dbuckets = Min(dbuckets, max_pointers);
+		dbuckets = Min(dbuckets, MaxAllocSize / sizeof(void*));
 		nbuckets = (int) dbuckets;
 
 		nbatch = 1;
@@ -500,7 +502,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 * Both nbuckets and nbatch must be powers of 2 to make
 	 * ExecHashGetBucketAndBatch fast.  We already fixed nbatch; now inflate
 	 * nbuckets to the next larger power of 2.  We also force nbuckets to not
-	 * be real small, by starting the search at 2^10.  (Note: above we made
+	 * be real small, by starting the search at 2^10, or too large - we allocate
+	 * them as a single chunk, so must fit in MaxAllocSize. (Note: above we made
 	 * sure that nbuckets is not more than INT_MAX / 2, so this loop cannot
 	 * overflow, nor can the final shift to recalculate nbuckets.)
 	 */
hash-nbuckets-backpatch-alt.patchtext/x-diff; name=hash-nbuckets-backpatch-alt.patchDownload
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6917d3f..ca90aed 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -465,6 +465,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	max_pointers = (work_mem * 1024L) / sizeof(void *);
 	/* also ensure we avoid integer overflow in nbatch and nbuckets */
 	max_pointers = Min(max_pointers, INT_MAX / 2);
+	/* ensure we don't exceed the maximum allocation size */
+	max_pointers = Min(max_pointers, MaxAllocSize / sizeof(void*));
 
 	if (inner_rel_bytes > hash_table_bytes)
 	{
@@ -500,7 +502,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 * Both nbuckets and nbatch must be powers of 2 to make
 	 * ExecHashGetBucketAndBatch fast.  We already fixed nbatch; now inflate
 	 * nbuckets to the next larger power of 2.  We also force nbuckets to not
-	 * be real small, by starting the search at 2^10.  (Note: above we made
+	 * be real small, by starting the search at 2^10, or too large - we allocate
+	 * them as a single chunk, so must fit in MaxAllocSize. (Note: above we made
 	 * sure that nbuckets is not more than INT_MAX / 2, so this loop cannot
 	 * overflow, nor can the final shift to recalculate nbuckets.)
 	 */
#23Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#18)
Re: DBT-3 with SF=20 got failed

Hi,

On 08/19/2015 03:53 PM, Tom Lane wrote:

I don't see anything very wrong with constraining the initial
allocation to 1GB, or even less. That will prevent consuming insane
amounts of work_mem when the planner's rows estimate is too high
rather than too low. And we do have the ability to increase the hash
table size on the fly.

Perhaps. Limiting the initial allocation to 1GB might help with lowering
memory consumed in case of over-estimation, and it's high enough so that
regular queries don't run into that.

My initial thought was "if the memory consumption is a problem, lower
the work_mem" but after thinking about it for a while I don't see a
reason to limit the memory consumption this way, if possible.

But what is the impact on queries that actually need more than 1GB of
buckets? I assume we'd only limit the initial allocation and still allow
the resize based on the actual data (i.e. the 9.5 improvement), so the
queries would start with 1GB and then resize once finding out the
optimal size (as done in 9.5). The resize is not very expensive, but
it's not free either, and with so many tuples (requiring more than 1GB
of buckets, i.e. ~130M tuples) it's probably just a noise in the total
query runtime. But I'd be nice to see some proofs of that ...

Also, why 1GB and not 512MB (which is effectively the current limit on
9.4, because we can't allocate 1GB there so we end up with 1/2 of that)?
Why not some small percentage of work_mem, e.g. 5%?

Most importantly, this is mostly orthogonal to the original bug report.
Even if we do this, we still have to fix the palloc after the resize.

So I'd say let's do a minimal bugfix of the actual issue, and then
perhaps improve the behavior in case of significant overestimates. The
bugfix should happen ASAP so that it gets into 9.5.0 (and backported).
We already have patches for that.

Limiting the initial allocation deserves more thorough discussion and
testing, and I'd argue it's 9.6 material at this point.

The real problem here is the potential integer overflow in
ExecChooseHashTableSize. Now that we know there is one, that should
be fixed (and not only in HEAD/9.5).

I don't think there's any integer overflow. The problem is that we end
up with nbuckets so high that (nbuckets*8) overflows 1GB-1.

There's a protection against integer overflow in place (it was there for
a long time), but there never was any protection against the 1GB limit.
Because we've been using much smaller work_mem values and
NTUP_PER_BUCKET=10, so we could not actually reach it.

But I believe Kaigai-san withdrew his initial proposed patch, and we
don't have a replacementas far as I saw.

I believe the patch proposed by KaiGai-san is the right one to fix the
bug discussed in this thread. My understanding is KaiGai-san withdrew
the patch as he wants to extend it to address the over-estimation issue.

I don't think we should do that - IMHO that's an unrelated improvement
and should be addressed in a separate patch.

kind regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Tomas Vondra (#23)
Re: DBT-3 with SF=20 got failed

On 08/19/2015 03:53 PM, Tom Lane wrote:

I don't see anything very wrong with constraining the initial
allocation to 1GB, or even less. That will prevent consuming insane
amounts of work_mem when the planner's rows estimate is too high
rather than too low. And we do have the ability to increase the hash
table size on the fly.

Perhaps. Limiting the initial allocation to 1GB might help with lowering
memory consumed in case of over-estimation, and it's high enough so that
regular queries don't run into that.

My initial thought was "if the memory consumption is a problem, lower
the work_mem" but after thinking about it for a while I don't see a
reason to limit the memory consumption this way, if possible.

But what is the impact on queries that actually need more than 1GB of
buckets? I assume we'd only limit the initial allocation and still allow
the resize based on the actual data (i.e. the 9.5 improvement), so the
queries would start with 1GB and then resize once finding out the
optimal size (as done in 9.5). The resize is not very expensive, but
it's not free either, and with so many tuples (requiring more than 1GB
of buckets, i.e. ~130M tuples) it's probably just a noise in the total
query runtime. But I'd be nice to see some proofs of that ...

The problem here is we cannot know exact size unless Hash node doesn't
read entire inner relation. All we can do is relying planner's estimation,
however, it often computes a crazy number of rows.
I think resizing of hash buckets is a reasonable compromise.

Also, why 1GB and not 512MB (which is effectively the current limit on
9.4, because we can't allocate 1GB there so we end up with 1/2 of that)?
Why not some small percentage of work_mem, e.g. 5%?

No clear reason at this moment.

Most importantly, this is mostly orthogonal to the original bug report.
Even if we do this, we still have to fix the palloc after the resize.

So I'd say let's do a minimal bugfix of the actual issue, and then
perhaps improve the behavior in case of significant overestimates. The
bugfix should happen ASAP so that it gets into 9.5.0 (and backported).
We already have patches for that.

Limiting the initial allocation deserves more thorough discussion and
testing, and I'd argue it's 9.6 material at this point.

Indeed, the original bug was caused by normal MemoryContextAlloc().
The issue around memory over consumption is a problem newly caused
by this. I didn't notice it two months before.

The real problem here is the potential integer overflow in
ExecChooseHashTableSize. Now that we know there is one, that should
be fixed (and not only in HEAD/9.5).

I don't think there's any integer overflow. The problem is that we end
up with nbuckets so high that (nbuckets*8) overflows 1GB-1.

There's a protection against integer overflow in place (it was there for
a long time), but there never was any protection against the 1GB limit.
Because we've been using much smaller work_mem values and
NTUP_PER_BUCKET=10, so we could not actually reach it.

But I believe Kaigai-san withdrew his initial proposed patch, and we
don't have a replacementas far as I saw.

I believe the patch proposed by KaiGai-san is the right one to fix the
bug discussed in this thread. My understanding is KaiGai-san withdrew
the patch as he wants to extend it to address the over-estimation issue.

I don't think we should do that - IMHO that's an unrelated improvement
and should be addressed in a separate patch.

OK, it might not be a problem we should conclude within a few days, just
before the beta release.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kouhei Kaigai (#24)
Re: DBT-3 with SF=20 got failed

Hello KaiGai-san,

On 08/21/2015 02:28 AM, Kouhei Kaigai wrote:
...

But what is the impact on queries that actually need more than 1GB
of buckets? I assume we'd only limit the initial allocation and
still allow the resize based on the actual data (i.e. the 9.5
improvement), so the queries would start with 1GB and then resize
once finding out the optimal size (as done in 9.5). The resize is
not very expensive, but it's not free either, and with so many
tuples (requiring more than 1GB of buckets, i.e. ~130M tuples) it's
probably just a noise in the total query runtime. But I'd be nice
to see some proofs of that ...

The problem here is we cannot know exact size unless Hash node
doesn't read entire inner relation. All we can do is relying
planner's estimation, however, it often computes a crazy number of
rows. I think resizing of hash buckets is a reasonable compromise.

I understand the estimation problem. The question I think we need to
answer is how to balance the behavior for well- and poorly-estimated
cases. It'd be unfortunate if we lower the memory consumption in the
over-estimated case while significantly slowing down the well-estimated
ones.

I don't think we have a clear answer at this point - maybe it's not a
problem at all and it'll be a win no matter what threshold we choose.
But it's a separate problem from the bugfix.

I believe the patch proposed by KaiGai-san is the right one to fix
the bug discussed in this thread. My understanding is KaiGai-san
withdrew the patch as he wants to extend it to address the
over-estimation issue.

I don't think we should do that - IMHO that's an unrelated
improvement and should be addressed in a separate patch.

OK, it might not be a problem we should conclude within a few days,
just before the beta release.

I don't quite see a reason to wait for the over-estimation patch. We
probably should backpatch the bugfix anyway (although it's much less
likely to run into that before 9.5), and we can't really backpatch the
behavior change there (as there's no hash resize).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Tomas Vondra (#25)
Re: DBT-3 with SF=20 got failed

Hello KaiGai-san,

On 08/21/2015 02:28 AM, Kouhei Kaigai wrote:
...

But what is the impact on queries that actually need more than 1GB
of buckets? I assume we'd only limit the initial allocation and
still allow the resize based on the actual data (i.e. the 9.5
improvement), so the queries would start with 1GB and then resize
once finding out the optimal size (as done in 9.5). The resize is
not very expensive, but it's not free either, and with so many
tuples (requiring more than 1GB of buckets, i.e. ~130M tuples) it's
probably just a noise in the total query runtime. But I'd be nice
to see some proofs of that ...

The problem here is we cannot know exact size unless Hash node
doesn't read entire inner relation. All we can do is relying
planner's estimation, however, it often computes a crazy number of
rows. I think resizing of hash buckets is a reasonable compromise.

I understand the estimation problem. The question I think we need to
answer is how to balance the behavior for well- and poorly-estimated
cases. It'd be unfortunate if we lower the memory consumption in the
over-estimated case while significantly slowing down the well-estimated
ones.

I don't think we have a clear answer at this point - maybe it's not a
problem at all and it'll be a win no matter what threshold we choose.
But it's a separate problem from the bugfix.

I agree with this is a separate (and maybe not easy) problem.

If somebody know previous research in academic area, please share with us.

I believe the patch proposed by KaiGai-san is the right one to fix
the bug discussed in this thread. My understanding is KaiGai-san
withdrew the patch as he wants to extend it to address the
over-estimation issue.

I don't think we should do that - IMHO that's an unrelated
improvement and should be addressed in a separate patch.

OK, it might not be a problem we should conclude within a few days,
just before the beta release.

I don't quite see a reason to wait for the over-estimation patch. We
probably should backpatch the bugfix anyway (although it's much less
likely to run into that before 9.5), and we can't really backpatch the
behavior change there (as there's no hash resize).

I don't argue this bugfix anymore.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kohei KaiGai (#8)
1 attachment(s)
Re: DBT-3 with SF=20 got failed

Hello KaiGai-san,

I've discovered a bug in the proposed patch - when resetting the hash
table after the first batch, it does this:

memset(hashtable->buckets, 0, sizeof(nbuckets * sizeof(HashJoinTuple)));

The first 'sizeof' is bogus, so this only zeroes the first 8 bytes of
the array (usually resulting in crashes due to invalid pointers).

I fixed it to

memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));

Fixed patch attached (marked as v2).

kind regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

hashslot-allocation-by-huge-alloc-v2.patchtext/x-diff; name=hashslot-allocation-by-huge-alloc-v2.patchDownload
 src/backend/executor/nodeHash.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 906cb46..63d484c 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -388,7 +388,9 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	MemoryContextSwitchTo(hashtable->batchCxt);
 
 	hashtable->buckets = (HashJoinTuple *)
-		palloc0(nbuckets * sizeof(HashJoinTuple));
+		MemoryContextAllocHuge(hashtable->batchCxt,
+							   nbuckets * sizeof(HashJoinTuple));
+	memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));
 
 	/*
 	 * Set up for skew optimization, if possible and there's a need for more
@@ -654,7 +656,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 		hashtable->nbuckets = hashtable->nbuckets_optimal;
 		hashtable->log2_nbuckets = hashtable->log2_nbuckets_optimal;
 
-		hashtable->buckets = repalloc(hashtable->buckets,
+		hashtable->buckets = repalloc_huge(hashtable->buckets,
 								sizeof(HashJoinTuple) * hashtable->nbuckets);
 	}
 
@@ -779,7 +781,7 @@ ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
 	 * chunks)
 	 */
 	hashtable->buckets =
-		(HashJoinTuple *) repalloc(hashtable->buckets,
+		(HashJoinTuple *) repalloc_huge(hashtable->buckets,
 								hashtable->nbuckets * sizeof(HashJoinTuple));
 
 	memset(hashtable->buckets, 0, sizeof(void *) * hashtable->nbuckets);
@@ -1217,7 +1219,9 @@ ExecHashTableReset(HashJoinTable hashtable)
 
 	/* Reallocate and reinitialize the hash bucket headers. */
 	hashtable->buckets = (HashJoinTuple *)
-		palloc0(nbuckets * sizeof(HashJoinTuple));
+		MemoryContextAllocHuge(hashtable->batchCxt,
+							   nbuckets * sizeof(HashJoinTuple));
+	memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));
 
 	hashtable->spaceUsed = 0;
 
#28Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Tomas Vondra (#27)
Re: DBT-3 with SF=20 got failed

Hello KaiGai-san,

I've discovered a bug in the proposed patch - when resetting the hash
table after the first batch, it does this:

memset(hashtable->buckets, 0, sizeof(nbuckets * sizeof(HashJoinTuple)));

The first 'sizeof' is bogus, so this only zeroes the first 8 bytes of
the array (usually resulting in crashes due to invalid pointers).

I fixed it to

memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));

Fixed patch attached (marked as v2).

Thanks, it was my bug, but oversight.

I want committer to push this fix.
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Robert Haas
robertmhaas@gmail.com
In reply to: Kouhei Kaigai (#28)
Re: DBT-3 with SF=20 got failed

On Tue, Sep 8, 2015 at 8:28 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Hello KaiGai-san,

I've discovered a bug in the proposed patch - when resetting the hash
table after the first batch, it does this:

memset(hashtable->buckets, 0, sizeof(nbuckets * sizeof(HashJoinTuple)));

The first 'sizeof' is bogus, so this only zeroes the first 8 bytes of
the array (usually resulting in crashes due to invalid pointers).

I fixed it to

memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));

Fixed patch attached (marked as v2).

Thanks, it was my bug, but oversight.

I want committer to push this fix.

I'm not in agreement with this fix, and would prefer to instead
proceed by limiting the initial allocation to 1GB. Otherwise, as
KaiGai has mentioned, we might end up trying to allocate completely
unreasonable amounts of memory if the planner gives a bad estimate.
Of course, it's true (as Tomas points out) that this issue already
exists today to some degree, and it's also true (as he also points
out) that 1GB is an arbitrary limit. But it's also true that we use
that same arbitrary 1GB limit in a lot of places, so it's hardly
without precedent.

More importantly, removing the cap on the allocation size makes the
problem a lot worse. You might be sad if a bad planner estimate
causes the planner to allocate 1GB when 64MB would have been enough,
but on modern systems it is not likely to be an enormous problem. If
a similar mis-estimation causes the planner to allocate 16GB rather
than 1GB, the opportunity for you to be sad is magnified pretty
considerably. Therefore, I don't really see the over-estimation bug
fix as being separate from this one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#29)
Re: DBT-3 with SF=20 got failed

On 09/08/2015 08:44 PM, Robert Haas wrote:

On Tue, Sep 8, 2015 at 8:28 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Hello KaiGai-san,

I've discovered a bug in the proposed patch - when resetting the hash
table after the first batch, it does this:

memset(hashtable->buckets, 0, sizeof(nbuckets * sizeof(HashJoinTuple)));

The first 'sizeof' is bogus, so this only zeroes the first 8 bytes of
the array (usually resulting in crashes due to invalid pointers).

I fixed it to

memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));

Fixed patch attached (marked as v2).

Thanks, it was my bug, but oversight.

I want committer to push this fix.

I'm not in agreement with this fix, and would prefer to instead
proceed by limiting the initial allocation to 1GB. Otherwise, as
KaiGai has mentioned, we might end up trying to allocate completely
unreasonable amounts of memory if the planner gives a bad estimate.
Of course, it's true (as Tomas points out) that this issue already
exists today to some degree, and it's also true (as he also points
out) that 1GB is an arbitrary limit. But it's also true that we use
that same arbitrary 1GB limit in a lot of places, so it's hardly
without precedent.

AFAIK the limit is not 1GB, but 512MB (because we use 2^N and the actual
limit is 1GB-1B). But that's a minor detail.

Also, I'm not sure what other places do you have in mind (could you list
some examples?) but I'd bet we limit the allocation to 1GB because of
the palloc() limit and not because of fear of over-estimates.

More importantly, removing the cap on the allocation size makes the
problem a lot worse. You might be sad if a bad planner estimate
causes the planner to allocate 1GB when 64MB would have been enough,
but on modern systems it is not likely to be an enormous problem. If
a similar mis-estimation causes the planner to allocate 16GB rather
than 1GB, the opportunity for you to be sad is magnified pretty
considerably. Therefore, I don't really see the over-estimation bug
fix as being separate from this one.

Perhaps. But if you want to absolutely prevent such sadness then maybe
you should not set work_mem that high?

Anyway, I do see this as a rather orthogonal problem - an independent
improvement, mostly unrelated to the bugfix. Even if we decide to
redesign it like this (and I'm not particularly opposed to that,
assuming someone takes the time to measure how expensive the additional
resize actually is), we'll still have to fix the repalloc().

So I still fail to see why we shouldn't apply this fix.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#30)
Re: DBT-3 with SF=20 got failed

On Tue, Sep 8, 2015 at 5:02 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Also, I'm not sure what other places do you have in mind (could you list
some examples?) but I'd bet we limit the allocation to 1GB because of the
palloc() limit and not because of fear of over-estimates.

I don't really think those two things are different from each other.
The palloc() limit is a means of enforcing a general policy of
limiting all allocations to 1GB except in places where we've made a
very conscious decision to allow a specific exception. This limit
happens to dovetail nicely with the varlena size limit, so in many
cases it is the exactly correct limit just for that reason. But even
when, as here, that's not at issue, it's still a useful limit, because
there are many ways that some garbage value can get passed to palloc
-- bad planner estimates, corrupted tuples, bugs in other parts of our
code. And at least on my old MacBook Pro (I haven't tested the
current one), passing a sufficiently-large value to malloc() causes a
kernel panic. That's probably a particularly bad bug, but there are
lots of systems where "accidentally" allocating an unreasonable amount
of space will have all kinds of unpleasant consequences. So, I
believe that palloc()'s limit improves the overall stability of the
system considerably even if it causes some occasional annoyance.

Most of the time, you can just palloc() and not worry too much about
whether you're going to blow up the machine: you won't, because you
aren't going to allocate more than 1GB. Any place that wants to
allocate more than that needs to be someplace where we can be pretty
sure that we're not going to accidentally allocate some completely
unreasonable amount of memory, like say 1TB. Nothing in this
discussion convinces me that this is such a place. Note that
tuplesort.c and tuplestore.c, the only existing callers of
repalloc_huge, only allocate such large amounts of memory when they
actually have enough tuples to justify it - it is always based on the
actual number of tuples, never an estimate. I think that would be a
sound principle here, too. Resizing the hash table to such a large
size based on the actual load factor is very reasonable; starting with
such a large size seems less so. Admittedly, 512MB is an arbitrary
point: and if it so happened that the limit was 256MB or 1GB or 128MB
or even 2GB I wouldn't advocate for changing it just for fun. But
you're saying we should just remove that limit altogether, and I think
that's clearly unreasonable. Do you really want to start out with a
TB or even PB-sized hash table when the actual number of tuples is,
say, one? That may sound crazy, but I've seen enough bad query plans
to know that, yes, we are sometimes off by nine orders of magnitude.
This is not a hypothetical problem.

More importantly, removing the cap on the allocation size makes the
problem a lot worse. You might be sad if a bad planner estimate
causes the planner to allocate 1GB when 64MB would have been enough,
but on modern systems it is not likely to be an enormous problem. If
a similar mis-estimation causes the planner to allocate 16GB rather
than 1GB, the opportunity for you to be sad is magnified pretty
considerably. Therefore, I don't really see the over-estimation bug
fix as being separate from this one.

Perhaps. But if you want to absolutely prevent such sadness then maybe you
should not set work_mem that high?

I think that's a red herring for a number of reasons. One, the
allocation for the hash buckets is only a small portion of the total
memory. Two, the fact that you are OK with the hash table growing to
a certain size does not mean that you want it to start out that big on
the strength of a frequently-flawed planner estimate.

Anyway, I do see this as a rather orthogonal problem - an independent
improvement, mostly unrelated to the bugfix. Even if we decide to redesign
it like this (and I'm not particularly opposed to that, assuming someone
takes the time to measure how expensive the additional resize actually is),
we'll still have to fix the repalloc().

So I still fail to see why we shouldn't apply this fix.

In all seriousness, that is fine. I respect your opinion; I'm just
telling you mine, which happens to be different.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#31)
Re: DBT-3 with SF=20 got failed

On 09/09/2015 03:55 PM, Robert Haas wrote:

On Tue, Sep 8, 2015 at 5:02 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Also, I'm not sure what other places do you have in mind (could you list
some examples?) but I'd bet we limit the allocation to 1GB because of the
palloc() limit and not because of fear of over-estimates.

I don't really think those two things are different from each other.
The palloc() limit is a means of enforcing a general policy of
limiting all allocations to 1GB except in places where we've made a
very conscious decision to allow a specific exception. This limit
happens to dovetail nicely with the varlena size limit, so in many
cases it is the exactly correct limit just for that reason. But even
when, as here, that's not at issue, it's still a useful limit, because
there are many ways that some garbage value can get passed to palloc
-- bad planner estimates, corrupted tuples, bugs in other parts of our
code. And at least on my old MacBook Pro (I haven't tested the
current one), passing a sufficiently-large value to malloc() causes a
kernel panic. That's probably a particularly bad bug, but there are
lots of systems where "accidentally" allocating an unreasonable amount
of space will have all kinds of unpleasant consequences. So, I
believe that palloc()'s limit improves the overall stability of the
system considerably even if it causes some occasional annoyance.

I'm not really buying this. The 1GB has nothing to do with platform
limits, it's there exactly to make it varlena-like (which has exactly
the same limit), and because it allows using 32-bit int to track all the
bytes. Neither of these is relevant here.

It has nothing to do with malloc() limits on various platforms, and if
there really are such limits that we think we should worry about, we
should probably address those properly. Not by and-aiding all the
various places independently.

And most importantly, these platform limits would apply to both the
initial allocation and to the subsequent resize. It's utterly useless to
just "fix" the initial allocation and then allow failure when we try to
resize the hash table.

Most of the time, you can just palloc() and not worry too much about
whether you're going to blow up the machine: you won't, because you
aren't going to allocate more than 1GB. Any place that wants to
allocate more than that needs to be someplace where we can be pretty
sure that we're not going to accidentally allocate some completely
unreasonable amount of memory, like say 1TB. Nothing in this
discussion convinces me that this is such a place. Note that

We're not going to allocate a completely unreasonable amount of memory,
because there already are some checks in place.

Firstly, you can't really get buckets larger than ~25% of work_mem,
because we the pointer has only 8B, while the HJ tuple has 16B plus the
data (IIRC). For wider tuples the size of buckets further decreases.

Secondly, we limit the number of buckets to INT_MAX, so about 16GB
(because buckets are just pointers). No matter how awful estimate you
get (or how insanely high you set work_mem) you can't exceed this.

tuplesort.c and tuplestore.c, the only existing callers of
repalloc_huge, only allocate such large amounts of memory when they
actually have enough tuples to justify it - it is always based on the
actual number of tuples, never an estimate. I think that would be a
sound principle here, too. Resizing the hash table to such a large
size based on the actual load factor is very reasonable; starting with
such a large size seems less so. Admittedly, 512MB is an arbitrary
point: and if it so happened that the limit was 256MB or 1GB or 128MB
or even 2GB I wouldn't advocate for changing it just for fun. But
you're saying we should just remove that limit altogether, and I think
that's clearly unreasonable. Do you really want to start out with a
TB or even PB-sized hash table when the actual number of tuples is,
say, one? That may sound crazy, but I've seen enough bad query plans
to know that, yes, we are sometimes off by nine orders of magnitude.
This is not a hypothetical problem.

No, I'm not saying anything like that - I actually explicitly stated
that I'm not against such change (further restricting the initial hash
table size), if someone takes the time to do a bit of testing and
provide some numbers.

Moreover as I explained there already are limits in place (25% of
work_mem or 16GB, whichever is lower), so I don't really see the bugfix
as unreasonable.

Maybe if we decide to lift this restriction (using int64 to address the
buckets, which removes the 16GB limit) this issue will get much more
pressing. But I guess hash tables handling 2B buckets will be enough for
the near future.

More importantly, removing the cap on the allocation size makes the
problem a lot worse. You might be sad if a bad planner estimate
causes the planner to allocate 1GB when 64MB would have been enough,
but on modern systems it is not likely to be an enormous problem. If
a similar mis-estimation causes the planner to allocate 16GB rather
than 1GB, the opportunity for you to be sad is magnified pretty
considerably. Therefore, I don't really see the over-estimation bug
fix as being separate from this one.

Perhaps. But if you want to absolutely prevent such sadness then maybe you
should not set work_mem that high?

I think that's a red herring for a number of reasons. One, the
allocation for the hash buckets is only a small portion of the total
memory. Two, the fact that you are OK with the hash table growing to
a certain size does not mean that you want it to start out that big on
the strength of a frequently-flawed planner estimate.

True, although I don't think the herring is entirely red. It might just
as well be a mackerel ;-)

Anyway, I do see this as a rather orthogonal problem - an independent
improvement, mostly unrelated to the bugfix. Even if we decide to redesign
it like this (and I'm not particularly opposed to that, assuming someone
takes the time to measure how expensive the additional resize actually is),
we'll still have to fix the repalloc().

So I still fail to see why we shouldn't apply this fix.

In all seriousness, that is fine. I respect your opinion; I'm just
telling you mine, which happens to be different.

Likewise.

Let me repeat my previous proposal:

1) Let's apply the proposed bugfix (and also backpatch it), because
the current code *is* broken.

2) Do a bunch of experiments with limiting the initial hash size,
decide whether the impact on well-estimated cases is acceptable.

I'm strongly opposed to just limiting the initial size without actually
measuring how expensive the resize is, as that simply adds cost to the
well-estimated cases (which may easily be the vast majority of all the
plans, as we tend to notice just the poorly estimated ones).

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#32)
Re: DBT-3 with SF=20 got failed

On Wed, Sep 9, 2015 at 11:54 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Secondly, we limit the number of buckets to INT_MAX, so about 16GB (because
buckets are just pointers). No matter how awful estimate you get (or how
insanely high you set work_mem) you can't exceed this.

OK, so this is an interesting point, and I think it clarifies things.
Essentially, we're arguing about whether a 16GB limit is as good as a
512MB limit. Right now, if we would have allocated more than 512MB,
we instead fail. There are two possible solutions:

1. I'm arguing for maintaining the 512MB limit, but by clamping the
allocation to 512MB (and the number of buckets accordingly) so that it
works with fewer buckets instead of failing.

2. You're arguing for removing the 512MB limit, allowing an initial
allocation of up to 16GB.

My judgement is that #2 could give some people a nasty surprise, in
that such a large initial allocation might cause problems, especially
if driven by a bad estimate. Your judgement is that this is unlikely
to be a problem, and that the performance consequences of limiting a
hash join to an initial allocation of 64 million buckets rather than 2
billion buckets are the thing to worry about.

I guess we'll need to wait for some other opinions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#33)
Re: DBT-3 with SF=20 got failed

On 09/11/2015 06:55 PM, Robert Haas wrote:

On Wed, Sep 9, 2015 at 11:54 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Secondly, we limit the number of buckets to INT_MAX, so about 16GB (because
buckets are just pointers). No matter how awful estimate you get (or how
insanely high you set work_mem) you can't exceed this.

OK, so this is an interesting point, and I think it clarifies things.
Essentially, we're arguing about whether a 16GB limit is as good as a
512MB limit. Right now, if we would have allocated more than 512MB,
we instead fail. There are two possible solutions:

1. I'm arguing for maintaining the 512MB limit, but by clamping the
allocation to 512MB (and the number of buckets accordingly) so that it
works with fewer buckets instead of failing.

2. You're arguing for removing the 512MB limit, allowing an initial
allocation of up to 16GB.

I'm arguing for fixing the existing bug, and then addressing the case of
over-estimation separately, with proper analysis.

My judgement is that #2 could give some people a nasty surprise, in
that such a large initial allocation might cause problems, especially
if driven by a bad estimate. Your judgement is that this is unlikely
to be a problem, and that the performance consequences of limiting a
hash join to an initial allocation of 64 million buckets rather than 2
billion buckets are the thing to worry about.

Not quite, my judgment is that

- We shouldn't address this in this particular bugfix, because it's a
separete problem (even if we limit the initial allocation, we still
have to fix the repalloc after we build the Hash).

- I assume the "might cause problems" refers to malloc() issues on some
platforms. In that case we still have to apply it to both places, not
just to the initial allocation. I don't know if this is a problem (I
haven't heard any such reports until now), but if it is we better
address this consistently everywhere, not just this one place.

- I'm not really sure about the impact of the additional resize. I
surely don't want to significantly penalize the well-estimated cases,
so I'd like to see some numbers first.

I guess we'll need to wait for some other opinions.

OK

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#34)
Re: DBT-3 with SF=20 got failed

On Fri, Sep 11, 2015 at 1:12 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I'm arguing for fixing the existing bug, and then addressing the case of
over-estimation separately, with proper analysis.

Well, this is part of how we're looking it differently. I think the
bug is "we're passing a value to palloc that is too large, so
sometimes it fails" and the way to fix that is to properly limit the
value. You are clearly defining the bug a bit differently.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#35)
Re: DBT-3 with SF=20 got failed

On 09/11/2015 07:16 PM, Robert Haas wrote:

On Fri, Sep 11, 2015 at 1:12 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I'm arguing for fixing the existing bug, and then addressing the case of
over-estimation separately, with proper analysis.

Well, this is part of how we're looking it differently. I think the
bug is "we're passing a value to palloc that is too large, so
sometimes it fails" and the way to fix that is to properly limit the
value. You are clearly defining the bug a bit differently.

Yes, I see it differently.

I don't quite understand why limiting the value is more "proper" than
using a function that can handle the actual value.

The proposed bugfix addresses the issue in the most straightforward way,
without introducing additional considerations about possible
over-estimations (which the current code completely ignores, so this is
a new thing). I think bugfixes should not introduce such changes to
behavior (albeit internal), especially not without any numbers.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#36)
Re: DBT-3 with SF=20 got failed

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On 09/11/2015 07:16 PM, Robert Haas wrote:

On Fri, Sep 11, 2015 at 1:12 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I'm arguing for fixing the existing bug, and then addressing the case of
over-estimation separately, with proper analysis.

Well, this is part of how we're looking it differently. I think the
bug is "we're passing a value to palloc that is too large, so
sometimes it fails" and the way to fix that is to properly limit the
value. You are clearly defining the bug a bit differently.

Yes, I see it differently.

I don't quite understand why limiting the value is more "proper" than
using a function that can handle the actual value.

The proposed bugfix addresses the issue in the most straightforward way,
without introducing additional considerations about possible
over-estimations (which the current code completely ignores, so this is
a new thing). I think bugfixes should not introduce such changes to
behavior (albeit internal), especially not without any numbers.

This thread seems to have stalled out...

After re-reading it, I'm going to agree with Robert that we should clamp
the initial pointer-array size to work with palloc (ie, 512MB worth of
pointers, which please recall is going to represent at least 10X that much
in hashtable entries, probably more). The argument that allowing it to be
larger would be a performance win seems entirely unbased on any evidence,
while the risks of allowing arbitrarily large allocations based on planner
estimates seem pretty obvious to me. And the argument that such
overestimates are a bug that we can easily fix is based on even less
evidence; in fact, I would dismiss that out of hand as hubris.

Now there is a separate issue about whether we should allow hashtable
resizes to exceed that limit. There I would vote yes, as long as the
resize is based on arrival of actual data and not estimates (following
Robert's point that the existing uses of repalloc_huge are driven by
actual need).

So I don't like any of the proposed patches exactly as-is.

BTW, just looking at the code in question, it seems to be desperately
in need of editorial review. A couple of examples:

* Some of the code goes to the trouble of maintaining a
log2_nbuckets_optimal value, but ExecHashIncreaseNumBuckets doesn't
know about that and recomputes the value.

* ExecHashIncreaseNumBuckets starts out with a run-time test on something
that its sole caller has just Assert()'d to not be the case, and which
really ought to be impossible with or without that Assert.

* ExecHashTableInsert believes it can only increase nbuckets_optimal
if we're still in the first batch, but MultiExecHash() will call
ExecHashIncreaseNumBuckets at the end of input-acquisition whether we've
created more than one batch or not. Meanwhile, ExecHashIncreaseNumBatches
thinks it can change the number of buckets in any case. Maybe that's all
okay, but at the very least those tests ought to look like they'd heard of
each other. And of those three places, having the safety check where it
is seems like the least reasonable place. Tracking an "optimal" number
of buckets seems like an activity that should not be constrained by
whether we have any hope of being able to apply the number.

So I'm not having a lot of faith that there aren't other bugs in the
immediate area.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#37)
Re: DBT-3 with SF=20 got failed

Hi,

On 09/23/2015 11:25 PM, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On 09/11/2015 07:16 PM, Robert Haas wrote:

On Fri, Sep 11, 2015 at 1:12 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I'm arguing for fixing the existing bug, and then addressing the case of
over-estimation separately, with proper analysis.

Well, this is part of how we're looking it differently. I think the
bug is "we're passing a value to palloc that is too large, so
sometimes it fails" and the way to fix that is to properly limit the
value. You are clearly defining the bug a bit differently.

Yes, I see it differently.

I don't quite understand why limiting the value is more "proper" than
using a function that can handle the actual value.

The proposed bugfix addresses the issue in the most straightforward way,
without introducing additional considerations about possible
over-estimations (which the current code completely ignores, so this is
a new thing). I think bugfixes should not introduce such changes to
behavior (albeit internal), especially not without any numbers.

This thread seems to have stalled out...

After re-reading it, I'm going to agree with Robert that we should
clamp the initial pointer-array size to work with palloc (ie, 512MB
worth of pointers, which please recall is going to represent at
least 10X that much in hashtable entries, probably more).

10x that much in entries? Do you mean NTUP_PER_BUCKET? Because that was
reduced to 1 last year as part of the hashjoin improvements. So we do
have more buckets than tuples (to keep load factor below 1.0). It's
still true the entries will need more space than buckets (because of
headers and such), but it may easily get well below 10x.

In the example reported by KaiGai-san, the entries are 8B wide (~24B
with header), while buckets are 8B. That's 1:3 ratio. It is a bit
extreme example because in other cases the tuples will be wider.

It also seems to me that the higher the ratio, the lower the need to
actually impose such limit because it increases the natural pressure
keeping buckets down (because both buckets and entries need to share
work_mem of limited size).

The argument that allowing it to be larger would be a performance win
seems entirely unbased on any evidence, while the risks of allowing
arbitrarily large allocations based on planner estimates seem pretty
obvious to me.

Do we agree that resizing the hash table is not free? Because my
argument was that we're forcing the well-estimated cases to do
additional resize, so maybe we should measure the impact first.

Now, maybe it does not really matter in this case - we probably get
slightly inaccurate estimates all the time. Not terribly wrong, but
enough to make the initial number of buckets too low, so we may actually
do the resize quite anyway. Also, if we're dealing with hash tables of
this size, we're probably dealing with much larger outer relation and
the additional resize is going to be just noise ...

I however quite dislike the dismissal of the possible impact. It should
be the responsibility of the person introducing the change to show that
no such impact actually exists, not just waving it off as "unbased on
any evidence" when there's no evidence presented.

Had I been adding such limit, I'd do at least some testing and presented
the results here. Perhaps I'm a bit over-protective of this piece of
code as I've spent quite a bit of time getting it faster, but I believe
the principle that the person proposing a change should demonstrate the
(lack of) performance impact is sound.

And the argument that such overestimates are a bug that we can easily
fix is based on even less evidence; in fact, I would dismiss that out
of hand as hubris.

I don't think anyone was suggesting the overestimates are easy to fix
(or even possible to fix in general). If I ever claimed that, I was
clearly wrong.

Now there is a separate issue about whether we should allow hashtable
resizes to exceed that limit. There I would vote yes, as long as the
resize is based on arrival of actual data and not estimates (following
Robert's point that the existing uses of repalloc_huge are driven by
actual need).

So I don't like any of the proposed patches exactly as-is.

BTW, just looking at the code in question, it seems to be desperately
in need of editorial review. A couple of examples:

* Some of the code goes to the trouble of maintaining a
log2_nbuckets_optimal value, but ExecHashIncreaseNumBuckets doesn't
know about that and recomputes the value.

Yeah, that's a stupid bug.

* ExecHashIncreaseNumBuckets starts out with a run-time test on
something that its sole caller has just Assert()'d to not be the
case, and which really ought to be impossible with or without that
Assert.

Yeah, right. Excessively defensive coding on my side (I think I've added
the Assert later, or something).

* ExecHashTableInsert believes it can only increase nbuckets_optimal
if we're still in the first batch, but MultiExecHash() will call
ExecHashIncreaseNumBuckets at the end of input-acquisition whether
we've created more than one batch or not. Meanwhile,
ExecHashIncreaseNumBatches thinks it can change the number of buckets
in any case. Maybe that's all okay, but at the very least those
tests ought to look like they'd heard of each other. And of those
three places, having the safety check where it is seems like the
least reasonable place.

I don't quite understand where you see the problem. I believe the logic
is sound, although adding a comment explaining it, but let me explain.

The only place where we actually mess with the number of buckets is
ExecHashTableInsert() around line 880. All the other places are resizes
of the hash table, rebuilding it to the optimal number of buckets.

The rebuild can happen in two situations - either at the end of the
input (which is what happens in MultiExecHash), or when we start
batching (ExecHashIncreaseNumBatches).

Once we're batching, there's no point in further messing with buckets
because we assume using more buckets would overflow work_mem.

The fact that we're doing the resize within ExecHashIncreaseNumBatches
is merely because of efficiency - we are going to walk the hash table
anyway (because we need to get rid of ~1/2 the tuples), so we simply
rebuild the buckets too.

Tracking an "optimal" number of buckets seems like an activity that
should not be constrained by whether we have any hope of being able
to apply the number.

Not sure I understand?

So I'm not having a lot of faith that there aren't other bugs in the
immediate area.

Well, I surely am not infallible, but so far I haven't seen any proof of
an actual bug, except for needlessly recomputing a value (which can only
happen once), and excessive check on an already asserted value.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#38)
Re: DBT-3 with SF=20 got failed

On Thu, Sep 24, 2015 at 5:50 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I however quite dislike the dismissal of the possible impact. It should be
the responsibility of the person introducing the change to show that no such
impact actually exists, not just waving it off as "unbased on any evidence"
when there's no evidence presented.

So, we're talking about determining the behavior in a case that
currently fails. Making it behave like a case that currently works
can't but be an improvement. Making it do something that currently
never happens might be better still, or it might be equivalent, or it
might be worse. I just don't buy the argument that somebody's got to
justify on performance grounds a decision not to allocate more memory
than we currently ever allocate. That seems 100% backwards to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#39)
Re: DBT-3 with SF=20 got failed

On 09/24/2015 01:51 PM, Robert Haas wrote:

On Thu, Sep 24, 2015 at 5:50 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I however quite dislike the dismissal of the possible impact. It should be
the responsibility of the person introducing the change to show that no such
impact actually exists, not just waving it off as "unbased on any evidence"
when there's no evidence presented.

So, we're talking about determining the behavior in a case that
currently fails. Making it behave like a case that currently works
can't but be an improvement. Making it do something that currently
never happens might be better still, or it might be equivalent, or
it might be worse. I just don't buy the argument that somebody's got
to justify on performance grounds a decision not to allocate more
memory than we currently ever allocate. That seems 100% backwards to
me.

Yes, it's true that if you hit the issue it fails, so I understand your
view that it's a win to fix this by introducing the (arbitrary) limit. I
disagree with this view because the limit changes at the limit - if you
get a good estimate just below the limit, you get no resize, if you get
slightly higher estimate you get resize.

So while it does not introduce behavior change in this particular case
(because it fails, as you point out), it introduces a behavior change in
general - it simply triggers behavior that does not happen below the
limit. Would we accept the change if the proposed limit was 256MB, for
example?

It also seems to me that we don't really need the hash table until after
MultiExecHash(), so maybe building the hash table incrementally is just
unnecessary and we could simply track the optimal number of buckets and
build the buckets at the end of MultiExecHash (essentially at the place
where we do the resize now). We'd have to walk the tuples and insert
them into the buckets, but that seems more efficient than the
incremental build (no data to support that at this point).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#40)
Re: DBT-3 with SF=20 got failed

On Thu, Sep 24, 2015 at 9:49 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

So while it does not introduce behavior change in this particular case
(because it fails, as you point out), it introduces a behavior change in
general - it simply triggers behavior that does not happen below the limit.
Would we accept the change if the proposed limit was 256MB, for example?

So, I'm a huge fan of arbitrary limits.

That's probably the single thing I'll say this year that sounds most
like a troll, but it isn't. I really, honestly believe that.
Doubling things is very sensible when they are small, but at some
point it ceases to be sensible. The fact that we can't set a
black-and-white threshold as to when we've crossed over that line
doesn't mean that there is no line. We can't imagine that the
occasional 32GB allocation when 4GB would have been optimal is no more
problematic than the occasional 32MB allocation when 4MB would have
been optimal. Where exactly to put the divider is subjective, but
"what palloc will take" is not an obviously unreasonable barometer.

Of course, if we can postpone sizing the hash table until after the
input size is known, as you suggest, then that would be better still
(but not back-patch material).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#41)
Re: DBT-3 with SF=20 got failed

Robert Haas <robertmhaas@gmail.com> writes:

Of course, if we can postpone sizing the hash table until after the
input size is known, as you suggest, then that would be better still
(but not back-patch material).

AFAICS, it works that way today as long as the hash fits in memory
(ie, single-batch). We load into a possibly seriously undersized hash
table, but that won't matter for performance until we start probing it.
At the conclusion of loading, MultiExecHash will call
ExecHashIncreaseNumBuckets which will re-hash into a better-sized hash
table. I doubt this can be improved on much.

It would be good if we could adjust the numbuckets choice at the
conclusion of the input phase for the multi-batch case as well.
The code appears to believe that wouldn't work, but I'm not sure if
it's right about that, or how hard it'd be to fix if so.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#41)
Re: DBT-3 with SF=20 got failed

On 09/24/2015 05:09 PM, Robert Haas wrote:

On Thu, Sep 24, 2015 at 9:49 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

So while it does not introduce behavior change in this particular
case (because it fails, as you point out), it introduces a behavior
change in general - it simply triggers behavior that does not
happen below the limit. Would we accept the change if the proposed
limit was 256MB, for example?

So, I'm a huge fan of arbitrary limits.

That's probably the single thing I'll say this year that sounds most
like a troll, but it isn't. I really, honestly believe that.
Doubling things is very sensible when they are small, but at some
point it ceases to be sensible. The fact that we can't set a
black-and-white threshold as to when we've crossed over that line
doesn't mean that there is no line. We can't imagine that the
occasional 32GB allocation when 4GB would have been optimal is no
more problematic than the occasional 32MB allocation when 4MB would
have been optimal. Where exactly to put the divider is subjective,
but "what palloc will take" is not an obviously unreasonable
barometer.

There are two machines - one with 32GB of RAM and work_mem=2GB, the
other one with 256GB of RAM and work_mem=16GB. The machines are hosting
about the same data, just scaled accordingly (~8x more data on the large
machine).

Let's assume there's a significant over-estimate - we expect to get
about 10x the actual number of tuples, and the hash table is expected to
almost exactly fill work_mem. Using the 1:3 ratio (as in the query at
the beginning of this thread) we'll use ~512MB and ~4GB for the buckets,
and the rest is for entries.

Thanks to the 10x over-estimate, ~64MB and 512MB would be enough for the
buckets, so we're wasting ~448MB (13% of RAM) on the small machine and
~3.5GB (~1.3%) on the large machine.

How does it make any sense to address the 1.3% and not the 13%?

Of course, if we can postpone sizing the hash table until after the
input size is known, as you suggest, then that would be better still
(but not back-patch material).

This dynamic resize is 9.5-only anyway.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#42)
Re: DBT-3 with SF=20 got failed

On 09/24/2015 05:18 PM, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Of course, if we can postpone sizing the hash table until after the
input size is known, as you suggest, then that would be better still
(but not back-patch material).

AFAICS, it works that way today as long as the hash fits in memory
(ie, single-batch). We load into a possibly seriously undersized hash
table, but that won't matter for performance until we start probing it.
At the conclusion of loading, MultiExecHash will call
ExecHashIncreaseNumBuckets which will re-hash into a better-sized hash
table. I doubt this can be improved on much.

It would be good if we could adjust the numbuckets choice at the
conclusion of the input phase for the multi-batch case as well.
The code appears to believe that wouldn't work, but I'm not sure if
it's right about that, or how hard it'd be to fix if so.

So you suggest to use a small hash table even when we expect batching?

That would be rather difficult to do because of the way we derive
buckets and batches from the hash value - they must not overlap. The
current code simply assumes that once we start batching the number of
bits needed for buckets does not change anymore.

It's possible to rework of course - the initial version of the patch
actually did just that (although it was broken in other ways).

But I think the real problem here is the batching itself - if we
overestimate and start batching (while we could actually run with a
single batch), we've already lost.

But what about computing the number of expected batches, but always
start executing assuming no batching? And only if we actually fill
work_mem, we start batching and use the expected number of batches?

I.e.

1) estimate nbatches, but use nbatches=1

2) run until exhausting work_mem

3) start batching, with the initially estimated number of batches

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#44)
Re: DBT-3 with SF=20 got failed

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

But what about computing the number of expected batches, but always
start executing assuming no batching? And only if we actually fill
work_mem, we start batching and use the expected number of batches?

Hmm. You would likely be doing the initial data load with a "too small"
numbuckets for single-batch behavior, but if you successfully loaded all
the data then you could resize the table at little penalty. So yeah,
that sounds like a promising approach for cases where the initial rowcount
estimate is far above reality.

But I kinda thought we did this already, actually.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#45)
Re: DBT-3 with SF=20 got failed

On 09/24/2015 07:04 PM, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

But what about computing the number of expected batches, but always
start executing assuming no batching? And only if we actually fill
work_mem, we start batching and use the expected number of batches?

Hmm. You would likely be doing the initial data load with a "too
small" numbuckets for single-batch behavior, but if you successfully
loaded all the data then you could resize the table at little
penalty. So yeah, that sounds like a promising approach for cases
where the initial rowcount estimate is far above reality.

I don't understand the comment about "too small" numbuckets - isn't
doing that the whole point of using the proposed limit? The batching is
merely a consequence of how bad the over-estimate is.

But I kinda thought we did this already, actually.

I don't think so - I believe we haven't modified this aspect at all. It
may not have been as pressing thanks to NTUP_PER_BUCKET=10 in the past.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#43)
Re: DBT-3 with SF=20 got failed

On Thu, Sep 24, 2015 at 12:40 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

There are two machines - one with 32GB of RAM and work_mem=2GB, the other
one with 256GB of RAM and work_mem=16GB. The machines are hosting about the
same data, just scaled accordingly (~8x more data on the large machine).

Let's assume there's a significant over-estimate - we expect to get about
10x the actual number of tuples, and the hash table is expected to almost
exactly fill work_mem. Using the 1:3 ratio (as in the query at the beginning
of this thread) we'll use ~512MB and ~4GB for the buckets, and the rest is
for entries.

Thanks to the 10x over-estimate, ~64MB and 512MB would be enough for the
buckets, so we're wasting ~448MB (13% of RAM) on the small machine and
~3.5GB (~1.3%) on the large machine.

How does it make any sense to address the 1.3% and not the 13%?

One of us is confused, because from here it seems like 448MB is 1.3%
of 32GB, not 13%.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#47)
Re: DBT-3 with SF=20 got failed

On 09/24/2015 07:42 PM, Robert Haas wrote:

On Thu, Sep 24, 2015 at 12:40 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

There are two machines - one with 32GB of RAM and work_mem=2GB, the other
one with 256GB of RAM and work_mem=16GB. The machines are hosting about the
same data, just scaled accordingly (~8x more data on the large machine).

Let's assume there's a significant over-estimate - we expect to get about
10x the actual number of tuples, and the hash table is expected to almost
exactly fill work_mem. Using the 1:3 ratio (as in the query at the beginning
of this thread) we'll use ~512MB and ~4GB for the buckets, and the rest is
for entries.

Thanks to the 10x over-estimate, ~64MB and 512MB would be enough for the
buckets, so we're wasting ~448MB (13% of RAM) on the small machine and
~3.5GB (~1.3%) on the large machine.

How does it make any sense to address the 1.3% and not the 13%?

One of us is confused, because from here it seems like 448MB is 1.3%
of 32GB, not 13%.

Meh, you're right - I got the math wrong. It's 1.3% in both cases.

However the question still stands - why should we handle the
over-estimate in one case and not the other? We're wasting the same
fraction of memory in both cases.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#48)
Re: DBT-3 with SF=20 got failed

On Thu, Sep 24, 2015 at 1:58 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Meh, you're right - I got the math wrong. It's 1.3% in both cases.

However the question still stands - why should we handle the over-estimate
in one case and not the other? We're wasting the same fraction of memory in
both cases.

Well, I think we're going around in circles here. It doesn't seem
likely that either of us will convince the other.

But for the record, I agree with you that in the scenario you lay out,
it's the about the same problem in both cases. I could argue that
it's slightly different because of [ tedious and somewhat tenuous
argument omitted ], but I'll spare you that. However, consider the
alternative scenario where, on the same machine, perhaps even in the
same query, we perform two hash joins, one of which involves hashing a
small table (say, 2MB) and one of which involves hashing a big table
(say, 2GB). If the small query uses twice the intended amount of
memory, probably nothing bad will happen. If the big query does the
same thing, a bad outcome is much more likely. Say the machine has
16GB of RAM. Well, a 2MB over-allocation is not going to break the
world. A 2GB over-allocation very well might.

I really don't see why this is a controversial proposition. It seems
clearly as daylight from here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#49)
Re: DBT-3 with SF=20 got failed

On 09/25/2015 02:54 AM, Robert Haas wrote:

On Thu, Sep 24, 2015 at 1:58 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Meh, you're right - I got the math wrong. It's 1.3% in both cases.

However the question still stands - why should we handle the
over-estimate in one case and not the other? We're wasting the
samefraction of memory in both cases.

Well, I think we're going around in circles here. It doesn't seem
likely that either of us will convince the other.

Let's agree we disagree ;-) That's perfectly OK, no hard feelings.

But for the record, I agree with you that in the scenario you lay
out, it's the about the same problem in both cases. I could argue
that it's slightly different because of [ tedious and somewhat
tenuous argument omitted ], but I'll spare you that.

OK, although that makes kinda prevents further discussion.

However, consider the alternative scenario where, on the same
machine, perhaps even in the same query, we perform two hash joins,
one of which involves hashing a small table (say, 2MB) and one of
which involves hashing a big table (say, 2GB). If the small query
uses twice the intended amount of memory, probably nothing bad will
happen. If the big query does the same thing, a bad outcome is much
more likely. Say the machine has 16GB of RAM. Well, a 2MB
over-allocation is not going to break the world. A 2GB
over-allocation very well might.

I've asked about case A. You've presented case B and shown that indeed,
the limit seems to help here. I don't see how that makes any difference
in case A, which I asked about?

I really don't see why this is a controversial proposition. It seems
clearly as daylight from here.

I wouldn't say controversial, but I do see the proposed solution as
misguided because we're fixing A and claiming to also fix B. Not only
we're not really fixing B, but may actually make it needlessly slower
for people who don't have problems with B at all.

We've ran into problem with allocating more than MaxAllocSize. The
proposed fix (imposing arbitrary limit) is also supposedly fixing
over-estimation problems, but actually it not (IMNSHO).

And I think my view is supported by the fact that solutions that seem to
actually fix the over-estimation properly emerged. I mean the "let's not
build the buckets at all, until the very end" and "let's start with
nbatches=0" discussed yesterday. (And I'm not saying that because I
proposed those two things.)

Anyway, I think you're right we're going in circles here. I think we
both presented all the arguments we had and we still disagree. I'm not
going to continue with this - I'm unlikely to win an argument against
two committers if that didn't happen until now. Thanks for the
discussion though.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#50)
Re: DBT-3 with SF=20 got failed

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

Anyway, I think you're right we're going in circles here. I think we
both presented all the arguments we had and we still disagree. I'm not
going to continue with this - I'm unlikely to win an argument against
two committers if that didn't happen until now. Thanks for the
discussion though.

Since we're hard up against the 9.5beta1 deadline, I've made an executive
decision to commit just the minimal change, which I view as being to
constrain the array size to MaxAllocSize where it has been all along.

I found a second rather serious bug in the new hash code while doing that
--- it failed to ensure nbuckets was actually a power of 2 --- which did
not improve my opinion of it one bit.

It's clear from this discussion that there's room for further improvement
in the hashtable sizing behavior, but I think that's 9.6 material at
this point.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers