Our trial to TPC-DS but optimizer made unreasonable plan

Started by Kouhei Kaigaiover 10 years ago24 messages
#1Kouhei Kaigai
kaigai@ak.jp.nec.com

(Please read this message on wide display)

Our team recently tries to run TPC-DS benchmark to know capability of
PostgreSQL towards typical analytic queries.
TPC-DS defines about 100 complicated queries. We noticed optimizer made
unreasonable execution plan towards some of queries.

Here is an example (query.04) below, on bottom of this message.

Its query execution plan by EXPLAIN is below. The most time consuming
portion is multilevel nested-loop on bottom of the EXPLAIN output,
because it sequentially runs on the CTE result.
I cannot complete the query execution within 30minutes towards SF=1
on Xeon E5-2670v3 and RAM=384GB environment.

----------------------------------------------------------------------
Limit (cost=1248808.70..1248808.71 rows=1 width=132)
CTE year_total
-> Append (cost=193769.66..496076.44 rows=4778919 width=220)
-> HashAggregate (cost=193769.66..226692.26 rows=2633808 width=178)
Group Key: customer.c_customer_id, customer.c_first_name, customer.c_last_name, customer.c_preferred_cust_flag, customer.c_birth_country, customer.c_login, customer.c_email_address, date_dim.d_year
-> Custom Scan (GpuJoin) (cost=14554.84..108170.90 rows=2633808 width=178)
Bulkload: On (density: 100.00%)
Depth 1: Logic: GpuHashJoin, HashKeys: (ss_sold_date_sk), JoinQual: (ss_sold_date_sk = d_date_sk), nrows_ratio: 0.95623338
Depth 2: Logic: GpuHashJoin, HashKeys: (ss_customer_sk), JoinQual: (ss_customer_sk = c_customer_sk), nrows_ratio: 0.91441411
-> Custom Scan (BulkScan) on store_sales (cost=0.00..96501.23 rows=2880323 width=38)
-> Seq Scan on date_dim (cost=0.00..2705.49 rows=73049 width=16)
-> Seq Scan on customer (cost=0.00..4358.00 rows=100000 width=156)
-> HashAggregate (cost=125474.72..143301.10 rows=1426111 width=181)
Group Key: customer_1.c_customer_id, customer_1.c_first_name, customer_1.c_last_name, customer_1.c_preferred_cust_flag, customer_1.c_birth_country, customer_1.c_login, customer_1.c_email_address, date_dim_1.d_year
-> Custom Scan (GpuJoin) (cost=14610.07..79126.11 rows=1426111 width=181)
Bulkload: On (density: 100.00%)
Depth 1: Logic: GpuHashJoin, HashKeys: (cs_bill_customer_sk), JoinQual: (c_customer_sk = cs_bill_customer_sk), nrows_ratio: 0.99446636
Depth 2: Logic: GpuHashJoin, HashKeys: (cs_sold_date_sk), JoinQual: (cs_sold_date_sk = d_date_sk), nrows_ratio: 0.98929483
-> Custom Scan (BulkScan) on catalog_sales (cost=0.00..65628.43 rows=1441543 width=41)
-> Seq Scan on customer customer_1 (cost=0.00..4358.00 rows=100000 width=156)
-> Seq Scan on date_dim date_dim_1 (cost=0.00..2705.49 rows=73049 width=16)
-> HashAggregate (cost=69306.38..78293.88 rows=719000 width=181)
Group Key: customer_2.c_customer_id, customer_2.c_first_name, customer_2.c_last_name, customer_2.c_preferred_cust_flag, customer_2.c_birth_country, customer_2.c_login, customer_2.c_email_address, date_dim_2.d_year
-> Custom Scan (GpuJoin) (cost=14702.18..45938.88 rows=719000 width=181)
Bulkload: On (density: 100.00%)
Depth 1: Logic: GpuHashJoin, HashKeys: (ws_bill_customer_sk), JoinQual: (c_customer_sk = ws_bill_customer_sk), nrows_ratio: 0.99973309
Depth 2: Logic: GpuHashJoin, HashKeys: (ws_sold_date_sk), JoinQual: (ws_sold_date_sk = d_date_sk), nrows_ratio: 0.99946618
-> Custom Scan (BulkScan) on web_sales (cost=0.00..32877.84 rows=719384 width=41)
-> Seq Scan on customer customer_2 (cost=0.00..4358.00 rows=100000 width=156)
-> Seq Scan on date_dim date_dim_2 (cost=0.00..2705.49 rows=73049 width=16)
-> Sort (cost=752732.27..752732.27 rows=1 width=132)
Sort Key: t_s_secyear.customer_id, t_s_secyear.customer_first_name, t_s_secyear.customer_last_name, t_s_secyear.customer_email_address
-> Nested Loop (cost=0.00..752732.26 rows=1 width=132)
Join Filter: ((t_s_secyear.customer_id = t_w_secyear.customer_id) AND (CASE WHEN (t_c_firstyear.year_total > '0'::numeric) THEN (t_c_secyear.year_total / t_c_firstyear.year_total) ELSE NULL::numeric END > CASE WHEN (t_w_firstyear.year_total > '0'::numeric) THEN (t_w_secyear.year_total / t_w_firstyear.year_total) ELSE NULL::numeric END))
-> Nested Loop (cost=0.00..633256.31 rows=1 width=308)
Join Filter: ((t_s_secyear.customer_id = t_c_secyear.customer_id) AND (CASE WHEN (t_c_firstyear.year_total > '0'::numeric) THEN (t_c_secyear.year_total / t_c_firstyear.year_total) ELSE NULL::numeric END > CASE WHEN (t_s_firstyear.year_total > '0'::numeric) THEN (t_s_secyear.year_total / t_s_firstyear.year_total) ELSE NULL::numeric END))
-> Nested Loop (cost=0.00..513780.36 rows=1 width=320)
Join Filter: (t_s_firstyear.customer_id = t_s_secyear.customer_id)
-> Nested Loop (cost=0.00..394303.22 rows=2 width=156)
Join Filter: (t_s_firstyear.customer_id = t_w_firstyear.customer_id)
-> Nested Loop (cost=0.00..262876.15 rows=8 width=104)
Join Filter: (t_s_firstyear.customer_id = t_c_firstyear.customer_id)
-> CTE Scan on year_total t_s_firstyear (cost=0.00..131420.27 rows=40 width=52)
Filter: ((year_total > '0'::numeric) AND (sale_type = 's'::text) AND (dyear = 2001))
-> CTE Scan on year_total t_c_firstyear (cost=0.00..131420.27 rows=40 width=52)
Filter: ((year_total > '0'::numeric) AND (sale_type = 'c'::text) AND (dyear = 2001))
-> CTE Scan on year_total t_w_firstyear (cost=0.00..131420.27 rows=40 width=52)
Filter: ((year_total > '0'::numeric) AND (sale_type = 'w'::text) AND (dyear = 2001))
-> CTE Scan on year_total t_s_secyear (cost=0.00..119472.98 rows=119 width=164)
Filter: ((sale_type = 's'::text) AND (dyear = 2002))
-> CTE Scan on year_total t_c_secyear (cost=0.00..119472.98 rows=119 width=52)
Filter: ((sale_type = 'c'::text) AND (dyear = 2002))
-> CTE Scan on year_total t_w_secyear (cost=0.00..119472.98 rows=119 width=52)
Filter: ((sale_type = 'w'::text) AND (dyear = 2002))
(54 rows)
----------------------------------------------------------------------

On the other hands, once I turned off the nested-loop using SET enable_nestloop=off,
EXPLAIN displayed the following output, and query gets completed with 11.2sec.
This plan replaced a bunch of NestLoop on CTE Scan by HashJoin, therefore, JOIN
logic does not need to take full-scan on CTE multiple times.

-----------------------------------------------------------------------
Limit (cost=1248761.93..1248761.93 rows=1 width=132)
CTE year_total
-> Append (cost=193769.66..496076.44 rows=4778919 width=220)
-> HashAggregate (cost=193769.66..226692.26 rows=2633808 width=178)
Group Key: customer.c_customer_id, customer.c_first_name, customer.c_last_name, customer.c_preferred_cust_flag, customer.c_birth_country, customer.c_login, customer.c_email_address, date_dim.d_year
-> Custom Scan (GpuJoin) (cost=14554.84..108170.90 rows=2633808 width=178)
Bulkload: On (density: 100.00%)
Depth 1: Logic: GpuHashJoin, HashKeys: (ss_sold_date_sk), JoinQual: (ss_sold_date_sk = d_date_sk), nrows_ratio: 0.95623338
Depth 2: Logic: GpuHashJoin, HashKeys: (ss_customer_sk), JoinQual: (ss_customer_sk = c_customer_sk), nrows_ratio: 0.91441411
-> Custom Scan (BulkScan) on store_sales (cost=0.00..96501.23 rows=2880323 width=38)
-> Seq Scan on date_dim (cost=0.00..2705.49 rows=73049 width=16)
-> Seq Scan on customer (cost=0.00..4358.00 rows=100000 width=156)
-> HashAggregate (cost=125474.72..143301.10 rows=1426111 width=181)
Group Key: customer_1.c_customer_id, customer_1.c_first_name, customer_1.c_last_name, customer_1.c_preferred_cust_flag, customer_1.c_birth_country, customer_1.c_login, customer_1.c_email_address, date_dim_1.d_year
-> Custom Scan (GpuJoin) (cost=14610.07..79126.11 rows=1426111 width=181)
Bulkload: On (density: 100.00%)
Depth 1: Logic: GpuHashJoin, HashKeys: (cs_bill_customer_sk), JoinQual: (c_customer_sk = cs_bill_customer_sk), nrows_ratio: 0.99446636
Depth 2: Logic: GpuHashJoin, HashKeys: (cs_sold_date_sk), JoinQual: (cs_sold_date_sk = d_date_sk), nrows_ratio: 0.98929483
-> Custom Scan (BulkScan) on catalog_sales (cost=0.00..65628.43 rows=1441543 width=41)
-> Seq Scan on customer customer_1 (cost=0.00..4358.00 rows=100000 width=156)
-> Seq Scan on date_dim date_dim_1 (cost=0.00..2705.49 rows=73049 width=16)
-> HashAggregate (cost=69306.38..78293.88 rows=719000 width=181)
Group Key: customer_2.c_customer_id, customer_2.c_first_name, customer_2.c_last_name, customer_2.c_preferred_cust_flag, customer_2.c_birth_country, customer_2.c_login, customer_2.c_email_address, date_dim_2.d_year
-> Custom Scan (GpuJoin) (cost=14702.18..45938.88 rows=719000 width=181)
Bulkload: On (density: 100.00%)
Depth 1: Logic: GpuHashJoin, HashKeys: (ws_bill_customer_sk), JoinQual: (c_customer_sk = ws_bill_customer_sk), nrows_ratio: 0.99973309
Depth 2: Logic: GpuHashJoin, HashKeys: (ws_sold_date_sk), JoinQual: (ws_sold_date_sk = d_date_sk), nrows_ratio: 0.99946618
-> Custom Scan (BulkScan) on web_sales (cost=0.00..32877.84 rows=719384 width=41)
-> Seq Scan on customer customer_2 (cost=0.00..4358.00 rows=100000 width=156)
-> Seq Scan on date_dim date_dim_2 (cost=0.00..2705.49 rows=73049 width=16)
-> Sort (cost=752685.49..752685.50 rows=1 width=132)
Sort Key: t_s_secyear.customer_id, t_s_secyear.customer_first_name, t_s_secyear.customer_last_name, t_s_secyear.customer_email_address
-> Hash Join (cost=621264.91..752685.48 rows=1 width=132)
Hash Cond: (t_s_secyear.customer_id = t_w_secyear.customer_id)
Join Filter: (CASE WHEN (t_c_firstyear.year_total > '0'::numeric) THEN (t_c_secyear.year_total / t_c_firstyear.year_total) ELSE NULL::numeric END > CASE WHEN (t_w_firstyear.year_total > '0'::numeric) THEN (t_w_secyear.year_total / t_w_firstyear.year_total) ELSE NULL::numeric END)
-> Hash Join (cost=501790.45..633210.98 rows=1 width=308)
Hash Cond: (t_s_secyear.customer_id = t_c_secyear.customer_id)
Join Filter: (CASE WHEN (t_c_firstyear.year_total > '0'::numeric) THEN (t_c_secyear.year_total / t_c_firstyear.year_total) ELSE NULL::numeric END > CASE WHEN (t_s_firstyear.year_total > '0'::numeric) THEN (t_s_secyear.year_total / t_s_firstyear.year_total) ELSE NULL::numeric END)
-> Hash Join (cost=382315.99..513736.47 rows=1 width=320)
Hash Cond: (t_s_firstyear.customer_id = t_s_secyear.customer_id)
-> Hash Join (cost=262841.53..394261.97 rows=2 width=156)
Hash Cond: (t_w_firstyear.customer_id = t_s_firstyear.customer_id)
-> CTE Scan on year_total t_w_firstyear (cost=0.00..131420.27 rows=40 width=52)
Filter: ((year_total > '0'::numeric) AND (sale_type = 'w'::text) AND (dyear = 2001))
-> Hash (cost=262841.43..262841.43 rows=8 width=104)
-> Hash Join (cost=131420.77..262841.43 rows=8 width=104)
Hash Cond: (t_s_firstyear.customer_id = t_c_firstyear.customer_id)
-> CTE Scan on year_total t_s_firstyear (cost=0.00..131420.27 rows=40 width=52)
Filter: ((year_total > '0'::numeric) AND (sale_type = 's'::text) AND (dyear = 2001))
-> Hash (cost=131420.27..131420.27 rows=40 width=52)
-> CTE Scan on year_total t_c_firstyear (cost=0.00..131420.27 rows=40 width=52)
Filter: ((year_total > '0'::numeric) AND (sale_type = 'c'::text) AND (dyear = 2001))
-> Hash (cost=119472.98..119472.98 rows=119 width=164)
-> CTE Scan on year_total t_s_secyear (cost=0.00..119472.98 rows=119 width=164)
Filter: ((sale_type = 's'::text) AND (dyear = 2002))
-> Hash (cost=119472.98..119472.98 rows=119 width=52)
-> CTE Scan on year_total t_c_secyear (cost=0.00..119472.98 rows=119 width=52)
Filter: ((sale_type = 'c'::text) AND (dyear = 2002))
-> Hash (cost=119472.98..119472.98 rows=119 width=52)
-> CTE Scan on year_total t_w_secyear (cost=0.00..119472.98 rows=119 width=52)
Filter: ((sale_type = 'w'::text) AND (dyear = 2002))
(61 rows)
------------------------------------------------------------------

SET enable_nestloop=off performed fine in this case. However, it seems to me
this restriction is too much.

In fact, cost of HashJoin underlying Sort node is:
-> Hash Join (cost=621264.91..752685.48 rows=1 width=132)

On the other hands, NestedLoop on same place is:
-> Nested Loop (cost=0.00..752732.26 rows=1 width=132)

Probably, small GUC adjustment may make optimizer to prefer HashJoin towards
these kind of queries.
Do you have a good idea?

====== QUERY: No.04 ========================

EXPLAIN ANALYZE
with year_total as (
select c_customer_id customer_id
,c_first_name customer_first_name
,c_last_name customer_last_name
,c_preferred_cust_flag customer_preferred_cust_flag
,c_birth_country customer_birth_country
,c_login customer_login
,c_email_address customer_email_address
,d_year dyear
,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2) year_total
,'s' sale_type
from customer
,store_sales
,date_dim
where c_customer_sk = ss_customer_sk
and ss_sold_date_sk = d_date_sk
group by c_customer_id
,c_first_name
,c_last_name
,c_preferred_cust_flag
,c_birth_country
,c_login
,c_email_address
,d_year
union all
select c_customer_id customer_id
,c_first_name customer_first_name
,c_last_name customer_last_name
,c_preferred_cust_flag customer_preferred_cust_flag
,c_birth_country customer_birth_country
,c_login customer_login
,c_email_address customer_email_address
,d_year dyear
,sum((((cs_ext_list_price-cs_ext_wholesale_cost-cs_ext_discount_amt)+cs_ext_sales_price)/2) ) year_total
,'c' sale_type
from customer
,catalog_sales
,date_dim
where c_customer_sk = cs_bill_customer_sk
and cs_sold_date_sk = d_date_sk
group by c_customer_id
,c_first_name
,c_last_name
,c_preferred_cust_flag
,c_birth_country
,c_login
,c_email_address
,d_year
union all
select c_customer_id customer_id
,c_first_name customer_first_name
,c_last_name customer_last_name
,c_preferred_cust_flag customer_preferred_cust_flag
,c_birth_country customer_birth_country
,c_login customer_login
,c_email_address customer_email_address
,d_year dyear
,sum((((ws_ext_list_price-ws_ext_wholesale_cost-ws_ext_discount_amt)+ws_ext_sales_price)/2) ) year_total
,'w' sale_type
from customer
,web_sales
,date_dim
where c_customer_sk = ws_bill_customer_sk
and ws_sold_date_sk = d_date_sk
group by c_customer_id
,c_first_name
,c_last_name
,c_preferred_cust_flag
,c_birth_country
,c_login
,c_email_address
,d_year
)
select
t_s_secyear.customer_id
,t_s_secyear.customer_first_name
,t_s_secyear.customer_last_name
,t_s_secyear.customer_email_address
from year_total t_s_firstyear
,year_total t_s_secyear
,year_total t_c_firstyear
,year_total t_c_secyear
,year_total t_w_firstyear
,year_total t_w_secyear
where t_s_secyear.customer_id = t_s_firstyear.customer_id
and t_s_firstyear.customer_id = t_c_secyear.customer_id
and t_s_firstyear.customer_id = t_c_firstyear.customer_id
and t_s_firstyear.customer_id = t_w_firstyear.customer_id
and t_s_firstyear.customer_id = t_w_secyear.customer_id
and t_s_firstyear.sale_type = 's'
and t_c_firstyear.sale_type = 'c'
and t_w_firstyear.sale_type = 'w'
and t_s_secyear.sale_type = 's'
and t_c_secyear.sale_type = 'c'
and t_w_secyear.sale_type = 'w'
and t_s_firstyear.dyear = 2001
and t_s_secyear.dyear = 2001+1
and t_c_firstyear.dyear = 2001
and t_c_secyear.dyear = 2001+1
and t_w_firstyear.dyear = 2001
and t_w_secyear.dyear = 2001+1
and t_s_firstyear.year_total > 0
and t_c_firstyear.year_total > 0
and t_w_firstyear.year_total > 0
and case when t_c_firstyear.year_total > 0 then t_c_secyear.year_total / t_c_firstyear.year_total else null end

case when t_s_firstyear.year_total > 0 then t_s_secyear.year_total / t_s_firstyear.year_total else null end

and case when t_c_firstyear.year_total > 0 then t_c_secyear.year_total / t_c_firstyear.year_total else null end

case when t_w_firstyear.year_total > 0 then t_w_secyear.year_total / t_w_firstyear.year_total else null end

order by t_s_secyear.customer_id
,t_s_secyear.customer_first_name
,t_s_secyear.customer_last_name
,t_s_secyear.customer_email_address
limit 100;

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Greg Stark
stark@mit.edu
In reply to: Kouhei Kaigai (#1)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

On Thu, Aug 13, 2015 at 2:49 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

In fact, cost of HashJoin underlying Sort node is:
-> Hash Join (cost=621264.91..752685.48 rows=1 width=132)

On the other hands, NestedLoop on same place is:
-> Nested Loop (cost=0.00..752732.26 rows=1 width=132)

Probably, small GUC adjustment may make optimizer to prefer HashJoin towards
these kind of queries.

With that kind of discrepancy I doubt adjusting GUCs will be sufficient

Do you have a good idea?

Do you have EXPLAIN ANALYZE from the plan that finishes? Are there any
row estimates that are way off?

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Greg Stark (#2)
1 attachment(s)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

On Thu, Aug 13, 2015 at 2:49 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

In fact, cost of HashJoin underlying Sort node is:
-> Hash Join (cost=621264.91..752685.48 rows=1 width=132)

On the other hands, NestedLoop on same place is:
-> Nested Loop (cost=0.00..752732.26 rows=1 width=132)

Probably, small GUC adjustment may make optimizer to prefer HashJoin towards
these kind of queries.

With that kind of discrepancy I doubt adjusting GUCs will be sufficient

Do you have a good idea?

Do you have EXPLAIN ANALYZE from the plan that finishes? Are there any
row estimates that are way off?

Yes, EXPLAIN ANALYZE is attached.

According to this, CTE year_total generates 384,208 rows. It is much smaller
than estimation (4.78M rows), however, filter's selectivity of CTE Scan was
not large as expectation.
For example, the deepest CTE Scan returns 37923 rows and 26314 rows, even though
40 rows were expected. On the next level, relations join between 11324 rows and
9952 rows, towards to estimation of 40rows x 8 rows.
If NestLoop is placed instead of HashJoin, it will make an explosion of the number
of loops.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Attachments:

TPCDS_Q4_NLJ_disabled.txttext/plain; name=TPCDS_Q4_NLJ_disabled.txtDownload
#4Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Kouhei Kaigai (#3)
1 attachment(s)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

Here is one other thing I could learn from TPC-DS benchmark.

The attached query is Q4 of TPC-DS, and its result was towards SF=100.
It took long time to compete (about 30min), please see the attached
EXPLAIN ANALYZE output.

Its core workload is placed on CTE year_total. Its Append node has
underlying three HashAggregate nodes which also has tables join for
each.

Below shows the first HashAggregate node. It consumes 268M rows, then
generates 9M rows. Underlying GpuJoin takes 74sec to process 268M rows,
so we can guess HashAggregate consumed 400sec.

HashAggregate (cost=18194948.40..21477516.00 rows=262605408 width=178) (actual time=480652.856..488055.918 rows=9142442 loops=1)
Group Key: customer.c_customer_id, customer.c_first_name, customer.c_last_name, customer.c_preferred_cust_flag, customer.c_birth_country, customer.c_login, customer.c_email_address, date_dim.d_year
-> Custom Scan (GpuJoin) (cost=101342.54..9660272.64 rows=262605408 width=178) (actual time=2472.174..73908.894 rows=268562375 loops=1)
Bulkload: On (density: 100.00%)
Depth 1: Logic: GpuHashJoin, HashKeys: (ss_sold_date_sk), JoinQual: (ss_sold_date_sk = d_date_sk), nrows (287997024 -> 275041999, 95.50% expected 95.47%)
Depth 2: Logic: GpuHashJoin, HashKeys: (ss_customer_sk), JoinQual: (ss_customer_sk = c_customer_sk), nrows (275041999 -> 268562375, 93.25% expected 91.18%)
-> Custom Scan (BulkScan) on store_sales (cost=0.00..9649559.60 rows=287996960 width=38) (actual time=18.372..54522.803 rows=287997024 loops=1)
-> Seq Scan on date_dim (cost=0.00..2705.49 rows=73049 width=16) (actual time=0.015..15.533 rows=73049 loops=1)
-> Seq Scan on customer (cost=0.00..87141.74 rows=2000074 width=156) (actual time=0.018..582.373 rows=2000000 loops=1)

Other two HashAggregate nodes have similar behavior. The second one
consumed 281sec, including 30sec by underlying GpuJoin. The third one
consumed 138sec, including 25sec by underlying GpuJoin.

Apart from my custom join implementation, It seems to me HashAggregate
node consumed too much time than expectation.

One characteristics of this workload is, this aggregation takes eight
grouping-keys. I doubt cost of function invocation for hash-value and
equal-checks may be criminal.

Let's dive into nodeAgg.c.
ExecAgg() calls agg_fill_hash_table() to fill up the hash table with
rows fetched from the underlying tables. On construction of the hash
table, it calls hash functions (at TupleHashTableHash) and equal check
functions (at execTuplesMatch) repeatedly.
Both of them uses FunctionCallX() interface that exchange the argument
using FmgrInfo structure. Usually, it is not the best way from performance
perspective. Especially, this workload takes 268M input rows and
8 grouping keys, so 268M (rows) x 8 (grouping keys) x 2 (for hash/equal),
4.2B times function calls via FmgrInfo shall happen.

I think SortSupport logic provides a reasonable way to solve this
kind of problem. For example, btint4sortsupport() informs a function
pointer of the fast version of comparator (btint4fastcmp) which takes
two Datum argument without indirect memory reference.
This mechanism will also make sense for HashAggregate logic, to reduce
the cost of function invocations.

Please comment on the idea I noticed here.

And, as an aside, if HashAggregate picks up a hash-value of grouping-keys
from the target-list of underlying plan node (GpuJoin in this case), it
may be possible to off-load calculation of hash-values on GPU device. :-)

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Show quoted text

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
Sent: Thursday, August 13, 2015 8:23 PM
To: Greg Stark
Cc: PostgreSQL-development
Subject: Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan

On Thu, Aug 13, 2015 at 2:49 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

In fact, cost of HashJoin underlying Sort node is:
-> Hash Join (cost=621264.91..752685.48 rows=1 width=132)

On the other hands, NestedLoop on same place is:
-> Nested Loop (cost=0.00..752732.26 rows=1 width=132)

Probably, small GUC adjustment may make optimizer to prefer HashJoin towards
these kind of queries.

With that kind of discrepancy I doubt adjusting GUCs will be sufficient

Do you have a good idea?

Do you have EXPLAIN ANALYZE from the plan that finishes? Are there any
row estimates that are way off?

Yes, EXPLAIN ANALYZE is attached.

According to this, CTE year_total generates 384,208 rows. It is much smaller
than estimation (4.78M rows), however, filter's selectivity of CTE Scan was
not large as expectation.
For example, the deepest CTE Scan returns 37923 rows and 26314 rows, even though
40 rows were expected. On the next level, relations join between 11324 rows and
9952 rows, towards to estimation of 40rows x 8 rows.
If NestLoop is placed instead of HashJoin, it will make an explosion of the number
of loops.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Attachments:

TPCDS_Q4_SF100_NLJ_Disabled.txttext/plain; name=TPCDS_Q4_SF100_NLJ_Disabled.txtDownload
#5Robert Haas
robertmhaas@gmail.com
In reply to: Kouhei Kaigai (#4)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

On Mon, Aug 17, 2015 at 9:40 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

I think SortSupport logic provides a reasonable way to solve this
kind of problem. For example, btint4sortsupport() informs a function
pointer of the fast version of comparator (btint4fastcmp) which takes
two Datum argument without indirect memory reference.
This mechanism will also make sense for HashAggregate logic, to reduce
the cost of function invocations.

Please comment on the idea I noticed here.

It's possible that this can work, but it might be a good idea to run
'perf' on this query and find out where the CPU time is actually
going. That might give you a clearer picture of why the HashAggregate
is slow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Qingqing Zhou
zhouqq.postgres@gmail.com
In reply to: Kouhei Kaigai (#4)
2 attachment(s)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

On Mon, Aug 17, 2015 at 6:40 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Here is one other thing I could learn from TPC-DS benchmark.

The attached query is Q4 of TPC-DS, and its result was towards SF=100.
It took long time to compete (about 30min), please see the attached
EXPLAIN ANALYZE output.

Look at this:

-> CTE Scan on year_total t_s_firstyear (cost=0.00..13120715.27
rows=3976 width=52) (actual time=0.020..5425.980 rows=1816438 loops=1)
Filter: ((year_total > '0'::numeric) AND
(sale_type = 's'::text) AND (dyear = 2001))
Rows Removed by Filter: 19879897
-> CTE Scan on year_total t_s_secyear (cost=0.00..11927922.98
rows=11928 width=164) (actual time=0.007..45.249 rows=46636 loops=1)
Filter: ((sale_type = 's'::text) AND (dyear = 2002))
Rows Removed by Filter: 185596

CTE expansion shall help here as we can push the filer down. I did a
quick patch to demonstrate the idea, following Tom's proposal
(38448.1430519406@sss.pgh.pa.us). I see obvious performance boost:

Turn off NLJ:
original: Planning time: 4.391 ms
Execution time: 77113.721 ms
patched: Planning time: 8.429 ms
Execution time: 18572.663 ms

+ work_mem to 1G
original: Planning time: 4.487 ms
Execution time: 29249.466 ms
patched: Planning time: 11.148 ms
Execution time: 7309.586 ms

Attached please find the WIP patch and also the ANALYZE results.
Notes: the patch may not directly apply to head as some network issue
here so my Linux box can't talk to git server.

Regards,
Qingqing

Attachments:

ctes.patchapplication/octet-stream; name=ctes.patchDownload
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 7069f60..7b49816
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool		enable_nestloop = true;
*** 118,123 ****
--- 118,124 ----
  bool		enable_material = true;
  bool		enable_mergejoin = true;
  bool		enable_hashjoin = true;
+ bool		enable_cte_subquery = true;
  
  typedef struct
  {
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index fedbfab..00334c7
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** subquery_planner(PlannerGlobal *glob, Qu
*** 359,364 ****
--- 359,369 ----
  		pull_up_sublinks(root);
  
  	/*
+ 	 * Try to subsitute CTEs with subqueries.
+ 	 */
+ 	substitute_ctes_with_subqueries(root);
+ 
+ 	/*
  	 * Scan the rangetable for set-returning functions, and inline them if
  	 * possible (producing subqueries that might get pulled up next).
  	 * Recursion issues here are handled in the same way as for SubLinks.
diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
new file mode 100644
index 9bf1c66..e9242a4
*** a/src/backend/optimizer/prep/prepjointree.c
--- b/src/backend/optimizer/prep/prepjointree.c
***************
*** 31,41 ****
  #include "optimizer/prep.h"
  #include "optimizer/subselect.h"
  #include "optimizer/tlist.h"
  #include "parser/parse_relation.h"
  #include "parser/parsetree.h"
  #include "rewrite/rewriteManip.h"
  
- 
  typedef struct pullup_replace_vars_context
  {
  	PlannerInfo *root;
--- 31,41 ----
  #include "optimizer/prep.h"
  #include "optimizer/subselect.h"
  #include "optimizer/tlist.h"
+ #include "optimizer/cost.h"
  #include "parser/parse_relation.h"
  #include "parser/parsetree.h"
  #include "rewrite/rewriteManip.h"
  
  typedef struct pullup_replace_vars_context
  {
  	PlannerInfo *root;
*************** pull_up_sublinks_qual_recurse(PlannerInf
*** 551,556 ****
--- 551,590 ----
  }
  
  /*
+  * substitute_ctes_with_subqueries
+  *		Attempt to sustitute CTEs with subqueries.
+  *
+  */
+ void
+ substitute_ctes_with_subqueries(PlannerInfo *root)
+ {
+ 	ListCell   *rt;
+ 
+ 	if (!enable_cte_subquery)
+ 		return;
+ 	
+ 	foreach(rt, root->parse->rtable)
+ 	{
+ 		RangeTblEntry *rte = (RangeTblEntry *) lfirst(rt);
+ 
+ 		if (rte->rtekind == RTE_CTE)
+ 		{
+ 			Query	   *funcquery;
+ 
+ 			/* Check safety of expansion, and expand if possible */
+ 			funcquery = substitute_cte_with_subquery(root, rte);
+ 			if (funcquery)
+ 			{
+ 				/* Successful expansion, replace the rtable entry */
+ 				rte->rtekind = RTE_SUBQUERY;
+ 				rte->subquery = funcquery;
+ 				rte->functions = NIL;
+ 			}
+ 		}
+ 	}
+ }
+ 
+ /*
   * inline_set_returning_functions
   *		Attempt to "inline" set-returning functions in the FROM clause.
   *
*************** pull_up_simple_subquery(PlannerInfo *roo
*** 921,926 ****
--- 955,965 ----
  		pull_up_sublinks(subroot);
  
  	/*
+ 	 * Try to subsitute CTEs with subqueries.
+ 	 */
+ 	substitute_ctes_with_subqueries(subroot);
+ 
+ 	/*
  	 * Similarly, inline any set-returning functions in its rangetable.
  	 */
  	inline_set_returning_functions(subroot);
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
new file mode 100644
index c72dbef..db9ec7b
*** a/src/backend/optimizer/util/clauses.c
--- b/src/backend/optimizer/util/clauses.c
*************** evaluate_expr(Expr *expr, Oid result_typ
*** 4677,4682 ****
--- 4677,4710 ----
  							  resultTypByVal);
  }
  
+ /*
+  * substitute_cte_with_subquery
+  *		Attempt to sustitute a CTE with a subquery.
+  *
+  */
+ Query *
+ substitute_cte_with_subquery(PlannerInfo *root, RangeTblEntry *rte)
+ {
+ 	ListCell   *lc;
+  
+ 	Assert(rte->rtekind == RTE_CTE);
+ 
+ 	/*
+ 	 * Lookup current CTE by name
+ 	 */
+ 	foreach(lc, root->parse->cteList)
+ 	{
+ 		CommonTableExpr *cte = (CommonTableExpr *) lfirst(lc);
+ 
+ 		if (strcmp(cte->ctename, rte->ctename) == 0)
+ 		{
+ 			/* TODO: check validity of this CTE */
+ 			return (Query *)cte->ctequery;
+ 		}
+ 	}
+ 
+ 	return NULL;
+ }
  
  /*
   * inline_set_returning_function
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 7a55a49..380d2e4
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 892,897 ****
--- 892,906 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"enable_cte_subquery", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the planner's subsituting CTEs with subqueries."),
+ 			NULL
+ 		},
+ 		&enable_cte_subquery,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
  			gettext_noop("Enables genetic query optimization."),
  			gettext_noop("This algorithm attempts to do planning without "
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
new file mode 100644
index 3d04ac2..27a7227
*** a/src/include/optimizer/clauses.h
--- b/src/include/optimizer/clauses.h
*************** extern Node *estimate_expression_value(P
*** 84,88 ****
--- 84,89 ----
  
  extern Query *inline_set_returning_function(PlannerInfo *root,
  							  RangeTblEntry *rte);
+ extern Query *substitute_cte_with_subquery(PlannerInfo *root, RangeTblEntry *rte);
  
  #endif   /* CLAUSES_H */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index dd43e45..58a146b
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_nestloop;
*** 61,66 ****
--- 61,67 ----
  extern bool enable_material;
  extern bool enable_mergejoin;
  extern bool enable_hashjoin;
+ extern bool enable_cte_subquery;
  extern int	constraint_exclusion;
  
  extern double clamp_row_est(double nrows);
diff --git a/src/include/optimizer/prep.h b/src/include/optimizer/prep.h
new file mode 100644
index 7b8c0a9..ed43dd0
*** a/src/include/optimizer/prep.h
--- b/src/include/optimizer/prep.h
***************
*** 23,28 ****
--- 23,29 ----
   */
  extern void pull_up_sublinks(PlannerInfo *root);
  extern void inline_set_returning_functions(PlannerInfo *root);
+ extern void substitute_ctes_with_subqueries(PlannerInfo *root);
  extern void pull_up_subqueries(PlannerInfo *root);
  extern void flatten_simple_union_all(PlannerInfo *root);
  extern void reduce_outer_joins(PlannerInfo *root);
result.outapplication/octet-stream; name=result.outDownload
#7Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Qingqing Zhou (#6)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

On Mon, Aug 17, 2015 at 6:40 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Here is one other thing I could learn from TPC-DS benchmark.

The attached query is Q4 of TPC-DS, and its result was towards SF=100.
It took long time to compete (about 30min), please see the attached
EXPLAIN ANALYZE output.

Look at this:

-> CTE Scan on year_total t_s_firstyear (cost=0.00..13120715.27
rows=3976 width=52) (actual time=0.020..5425.980 rows=1816438 loops=1)
Filter: ((year_total > '0'::numeric) AND
(sale_type = 's'::text) AND (dyear = 2001))
Rows Removed by Filter: 19879897
-> CTE Scan on year_total t_s_secyear (cost=0.00..11927922.98
rows=11928 width=164) (actual time=0.007..45.249 rows=46636 loops=1)
Filter: ((sale_type = 's'::text) AND (dyear = 2002))
Rows Removed by Filter: 185596

CTE expansion shall help here as we can push the filer down. I did a
quick patch to demonstrate the idea, following Tom's proposal
(38448.1430519406@sss.pgh.pa.us). I see obvious performance boost:

Turn off NLJ:
original: Planning time: 4.391 ms
Execution time: 77113.721 ms
patched: Planning time: 8.429 ms
Execution time: 18572.663 ms

+ work_mem to 1G
original: Planning time: 4.487 ms
Execution time: 29249.466 ms
patched: Planning time: 11.148 ms
Execution time: 7309.586 ms

Attached please find the WIP patch and also the ANALYZE results.
Notes: the patch may not directly apply to head as some network issue
here so my Linux box can't talk to git server.

Thanks for your patch.
Let me test and report the result in my environment.

BTW, did you register the patch on the upcoming commit-fest?
I think it may be a helpful feature, if we can add alternative
subquery-path towards cte-scan on set_cte_pathlist() and choose
them according to the cost estimation.

Best regards,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Qingqing Zhou
zhouqq.postgres@gmail.com
In reply to: Kouhei Kaigai (#7)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

On Tue, Aug 18, 2015 at 5:59 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

BTW, did you register the patch on the upcoming commit-fest?

Not yet, it is in WIP status.

I think it may be a helpful feature, if we can add alternative
subquery-path towards cte-scan on set_cte_pathlist() and choose
them according to the cost estimation.

Are you suggesting that we keep both subquery-path (whenever possible)
and cte-path so that optimizer can choose among them?

I could imagine that if we could support "materialize cte once and use
multiple times" execution, then we shall be able to benefit from
keeping cte-path. But seems we still don't support this execution
mode.

Regards,
Qingqing

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Josh Berkus
josh@agliodbs.com
In reply to: Kouhei Kaigai (#1)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

On 08/18/2015 04:40 PM, Qingqing Zhou wrote:

Attached please find the WIP patch and also the ANALYZE results.
Notes: the patch may not directly apply to head as some network issue
here so my Linux box can't talk to git server.

So, one of the things we previously mentioned is that currently many
users deliberately use CTEs as an optimization barrier in order to force
the planner. Given that, we need some kind of option to force the old
behavior; either SQL syntax or a GUC option. Otherwise this will cause
a bunch of backwards-compatibility breakage.

Ideas?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#9)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

Josh Berkus <josh@agliodbs.com> writes:

On 08/18/2015 04:40 PM, Qingqing Zhou wrote:

Attached please find the WIP patch and also the ANALYZE results.
Notes: the patch may not directly apply to head as some network issue
here so my Linux box can't talk to git server.

So, one of the things we previously mentioned is that currently many
users deliberately use CTEs as an optimization barrier in order to force
the planner. Given that, we need some kind of option to force the old
behavior; either SQL syntax or a GUC option.

I think we already agreed what the syntax would be: ye good olde OFFSET 0
in the subquery.

We could have a GUC option too if people are sufficiently worried about
it, but I think that the need for one hasn't really been proven.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Josh Berkus
josh@agliodbs.com
In reply to: Kouhei Kaigai (#1)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

On 08/19/2015 01:32 PM, Tom Lane wrote:

Josh Berkus <josh@agliodbs.com> writes:

On 08/18/2015 04:40 PM, Qingqing Zhou wrote:

Attached please find the WIP patch and also the ANALYZE results.
Notes: the patch may not directly apply to head as some network issue
here so my Linux box can't talk to git server.

So, one of the things we previously mentioned is that currently many
users deliberately use CTEs as an optimization barrier in order to force
the planner. Given that, we need some kind of option to force the old
behavior; either SQL syntax or a GUC option.

I think we already agreed what the syntax would be: ye good olde OFFSET 0
in the subquery.

We could have a GUC option too if people are sufficiently worried about
it, but I think that the need for one hasn't really been proven.

Asking users to refactor their applications to add OFFSET 0 is a bit
painful, if we could take care of it via a backwards-compatibility GUC.
We have many users who are specifically using the CTE optimization
barrier to work around planner failures.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Josh Berkus (#11)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

On Mon, Aug 17, 2015 at 9:40 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

I think SortSupport logic provides a reasonable way to solve this
kind of problem. For example, btint4sortsupport() informs a function
pointer of the fast version of comparator (btint4fastcmp) which takes
two Datum argument without indirect memory reference.
This mechanism will also make sense for HashAggregate logic, to reduce
the cost of function invocations.

Please comment on the idea I noticed here.

It's possible that this can work, but it might be a good idea to run
'perf' on this query and find out where the CPU time is actually
going. That might give you a clearer picture of why the HashAggregate
is slow.

I tried to run one of CTE portion under the perf enabled.

HashAggregate still takes 490sec in spite of 70sec by underlying Join.

tpcds100=# explain analyze select c_customer_id customer_id
,c_first_name customer_first_name
,c_last_name customer_last_name
,c_preferred_cust_flag customer_preferred_cust_flag
,c_birth_country customer_birth_country
,c_login customer_login
,c_email_address customer_email_address
,d_year dyear
,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2) year_total
,'s' sale_type
from customer
,store_sales
,date_dim
where c_customer_sk = ss_customer_sk
and ss_sold_date_sk = d_date_sk
group by c_customer_id
,c_first_name
,c_last_name
,c_preferred_cust_flag
,c_birth_country
,c_login
,c_email_address
,d_year
;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=18194948.40..21477516.00 rows=262605408 width=178)
(actual time=483480.161..490763.640 rows=9142442 loops=1)
Group Key: customer.c_customer_id, customer.c_first_name, customer.c_last_name, customer.c_preferred_cust_flag,
customer.c_birth_country, customer.c_login, customer.c_email_address, date_dim.d_year
-> Custom Scan (GpuJoin) (cost=101342.54..9660272.64 rows=262605408 width=178)
(actual time=2430.787..73116.553 rows=268562375 loops=1)
Bulkload: On (density: 100.00%)
Depth 1: Logic: GpuHashJoin, HashKeys: (ss_sold_date_sk), JoinQual: (ss_sold_date_sk = d_date_sk),
nrows (287997024 -> 275041999, 95.50% expected 95.47%)
Depth 2: Logic: GpuHashJoin, HashKeys: (ss_customer_sk), JoinQual: (ss_customer_sk = c_customer_sk),
nrows (275041999 -> 268562375, 93.25% expected 91.18%)
-> Custom Scan (BulkScan) on store_sales (cost=0.00..9649559.60 rows=287996960 width=38)
(actual time=17.141..52757.354 rows=287997024 loops=1)
-> Seq Scan on date_dim (cost=0.00..2705.49 rows=73049 width=16)
(actual time=0.030..20.597 rows=73049 loops=1)
-> Seq Scan on customer (cost=0.00..87141.74 rows=2000074 width=156)
(actual time=0.010..585.861 rows=2000000 loops=1)
Planning time: 1.558 ms
Execution time: 492113.558 ms
(11 rows)

Perf output is below. Unlike my expectation, the largest portion was consumed
by bpchareq(6.76%) + bcTruelen(8.23%). One other big cluster is, probabaly,
TupleHashTableHash(1.11%) -> slot_getattr(4.29%) -> slot_deform_tuple(4.92%).

# ========
# captured on: Thu Aug 20 09:52:24 2015
# hostname : ayu.kaigai.gr.jp
# os release : 2.6.32-504.23.4.el6.x86_64
# perf version : 2.6.32-504.23.4.el6.x86_64.debug
# arch : x86_64
# nrcpus online : 48
# nrcpus avail : 48
# cpudesc : Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
# cpuid : GenuineIntel,6,63,2
# total memory : 396795400 kB
# cmdline : /usr/bin/perf record -a -e cycles
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0, attr_mmap2 = 0, attr_mmap = 1, attr_mmap_data = 0
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, tracepoint = 2, software = 1
# ========
#
# Samples: 2M of event 'cycles'
# Event count (approx.): 1558291468259
#
# Overhead Command Shared Object Symbol
# ........ ............... .......................... .....................................................
#
8.23% postgres postgres [.] bcTruelen
6.76% postgres postgres [.] bpchareq
4.92% postgres postgres [.] pg_detoast_datum
4.29% postgres postgres [.] slot_getattr
4.07% postgres postgres [.] AllocSetAlloc
3.58% postgres postgres [.] slot_deform_tuple
3.39% postgres postgres [.] div_var
3.35% postgres postgres [.] hash_search_with_hash_value
3.11% postgres postgres [.] hash_any
2.62% postgres postgres [.] make_result
2.50% postgres postgres [.] add_abs
2.24% postgres postgres [.] ExecAgg
2.23% postgres postgres [.] init_var_from_num
2.09% postgres postgres [.] pg_detoast_datum_packed
2.07% postgres postgres [.] ExecMakeFunctionResultNoSets
1.95% postgres [vsyscall] [.] 0x000000000000014c
1.88% postgres libc-2.12.so [.] memcpy
1.83% postgres postgres [.] execTuplesMatch
1.71% postgres postgres [.] sub_abs
1.70% postgres [kernel.kallsyms] [k] copy_user_generic_string
1.48% postgres pg_strom.so [.] pgstrom_data_store_insert_block
1.41% postgres postgres [.] palloc
1.38% postgres postgres [.] texteq
1.29% postgres [vdso] [.] 0x0000000000000890
1.11% postgres postgres [.] TupleHashTableHash
:
(only larger than 1.0%)

Indeed, 6 of 8 grouping keys in this query uses bpchar() data type, so it is
natural comparison function consumed larger portion of CPU cycles.
Do we have any idea to assist these queries by the backend?

tpcds100=# \d customer
Table "public.customer"
Column | Type | Modifiers
------------------------+-----------------------+-----------
c_customer_sk | bigint | not null
c_customer_id | character(16) | not null
c_current_cdemo_sk | bigint |
c_current_hdemo_sk | bigint |
c_current_addr_sk | bigint |
c_first_shipto_date_sk | bigint |
c_first_sales_date_sk | bigint |
c_salutation | character(10) |
c_first_name | character(20) |
c_last_name | character(30) |
c_preferred_cust_flag | character(1) |
c_birth_day | bigint |
c_birth_month | bigint |
c_birth_year | bigint |
c_birth_country | character varying(20) |
c_login | character(13) |
c_email_address | character(50) |
c_last_review_date_sk | bigint |

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Peter Geoghegan
pg@heroku.com
In reply to: Kouhei Kaigai (#12)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

On Wed, Aug 19, 2015 at 6:08 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Indeed, 6 of 8 grouping keys in this query uses bpchar() data type, so it is
natural comparison function consumed larger portion of CPU cycles.
Do we have any idea to assist these queries by the backend?

With abbreviated keys, char(n) is very significantly slower than
varchar(n) (or text). I've been meaning to revisit my docpatch to warn
users of this (we already specifically advise against using char(n),
more or less). Abbreviation and a few other tricks could easily make
an enormous difference.

There is no very good reason why the same optimizations that I applied
to text could not also be applied to bpchar(), I think. I think that
abbreviation could remove much of the special char(n) effort, as well;
someone simply needs to do the work. I'm actually more concerned about
the problems that this causes for third-party benchmarks than I am
about the problems for real users.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Qingqing Zhou
zhouqq.postgres@gmail.com
In reply to: Qingqing Zhou (#8)
1 attachment(s)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

On Wed, Aug 19, 2015 at 10:32 AM, Qingqing Zhou
<zhouqq.postgres@gmail.com> wrote:

On Tue, Aug 18, 2015 at 5:59 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

BTW, did you register the patch on the upcoming commit-fest?

Not yet, it is in WIP status.

While I am working on the patch, I found some issues and resort help
here. Patch attached.

Here is an example:

postgres=# explain WITH q AS (
WITH p AS (SELECT * from a) SELECT p.* FROM p JOIN p p1 on p.i>=p1.i)
SELECT * FROM q WHERE i <= 5;
QUERY PLAN
----------------------------------------------------------------------------------
Nested Loop (cost=0.58..5980.16 rows=133333 width=8)
-> Index Scan using ai on a (cost=0.29..8.36 rows=4 width=8)
Index Cond: (i <= 5)
-> Index Only Scan using ai on a a_1 (cost=0.29..1159.62
rows=33333 width=4)
Index Cond: (i <= a.i)
(5 rows)

So far so good. But if we add other references of the CTE q (m1->m,
m->q), we still have some extra CTE scans:

postgres=# explain WITH q AS (
WITH p AS (SELECT * from a) SELECT p.* FROM p JOIN p p1 on
p.i>=p1.i), m as (select * from q), m1 as (select * from m)
SELECT * FROM m1 WHERE i <= 5;
QUERY PLAN
-----------------------------------------------------------------------------------------
CTE Scan on m (cost=158365985.66..233365985.65 rows=1111111111 width=8)
Filter: (i <= 5)
CTE q
-> Nested Loop (cost=0.29..91699319.00 rows=3333333333 width=8)
-> Seq Scan on a (cost=0.00..1443.00 rows=100000 width=8)
-> Index Only Scan using ai on a a_1 (cost=0.29..583.65
rows=33333 width=4)
Index Cond: (i <= a.i)
CTE m
-> CTE Scan on q (cost=0.00..66666666.66 rows=3333333333 width=8)
(9 rows)

Above two queries essentially the same, but the second one is a
non-optimal plan. The reason is that how my patch works: it put a
substitution in front of SS_process_ctes():

/*
* If there is a WITH list, process each WITH query and build an initplan
! * SubPlan structure for it. Before we process ctes, try to subsitute with
! * subqueries to benefits from global optimization.
*/
if (parse->cteList)
+ {
+ substitute_ctes_with_subqueries(root);
SS_process_ctes(root);
+ }

AFAICS, the substitution only handles cteList within a query block, so
it does not go across the subquery boundary. I can see this is an
issue but can't see a nice way to fix it. Anybody has some recipe?

Regards,
Qingqing

Attachments:

ctes.patchapplication/octet-stream; name=ctes.patchDownload
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 7069f60..7b49816
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool		enable_nestloop = true;
*** 118,123 ****
--- 118,124 ----
  bool		enable_material = true;
  bool		enable_mergejoin = true;
  bool		enable_hashjoin = true;
+ bool		enable_cte_subquery = true;
  
  typedef struct
  {
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100755
index d598c1b..32fe18b
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** subquery_planner(PlannerGlobal *glob, Qu
*** 365,374 ****
  
  	/*
  	 * If there is a WITH list, process each WITH query and build an initplan
! 	 * SubPlan structure for it.
  	 */
  	if (parse->cteList)
  		SS_process_ctes(root);
  
  	/*
  	 * Look for ANY and EXISTS SubLinks in WHERE and JOIN/ON clauses, and try
--- 365,378 ----
  
  	/*
  	 * If there is a WITH list, process each WITH query and build an initplan
! 	 * SubPlan structure for it. Before we process ctes, try to subsitute with 
! 	 * subqueries to benefits from global optimization.
  	 */
  	if (parse->cteList)
+ 	{
+ 		substitute_ctes_with_subqueries(root);
  		SS_process_ctes(root);
+ 	}
  
  	/*
  	 * Look for ANY and EXISTS SubLinks in WHERE and JOIN/ON clauses, and try
diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
new file mode 100755
index 401ba5b..e8c8ee2
*** a/src/backend/optimizer/prep/prepjointree.c
--- b/src/backend/optimizer/prep/prepjointree.c
***************
*** 31,36 ****
--- 31,37 ----
  #include "optimizer/prep.h"
  #include "optimizer/subselect.h"
  #include "optimizer/tlist.h"
+ #include "optimizer/cost.h"
  #include "parser/parse_relation.h"
  #include "parser/parsetree.h"
  #include "rewrite/rewriteManip.h"
*************** static void fix_append_rel_relids(List *
*** 116,122 ****
  					  Relids subrelids);
  static Node *find_jointree_node_for_rel(Node *jtnode, int relid);
  
- 
  /*
   * pull_up_sublinks
   *		Attempt to pull up ANY and EXISTS SubLinks to be treated as
--- 117,122 ----
*************** pull_up_sublinks_qual_recurse(PlannerInf
*** 551,556 ****
--- 551,595 ----
  }
  
  /*
+  * substitute_ctes_with_subqueries
+  *		Attempt to sustitute CTEs with subqueries in the FROM clause.
+  *
+  *  The spec does not specify the optimization boundary of a CTE, so we are
+  *  free to expand it as a subquery to enable it to participate in the global 
+  *  optimization.
+  * 
+  *  There are some restrictions when we can't do that, and see the processor
+  *  function for details. 
+  */
+ void
+ substitute_ctes_with_subqueries(PlannerInfo *root)
+ {
+ 	ListCell   *rt;
+ 
+ 	if (!enable_cte_subquery)
+ 		return;
+ 
+ 	foreach(rt, root->parse->rtable)
+ 	{
+ 		RangeTblEntry *rte = (RangeTblEntry *) lfirst(rt);
+ 
+ 		if (rte->rtekind == RTE_CTE)
+ 		{
+ 			Query	   *subquery;
+ 
+ 			/* Check safety of expansion, and expand if possible */
+ 			subquery = substitute_cte_with_subquery(root, rte);
+ 			if (subquery)
+ 			{
+ 				/* Successful expansion, replace the rtable entry */
+ 				rte->rtekind = RTE_SUBQUERY;
+ 				rte->subquery = subquery;
+ 			}
+ 		}
+ 	}
+ }
+ 
+ /*
   * inline_set_returning_functions
   *		Attempt to "inline" set-returning functions in the FROM clause.
   *
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
new file mode 100755
index c72dbef..4c8b2af
*** a/src/backend/optimizer/util/clauses.c
--- b/src/backend/optimizer/util/clauses.c
*************** evaluate_expr(Expr *expr, Oid result_typ
*** 4677,4682 ****
--- 4677,4731 ----
  							  resultTypByVal);
  }
  
+ /*
+  * substitute_cte_with_subquery
+  *		Attempt to sustitute a CTE with a subquery.
+  *
+  */
+ Query *
+ substitute_cte_with_subquery(PlannerInfo *root, RangeTblEntry *rte)
+ {
+ 	List		*cteList;
+ 	ListCell	*lc;
+ 	int levelsup;
+  
+ 	Assert(rte->rtekind == RTE_CTE);
+ 
+ 	/*
+ 	 * Lookup matching RTE by name in current level's CTE list
+ 	 */
+ 	cteList = root->parse->cteList;
+ 	foreach(lc, cteList)
+ 	{
+ 		CommonTableExpr *cte = (CommonTableExpr *) lfirst(lc);
+ 		Query *query = (Query *)cte->ctequery;
+ 
+ 		/* Not interested in unused CTE */
+ 		if (!cte->cterefcount)
+ 			continue;
+ 
+ 		/* Do not convert non-select, with-recursive CTE */
+ 		if (query->commandType != CMD_SELECT || cte->cterecursive)
+ 			continue;
+ 
+ 		if (!strcmp(cte->ctename, rte->ctename))
+ 		{
+ 			/*
+ 			 * Expanding volatile function could cause multiple evaluation of  the
+ 			 * function, so we can't do it. Also, volatile check is expensive so arrange 
+ 			 * it after name check. 
+ 			 */
+ 			if (contain_volatile_functions(query))
+ 				continue;
+ 
+ 			/* decrease the refcount as current RTE does not need it */
+ 			cte->cterefcount --;
+ 			return query;
+ 		}
+ 	}
+ 
+ 	return NULL;
+ }
  
  /*
   * inline_set_returning_function
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100755
index b3dac51..7f4e2bd
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 873,878 ****
--- 873,887 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"enable_cte_subquery", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the planner subsitutes CTEs with subqueries."),
+ 			NULL
+ 		},
+ 		&enable_cte_subquery,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
  			gettext_noop("Enables genetic query optimization."),
  			gettext_noop("This algorithm attempts to do planning without "
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
new file mode 100644
index 3d04ac2..27a7227
*** a/src/include/optimizer/clauses.h
--- b/src/include/optimizer/clauses.h
*************** extern Node *estimate_expression_value(P
*** 84,88 ****
--- 84,89 ----
  
  extern Query *inline_set_returning_function(PlannerInfo *root,
  							  RangeTblEntry *rte);
+ extern Query *substitute_cte_with_subquery(PlannerInfo *root, RangeTblEntry *rte);
  
  #endif   /* CLAUSES_H */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index dd43e45..58a146b
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_nestloop;
*** 61,66 ****
--- 61,67 ----
  extern bool enable_material;
  extern bool enable_mergejoin;
  extern bool enable_hashjoin;
+ extern bool enable_cte_subquery;
  extern int	constraint_exclusion;
  
  extern double clamp_row_est(double nrows);
diff --git a/src/include/optimizer/prep.h b/src/include/optimizer/prep.h
new file mode 100644
index 7b8c0a9..ed43dd0
*** a/src/include/optimizer/prep.h
--- b/src/include/optimizer/prep.h
***************
*** 23,28 ****
--- 23,29 ----
   */
  extern void pull_up_sublinks(PlannerInfo *root);
  extern void inline_set_returning_functions(PlannerInfo *root);
+ extern void substitute_ctes_with_subqueries(PlannerInfo *root);
  extern void pull_up_subqueries(PlannerInfo *root);
  extern void flatten_simple_union_all(PlannerInfo *root);
  extern void reduce_outer_joins(PlannerInfo *root);
diff --git a/src/test/regress/expected/rangefuncs.out b/src/test/regress/expected/rangefuncs.out
new file mode 100644
index 6dabe50..7c926da
*** a/src/test/regress/expected/rangefuncs.out
--- b/src/test/regress/expected/rangefuncs.out
*************** SELECT name, setting FROM pg_settings WH
*** 2,7 ****
--- 2,8 ----
           name         | setting 
  ----------------------+---------
   enable_bitmapscan    | on
+  enable_cte_subquery  | on
   enable_hashagg       | on
   enable_hashjoin      | on
   enable_indexonlyscan | on
*************** SELECT name, setting FROM pg_settings WH
*** 12,18 ****
   enable_seqscan       | on
   enable_sort          | on
   enable_tidscan       | on
! (11 rows)
  
  CREATE TABLE foo2(fooid int, f2 int);
  INSERT INTO foo2 VALUES(1, 11);
--- 13,19 ----
   enable_seqscan       | on
   enable_sort          | on
   enable_tidscan       | on
! (12 rows)
  
  CREATE TABLE foo2(fooid int, f2 int);
  INSERT INTO foo2 VALUES(1, 11);
#15Peter Geoghegan
pg@heroku.com
In reply to: Kouhei Kaigai (#4)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

On Mon, Aug 17, 2015 at 6:40 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

I think SortSupport logic provides a reasonable way to solve this
kind of problem. For example, btint4sortsupport() informs a function
pointer of the fast version of comparator (btint4fastcmp) which takes
two Datum argument without indirect memory reference.
This mechanism will also make sense for HashAggregate logic, to reduce
the cost of function invocations.

Please comment on the idea I noticed here.

Is this a 9.5-based system? If so, then you'd benefit from the
memcmp() pre-check within varstr_cmp() by being on 9.5, since the
pre-check is not limited to cases that use text/varchar SortSupport --
this could make a big difference here. If not, then it might be
somewhat helpful to add a pre-check that considers total binary
equality only before bcTruelen() is ever called. Not so sure about the
latter idea, though.

I'm not sure if it would help with hash aggregates to use something
like SortSupport to avoid fmgr overhead. It might make enough of a
difference to matter, but maybe the easier win would come from
considering simple binary equality first, and only then using an
equality operator (think HOT style checks). That would have the
advantage of requiring no per-type/operator class support at all,
since it's safe to assume that binary equality is a proxy for
"equivalence" of sort order (or whatever we call the case where
5.00::numeric and 5.000::numeric are considered equal).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Qingqing Zhou (#14)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

Qingqing Zhou <zhouqq.postgres@gmail.com> writes:

Above two queries essentially the same, but the second one is a
non-optimal plan. The reason is that how my patch works: it put a
substitution in front of SS_process_ctes():

/*
* If there is a WITH list, process each WITH query and build an initplan
! * SubPlan structure for it. Before we process ctes, try to subsitute with
! * subqueries to benefits from global optimization.
*/
if (parse->cteList)
+ {
+ substitute_ctes_with_subqueries(root);
SS_process_ctes(root);
+ }

AFAICS, the substitution only handles cteList within a query block, so
it does not go across the subquery boundary. I can see this is an
issue but can't see a nice way to fix it. Anybody has some recipe?

It seems like you're doing this in fundamentally the wrong place.

What I had in mind in <38448.1430519406@sss.pgh.pa.us> was to convert CTEs
into plain subqueries during the prepjointree phase, either just before
or as part of the pull_up_subqueries pass (since you'd want the converted
subquery to be flattened if possible). If you do it later than that,
then you'll have to reinvent a whole bunch of wheels to provide behavior
similar to regular subquery optimization.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#16)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

I wrote:

What I had in mind in <38448.1430519406@sss.pgh.pa.us> was to convert CTEs
into plain subqueries during the prepjointree phase, either just before
or as part of the pull_up_subqueries pass (since you'd want the converted
subquery to be flattened if possible).

After looking at the code a bit, IMO the most reasonable thing to do is to
include this transformation in inline_set_returning_functions(), perhaps
renaming it to something like inline_srfs_and_ctes(). You could invent
a separate function instead, but that would require an extra pass over
the rangetable, for no particular benefit that I can see; the separate
function would have to be called from *exactly* the same set of places
as inline_set_returning_functions(), anyway, or it would not work right
for multiple levels of query optimization.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Peter Geoghegan (#15)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Peter Geoghegan
Sent: Thursday, August 27, 2015 8:31 AM
To: Kaigai Kouhei(海外 浩平)
Cc: Greg Stark; PostgreSQL-development
Subject: Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan

On Mon, Aug 17, 2015 at 6:40 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

I think SortSupport logic provides a reasonable way to solve this
kind of problem. For example, btint4sortsupport() informs a function
pointer of the fast version of comparator (btint4fastcmp) which takes
two Datum argument without indirect memory reference.
This mechanism will also make sense for HashAggregate logic, to reduce
the cost of function invocations.

Please comment on the idea I noticed here.

Is this a 9.5-based system? If so, then you'd benefit from the
memcmp() pre-check within varstr_cmp() by being on 9.5, since the
pre-check is not limited to cases that use text/varchar SortSupport --
this could make a big difference here. If not, then it might be
somewhat helpful to add a pre-check that considers total binary
equality only before bcTruelen() is ever called. Not so sure about the
latter idea, though.

My measurement is done on v9.5 based system. So, it also seems to me
replacement of CHAR(n) by VARCHAR(n) will make sense.

I'm not sure if it would help with hash aggregates to use something
like SortSupport to avoid fmgr overhead. It might make enough of a
difference to matter, but maybe the easier win would come from
considering simple binary equality first, and only then using an
equality operator (think HOT style checks). That would have the
advantage of requiring no per-type/operator class support at all,
since it's safe to assume that binary equality is a proxy for
"equivalence" of sort order (or whatever we call the case where
5.00::numeric and 5.000::numeric are considered equal).

My presumption was wrong, at least not major portion, according to
the perf result. So, I don't think elimination of fmgr overhead has
the first priority. However, shortcut pass of equality checks seems
to me a great leap, to avoid strict equality checks implemented per
data type; that often takes complicated logic.
Probably, it is more intelligent to apply this binary equality proxy
on only problematic data types, like bpchar(n). But less effective
on simple data types, like int4.

On the other hands, one other big portion of HashAggregate is
calculation of hash-value by all the grouping key.
It may be beneficial to have an option to reference the result
attribute of underlying plan. It potentially allows co-processor
to compute hash-value instead of CPU.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Tom Lane (#16)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, August 27, 2015 9:03 AM
To: Qingqing Zhou
Cc: Kaigai Kouhei(海外 浩平); Greg Stark; PostgreSQL-development
Subject: Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan

Qingqing Zhou <zhouqq.postgres@gmail.com> writes:

Above two queries essentially the same, but the second one is a
non-optimal plan. The reason is that how my patch works: it put a
substitution in front of SS_process_ctes():

/*
* If there is a WITH list, process each WITH query and build an initplan
! * SubPlan structure for it. Before we process ctes, try to subsitute with
! * subqueries to benefits from global optimization.
*/
if (parse->cteList)
+ {
+ substitute_ctes_with_subqueries(root);
SS_process_ctes(root);
+ }

AFAICS, the substitution only handles cteList within a query block, so
it does not go across the subquery boundary. I can see this is an
issue but can't see a nice way to fix it. Anybody has some recipe?

It seems like you're doing this in fundamentally the wrong place.

What I had in mind in <38448.1430519406@sss.pgh.pa.us> was to convert CTEs
into plain subqueries during the prepjointree phase, either just before
or as part of the pull_up_subqueries pass (since you'd want the converted
subquery to be flattened if possible). If you do it later than that,
then you'll have to reinvent a whole bunch of wheels to provide behavior
similar to regular subquery optimization.

Hmm... My suggestion might not be reasonable. Sorry.

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Kouhei Kaigai (#18)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

On 27/08/15 13:36, Kouhei Kaigai wrote:
[...]

My measurement is done on v9.5 based system. So, it also seems to me
replacement of CHAR(n) by VARCHAR(n) will make sense.

Is there any reason to not simply use text instead of CHAR(n) or VARCHAR(n)?

[...]

-Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Gavin Flower (#20)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

On 27/08/15 13:36, Kouhei Kaigai wrote:
[...]

My measurement is done on v9.5 based system. So, it also seems to me
replacement of CHAR(n) by VARCHAR(n) will make sense.

Is there any reason to not simply use text instead of CHAR(n) or VARCHAR(n)?

Text is also welcome, of course.
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Qingqing Zhou
zhouqq.postgres@gmail.com
In reply to: Tom Lane (#17)
1 attachment(s)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

On Wed, Aug 26, 2015 at 5:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

After looking at the code a bit, IMO the most reasonable thing to do is to
include this transformation in inline_set_returning_functions(), perhaps
renaming it to something like inline_srfs_and_ctes().

This is essentially the same as my current implementation (revised
patch attached):
1. There are two call sites of inline_set_returning_functions(), and
one place is guarded with Assert(subquery->cteList == NIL). This means
transformation in subquery_planner() is effective.
2. A problem with revised patch is that we can't get rid of non-used
CTEs show up in EXPLAIN.

IMHO, here the problem is not "multiple levels" but "multiple
references". "levels" is handled well by recursion but references are
not: set returning function seems does not have the this issue because
you don't define a function along the query.

Regards,
Qingqing

---

Two testing queries results with revised patch:
1. Extra CTE q and p prints in EXPLAIN:
postgres=# explain WITH q AS (
postgres(# WITH p AS (SELECT * from a) SELECT p.* FROM p JOIN p p1
on p.i>=p1.i)
postgres-# SELECT * FROM q WHERE i <= 5;
QUERY PLAN
-----------------------------------------------------------------------------------------
Nested Loop (cost=1443.59..7423.16 rows=133333 width=8)
CTE q
-> Nested Loop (cost=1443.29..91700762.00 rows=3333333333 width=8)
CTE p
-> Seq Scan on a a_2 (cost=0.00..1443.00 rows=100000 width=8)
-> Seq Scan on a a_3 (cost=0.00..1443.00 rows=100000 width=8)
-> Index Only Scan using ai on a a_4 (cost=0.29..583.65
rows=33333 width=4)
Index Cond: (i <= a_3.i)
CTE p
-> Seq Scan on a a_5 (cost=0.00..1443.00 rows=100000 width=8)
-> Index Scan using ai on a (cost=0.29..8.36 rows=4 width=8)
Index Cond: (i <= 5)
-> Index Only Scan using ai on a a_1 (cost=0.29..1159.62
rows=33333 width=4)
Index Cond: (i <= a.i)
(14 rows)

2. Extra m1 show up and same problem still there:
postgres=# explain WITH q AS (
postgres(# WITH p AS (SELECT * from a) SELECT p.* FROM p JOIN p p1 on
postgres(# p.i>=p1.i), m as (select * from q), m1 as (select * from m)
postgres-# SELECT * FROM m1 WHERE i <= 5;
QUERY PLAN
-----------------------------------------------------------------------------------------
CTE Scan on m (cost=225034095.32..300034095.31 rows=1111111111 width=8)
Filter: (i <= 5)
CTE q
-> Nested Loop (cost=1443.29..91700762.00 rows=3333333333 width=8)
CTE p
-> Seq Scan on a (cost=0.00..1443.00 rows=100000 width=8)
-> Seq Scan on a a_1 (cost=0.00..1443.00 rows=100000 width=8)
-> Index Only Scan using ai on a a_2 (cost=0.29..583.65
rows=33333 width=4)
Index Cond: (i <= a_1.i)
CTE m
-> CTE Scan on q (cost=0.00..66666666.66 rows=3333333333 width=8)
CTE m1
-> CTE Scan on m m_1 (cost=0.00..66666666.66 rows=3333333333 width=8)
(13 rows)

Attachments:

ctes2.difftext/plain; charset=US-ASCII; name=ctes2.diffDownload
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 7069f60..7b49816
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool		enable_nestloop = true;
*** 118,123 ****
--- 118,124 ----
  bool		enable_material = true;
  bool		enable_mergejoin = true;
  bool		enable_hashjoin = true;
+ bool		enable_cte_subquery = true;
  
  typedef struct
  {
diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
new file mode 100644
index 401ba5b..58698ab
*** a/src/backend/optimizer/prep/prepjointree.c
--- b/src/backend/optimizer/prep/prepjointree.c
***************
*** 31,36 ****
--- 31,37 ----
  #include "optimizer/prep.h"
  #include "optimizer/subselect.h"
  #include "optimizer/tlist.h"
+ #include "optimizer/cost.h"
  #include "parser/parse_relation.h"
  #include "parser/parsetree.h"
  #include "rewrite/rewriteManip.h"
*************** static void fix_append_rel_relids(List *
*** 116,122 ****
  					  Relids subrelids);
  static Node *find_jointree_node_for_rel(Node *jtnode, int relid);
  
- 
  /*
   * pull_up_sublinks
   *		Attempt to pull up ANY and EXISTS SubLinks to be treated as
--- 117,122 ----
*************** inline_set_returning_functions(PlannerIn
*** 577,583 ****
  	{
  		RangeTblEntry *rte = (RangeTblEntry *) lfirst(rt);
  
! 		if (rte->rtekind == RTE_FUNCTION)
  		{
  			Query	   *funcquery;
  
--- 577,599 ----
  	{
  		RangeTblEntry *rte = (RangeTblEntry *) lfirst(rt);
  
! 		if (rte->rtekind == RTE_CTE)
! 		{
! 			Query	   *subquery;
! 
! 			if (!enable_cte_subquery)
! 				continue;
! 			
! 			/* Check safety of expansion, and expand if possible */
! 			subquery = substitute_cte_with_subquery(root, rte);
! 			if (subquery)
! 			{
! 				/* Successful expansion, replace the rtable entry */
! 				rte->rtekind = RTE_SUBQUERY;
! 				rte->subquery = subquery;
! 			}
! 		}
! 		else if (rte->rtekind == RTE_FUNCTION)
  		{
  			Query	   *funcquery;
  
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
new file mode 100644
index c72dbef..274c0bb
*** a/src/backend/optimizer/util/clauses.c
--- b/src/backend/optimizer/util/clauses.c
*************** evaluate_expr(Expr *expr, Oid result_typ
*** 4677,4682 ****
--- 4677,4728 ----
  							  resultTypByVal);
  }
  
+ /*
+  * substitute_cte_with_subquery
+  *		Attempt to sustitute a CTE with a subquery.
+  *
+  */
+ Query *
+ substitute_cte_with_subquery(PlannerInfo *root, RangeTblEntry *rte)
+ {
+ 	List		*cteList;
+ 	ListCell	*lc;
+  
+ 	Assert(rte->rtekind == RTE_CTE);
+ 
+ 	/*
+ 	 * Lookup matching RTE by name in current level's CTE list.
+ 	 */
+ 	cteList = root->parse->cteList;
+ 	foreach(lc, cteList)
+ 	{
+ 		CommonTableExpr *cte = (CommonTableExpr *) lfirst(lc);
+ 		Query *query = (Query *)cte->ctequery;
+ 
+ 		/* Not interested in unused CTE */
+ 		if (!cte->cterefcount)
+ 			continue;
+ 
+ 		/* Do not convert non-select, with-recursive CTE */
+ 		if (query->commandType != CMD_SELECT || cte->cterecursive)
+ 			continue;
+ 
+ 		if (!strcmp(cte->ctename, rte->ctename))
+ 		{
+ 			/*
+ 			 * Expanding volatile function could cause multiple evaluation of  the
+ 			 * function, so we can't do it. Also, volatile check is expensive so arrange 
+ 			 * it after name check. 
+ 			 */
+ 			if (contain_volatile_functions((Node *)query))
+ 				continue;
+ 
+ 			return query;
+ 		}
+ 	}
+ 
+ 	return NULL;
+ }
  
  /*
   * inline_set_returning_function
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index b3dac51..7f4e2bd
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 873,878 ****
--- 873,887 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"enable_cte_subquery", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the planner subsitutes CTEs with subqueries."),
+ 			NULL
+ 		},
+ 		&enable_cte_subquery,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
  			gettext_noop("Enables genetic query optimization."),
  			gettext_noop("This algorithm attempts to do planning without "
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
new file mode 100644
index 3d04ac2..27a7227
*** a/src/include/optimizer/clauses.h
--- b/src/include/optimizer/clauses.h
*************** extern Node *estimate_expression_value(P
*** 84,88 ****
--- 84,89 ----
  
  extern Query *inline_set_returning_function(PlannerInfo *root,
  							  RangeTblEntry *rte);
+ extern Query *substitute_cte_with_subquery(PlannerInfo *root, RangeTblEntry *rte);
  
  #endif   /* CLAUSES_H */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index dd43e45..58a146b
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_nestloop;
*** 61,66 ****
--- 61,67 ----
  extern bool enable_material;
  extern bool enable_mergejoin;
  extern bool enable_hashjoin;
+ extern bool enable_cte_subquery;
  extern int	constraint_exclusion;
  
  extern double clamp_row_est(double nrows);
#23Qingqing Zhou
zhouqq.postgres@gmail.com
In reply to: Qingqing Zhou (#22)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

On Thu, Aug 27, 2015 at 1:01 PM, Qingqing Zhou
<zhouqq.postgres@gmail.com> wrote:

On Wed, Aug 26, 2015 at 5:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

After looking at the code a bit, IMO the most reasonable thing to do is to
include this transformation in inline_set_returning_functions(), perhaps
renaming it to something like inline_srfs_and_ctes().

This is essentially the same as my current implementation (revised
patch attached):

I tried the method as Tom suggested (attached in previous email) but
still got the same issue - anybody see what I did wrong here? :-(

Thanks,
Qingqing

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Andres Freund
andres@anarazel.de
In reply to: Josh Berkus (#11)
Re: Our trial to TPC-DS but optimizer made unreasonable plan

On 2015-08-19 15:14:03 -0700, Josh Berkus wrote:

Asking users to refactor their applications to add OFFSET 0 is a bit
painful, if we could take care of it via a backwards-compatibility GUC.
We have many users who are specifically using the CTE optimization
barrier to work around planner failures.

Agreed. I think we'll cause a lot of problems in migrations if we do
this unconditionally. I also think CTEs are a much cleaner optimization
barrier than OFFSET 0.

Some are probably going to hate me for this, but I think it'd be better
to change the grammar to something like
name opt_name_list AS '(' PreparableStmt ')' OPTIONS '(' cte_option_list ')'

and allow to specify 'inline' 'off'/'on'. The guc would simply change
the default value.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers