please help on query
I can't improve performance on this query:
SELECT
supplier.name,
supplier.address
FROM
supplier,
nation
WHERE
supplier.suppkey IN(
SELECT
partsupp.suppkey
FROM
partsupp
WHERE
partsupp.partkey IN(
SELECT
part.partkey
FROM
part
WHERE
part.name like 'forest%'
)
AND partsupp.availqty>(
SELECT
0.5*(sum(lineitem.quantity)::FLOAT)
FROM
lineitem
WHERE
lineitem.partkey=partsupp.partkey
AND lineitem.suppkey=partsupp.partkey
AND lineitem.shipdate>=('1994-01-01')::DATE
AND lineitem.shipdate<(('1994-01-01')::DATE+('1 year')::INTERVAL)::DATE
)
)
AND supplier.nationkey=nation.nationkey
AND nation.name='CANADA'
ORDER BY
supplier.name;
explain results:
NOTICE: QUERY PLAN:
Sort (cost=2777810917708.17..2777810917708.17 rows=200 width=81)
-> Nested Loop (cost=0.00..2777810917700.53 rows=200 width=81)
-> Seq Scan on nation (cost=0.00..1.31 rows=1 width=4)
-> Index Scan using snation_index on supplier (cost=0.00..2777810917696.72 rows=200 width=77)
SubPlan
-> Materialize (cost=6944527291.72..6944527291.72 rows=133333 width=4)
-> Seq Scan on partsupp (cost=0.00..6944527291.72 rows=133333 width=4)
SubPlan
-> Materialize (cost=8561.00..8561.00 rows=1 width=4)
-> Seq Scan on part (cost=0.00..8561.00 rows=1 width=4)
-> Aggregate (cost=119.61..119.61 rows=1 width=4)
-> Index Scan using lineitem_index on lineitem (cost=0.00..119.61 rows=1 width=4)
partsupp::800000 tuples
Table "partsupp"
Column | Type | Modifiers
------------+----------------+-----------
partkey | integer | not null
suppkey | integer | not null
availqty | integer |
supplycost | numeric(10,2) |
comment | character(199) |
Primary key: partsupp_pkey
Triggers: RI_ConstraintTrigger_16597,
RI_ConstraintTrigger_16603
tpch=# select attname,n_distinct,correlation from pg_stats where tablename='partsupp';
attname | n_distinct | correlation
------------+------------+-------------
partkey | -0.195588 | 1
suppkey | 9910 | 0.00868363
availqty | 9435 | -0.00788662
supplycost | -0.127722 | -0.0116864
comment | -1 | 0.0170702
I accept query changes, reordering, indexes ideas and horizontal partitioning
thanks in advance.
Regards
[moving to pgsql-sql]
On Thu, 11 Jul 2002 17:22:14 +0200, "Luis Alberto Amigo Navarro"
<lamigo@atc.unican.es> wrote:
I can't improve performance on this query:
SELECT
supplier.name,
supplier.address
FROM
supplier,
nation
WHERE
supplier.suppkey IN(
SELECT
partsupp.suppkey
FROM
partsupp
WHERE
partsupp.partkey IN(
SELECT
part.partkey
FROM
part
WHERE
part.name like 'forest%'
)
AND partsupp.availqty>(
SELECT
0.5*(sum(lineitem.quantity)::FLOAT)
FROM
lineitem
WHERE
lineitem.partkey=partsupp.partkey
AND lineitem.suppkey=partsupp.partkey
^^^^^^^
suppkey ???
AND lineitem.shipdate>=('1994-01-01')::DATE
AND lineitem.shipdate<(('1994-01-01')::DATE+('1 year')::INTERVAL)::DATE
)
)
AND supplier.nationkey=nation.nationkey
AND nation.name='CANADA'
ORDER BY
supplier.name;
Luis,
rules of thumb: "Avoid subselects; use joins!" and "If you have to use
subselects, avoid IN, use EXISTS!"
Let's try. If partkey is unique in part, then
| FROM partsupp
| WHERE partsupp.partkey IN (SELECT part.partkey
can be replaced by
FROM partsupp ps, part p
WHERE ps.partkey = p.partkey
or
partsupp ps INNER JOIN part p
ON (ps.partkey = p.partkey AND p.name LIKE '...')
When we ignore "part" for now, your subselect boils down to
| SELECT partsupp.suppkey
| FROM partsupp
| WHERE partsupp.availqty > (
| SELECT 0.5*(sum(lineitem.quantity)::FLOAT)
| FROM lineitem
| WHERE lineitem.partkey=partsupp.partkey
| AND lineitem.suppkey=partsupp.suppkey
| AND lineitem.shipdate BETWEEN ... AND ...
| )
which can be rewritten to (untested)
SELECT ps.suppkey
FROM partsupp ps, lineitem li
WHERE li.partkey=ps.partkey
AND li.suppkey=ps.suppkey
AND lineitem.shipdate BETWEEN ... AND ...
GROUP BY ps.partkey, ps.suppkey
HAVING min(ps.availqty) > 0.5*(sum(lineitem.quantity)::FLOAT)
^^^
As all ps.availqty are equal in one group, you can as well
use max() or avg().
Now we have left only one IN:
| WHERE supplier.suppkey IN (
| SELECT partsupp.suppkey FROM partsupp WHERE <condition> )
Being to lazy to find out, if this can be rewritten to a join, let`s
apply rule 2 here:
WHERE EXISTS (
SELECT ... FROM partsupp ps
WHERE supplier.suppkey = ps.suppkey
AND <condition> )
HTH, but use with a grain of salt ...
Sort (cost=2777810917708.17..2777810917708.17 rows=200 width=81)
^^^^^^^^^^^^^^^^
BTW, how many years are these? :-)
Servus
Manfred
I've tried
SELECT
supplier.name,
supplier.address
FROM
supplier,
nation,
lineitem
WHERE
EXISTS(
SELECT
partsupp.suppkey
FROM
partsupp,lineitem
WHERE
lineitem.partkey=partsupp.partkey
AND lineitem.suppkey=partsupp.partkey
AND lineitem.shipdate>=('1994-01-01')::DATE
AND lineitem.shipdate<(('1994-01-01')::DATE+('1 year')::INTERVAL)::DATE
AND EXISTS(
SELECT
part.partkey
FROM
part
WHERE
part.name like 'forest%'
)
GROUP BY partsupp.partkey,partsupp.suppkey
HAVING min(availqty)>(0.5*(sum(lineitem.quantity)::FLOAT))
)
AND supplier.nationkey=nation.nationkey
AND nation.name='CANADA'
ORDER BY
supplier.name;
as you said and something is wrong
Sort (cost=1141741215.35..1141741215.35 rows=2400490000 width=81)
InitPlan
-> Aggregate (cost=0.00..921773.85 rows=48 width=24)
InitPlan
-> Seq Scan on part (cost=0.00..8561.00 rows=1 width=4)
-> Group (cost=0.00..921771.44 rows=481 width=24)
-> Result (cost=0.00..921769.04 rows=481 width=24)
-> Merge Join (cost=0.00..921769.04 rows=481
width=24)
-> Index Scan using partsupp_pkey on partsupp
(cost=0.00..98522.75 rows=800000 width=12)
-> Index Scan using lsupp_index on lineitem
(cost=0.00..821239.91 rows=145 width=12)
-> Result (cost=1.31..112888690.31 rows=2400490000 width=81)
-> Nested Loop (cost=1.31..112888690.31 rows=2400490000 width=81)
-> Hash Join (cost=1.31..490.31 rows=400 width=81)
-> Seq Scan on supplier (cost=0.00..434.00 rows=10000
width=77)
-> Hash (cost=1.31..1.31 rows=1 width=4)
-> Seq Scan on nation (cost=0.00..1.31 rows=1
width=4)
-> Seq Scan on lineitem (cost=0.00..222208.25 rows=6001225
width=0)
where might be my mistake
Thanks and regards
----- Original Message -----
From: "Manfred Koizar" <mkoi-pg@aon.at>
To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
Cc: <pgsql-sql@postgresql.org>
Sent: Thursday, July 11, 2002 6:47 PM
Subject: Re: [HACKERS] please help on query
[moving to pgsql-sql]
On Thu, 11 Jul 2002 17:22:14 +0200, "Luis Alberto Amigo Navarro"
<lamigo@atc.unican.es> wrote:I can't improve performance on this query:
SELECT
supplier.name,
supplier.address
FROM
supplier,
nation
WHERE
supplier.suppkey IN(
SELECT
partsupp.suppkey
FROM
partsupp
WHERE
partsupp.partkey IN(
SELECT
part.partkey
FROM
part
WHERE
part.name like 'forest%'
)
AND partsupp.availqty>(
SELECT
0.5*(sum(lineitem.quantity)::FLOAT)
FROM
lineitem
WHERE
lineitem.partkey=partsupp.partkey
AND lineitem.suppkey=partsupp.partkey^^^^^^^
suppkey ???AND lineitem.shipdate>=('1994-01-01')::DATE
AND lineitem.shipdate<(('1994-01-01')::DATE+('1
year')::INTERVAL)::DATE
Show quoted text
)
)
AND supplier.nationkey=nation.nationkey
AND nation.name='CANADA'
ORDER BY
supplier.name;
On Thu, 2002-07-11 at 11:22, Luis Alberto Amigo Navarro wrote:
I can't improve performance on this query:
Blame Canada!
--
J. R. Nield
jrnield@usol.com
On Thursday 11 July 2002 12:06, J. R. Nield wrote:
On Thu, 2002-07-11 at 11:22, Luis Alberto Amigo Navarro wrote:
I can't improve performance on this query:
Blame Canada!
Whatever ...
How's that silver medal down there in the states?
;-)
On Thu, 11 Jul 2002 19:40:46 +0200, "Luis Alberto Amigo Navarro"
<lamigo@atc.unican.es> wrote:
I've tried
[reformatted to fit on one page]
| SELECT supplier.name, supplier.address
| FROM supplier, nation, lineitem
You already found out that you do not need lineitem here.
| WHERE EXISTS(
| SELECT partsupp.suppkey
| FROM partsupp,lineitem
| WHERE
| lineitem.partkey=partsupp.partkey
| AND lineitem.suppkey=partsupp.partkey
I still don't believe this suppkey=partkey
| AND lineitem.shipdate [...]
| AND EXISTS( SELECT part.partkey
| FROM part WHERE part.name like 'forest%')
This subselect gives either true or false, but in any case always the
same result. You might want to add a condition
AND part.partkey=partsupp.partkey
Are you sure partkey is not unique? If it is unique you can replace
this subselect by a join.
| GROUP BY partsupp.partkey,partsupp.suppkey
| HAVING min(availqty)>(0.5*(sum(lineitem.quantity)::FLOAT))
| )
| AND supplier.nationkey=nation.nationkey
| AND nation.name='CANADA'
| ORDER BY supplier.name;
as you said and something is wrong
Sort (cost=1141741215.35..1141741215.35 rows=2400490000 width=81)
The cost is now only 1141741215.35 compared to 2777810917708.17
before; this is an improvement factor of more than 2000. So what's
your problem? ;-)
Servus
Manfred
On Thu, 2002-07-11 at 17:22, Luis Alberto Amigo Navarro wrote:
I can't improve performance on this query:
You could try rewriting the IN's into = joins
or even use explicit INNER JOIN syntax to force certain plans
with a select inside another and depending on value of partsupp.partkey
it is really hard for optimiser to do anything else than to perform the
query for each row.
But it may help to rewrite
SELECT
partsupp.suppkey
FROM
partsupp
WHERE
partsupp.partkey IN (
SELECT
part.partkey
FROM
part
WHERE
part.name like 'forest%'
)
AND partsupp.availqty>(
SELECT
0.5*(sum(lineitem.quantity)::FLOAT)
FROM
lineitem
WHERE
lineitem.partkey=partsupp.partkey
AND lineitem.suppkey=partsupp.partkey
AND lineitem.shipdate>=('1994-01-01')::DATE
AND lineitem.shipdate<(('1994-01-01')::DATE+('1
year')::INTERVAL)::DATE
)
)
into
SELECT
partsupp.suppkey
FROM
partsupp,
(SELECT part.partkey as partkey
FROM part
WHERE part.name like 'forest%'
) fp,
(SELECT 0.5*(sum(lineitem.quantity)::FLOAT) as halfsum,
partkey
FROM lineitem
WHERE
lineitem.partkey=partsupp.partkey
AND lineitem.suppkey=partsupp.partkey
AND lineitem.shipdate>=('1994-01-01')::DATE
AND lineitem.shipdate<(('1994-01-01')::DATE+('1
year')::INTERVAL)::DATE
) li
WHERE partsupp.partkey = fp.partkey
AND partsupp.partkey = li.partkey
AND partsupp.availqty > halfsum
if "lineitem" is significantly smaller than "partsupp"
But you really should tell us more, like how many lines does lineitem
and other tables have,
----------
Hannu
On Thu, 2002-07-11 at 17:22, Luis Alberto Amigo Navarro wrote:
I can't improve performance on this query:
You may also want to rewrite
lineitem.shipdate<(('1994-01-01')::DATE+('1 year')::INTERVAL)::DATE
into
lineitem.shipdate<(('1995-01-01')::DATE
if you can, as probably the optimiser will not recognize it else as a
constant and won't use index on lineitem.shipdate.
----------------
Hannu
On Thu, 2002-07-11 at 17:22, Luis Alberto Amigo Navarro wrote:
I can't improve performance on this query:
This _may_ work.
SELECT
supplier.name,
supplier.address
FROM
supplier,
nation,
WHERE supplier.suppkey IN (
SELECT part.partkey
FROM part
WHERE part.name like 'forest%'
INNER JOIN partsupp ON part.partkey=partsupp.partkey
INNER JOIN (
SELECT 0.5*(sum(lineitem.quantity)::FLOAT) as halfsum
FROM lineitem
WHERE lineitem.partkey=partsupp.partkey
AND shipdate >= '1994-01-01'
AND shipdate < '1995-01-01'
) li ON partsupp.availqty > halfsum
)
AND supplier.nationkey=nation.nationkey
AND nation.name='CANADA'
ORDER BY supplier.name;
---------------
Hannu
The cost is now only 1141741215.35 compared to 2777810917708.17
before; this is an improvement factor of more than 2000. So what's
your problem? ;-)Servus
Manfred
In fact planner is estimating incredibly badly, it took only 833msecs now
runs perfectly
I'm going to keep on asking about another query:
SELECT
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice,
sum(lineitem.quantity)
FROM
customer,
orders,
lineitem
WHERE
exists(
SELECT
lineitem.orderkey
FROM
lineitem
WHERE
lineitem.orderkey=orders.orderkey
GROUP BY
lineitem.orderkey HAVING
sum(lineitem.quantity)>300
)
AND customer.custkey=orders.custkey
AND orders.orderkey=lineitem.orderkey
GROUP BY
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice
ORDER BY
orders.totalprice DESC,
orders.orderdate;
NOTICE: QUERY PLAN:
Sort (cost=26923941.97..26923941.97 rows=300061 width=66)
-> Aggregate (cost=26851634.86..26896644.05 rows=300061 width=66)
-> Group (cost=26851634.86..26889142.52 rows=3000612 width=66)
-> Sort (cost=26851634.86..26851634.86 rows=3000612
width=66)
-> Hash Join (cost=26107574.81..26457309.10
rows=3000612 width=66)
-> Seq Scan on lineitem (cost=0.00..222208.25
rows=6001225 width=8)
-> Hash (cost=26105699.81..26105699.81
rows=750000 width=58)
-> Hash Join (cost=7431.00..26105699.81
rows=750000 width=58)
-> Seq Scan on orders
(cost=0.00..26083268.81 rows=750000 width=25)
SubPlan
-> Aggregate
(cost=0.00..17.35 rows=1 width=8)
-> Group
(cost=0.00..17.34 rows=5 width=8)
-> Index Scan
using lineitem_pkey on lineitem (cost=0.00..17.33 rows=5 width=8)
-> Hash (cost=7056.00..7056.00
rows=150000 width=33)
-> Seq Scan on customer
(cost=0.00..7056.00 rows=150000 width=33)
again:
orders 1500000 tuples
lineitem 6000000 tuples there are 1 to 7 lineitems per orderkey
Customer 150000 tuples
select attname,n_distinct,correlation from pg_stats where
tablename='lineitem';
attname | n_distinct | correlation
---------------+------------+-------------
orderkey | -0.199847 | 1
partkey | 196448 | 0.0223377
suppkey | 9658 | -0.00822751
linenumber | 7 | 0.17274
quantity | 50 | 0.0150153
extendedprice | 25651 | -0.00790245
discount | 11 | 0.103761
tax | 9 | 0.0993771
returnflag | 3 | 0.391434
linestatus | 2 | 0.509791
shipdate | 2440 | 0.0072777
commitdate | 2497 | 0.00698162
receiptdate | 2416 | 0.00726686
shipinstruct | 4 | 0.241511
shipmode | 7 | 0.138432
comment | 275488 | 0.0188006
(16 rows)
select attname,n_distinct,correlation from pg_stats where
tablename='orders';
attname | n_distinct | correlation
---------------+------------+-------------
orderkey | -1 | -0.999925
custkey | 76309 | 0.00590596
orderstatus | 3 | 0.451991
totalprice | -1 | -0.00768806
orderdate | 2431 | -0.0211354
orderpriority | 5 | 0.182489
clerk | 1009 | 0.00546939
shippriority | 1 | 1
comment | -0.750125 | -0.0123887
Customer
attname | n_distinct | correlation
------------+------------+-------------
custkey | -1 | 1
name | -1 | 1
address | -1 | -0.00510274
nationkey | 25 | 0.0170533
phone | -1 | -0.0227816
acctbal | -0.83444 | -0.00220958
mktsegment | 5 | 0.205013
comment | -1 | 0.0327827
This query takes 12 minutes to run and returns about 50 customers.
lineitem.quantity takes values from 1 to 50, so 300 per orderkey is very
restrictive
May someone help on improving performance?
Again thanks in advance
Regards
hi,
avoid subselect: create a temp table and use join...
CREATE TEMP TABLE tmp AS
SELECT
lineitem.orderkey
FROM
lineitem
WHERE
lineitem.orderkey=orders.orderkey
GROUP BY
lineitem.orderkey HAVING
sum(lineitem.quantity)>300;
CREATE INDEX tmp_idx ON tmp (orderkey);
SELECT
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice,
sum(lineitem.quantity)
FROM
customer,
orders,
lineitem,
tmp
WHERE
orders.orderkey=tmp.orderkey
AND customer.custkey=orders.custkey
AND orders.orderkey=lineitem.orderkey
GROUP BY
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice
ORDER BY
orders.totalprice DESC,
orders.orderdate;
may be the index is not necessary...
kuba
Show quoted text
I'm going to keep on asking about another query:
SELECT
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice,
sum(lineitem.quantity)
FROM
customer,
orders,
lineitem
WHERE
exists(
SELECT
lineitem.orderkey
FROM
lineitem
WHERE
lineitem.orderkey=orders.orderkey
GROUP BY
lineitem.orderkey HAVING
sum(lineitem.quantity)>300
)
AND customer.custkey=orders.custkey
AND orders.orderkey=lineitem.orderkey
GROUP BY
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalpriceORDER BY
orders.totalprice DESC,
orders.orderdate;NOTICE: QUERY PLAN:
Sort (cost=26923941.97..26923941.97 rows=300061 width=66)
-> Aggregate (cost=26851634.86..26896644.05 rows=300061 width=66)
-> Group (cost=26851634.86..26889142.52 rows=3000612 width=66)
-> Sort (cost=26851634.86..26851634.86 rows=3000612
width=66)
-> Hash Join (cost=26107574.81..26457309.10
rows=3000612 width=66)
-> Seq Scan on lineitem (cost=0.00..222208.25
rows=6001225 width=8)
-> Hash (cost=26105699.81..26105699.81
rows=750000 width=58)
-> Hash Join (cost=7431.00..26105699.81
rows=750000 width=58)
-> Seq Scan on orders
(cost=0.00..26083268.81 rows=750000 width=25)
SubPlan
-> Aggregate
(cost=0.00..17.35 rows=1 width=8)
-> Group
(cost=0.00..17.34 rows=5 width=8)
-> Index Scan
using lineitem_pkey on lineitem (cost=0.00..17.33 rows=5 width=8)
-> Hash (cost=7056.00..7056.00
rows=150000 width=33)
-> Seq Scan on customer
(cost=0.00..7056.00 rows=150000 width=33)again:
orders 1500000 tuples
lineitem 6000000 tuples there are 1 to 7 lineitems per orderkey
Customer 150000 tuplesselect attname,n_distinct,correlation from pg_stats where
tablename='lineitem';
attname | n_distinct | correlation
---------------+------------+-------------
orderkey | -0.199847 | 1
partkey | 196448 | 0.0223377
suppkey | 9658 | -0.00822751
linenumber | 7 | 0.17274
quantity | 50 | 0.0150153
extendedprice | 25651 | -0.00790245
discount | 11 | 0.103761
tax | 9 | 0.0993771
returnflag | 3 | 0.391434
linestatus | 2 | 0.509791
shipdate | 2440 | 0.0072777
commitdate | 2497 | 0.00698162
receiptdate | 2416 | 0.00726686
shipinstruct | 4 | 0.241511
shipmode | 7 | 0.138432
comment | 275488 | 0.0188006
(16 rows)select attname,n_distinct,correlation from pg_stats where
tablename='orders';
attname | n_distinct | correlation
---------------+------------+-------------
orderkey | -1 | -0.999925
custkey | 76309 | 0.00590596
orderstatus | 3 | 0.451991
totalprice | -1 | -0.00768806
orderdate | 2431 | -0.0211354
orderpriority | 5 | 0.182489
clerk | 1009 | 0.00546939
shippriority | 1 | 1
comment | -0.750125 | -0.0123887Customer
attname | n_distinct | correlation
------------+------------+-------------
custkey | -1 | 1
name | -1 | 1
address | -1 | -0.00510274
nationkey | 25 | 0.0170533
phone | -1 | -0.0227816
acctbal | -0.83444 | -0.00220958
mktsegment | 5 | 0.205013
comment | -1 | 0.0327827This query takes 12 minutes to run and returns about 50 customers.
lineitem.quantity takes values from 1 to 50, so 300 per orderkey is very
restrictiveMay someone help on improving performance?
Again thanks in advance
Regards---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
Lineitem is being modified on run time, so creating a temp table don't
solves my problem
The time of creating this table is the same of performing the subselect (or
so I think), it could be done creating a new table, and a new trigger, but
there are already triggers to calculate
lineitem.extendedprice=part.retailprice*lineitem.quantity*(1+taxes)*(1-disco
unt) and to calculate orderstatus in order with linestatus and to calculate
orders.totalprice as sum(extendedprice) where
lineitem.orderkey=new.orderkey. A new trigger in order to insert orderkey if
sum(quantity) where orderkey=new.orderkey might be excessive.
Any other idea?
Thanks And Regards
----- Original Message -----
From: "Jakub Ouhrabka" <jakub.ouhrabka@comgate.cz>
To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
Cc: "Manfred Koizar" <mkoi-pg@aon.at>; <pgsql-sql@postgresql.org>
Sent: Friday, July 12, 2002 1:50 PM
Subject: Re: [SQL] [HACKERS] please help on query
hi,
avoid subselect: create a temp table and use join...
CREATE TEMP TABLE tmp AS
SELECT
lineitem.orderkey
FROM
lineitem
WHERE
lineitem.orderkey=orders.orderkey
GROUP BY
lineitem.orderkey HAVING
sum(lineitem.quantity)>300;CREATE INDEX tmp_idx ON tmp (orderkey);
SELECT
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice,
sum(lineitem.quantity)
FROM
customer,
orders,
lineitem,
tmp
WHERE
orders.orderkey=tmp.orderkey
AND customer.custkey=orders.custkey
AND orders.orderkey=lineitem.orderkey
GROUP BY
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice
ORDER BY
orders.totalprice DESC,
orders.orderdate;may be the index is not necessary...
kuba
I'm going to keep on asking about another query:
SELECT
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice,
sum(lineitem.quantity)
FROM
customer,
orders,
lineitem
WHERE
exists(
SELECT
lineitem.orderkey
FROM
lineitem
WHERE
lineitem.orderkey=orders.orderkey
GROUP BY
lineitem.orderkey HAVING
sum(lineitem.quantity)>300
)
AND customer.custkey=orders.custkey
AND orders.orderkey=lineitem.orderkey
GROUP BY
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalpriceORDER BY
orders.totalprice DESC,
orders.orderdate;NOTICE: QUERY PLAN:
Sort (cost=26923941.97..26923941.97 rows=300061 width=66)
-> Aggregate (cost=26851634.86..26896644.05 rows=300061 width=66)
-> Group (cost=26851634.86..26889142.52 rows=3000612 width=66)
-> Sort (cost=26851634.86..26851634.86 rows=3000612
width=66)
-> Hash Join (cost=26107574.81..26457309.10
rows=3000612 width=66)
-> Seq Scan on lineitem
(cost=0.00..222208.25
rows=6001225 width=8)
-> Hash (cost=26105699.81..26105699.81
rows=750000 width=58)
-> Hash Join
(cost=7431.00..26105699.81
Show quoted text
rows=750000 width=58)
-> Seq Scan on orders
(cost=0.00..26083268.81 rows=750000 width=25)
SubPlan
-> Aggregate
(cost=0.00..17.35 rows=1 width=8)
-> Group
(cost=0.00..17.34 rows=5 width=8)
-> Index Scan
using lineitem_pkey on lineitem (cost=0.00..17.33 rows=5 width=8)
-> Hash (cost=7056.00..7056.00
rows=150000 width=33)
-> Seq Scan on customer
(cost=0.00..7056.00 rows=150000 width=33)again:
orders 1500000 tuples
lineitem 6000000 tuples there are 1 to 7 lineitems per orderkey
Customer 150000 tuplesselect attname,n_distinct,correlation from pg_stats where
tablename='lineitem';
attname | n_distinct | correlation
---------------+------------+-------------
orderkey | -0.199847 | 1
partkey | 196448 | 0.0223377
suppkey | 9658 | -0.00822751
linenumber | 7 | 0.17274
quantity | 50 | 0.0150153
extendedprice | 25651 | -0.00790245
discount | 11 | 0.103761
tax | 9 | 0.0993771
returnflag | 3 | 0.391434
linestatus | 2 | 0.509791
shipdate | 2440 | 0.0072777
commitdate | 2497 | 0.00698162
receiptdate | 2416 | 0.00726686
shipinstruct | 4 | 0.241511
shipmode | 7 | 0.138432
comment | 275488 | 0.0188006
(16 rows)select attname,n_distinct,correlation from pg_stats where
tablename='orders';
attname | n_distinct | correlation
---------------+------------+-------------
orderkey | -1 | -0.999925
custkey | 76309 | 0.00590596
orderstatus | 3 | 0.451991
totalprice | -1 | -0.00768806
orderdate | 2431 | -0.0211354
orderpriority | 5 | 0.182489
clerk | 1009 | 0.00546939
shippriority | 1 | 1
comment | -0.750125 | -0.0123887Customer
attname | n_distinct | correlation
------------+------------+-------------
custkey | -1 | 1
name | -1 | 1
address | -1 | -0.00510274
nationkey | 25 | 0.0170533
phone | -1 | -0.0227816
acctbal | -0.83444 | -0.00220958
mktsegment | 5 | 0.205013
comment | -1 | 0.0327827This query takes 12 minutes to run and returns about 50 customers.
lineitem.quantity takes values from 1 to 50, so 300 per orderkey is very
restrictiveMay someone help on improving performance?
Again thanks in advance
Regards---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
On Fri, 12 Jul 2002 17:32:50 +0200
"Luis Alberto Amigo Navarro" <lamigo@atc.unican.es> wrote:
Lineitem is being modified on run time, so creating a temp table don't
solves my problem
The time of creating this table is the same of performing the subselect (or
so I think), it could be done creating a new table, and a new trigger, but
there are already triggers to calculate
lineitem.extendedprice=part.retailprice*lineitem.quantity*(1+taxes)*(1-disco
unt) and to calculate orderstatus in order with linestatus and to calculate
orders.totalprice as sum(extendedprice) where
lineitem.orderkey=new.orderkey. A new trigger in order to insert orderkey if
sum(quantity) where orderkey=new.orderkey might be excessive.
Any other idea?
Thanks And Regards----- Original Message -----
From: "Jakub Ouhrabka" <jakub.ouhrabka@comgate.cz>
To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
Cc: "Manfred Koizar" <mkoi-pg@aon.at>; <pgsql-sql@postgresql.org>
Sent: Friday, July 12, 2002 1:50 PM
Subject: Re: [SQL] [HACKERS] please help on queryavoid subselect: create a temp table and use join...
CREATE TEMP TABLE tmp AS
SELECT
lineitem.orderkey
FROM
lineitem
WHERE
lineitem.orderkey=orders.orderkey
GROUP BY
lineitem.orderkey HAVING
sum(lineitem.quantity)>300;
Hi,
I'm not sure whether its performance can be improved or not. But I feel
there is a slight chance to reduce the total number of the tuples which
Planner must think.
BTW, how much time does the following query take in your situation,
and how many rows does it retrieve ?
EXPLAIN ANALYZE
SELECT
lineitem.orderkey
FROM
lineitem
GROUP BY
lineitem.orderkey
HAVING
SUM(lineitem.quantity) > 300;
Regards,
Masaru Sugawara
----- Original Message -----
From: "Masaru Sugawara" <rk73@sea.plala.or.jp>
To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
Cc: <pgsql-sql@postgresql.org>
Sent: Sunday, July 14, 2002 2:23 PM
Subject: Re: [SQL] [HACKERS] please help on query
This is the output:
Aggregate (cost=0.00..647161.10 rows=600122 width=8) (actual
time=4959.19..347328.83 rows=62 loops=1)
-> Group (cost=0.00..632158.04 rows=6001225 width=8) (actual
time=10.79..274259.16 rows=6001225 loops=1)
-> Index Scan using lineitem_pkey on lineitem
(cost=0.00..617154.97 rows=6001225 width=8) (actual time=10.77..162439.11
rows=6001225 loops=1)
Total runtime: 347330.28 msec
it is returning all rows in lineitem. Why is it using index?
Thanks and regards
On Fri, 12 Jul 2002 17:32:50 +0200
"Luis Alberto Amigo Navarro" <lamigo@atc.unican.es> wrote:Lineitem is being modified on run time, so creating a temp table don't
solves my problem
The time of creating this table is the same of performing the subselect
(or
so I think), it could be done creating a new table, and a new trigger,
but
there are already triggers to calculate
lineitem.extendedprice=part.retailprice*lineitem.quantity*(1+taxes)*(1-disco
unt) and to calculate orderstatus in order with linestatus and to
calculate
orders.totalprice as sum(extendedprice) where
lineitem.orderkey=new.orderkey. A new trigger in order to insert
orderkey if
Show quoted text
sum(quantity) where orderkey=new.orderkey might be excessive.
Any other idea?
Thanks And Regards----- Original Message -----
From: "Jakub Ouhrabka" <jakub.ouhrabka@comgate.cz>
To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
Cc: "Manfred Koizar" <mkoi-pg@aon.at>; <pgsql-sql@postgresql.org>
Sent: Friday, July 12, 2002 1:50 PM
Subject: Re: [SQL] [HACKERS] please help on queryavoid subselect: create a temp table and use join...
CREATE TEMP TABLE tmp AS
SELECT
lineitem.orderkey
FROM
lineitem
WHERE
lineitem.orderkey=orders.orderkey
GROUP BY
lineitem.orderkey HAVING
sum(lineitem.quantity)>300;Hi,
I'm not sure whether its performance can be improved or not. But I feel
there is a slight chance to reduce the total number of the tuples which
Planner must think.BTW, how much time does the following query take in your situation,
and how many rows does it retrieve ?EXPLAIN ANALYZE
SELECT
lineitem.orderkey
FROM
lineitem
GROUP BY
lineitem.orderkey
HAVING
SUM(lineitem.quantity) > 300;Regards,
Masaru Sugawara
On Mon, 15 Jul 2002 09:45:36 +0200
"Luis Alberto Amigo Navarro" <lamigo@atc.unican.es> wrote:
This is the output:
Aggregate (cost=0.00..647161.10 rows=600122 width=8) (actual
time=4959.19..347328.83 rows=62 loops=1)
-> Group (cost=0.00..632158.04 rows=6001225 width=8) (actual
time=10.79..274259.16 rows=6001225 loops=1)
-> Index Scan using lineitem_pkey on lineitem
(cost=0.00..617154.97 rows=6001225 width=8) (actual time=10.77..162439.11
rows=6001225 loops=1)
Total runtime: 347330.28 msecit is returning all rows in lineitem. Why is it using index?
Sorry, I don't know the reason.
I need more info. Can you show me the outputs of EXPLAIN ANALYZE ?
EXPLAIN ANALYZE
SELECT
orders.orderkey
FROM
lineitem LEFT OUTER JOIN
orders USING(orderkey)
WHERE
orders.orderkey IS NOT NULL
GROUP BY
orders.orderkey
HAVING
SUM(lineitem.quantity) > 300;
EXPLAIN ANALYZE
SELECT
t2.*
FROM (SELECT
orders.orderkey
FROM
lineitem LEFT OUTER JOIN
orders USING(orderkey)
WHERE
orders.orderkey IS NOT NULL
GROUP BY
orders.orderkey
HAVING
SUM(lineitem.quantity) > 300
) AS t1 LEFT OUTER JOIN
orders AS t2 USING(orderkey)
ORDER BY t2.custkey
Regards,
Masaru Sugawara
----- Original Message -----
From: "Masaru Sugawara" <rk73@sea.plala.or.jp>
To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
Cc: <pgsql-sql@postgresql.org>
Sent: Monday, July 15, 2002 6:15 PM
Subject: Re: [SQL] [HACKERS] please help on query
Sorry, I don't know the reason.
I need more info. Can you show me the outputs of EXPLAIN ANALYZE ?
Here it is:
EXPLAIN ANALYZE
SELECT
orders.orderkey
FROM
lineitem LEFT OUTER JOIN
orders USING(orderkey)
WHERE
orders.orderkey IS NOT NULL
GROUP BY
orders.orderkey
HAVING
SUM(lineitem.quantity) > 300;
Aggregate (cost=1257368.92..1287375.04 rows=600122 width=12) (actual
time=1236941.71..1454824.56 rows=62 loops=1)
-> Group (cost=1257368.92..1272371.98 rows=6001225 width=12) (actual
time=1233968.87..1385034.91 rows=6001225 loops=1)
-> Sort (cost=1257368.92..1257368.92 rows=6001225 width=12)
(actual time=1233968.82..1276147.37 rows=6001225 loops=1)
-> Hash Join (cost=166395.00..520604.08 rows=6001225
width=12) (actual time=59061.21..773997.08 rows=6001225 loops=1)
-> Seq Scan on lineitem (cost=0.00..195405.25
rows=6001225 width=8) (actual time=20.66..115511.34 rows=6001225 loops=1)
-> Hash (cost=162645.00..162645.00 rows=1500000
width=4) (actual time=59032.16..59032.16 rows=0 loops=1)
-> Seq Scan on orders (cost=0.00..162645.00
rows=1500000 width=4) (actual time=17.33..44420.10 rows=1500000 loops=1)
Total runtime: 1454929.11 msec
EXPLAIN ANALYZE
SELECT
t2.*
FROM (SELECT
orders.orderkey
FROM
lineitem LEFT OUTER JOIN
orders USING(orderkey)
WHERE
orders.orderkey IS NOT NULL
GROUP BY
orders.orderkey
HAVING
SUM(lineitem.quantity) > 300
) AS t1 LEFT OUTER JOIN
orders AS t2 USING(orderkey)
ORDER BY t2.custkey
Sort (cost=1739666.43..1739666.43 rows=600122 width=119) (actual
time=1538897.23..1538897.47 rows=62 loops=1)
-> Merge Join (cost=1344971.49..1682069.98 rows=600122 width=119)
(actual time=1440886.58..1538886.03 rows=62 loops=1)
-> Index Scan using orders_pkey on orders t2 (cost=0.00..324346.65
rows=1500000 width=115) (actual time=32.80..87906.98 rows=1455276 loops=1)
-> Sort (cost=1344971.49..1344971.49 rows=600122 width=12) (actual
time=1439550.31..1439550.73 rows=62 loops=1)
-> Subquery Scan t1 (cost=1257368.92..1287375.04 rows=600122
width=12) (actual time=1222560.86..1439549.36 rows=62 loops=1)
-> Aggregate (cost=1257368.92..1287375.04 rows=600122
width=12) (actual time=1222560.84..1439548.42 rows=62 loops=1)
-> Group (cost=1257368.92..1272371.98
rows=6001225 width=12) (actual time=1219607.04..1369327.42 rows=6001225
loops=1)
-> Sort (cost=1257368.92..1257368.92
rows=6001225 width=12) (actual time=1219607.00..1261208.08 rows=6001225
loops=1)
-> Hash Join
(cost=166395.00..520604.08 rows=6001225 width=12) (actual
time=65973.31..769253.41 rows=6001225 loops=1)
-> Seq Scan on lineitem
(cost=0.00..195405.25 rows=6001225 width=8) (actual time=20.07..115247.61
rows=6001225 loops=1)
-> Hash
(cost=162645.00..162645.00 rows=1500000 width=4) (actual
time=65943.80..65943.80 rows=0 loops=1)
-> Seq Scan on orders
(cost=0.00..162645.00 rows=1500000 width=4) (actual time=39.04..52049.90
rows=1500000 loops=1)
Total runtime: 1539010.00 msec
Thanks and regards
On Tue, 16 Jul 2002 10:51:03 +0200
"Luis Alberto Amigo Navarro" <lamigo@atc.unican.es> wrote:
Aggregate (cost=1257368.92..1287375.04 rows=600122 width=12) (actual
time=1236941.71..1454824.56 rows=62 loops=1)
-> Group (cost=1257368.92..1272371.98 rows=6001225 width=12) (actual
time=1233968.87..1385034.91 rows=6001225 loops=1)
-> Sort (cost=1257368.92..1257368.92 rows=6001225 width=12)
(actual time=1233968.82..1276147.37 rows=6001225 loops=1)
-> Hash Join (cost=166395.00..520604.08 rows=6001225
width=12) (actual time=59061.21..773997.08 rows=6001225 loops=1)
-> Seq Scan on lineitem (cost=0.00..195405.25
rows=6001225 width=8) (actual time=20.66..115511.34 rows=6001225 loops=1)
-> Hash (cost=162645.00..162645.00 rows=1500000
width=4) (actual time=59032.16..59032.16 rows=0 loops=1)
-> Seq Scan on orders (cost=0.00..162645.00
rows=1500000 width=4) (actual time=17.33..44420.10 rows=1500000 loops=1)
Total runtime: 1454929.11 msec
Hmm, does each of the three tables have some indices like the following?
If not so, could you execute EXPLAIN ANALYZE after creating the indices.
create index idx_lineitem_orderkey on lineitem(orderkey);
create index idx_orders_orderkey on orders(orderkey);
create index idx_orders_custkey on orders(custkey);
create index idx_customer_custkey on customer(custkey);
Regards,
Masaru Sugawara