please help on query

Started by Luis Alberto Amigo Navarroover 23 years ago17 messages

lamigo@atc.unican.es

over 23 years ago

I can't improve performance on this query:

SELECT
supplier.name,
supplier.address
FROM
supplier,
nation
WHERE
supplier.suppkey IN(
SELECT
partsupp.suppkey
FROM
partsupp
WHERE
partsupp.partkey IN(
SELECT
part.partkey
FROM
part
WHERE
part.name like 'forest%'
)
AND partsupp.availqty>(
SELECT
0.5*(sum(lineitem.quantity)::FLOAT)
FROM
lineitem
WHERE
lineitem.partkey=partsupp.partkey
AND lineitem.suppkey=partsupp.partkey
AND lineitem.shipdate>=('1994-01-01')::DATE
AND lineitem.shipdate<(('1994-01-01')::DATE+('1 year')::INTERVAL)::DATE
)
)
AND supplier.nationkey=nation.nationkey
AND nation.name='CANADA'
ORDER BY
supplier.name;

explain results:
NOTICE: QUERY PLAN:

Sort (cost=2777810917708.17..2777810917708.17 rows=200 width=81)
-> Nested Loop (cost=0.00..2777810917700.53 rows=200 width=81)
-> Seq Scan on nation (cost=0.00..1.31 rows=1 width=4)
-> Index Scan using snation_index on supplier (cost=0.00..2777810917696.72 rows=200 width=77)
SubPlan
-> Materialize (cost=6944527291.72..6944527291.72 rows=133333 width=4)
-> Seq Scan on partsupp (cost=0.00..6944527291.72 rows=133333 width=4)
SubPlan
-> Materialize (cost=8561.00..8561.00 rows=1 width=4)
-> Seq Scan on part (cost=0.00..8561.00 rows=1 width=4)
-> Aggregate (cost=119.61..119.61 rows=1 width=4)
-> Index Scan using lineitem_index on lineitem (cost=0.00..119.61 rows=1 width=4)

tpch=# select attname,n_distinct,correlation from pg_stats where tablename='partsupp';
attname | n_distinct | correlation
------------+------------+-------------
partkey | -0.195588 | 1
suppkey | 9910 | 0.00868363
availqty | 9435 | -0.00788662
supplycost | -0.127722 | -0.0116864
comment | -1 | 0.0170702

I accept query changes, reordering, indexes ideas and horizontal partitioning
thanks in advance.
Regards

Manfred Koizar

mkoi-pg@aon.at

over 23 years ago

In reply to: Luis Alberto Amigo Navarro (#1)

Re: [HACKERS] please help on query

[moving to pgsql-sql]
On Thu, 11 Jul 2002 17:22:14 +0200, "Luis Alberto Amigo Navarro"
<lamigo@atc.unican.es> wrote:

I can't improve performance on this query:

SELECT
supplier.name,
supplier.address
FROM
supplier,
nation
WHERE
supplier.suppkey IN(
SELECT
partsupp.suppkey
FROM
partsupp
WHERE
partsupp.partkey IN(
SELECT
part.partkey
FROM
part
WHERE
part.name like 'forest%'
)
AND partsupp.availqty>(
SELECT
0.5*(sum(lineitem.quantity)::FLOAT)
FROM
lineitem
WHERE
lineitem.partkey=partsupp.partkey
AND lineitem.suppkey=partsupp.partkey

^^^^^^^
suppkey ???

AND lineitem.shipdate>=('1994-01-01')::DATE
AND lineitem.shipdate<(('1994-01-01')::DATE+('1 year')::INTERVAL)::DATE
)
)
AND supplier.nationkey=nation.nationkey
AND nation.name='CANADA'
ORDER BY
supplier.name;

Luis,
rules of thumb: "Avoid subselects; use joins!" and "If you have to use
subselects, avoid IN, use EXISTS!"

Let's try. If partkey is unique in part, then
| FROM partsupp
| WHERE partsupp.partkey IN (SELECT part.partkey

can be replaced by
FROM partsupp ps, part p
WHERE ps.partkey = p.partkey

or
partsupp ps INNER JOIN part p
ON (ps.partkey = p.partkey AND p.name LIKE '...')

When we ignore "part" for now, your subselect boils down to

which can be rewritten to (untested)

SELECT ps.suppkey
FROM partsupp ps, lineitem li
WHERE li.partkey=ps.partkey
AND li.suppkey=ps.suppkey
AND lineitem.shipdate BETWEEN ... AND ...
GROUP BY ps.partkey, ps.suppkey
HAVING min(ps.availqty) > 0.5*(sum(lineitem.quantity)::FLOAT)
^^^
As all ps.availqty are equal in one group, you can as well
use max() or avg().

Now we have left only one IN:
| WHERE supplier.suppkey IN (
| SELECT partsupp.suppkey FROM partsupp WHERE <condition> )

Being to lazy to find out, if this can be rewritten to a join, let`s
apply rule 2 here:

WHERE EXISTS (
SELECT ... FROM partsupp ps
WHERE supplier.suppkey = ps.suppkey
AND <condition> )

HTH, but use with a grain of salt ...

Sort (cost=2777810917708.17..2777810917708.17 rows=200 width=81)

^^^^^^^^^^^^^^^^
BTW, how many years are these? :-)

Servus
Manfred

Luis Alberto Amigo Navarro

lamigo@atc.unican.es

over 23 years ago

In reply to: Luis Alberto Amigo Navarro (#1)

Re: [HACKERS] please help on query

I've tried
SELECT
supplier.name,
supplier.address
FROM
supplier,
nation,
lineitem
WHERE
EXISTS(
SELECT
partsupp.suppkey
FROM
partsupp,lineitem
WHERE
lineitem.partkey=partsupp.partkey
AND lineitem.suppkey=partsupp.partkey
AND lineitem.shipdate>=('1994-01-01')::DATE
AND lineitem.shipdate<(('1994-01-01')::DATE+('1 year')::INTERVAL)::DATE
AND EXISTS(
SELECT
part.partkey
FROM
part
WHERE
part.name like 'forest%'
)
GROUP BY partsupp.partkey,partsupp.suppkey
HAVING min(availqty)>(0.5*(sum(lineitem.quantity)::FLOAT))
)
AND supplier.nationkey=nation.nationkey
AND nation.name='CANADA'
ORDER BY
supplier.name;

as you said and something is wrong
Sort (cost=1141741215.35..1141741215.35 rows=2400490000 width=81)
InitPlan
-> Aggregate (cost=0.00..921773.85 rows=48 width=24)
InitPlan
-> Seq Scan on part (cost=0.00..8561.00 rows=1 width=4)
-> Group (cost=0.00..921771.44 rows=481 width=24)
-> Result (cost=0.00..921769.04 rows=481 width=24)
-> Merge Join (cost=0.00..921769.04 rows=481
width=24)
-> Index Scan using partsupp_pkey on partsupp
(cost=0.00..98522.75 rows=800000 width=12)
-> Index Scan using lsupp_index on lineitem
(cost=0.00..821239.91 rows=145 width=12)
-> Result (cost=1.31..112888690.31 rows=2400490000 width=81)
-> Nested Loop (cost=1.31..112888690.31 rows=2400490000 width=81)
-> Hash Join (cost=1.31..490.31 rows=400 width=81)
-> Seq Scan on supplier (cost=0.00..434.00 rows=10000
width=77)
-> Hash (cost=1.31..1.31 rows=1 width=4)
-> Seq Scan on nation (cost=0.00..1.31 rows=1
width=4)
-> Seq Scan on lineitem (cost=0.00..222208.25 rows=6001225
width=0)

where might be my mistake
Thanks and regards
----- Original Message -----
From: "Manfred Koizar" <mkoi-pg@aon.at>
To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
Cc: <pgsql-sql@postgresql.org>
Sent: Thursday, July 11, 2002 6:47 PM
Subject: Re: [HACKERS] please help on query

[moving to pgsql-sql]
On Thu, 11 Jul 2002 17:22:14 +0200, "Luis Alberto Amigo Navarro"
<lamigo@atc.unican.es> wrote:

I can't improve performance on this query:

SELECT
supplier.name,
supplier.address
FROM
supplier,
nation
WHERE
supplier.suppkey IN(
SELECT
partsupp.suppkey
FROM
partsupp
WHERE
partsupp.partkey IN(
SELECT
part.partkey
FROM
part
WHERE
part.name like 'forest%'
)
AND partsupp.availqty>(
SELECT
0.5*(sum(lineitem.quantity)::FLOAT)
FROM
lineitem
WHERE
lineitem.partkey=partsupp.partkey
AND lineitem.suppkey=partsupp.partkey

^^^^^^^
suppkey ???

AND lineitem.shipdate>=('1994-01-01')::DATE
AND lineitem.shipdate<(('1994-01-01')::DATE+('1

year')::INTERVAL)::DATE

Show quoted text

)
)
AND supplier.nationkey=nation.nationkey
AND nation.name='CANADA'
ORDER BY
supplier.name;

J. R. Nield

jrnield@usol.com

over 23 years ago

In reply to: Luis Alberto Amigo Navarro (#1)

Re: please help on query

On Thu, 2002-07-11 at 11:22, Luis Alberto Amigo Navarro wrote:

I can't improve performance on this query:

Blame Canada!

--
J. R. Nield
jrnield@usol.com

Andy Kopciuch

akopciuch@olympusproject.org

over 23 years ago

In reply to: J. R. Nield (#4)

Re: please help on query

On Thursday 11 July 2002 12:06, J. R. Nield wrote:

On Thu, 2002-07-11 at 11:22, Luis Alberto Amigo Navarro wrote:

I can't improve performance on this query:

Blame Canada!

Whatever ...

How's that silver medal down there in the states?

;-)

Manfred Koizar

mkoi-pg@aon.at

over 23 years ago

In reply to: Luis Alberto Amigo Navarro (#3)

Re: [HACKERS] please help on query

On Thu, 11 Jul 2002 19:40:46 +0200, "Luis Alberto Amigo Navarro"
<lamigo@atc.unican.es> wrote:

I've tried

[reformatted to fit on one page]
| SELECT supplier.name, supplier.address
| FROM supplier, nation, lineitem
You already found out that you do not need lineitem here.

| AND lineitem.shipdate [...]
| AND EXISTS( SELECT part.partkey
| FROM part WHERE part.name like 'forest%')
This subselect gives either true or false, but in any case always the
same result. You might want to add a condition
AND part.partkey=partsupp.partkey

Are you sure partkey is not unique? If it is unique you can replace
this subselect by a join.

as you said and something is wrong
Sort (cost=1141741215.35..1141741215.35 rows=2400490000 width=81)

The cost is now only 1141741215.35 compared to 2777810917708.17
before; this is an improvement factor of more than 2000. So what's
your problem? ;-)

Servus
Manfred

Hannu Krosing

hannu@tm.ee

over 23 years ago

In reply to: Luis Alberto Amigo Navarro (#1)

Re: please help on query

On Thu, 2002-07-11 at 17:22, Luis Alberto Amigo Navarro wrote:

I can't improve performance on this query:

You could try rewriting the IN's into = joins
or even use explicit INNER JOIN syntax to force certain plans

with a select inside another and depending on value of partsupp.partkey
it is really hard for optimiser to do anything else than to perform the
query for each row.

But it may help to rewrite

SELECT
partsupp.suppkey
FROM
partsupp
WHERE
partsupp.partkey IN (
SELECT
part.partkey
FROM
part
WHERE
part.name like 'forest%'
)
AND partsupp.availqty>(
SELECT
0.5*(sum(lineitem.quantity)::FLOAT)
FROM
lineitem
WHERE
lineitem.partkey=partsupp.partkey
AND lineitem.suppkey=partsupp.partkey
AND lineitem.shipdate>=('1994-01-01')::DATE
AND lineitem.shipdate<(('1994-01-01')::DATE+('1
year')::INTERVAL)::DATE
)
)

into

SELECT
partsupp.suppkey
FROM
partsupp,
(SELECT part.partkey as partkey
FROM part
WHERE part.name like 'forest%'
) fp,
(SELECT 0.5*(sum(lineitem.quantity)::FLOAT) as halfsum,
partkey
FROM lineitem
WHERE
lineitem.partkey=partsupp.partkey
AND lineitem.suppkey=partsupp.partkey
AND lineitem.shipdate>=('1994-01-01')::DATE
AND lineitem.shipdate<(('1994-01-01')::DATE+('1
year')::INTERVAL)::DATE
) li
WHERE partsupp.partkey = fp.partkey
AND partsupp.partkey = li.partkey
AND partsupp.availqty > halfsum

if "lineitem" is significantly smaller than "partsupp"

But you really should tell us more, like how many lines does lineitem
and other tables have,

----------
Hannu

Hannu Krosing

hannu@tm.ee

over 23 years ago

In reply to: Luis Alberto Amigo Navarro (#1)

Re: please help on query

On Thu, 2002-07-11 at 17:22, Luis Alberto Amigo Navarro wrote:

I can't improve performance on this query:

You may also want to rewrite

lineitem.shipdate<(('1994-01-01')::DATE+('1 year')::INTERVAL)::DATE

into

lineitem.shipdate<(('1995-01-01')::DATE

if you can, as probably the optimiser will not recognize it else as a
constant and won't use index on lineitem.shipdate.

----------------
Hannu

Hannu Krosing

hannu@tm.ee

over 23 years ago

In reply to: Luis Alberto Amigo Navarro (#1)

Re: please help on query

On Thu, 2002-07-11 at 17:22, Luis Alberto Amigo Navarro wrote:

I can't improve performance on this query:

This _may_ work.

SELECT
supplier.name,
supplier.address
FROM
supplier,
nation,
WHERE supplier.suppkey IN (
SELECT part.partkey
FROM part
WHERE part.name like 'forest%'
INNER JOIN partsupp ON part.partkey=partsupp.partkey
INNER JOIN (
SELECT 0.5*(sum(lineitem.quantity)::FLOAT) as halfsum
FROM lineitem
WHERE lineitem.partkey=partsupp.partkey
AND shipdate >= '1994-01-01'
AND shipdate < '1995-01-01'
) li ON partsupp.availqty > halfsum
)
AND supplier.nationkey=nation.nationkey
AND nation.name='CANADA'
ORDER BY supplier.name;

---------------
Hannu

#10

Luis Alberto Amigo Navarro

lamigo@atc.unican.es

over 23 years ago

In reply to: Luis Alberto Amigo Navarro (#1)

Re: [HACKERS] please help on query

The cost is now only 1141741215.35 compared to 2777810917708.17
before; this is an improvement factor of more than 2000. So what's
your problem? ;-)

Servus
Manfred

In fact planner is estimating incredibly badly, it took only 833msecs now
runs perfectly

I'm going to keep on asking about another query:

SELECT
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice,
sum(lineitem.quantity)
FROM
customer,
orders,
lineitem
WHERE
exists(
SELECT
lineitem.orderkey
FROM
lineitem
WHERE
lineitem.orderkey=orders.orderkey
GROUP BY
lineitem.orderkey HAVING
sum(lineitem.quantity)>300
)
AND customer.custkey=orders.custkey
AND orders.orderkey=lineitem.orderkey
GROUP BY
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice

ORDER BY
orders.totalprice DESC,
orders.orderdate;

NOTICE: QUERY PLAN:

Sort (cost=26923941.97..26923941.97 rows=300061 width=66)
-> Aggregate (cost=26851634.86..26896644.05 rows=300061 width=66)
-> Group (cost=26851634.86..26889142.52 rows=3000612 width=66)
-> Sort (cost=26851634.86..26851634.86 rows=3000612
width=66)
-> Hash Join (cost=26107574.81..26457309.10
rows=3000612 width=66)
-> Seq Scan on lineitem (cost=0.00..222208.25
rows=6001225 width=8)
-> Hash (cost=26105699.81..26105699.81
rows=750000 width=58)
-> Hash Join (cost=7431.00..26105699.81
rows=750000 width=58)
-> Seq Scan on orders
(cost=0.00..26083268.81 rows=750000 width=25)
SubPlan
-> Aggregate
(cost=0.00..17.35 rows=1 width=8)
-> Group
(cost=0.00..17.34 rows=5 width=8)
-> Index Scan
using lineitem_pkey on lineitem (cost=0.00..17.33 rows=5 width=8)
-> Hash (cost=7056.00..7056.00
rows=150000 width=33)
-> Seq Scan on customer
(cost=0.00..7056.00 rows=150000 width=33)

again:
orders 1500000 tuples
lineitem 6000000 tuples there are 1 to 7 lineitems per orderkey
Customer 150000 tuples

select attname,n_distinct,correlation from pg_stats where
tablename='lineitem';
attname | n_distinct | correlation
---------------+------------+-------------
orderkey | -0.199847 | 1
partkey | 196448 | 0.0223377
suppkey | 9658 | -0.00822751
linenumber | 7 | 0.17274
quantity | 50 | 0.0150153
extendedprice | 25651 | -0.00790245
discount | 11 | 0.103761
tax | 9 | 0.0993771
returnflag | 3 | 0.391434
linestatus | 2 | 0.509791
shipdate | 2440 | 0.0072777
commitdate | 2497 | 0.00698162
receiptdate | 2416 | 0.00726686
shipinstruct | 4 | 0.241511
shipmode | 7 | 0.138432
comment | 275488 | 0.0188006
(16 rows)

select attname,n_distinct,correlation from pg_stats where
tablename='orders';
attname | n_distinct | correlation
---------------+------------+-------------
orderkey | -1 | -0.999925
custkey | 76309 | 0.00590596
orderstatus | 3 | 0.451991
totalprice | -1 | -0.00768806
orderdate | 2431 | -0.0211354
orderpriority | 5 | 0.182489
clerk | 1009 | 0.00546939
shippriority | 1 | 1
comment | -0.750125 | -0.0123887

Customer
attname | n_distinct | correlation
------------+------------+-------------
custkey | -1 | 1
name | -1 | 1
address | -1 | -0.00510274
nationkey | 25 | 0.0170533
phone | -1 | -0.0227816
acctbal | -0.83444 | -0.00220958
mktsegment | 5 | 0.205013
comment | -1 | 0.0327827

This query takes 12 minutes to run and returns about 50 customers.
lineitem.quantity takes values from 1 to 50, so 300 per orderkey is very
restrictive

May someone help on improving performance?
Again thanks in advance
Regards

#11

Jakub Ouhrabka

jakub.ouhrabka@comgate.cz

over 23 years ago

In reply to: Luis Alberto Amigo Navarro (#10)

Re: [HACKERS] please help on query

hi,

avoid subselect: create a temp table and use join...

CREATE TEMP TABLE tmp AS
SELECT
lineitem.orderkey
FROM
lineitem
WHERE
lineitem.orderkey=orders.orderkey
GROUP BY
lineitem.orderkey HAVING
sum(lineitem.quantity)>300;

CREATE INDEX tmp_idx ON tmp (orderkey);

SELECT
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice,
sum(lineitem.quantity)
FROM
customer,
orders,
lineitem,
tmp
WHERE
orders.orderkey=tmp.orderkey
AND customer.custkey=orders.custkey
AND orders.orderkey=lineitem.orderkey
GROUP BY
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice
ORDER BY
orders.totalprice DESC,
orders.orderdate;

may be the index is not necessary...

kuba

Show quoted text

I'm going to keep on asking about another query:

SELECT
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice,
sum(lineitem.quantity)
FROM
customer,
orders,
lineitem
WHERE
exists(
SELECT
lineitem.orderkey
FROM
lineitem
WHERE
lineitem.orderkey=orders.orderkey
GROUP BY
lineitem.orderkey HAVING
sum(lineitem.quantity)>300
)
AND customer.custkey=orders.custkey
AND orders.orderkey=lineitem.orderkey
GROUP BY
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice

ORDER BY
orders.totalprice DESC,
orders.orderdate;

NOTICE: QUERY PLAN:

Sort (cost=26923941.97..26923941.97 rows=300061 width=66)
-> Aggregate (cost=26851634.86..26896644.05 rows=300061 width=66)
-> Group (cost=26851634.86..26889142.52 rows=3000612 width=66)
-> Sort (cost=26851634.86..26851634.86 rows=3000612
width=66)
-> Hash Join (cost=26107574.81..26457309.10
rows=3000612 width=66)
-> Seq Scan on lineitem (cost=0.00..222208.25
rows=6001225 width=8)
-> Hash (cost=26105699.81..26105699.81
rows=750000 width=58)
-> Hash Join (cost=7431.00..26105699.81
rows=750000 width=58)
-> Seq Scan on orders
(cost=0.00..26083268.81 rows=750000 width=25)
SubPlan
-> Aggregate
(cost=0.00..17.35 rows=1 width=8)
-> Group
(cost=0.00..17.34 rows=5 width=8)
-> Index Scan
using lineitem_pkey on lineitem (cost=0.00..17.33 rows=5 width=8)
-> Hash (cost=7056.00..7056.00
rows=150000 width=33)
-> Seq Scan on customer
(cost=0.00..7056.00 rows=150000 width=33)

again:
orders 1500000 tuples
lineitem 6000000 tuples there are 1 to 7 lineitems per orderkey
Customer 150000 tuples

select attname,n_distinct,correlation from pg_stats where
tablename='lineitem';
attname | n_distinct | correlation
---------------+------------+-------------
orderkey | -0.199847 | 1
partkey | 196448 | 0.0223377
suppkey | 9658 | -0.00822751
linenumber | 7 | 0.17274
quantity | 50 | 0.0150153
extendedprice | 25651 | -0.00790245
discount | 11 | 0.103761
tax | 9 | 0.0993771
returnflag | 3 | 0.391434
linestatus | 2 | 0.509791
shipdate | 2440 | 0.0072777
commitdate | 2497 | 0.00698162
receiptdate | 2416 | 0.00726686
shipinstruct | 4 | 0.241511
shipmode | 7 | 0.138432
comment | 275488 | 0.0188006
(16 rows)

select attname,n_distinct,correlation from pg_stats where
tablename='orders';
attname | n_distinct | correlation
---------------+------------+-------------
orderkey | -1 | -0.999925
custkey | 76309 | 0.00590596
orderstatus | 3 | 0.451991
totalprice | -1 | -0.00768806
orderdate | 2431 | -0.0211354
orderpriority | 5 | 0.182489
clerk | 1009 | 0.00546939
shippriority | 1 | 1
comment | -0.750125 | -0.0123887

Customer
attname | n_distinct | correlation
------------+------------+-------------
custkey | -1 | 1
name | -1 | 1
address | -1 | -0.00510274
nationkey | 25 | 0.0170533
phone | -1 | -0.0227816
acctbal | -0.83444 | -0.00220958
mktsegment | 5 | 0.205013
comment | -1 | 0.0327827

This query takes 12 minutes to run and returns about 50 customers.
lineitem.quantity takes values from 1 to 50, so 300 per orderkey is very
restrictive

May someone help on improving performance?
Again thanks in advance
Regards

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

#12

Luis Alberto Amigo Navarro

lamigo@atc.unican.es

over 23 years ago

In reply to: Jakub Ouhrabka (#11)

Re: [HACKERS] please help on query

Lineitem is being modified on run time, so creating a temp table don't
solves my problem
The time of creating this table is the same of performing the subselect (or
so I think), it could be done creating a new table, and a new trigger, but
there are already triggers to calculate
lineitem.extendedprice=part.retailprice*lineitem.quantity*(1+taxes)*(1-disco
unt) and to calculate orderstatus in order with linestatus and to calculate
orders.totalprice as sum(extendedprice) where
lineitem.orderkey=new.orderkey. A new trigger in order to insert orderkey if
sum(quantity) where orderkey=new.orderkey might be excessive.
Any other idea?
Thanks And Regards

----- Original Message -----
From: "Jakub Ouhrabka" <jakub.ouhrabka@comgate.cz>
To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
Cc: "Manfred Koizar" <mkoi-pg@aon.at>; <pgsql-sql@postgresql.org>
Sent: Friday, July 12, 2002 1:50 PM
Subject: Re: [SQL] [HACKERS] please help on query

hi,

avoid subselect: create a temp table and use join...

CREATE TEMP TABLE tmp AS
SELECT
lineitem.orderkey
FROM
lineitem
WHERE
lineitem.orderkey=orders.orderkey
GROUP BY
lineitem.orderkey HAVING
sum(lineitem.quantity)>300;

CREATE INDEX tmp_idx ON tmp (orderkey);

SELECT
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice,
sum(lineitem.quantity)
FROM
customer,
orders,
lineitem,
tmp
WHERE
orders.orderkey=tmp.orderkey
AND customer.custkey=orders.custkey
AND orders.orderkey=lineitem.orderkey
GROUP BY
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice
ORDER BY
orders.totalprice DESC,
orders.orderdate;

may be the index is not necessary...

kuba

I'm going to keep on asking about another query:

SELECT
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice,
sum(lineitem.quantity)
FROM
customer,
orders,
lineitem
WHERE
exists(
SELECT
lineitem.orderkey
FROM
lineitem
WHERE
lineitem.orderkey=orders.orderkey
GROUP BY
lineitem.orderkey HAVING
sum(lineitem.quantity)>300
)
AND customer.custkey=orders.custkey
AND orders.orderkey=lineitem.orderkey
GROUP BY
customer.name,
customer.custkey,
orders.orderkey,
orders.orderdate,
orders.totalprice

ORDER BY
orders.totalprice DESC,
orders.orderdate;

NOTICE: QUERY PLAN:

Sort (cost=26923941.97..26923941.97 rows=300061 width=66)
-> Aggregate (cost=26851634.86..26896644.05 rows=300061 width=66)
-> Group (cost=26851634.86..26889142.52 rows=3000612 width=66)
-> Sort (cost=26851634.86..26851634.86 rows=3000612
width=66)
-> Hash Join (cost=26107574.81..26457309.10
rows=3000612 width=66)
-> Seq Scan on lineitem

(cost=0.00..222208.25

rows=6001225 width=8)
-> Hash (cost=26105699.81..26105699.81
rows=750000 width=58)
-> Hash Join

(cost=7431.00..26105699.81

Show quoted text

rows=750000 width=58)
-> Seq Scan on orders
(cost=0.00..26083268.81 rows=750000 width=25)
SubPlan
-> Aggregate
(cost=0.00..17.35 rows=1 width=8)
-> Group
(cost=0.00..17.34 rows=5 width=8)
-> Index Scan
using lineitem_pkey on lineitem (cost=0.00..17.33 rows=5 width=8)
-> Hash (cost=7056.00..7056.00
rows=150000 width=33)
-> Seq Scan on customer
(cost=0.00..7056.00 rows=150000 width=33)

again:
orders 1500000 tuples
lineitem 6000000 tuples there are 1 to 7 lineitems per orderkey
Customer 150000 tuples

select attname,n_distinct,correlation from pg_stats where
tablename='lineitem';
attname | n_distinct | correlation
---------------+------------+-------------
orderkey | -0.199847 | 1
partkey | 196448 | 0.0223377
suppkey | 9658 | -0.00822751
linenumber | 7 | 0.17274
quantity | 50 | 0.0150153
extendedprice | 25651 | -0.00790245
discount | 11 | 0.103761
tax | 9 | 0.0993771
returnflag | 3 | 0.391434
linestatus | 2 | 0.509791
shipdate | 2440 | 0.0072777
commitdate | 2497 | 0.00698162
receiptdate | 2416 | 0.00726686
shipinstruct | 4 | 0.241511
shipmode | 7 | 0.138432
comment | 275488 | 0.0188006
(16 rows)

select attname,n_distinct,correlation from pg_stats where
tablename='orders';
attname | n_distinct | correlation
---------------+------------+-------------
orderkey | -1 | -0.999925
custkey | 76309 | 0.00590596
orderstatus | 3 | 0.451991
totalprice | -1 | -0.00768806
orderdate | 2431 | -0.0211354
orderpriority | 5 | 0.182489
clerk | 1009 | 0.00546939
shippriority | 1 | 1
comment | -0.750125 | -0.0123887

Customer
attname | n_distinct | correlation
------------+------------+-------------
custkey | -1 | 1
name | -1 | 1
address | -1 | -0.00510274
nationkey | 25 | 0.0170533
phone | -1 | -0.0227816
acctbal | -0.83444 | -0.00220958
mktsegment | 5 | 0.205013
comment | -1 | 0.0327827

This query takes 12 minutes to run and returns about 50 customers.
lineitem.quantity takes values from 1 to 50, so 300 per orderkey is very
restrictive

May someone help on improving performance?
Again thanks in advance
Regards

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

#13

Masaru Sugawara

rk73@sea.plala.or.jp

over 23 years ago

In reply to: Luis Alberto Amigo Navarro (#12)

Re: [HACKERS] please help on query

On Fri, 12 Jul 2002 17:32:50 +0200
"Luis Alberto Amigo Navarro" <lamigo@atc.unican.es> wrote:

Lineitem is being modified on run time, so creating a temp table don't
solves my problem
The time of creating this table is the same of performing the subselect (or
so I think), it could be done creating a new table, and a new trigger, but
there are already triggers to calculate
lineitem.extendedprice=part.retailprice*lineitem.quantity*(1+taxes)*(1-disco
unt) and to calculate orderstatus in order with linestatus and to calculate
orders.totalprice as sum(extendedprice) where
lineitem.orderkey=new.orderkey. A new trigger in order to insert orderkey if
sum(quantity) where orderkey=new.orderkey might be excessive.
Any other idea?
Thanks And Regards

----- Original Message -----
From: "Jakub Ouhrabka" <jakub.ouhrabka@comgate.cz>
To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
Cc: "Manfred Koizar" <mkoi-pg@aon.at>; <pgsql-sql@postgresql.org>
Sent: Friday, July 12, 2002 1:50 PM
Subject: Re: [SQL] [HACKERS] please help on query

avoid subselect: create a temp table and use join...

CREATE TEMP TABLE tmp AS
SELECT
lineitem.orderkey
FROM
lineitem
WHERE
lineitem.orderkey=orders.orderkey
GROUP BY
lineitem.orderkey HAVING
sum(lineitem.quantity)>300;

Hi,

I'm not sure whether its performance can be improved or not. But I feel
there is a slight chance to reduce the total number of the tuples which
Planner must think.

BTW, how much time does the following query take in your situation,
and how many rows does it retrieve ?

EXPLAIN ANALYZE
SELECT
lineitem.orderkey
FROM
lineitem
GROUP BY
lineitem.orderkey
HAVING
SUM(lineitem.quantity) > 300;

Regards,
Masaru Sugawara

#14

Luis Alberto Amigo Navarro

lamigo@atc.unican.es

over 23 years ago

In reply to: Jakub Ouhrabka (#11)

Re: [HACKERS] please help on query

----- Original Message -----
From: "Masaru Sugawara" <rk73@sea.plala.or.jp>
To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
Cc: <pgsql-sql@postgresql.org>
Sent: Sunday, July 14, 2002 2:23 PM
Subject: Re: [SQL] [HACKERS] please help on query

This is the output:

Aggregate (cost=0.00..647161.10 rows=600122 width=8) (actual
time=4959.19..347328.83 rows=62 loops=1)
-> Group (cost=0.00..632158.04 rows=6001225 width=8) (actual
time=10.79..274259.16 rows=6001225 loops=1)
-> Index Scan using lineitem_pkey on lineitem
(cost=0.00..617154.97 rows=6001225 width=8) (actual time=10.77..162439.11
rows=6001225 loops=1)
Total runtime: 347330.28 msec

it is returning all rows in lineitem. Why is it using index?
Thanks and regards

On Fri, 12 Jul 2002 17:32:50 +0200
"Luis Alberto Amigo Navarro" <lamigo@atc.unican.es> wrote:

Lineitem is being modified on run time, so creating a temp table don't
solves my problem
The time of creating this table is the same of performing the subselect

(or

so I think), it could be done creating a new table, and a new trigger,

but

there are already triggers to calculate

lineitem.extendedprice=part.retailprice*lineitem.quantity*(1+taxes)*(1-disco

unt) and to calculate orderstatus in order with linestatus and to

calculate

orders.totalprice as sum(extendedprice) where
lineitem.orderkey=new.orderkey. A new trigger in order to insert

orderkey if

Show quoted text

sum(quantity) where orderkey=new.orderkey might be excessive.
Any other idea?
Thanks And Regards

----- Original Message -----
From: "Jakub Ouhrabka" <jakub.ouhrabka@comgate.cz>
To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
Cc: "Manfred Koizar" <mkoi-pg@aon.at>; <pgsql-sql@postgresql.org>
Sent: Friday, July 12, 2002 1:50 PM
Subject: Re: [SQL] [HACKERS] please help on query

avoid subselect: create a temp table and use join...

CREATE TEMP TABLE tmp AS
SELECT
lineitem.orderkey
FROM
lineitem
WHERE
lineitem.orderkey=orders.orderkey
GROUP BY
lineitem.orderkey HAVING
sum(lineitem.quantity)>300;

Hi,

I'm not sure whether its performance can be improved or not. But I feel
there is a slight chance to reduce the total number of the tuples which
Planner must think.

BTW, how much time does the following query take in your situation,
and how many rows does it retrieve ?

EXPLAIN ANALYZE
SELECT
lineitem.orderkey
FROM
lineitem
GROUP BY
lineitem.orderkey
HAVING
SUM(lineitem.quantity) > 300;

Regards,
Masaru Sugawara

#15

Masaru Sugawara

rk73@sea.plala.or.jp

over 23 years ago

In reply to: Luis Alberto Amigo Navarro (#14)

Re: [HACKERS] please help on query

On Mon, 15 Jul 2002 09:45:36 +0200
"Luis Alberto Amigo Navarro" <lamigo@atc.unican.es> wrote:

This is the output:

Aggregate (cost=0.00..647161.10 rows=600122 width=8) (actual
time=4959.19..347328.83 rows=62 loops=1)
-> Group (cost=0.00..632158.04 rows=6001225 width=8) (actual
time=10.79..274259.16 rows=6001225 loops=1)
-> Index Scan using lineitem_pkey on lineitem
(cost=0.00..617154.97 rows=6001225 width=8) (actual time=10.77..162439.11
rows=6001225 loops=1)
Total runtime: 347330.28 msec

it is returning all rows in lineitem. Why is it using index?

Sorry, I don't know the reason.
I need more info. Can you show me the outputs of EXPLAIN ANALYZE ?

EXPLAIN ANALYZE
SELECT
orders.orderkey
FROM
lineitem LEFT OUTER JOIN
orders USING(orderkey)
WHERE
orders.orderkey IS NOT NULL
GROUP BY
orders.orderkey
HAVING
SUM(lineitem.quantity) > 300;

EXPLAIN ANALYZE
SELECT
t2.*
FROM (SELECT
orders.orderkey
FROM
lineitem LEFT OUTER JOIN
orders USING(orderkey)
WHERE
orders.orderkey IS NOT NULL
GROUP BY
orders.orderkey
HAVING
SUM(lineitem.quantity) > 300
) AS t1 LEFT OUTER JOIN
orders AS t2 USING(orderkey)
ORDER BY t2.custkey

Regards,
Masaru Sugawara

#16

Luis Alberto Amigo Navarro

lamigo@atc.unican.es

over 23 years ago

In reply to: Masaru Sugawara (#13)

Re: [HACKERS] please help on query

----- Original Message -----
From: "Masaru Sugawara" <rk73@sea.plala.or.jp>
To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
Cc: <pgsql-sql@postgresql.org>
Sent: Monday, July 15, 2002 6:15 PM
Subject: Re: [SQL] [HACKERS] please help on query

Sorry, I don't know the reason.
I need more info. Can you show me the outputs of EXPLAIN ANALYZE ?

Here it is:

EXPLAIN ANALYZE
SELECT
orders.orderkey
FROM
lineitem LEFT OUTER JOIN
orders USING(orderkey)
WHERE
orders.orderkey IS NOT NULL
GROUP BY
orders.orderkey
HAVING
SUM(lineitem.quantity) > 300;

Aggregate (cost=1257368.92..1287375.04 rows=600122 width=12) (actual
time=1236941.71..1454824.56 rows=62 loops=1)
-> Group (cost=1257368.92..1272371.98 rows=6001225 width=12) (actual
time=1233968.87..1385034.91 rows=6001225 loops=1)
-> Sort (cost=1257368.92..1257368.92 rows=6001225 width=12)
(actual time=1233968.82..1276147.37 rows=6001225 loops=1)
-> Hash Join (cost=166395.00..520604.08 rows=6001225
width=12) (actual time=59061.21..773997.08 rows=6001225 loops=1)
-> Seq Scan on lineitem (cost=0.00..195405.25
rows=6001225 width=8) (actual time=20.66..115511.34 rows=6001225 loops=1)
-> Hash (cost=162645.00..162645.00 rows=1500000
width=4) (actual time=59032.16..59032.16 rows=0 loops=1)
-> Seq Scan on orders (cost=0.00..162645.00
rows=1500000 width=4) (actual time=17.33..44420.10 rows=1500000 loops=1)
Total runtime: 1454929.11 msec

EXPLAIN ANALYZE
SELECT
t2.*
FROM (SELECT
orders.orderkey
FROM
lineitem LEFT OUTER JOIN
orders USING(orderkey)
WHERE
orders.orderkey IS NOT NULL
GROUP BY
orders.orderkey
HAVING
SUM(lineitem.quantity) > 300
) AS t1 LEFT OUTER JOIN
orders AS t2 USING(orderkey)
ORDER BY t2.custkey

Sort (cost=1739666.43..1739666.43 rows=600122 width=119) (actual
time=1538897.23..1538897.47 rows=62 loops=1)
-> Merge Join (cost=1344971.49..1682069.98 rows=600122 width=119)
(actual time=1440886.58..1538886.03 rows=62 loops=1)
-> Index Scan using orders_pkey on orders t2 (cost=0.00..324346.65
rows=1500000 width=115) (actual time=32.80..87906.98 rows=1455276 loops=1)
-> Sort (cost=1344971.49..1344971.49 rows=600122 width=12) (actual
time=1439550.31..1439550.73 rows=62 loops=1)
-> Subquery Scan t1 (cost=1257368.92..1287375.04 rows=600122
width=12) (actual time=1222560.86..1439549.36 rows=62 loops=1)
-> Aggregate (cost=1257368.92..1287375.04 rows=600122
width=12) (actual time=1222560.84..1439548.42 rows=62 loops=1)
-> Group (cost=1257368.92..1272371.98
rows=6001225 width=12) (actual time=1219607.04..1369327.42 rows=6001225
loops=1)
-> Sort (cost=1257368.92..1257368.92
rows=6001225 width=12) (actual time=1219607.00..1261208.08 rows=6001225
loops=1)
-> Hash Join
(cost=166395.00..520604.08 rows=6001225 width=12) (actual
time=65973.31..769253.41 rows=6001225 loops=1)
-> Seq Scan on lineitem
(cost=0.00..195405.25 rows=6001225 width=8) (actual time=20.07..115247.61
rows=6001225 loops=1)
-> Hash
(cost=162645.00..162645.00 rows=1500000 width=4) (actual
time=65943.80..65943.80 rows=0 loops=1)
-> Seq Scan on orders
(cost=0.00..162645.00 rows=1500000 width=4) (actual time=39.04..52049.90
rows=1500000 loops=1)
Total runtime: 1539010.00 msec

Thanks and regards

#17

Masaru Sugawara

rk73@sea.plala.or.jp

over 23 years ago

In reply to: Luis Alberto Amigo Navarro (#16)

Re: [HACKERS] please help on query

On Tue, 16 Jul 2002 10:51:03 +0200
"Luis Alberto Amigo Navarro" <lamigo@atc.unican.es> wrote:

Aggregate (cost=1257368.92..1287375.04 rows=600122 width=12) (actual
time=1236941.71..1454824.56 rows=62 loops=1)
-> Group (cost=1257368.92..1272371.98 rows=6001225 width=12) (actual
time=1233968.87..1385034.91 rows=6001225 loops=1)
-> Sort (cost=1257368.92..1257368.92 rows=6001225 width=12)
(actual time=1233968.82..1276147.37 rows=6001225 loops=1)
-> Hash Join (cost=166395.00..520604.08 rows=6001225
width=12) (actual time=59061.21..773997.08 rows=6001225 loops=1)
-> Seq Scan on lineitem (cost=0.00..195405.25
rows=6001225 width=8) (actual time=20.66..115511.34 rows=6001225 loops=1)
-> Hash (cost=162645.00..162645.00 rows=1500000
width=4) (actual time=59032.16..59032.16 rows=0 loops=1)
-> Seq Scan on orders (cost=0.00..162645.00
rows=1500000 width=4) (actual time=17.33..44420.10 rows=1500000 loops=1)
Total runtime: 1454929.11 msec

Hmm, does each of the three tables have some indices like the following?
If not so, could you execute EXPLAIN ANALYZE after creating the indices.

create index idx_lineitem_orderkey on lineitem(orderkey);
create index idx_orders_orderkey on orders(orderkey);
create index idx_orders_custkey on orders(custkey);
create index idx_customer_custkey on customer(custkey);

Regards,
Masaru Sugawara