Place of subselect

Started by Guillaume Bogover 17 years ago5 messagesgeneral

Jump to latest

Guillaume Bog

guibog@gmail.com

over 17 years ago

Hi dear Postgres users.

I have performance issues if I do the following pseudo-query:

SELECT a, b, (SELECT count(*) FROM t2 WHERE something) AS c
FROM t1 ORDER BY a LIMIT 10;

After some tests, it seems to me that the subquery on t2 is computed for all
rows of t1. As I don't "ORDER BY c", there is no need to compute c for every
row. I know I can (or should ?) work with joins or with a subquery in the
from clause, but I'd like to make sure there is no other way before changing
my sqls.

A subjective reason for me to prefer subqueries in fields instead of joins
of sub tables is that, when it only relates to the text displayed, it is
easyer to read and to change, and I mess less with agregate functions.

Thanks.

A. Kretschmer

andreas.kretschmer@schollglas.com

over 17 years ago

In reply to: Guillaume Bog (#1)

Re: Place of subselect

am Tue, dem 25.11.2008, um 15:34:57 +0800 mailte Guillaume Bog folgendes:

Hi dear Postgres users.

I have performance issues if I do the following pseudo-query:

SELECT a, b, (SELECT count(*) FROM t2 WHERE something) AS c
FROM t1 ORDER BY a LIMIT 10;

After some tests, it seems to me that the subquery on t2 is computed for all
rows of t1. As I don't "ORDER BY c", there is no need to compute c for every
row. I know I can (or should ?) work with joins or with a subquery in the from
clause, but I'd like to make sure there is no other way before changing my
sqls.

Please check your presumption with explain analyse <your query>.

For example:

test=*# explain analyse select t1.*, (select count(1) from t2) from t1 order by 1 limit 5;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Limit (cost=186.54..186.55 rows=5 width=4) (actual time=0.087..0.104 rows=3 loops=1)
InitPlan
-> Aggregate (cost=36.75..36.76 rows=1 width=0) (actual time=0.022..0.024 rows=1 loops=1)
-> Seq Scan on t2 (cost=0.00..31.40 rows=2140 width=0) (actual time=0.004..0.008 rows=1 loops=1)
-> Sort (cost=149.78..155.13 rows=2140 width=4) (actual time=0.082..0.088 rows=3 loops=1)
Sort Key: i
-> Seq Scan on t1 (cost=0.00..31.40 rows=2140 width=4) (actual time=0.046..0.056 rows=3 loops=1)
Total runtime: 0.197 ms
(8 rows)

Both tables executes only one scan.

Andreas
--
Andreas Kretschmer
Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net

Guillaume Bog

guibog@gmail.com

over 17 years ago

In reply to: A. Kretschmer (#2)

Re: Place of subselect

On Tue, Nov 25, 2008 at 15:56, A. Kretschmer <
andreas.kretschmer@schollglas.com> wrote:

am Tue, dem 25.11.2008, um 15:34:57 +0800 mailte Guillaume Bog folgendes:

Hi dear Postgres users.

I have performance issues if I do the following pseudo-query:

SELECT a, b, (SELECT count(*) FROM t2 WHERE something) AS c
FROM t1 ORDER BY a LIMIT 10;

After some tests, it seems to me that the subquery on t2 is computed for

all

rows of t1. As I don't "ORDER BY c", there is no need to compute c for

every

row. I know I can (or should ?) work with joins or with a subquery in the

from

clause, but I'd like to make sure there is no other way before changing

my

sqls.

Please check your presumption with explain analyse <your query>.

For example:

test=*# explain analyse select t1.*, (select count(1) from t2) from t1
order by 1 limit 5;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------
Limit (cost=186.54..186.55 rows=5 width=4) (actual time=0.087..0.104
rows=3 loops=1)
InitPlan
-> Aggregate (cost=36.75..36.76 rows=1 width=0) (actual
time=0.022..0.024 rows=1 loops=1)
-> Seq Scan on t2 (cost=0.00..31.40 rows=2140 width=0) (actual
time=0.004..0.008 rows=1 loops=1)
-> Sort (cost=149.78..155.13 rows=2140 width=4) (actual
time=0.082..0.088 rows=3 loops=1)
Sort Key: i
-> Seq Scan on t1 (cost=0.00..31.40 rows=2140 width=4) (actual
time=0.046..0.056 rows=3 loops=1)
Total runtime: 0.197 ms
(8 rows)

Both tables executes only one scan.

It seems that you are right. By further testing I found that a WHERE
condition in the subquery was making the query hundred times slower. As I'm
not very familiar with explain analyze, I paste them below. Why do I have
"merge join" and "merge cond" in one case and "subplan" in the other case?
Note that "u_xref_ug_id" is a reference and therefore b-tree indexed.

vf_cn2fr=# EXPLAIN ANALYZE SELECT ug_id AS id, ug_en AS name, ug_type AS
type,
(SELECT count(*) FROM forms_groups JOIN users ON fg_xref_u_id = u_id WHERE
u_xref_ug_id = ug_id) as groupes
FROM users_groups ORDER BY "ug_type","ug_en" LIMIT 5;

QUERY
PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=513022.74..513022.76 rows=5 width=26) (actual
time=31172.258..31172.271 rows=5 loops=1)
-> Sort (cost=513022.74..513023.74 rows=397 width=26) (actual
time=31172.255..31172.259 rows=5 loops=1)
Sort Key: ug_type, ug_en
-> Seq Scan on users_groups (cost=0.00..513005.61 rows=397
width=26) (actual time=83.273..31167.266 rows=397 loops=1)
SubPlan
-> Aggregate (cost=1292.18..1292.19 rows=1 width=0)
(actual time=78.498..78.499 rows=1 loops=397)
-> Hash Join (cost=146.18..1290.52 rows=663
width=0) (actual time=30.023..78.389 rows=102 loops=397)
Hash Cond: (forms_groups.fg_xref_u_id =
users.u_id)
-> Seq Scan on forms_groups
(cost=0.00..985.88 rows=40488 width=4) (actual time=0.005..42.046 rows=40490
loops=372)
-> Hash (cost=137.10..137.10 rows=726
width=4) (actual time=0.306..0.306 rows=112 loops=397)
-> Index Scan using
users_u_xref_ug_id_idx on users (cost=0.00..137.10 rows=726 width=4)
(actual time=0.013..0.169 rows=112 loops=397)
Index Cond: (u_xref_ug_id = $0)
Total runtime: 31172.363 ms
(13 rows)

vf_cn2fr=# EXPLAIN ANALYZE SELECT ug_id AS id, ug_en AS name, ug_type AS
type,
(SELECT count(*) FROM forms_groups JOIN users ON fg_xref_u_id = u_id) as
groupes
FROM users_groups ORDER BY "ug_type","ug_en" LIMIT 5;

QUERY
PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=4530.86..4530.87 rows=5 width=26) (actual
time=325.353..325.365 rows=5 loops=1)
InitPlan
-> Aggregate (cost=4505.74..4505.75 rows=1 width=0) (actual
time=320.673..320.674 rows=1 loops=1)
-> Merge Join (cost=0.00..4404.52 rows=40488 width=0) (actual
time=0.176..281.602 rows=40490 loops=1)
Merge Cond: (forms_groups.fg_xref_u_id = users.u_id)
-> Index Scan using forms_groups_fg_xref_u_id_idx on
forms_groups (cost=0.00..1576.38 rows=40488 width=4) (actual
time=0.020..64.556 rows=40490 loops=1)
-> Index Scan using users_pkey on users
(cost=0.00..2212.00 rows=44313 width=4) (actual time=0.015..73.373
rows=47689 loops=1)
-> Sort (cost=25.11..26.10 rows=397 width=26) (actual
time=325.350..325.355 rows=5 loops=1)
Sort Key: ug_type, ug_en
-> Seq Scan on users_groups (cost=0.00..7.97 rows=397 width=26)
(actual time=320.693..321.192 rows=397 loops=1)
Total runtime: 325.457 ms
(11 rows)

Show quoted text

Andreas
--
Andreas Kretschmer
Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

A. Kretschmer

andreas.kretschmer@schollglas.com

over 17 years ago

In reply to: Guillaume Bog (#3)

Re: Place of subselect

am Tue, dem 25.11.2008, um 16:44:34 +0800 mailte Guillaume Bog folgendes:

It seems that you are right. By further testing I found that a WHERE condition
in the subquery was making the query hundred times slower. As I'm not very
familiar with explain analyze, I paste them below. Why do I have "merge join"
and "merge cond" in one case and "subplan" in the other case? Note that
"u_xref_ug_id" is a reference and therefore b-tree indexed.

vf_cn2fr=# EXPLAIN ANALYZE SELECT ug_id AS id, ug_en AS name, ug_type AS type,
(SELECT count(*) FROM forms_groups JOIN users ON fg_xref_u_id = u_id WHERE
u_xref_ug_id = ug_id) as groupes
FROM users_groups ORDER BY "ug_type","ug_en" LIMIT 5;

The were-condition contains both inner and outer tables, because of that
the subselect performs for every record of the outer table.

Maybe someone else can tell you an advice how to rewrite the query for
better performance.

Regards, Andreas
--
Andreas Kretschmer
Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Guillaume Bog (#1)

Re: Place of subselect

"Guillaume Bog" <guibog@gmail.com> writes:

I have performance issues if I do the following pseudo-query:

SELECT a, b, (SELECT count(*) FROM t2 WHERE something) AS c
FROM t1 ORDER BY a LIMIT 10;

After some tests, it seems to me that the subquery on t2 is computed for all
rows of t1.

Yeah. The SQL specification says that ORDER BY happens after computing
the SELECT output-list. In some cases that'll get optimized but you
can't count on it.

You can probably improve matters by using a sub-select:

SELECT a, b, (SELECT count(*) FROM t2 WHERE something) AS c
FROM ( SELECT a, b, ... FROM t1 ORDER BY a LIMIT 10 ) ss;

regards, tom lane