Query optimization

Started by Jorge Arevaloover 11 years ago11 messagesgeneral
Jump to latest
#1Jorge Arevalo
jorgearevalo@libregis.org

Hello,

I'm trying to optimize a query that takes too much time. This is what I have

table1(field1, field2... field14): contains about 8.5 million rows
table2(f1, f2, f3): contains about 2.5 million rows
table3: is empty, and must be filled with data coming from table1 and table2

To fill table3, I'm using a query that looks like this:

WITH
records_to_insert(field1,field2,field3,field4,field5,field6_hstore,field7,field8_array)
AS

(SELECT value1,value2,value3,value4,value5, hstore(ARRAY['field9',
'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9,
field10, field11, field12, field13, field14) as metadata, value7, (select
array((select row(f1, f2) from table2 p where p.field7 = field7))) as
values_array FROM table1)

SELECT
fill_table3_function(field1,field2,field3,field4,field5,field6_hstore,field7,field8_array)
FROM records_to_insert

So, I first generate a common table records_to_insert, using data from
table1 and table2, and then call a function fill_table3_function, in order
to insert the values into table3 (I do more things apart from insert,
that's reason to call a function instead of just raising an insert query).
There are indexes created on all the columns that need them.

I'm having problems because the query takes a lot of time, and the server
returns a timeout error.

I think the bottleneck is the array built for the last argument of my
common table, and maybe the hstore too (not sure if it's a heavy process or
not). First thing I've tried is to limit the query to build the common
table, using LIMIT .. OFFSET after 'FROM table1', and make a loop to seek
table1 (the big one, with 8 million rows). But still getting timeout errors.

The problem is I don't have access to Postgres configuration, in order to
increase the timeout for user queries. And anyway, I don't think that
increasing the timeout is a real solution (It'll just make the server
suffer for more time).

So, is there anything obviously wrong with my query? Any changes to make it
faster?

Many thanks in advance, and best regards,

--
Jorge Arevalo
Freelance developer

http://about.me/jorgeas80

#2David G. Johnston
david.g.johnston@gmail.com
In reply to: Jorge Arevalo (#1)
Re: Query optimization

Jorge Arévalo-2 wrote

So, I first generate a common table records_to_insert, using data from
table1 and table2, and then call a function fill_table3_function, in order
to insert the values into table3 (I do more things apart from insert,
that's reason to call a function instead of just raising an insert query).
There are indexes created on all the columns that need them.

I would suggest you figure out how to do what you need without pushing the
insert into the function.

So, is there anything obviously wrong with my query? Any changes to make
it
faster?

You shouldn't expect much useful help when the important part of your query
is not provided.

Creating arrays and hstores is expensive but whether that is the biggest
factor is impossible to tell.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Query-optimization-tp5824739p5824744.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3David G. Johnston
david.g.johnston@gmail.com
In reply to: Jorge Arevalo (#1)
Re: Query optimization

Jorge Arévalo-2 wrote

(SELECT value1,value2,value3,value4,value5, hstore(ARRAY['field9',
'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9,
field10, field11, field12, field13, field14) as metadata, value7, (select
array((select row(f1, f2) from table2 p where p.field7 = field7))) as
values_array FROM table1)

You might try seeing whether:

FROM table1 JOIN (
SELECT field7, array_agg(row(f1, f2)) AS values_array FROM table2 GROUP BY
field7
) tbl2_agg USING (field7)

helps...

I'm also dubious (though this isn't necessarily a performance issue) of:

array[...] AS metadata

Without context I would say this would be better as a composite type instead
of an array. You may find it useful to use named composite types elsewhere
too...

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Query-optimization-tp5824739p5824746.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Jorge Arevalo
jorgearevalo@libregis.org
In reply to: David G. Johnston (#3)
Re: Query optimization

Hello David, many thanks for your responses,

Sorry for not providing the content of the fill_table3_function, but it
just executes 3 insert queries in 3 different tables. And I've checked the
time consuming operation is in this query (by the way, there was a little
mistake in the name of the fields of the inner select, I've corrected it)

SELECT value1,value2,value3,value4,
value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13',
'field14'], ARRAY[field9, field10, field11, field12, field13, field14) as
metadata, value7, (select array((select row(f1, f2) from table2 p where
p.f3 = field7))) as values_array FROM table1

This is the result of EXPLAIN ANALYZE

QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=67846.38..395773.45
rows=8419127 width=88) (actual time=7122.704..22670.680 rows=8419127
loops=1)
InitPlan 2 (returns $1)
-> Result (cost=67846.29..67846.29 rows=1 width=0) (actual
time=7009.063..7009.065 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on table2 p (cost=0.00..67846.29 rows=12689
width=20) (actual time=14.971..5069.840 rows=2537787 loops=1)
Filter: (f3 = field7)

So, there's a sequential scan over table2 (in the query to get
values_array), instead of a index scan. Could it be because the SELECT
returns more than approximately 5-10% of all rows in the table? (I've heard
that, under those conditions, a sequential scan is faster than index scan,
because the amount of I/O operations required for each row)

Anyway, if I understood well, I should try:

- Avoiding that inner query by using a JOIN instead
- Return a composite type instead of an array

Am I right? What kind of additional context information would you need?

Many thanks!!

--

Jorge

On Wed, Oct 29, 2014 at 2:54 AM, David G Johnston <
david.g.johnston@gmail.com> wrote:

Jorge Arévalo-2 wrote

(SELECT value1,value2,value3,value4,value5, hstore(ARRAY['field9',
'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9,
field10, field11, field12, field13, field14) as metadata, value7, (select
array((select row(f1, f2) from table2 p where p.field7 = field7))) as
values_array FROM table1)

You might try seeing whether:

FROM table1 JOIN (
SELECT field7, array_agg(row(f1, f2)) AS values_array FROM table2 GROUP BY
field7
) tbl2_agg USING (field7)

helps...

I'm also dubious (though this isn't necessarily a performance issue) of:

array[...] AS metadata

Without context I would say this would be better as a composite type
instead
of an array. You may find it useful to use named composite types elsewhere
too...

David J.

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Query-optimization-tp5824739p5824746.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Jorge Arevalo
Freelance developer

http://about.me/jorgeas80

#5David G. Johnston
david.g.johnston@gmail.com
In reply to: Jorge Arevalo (#4)
Re: Query optimization

List preference is to inline post or, at worse, bottom post. Please do not
top post.

On Wed, Oct 29, 2014 at 11:06 AM, Jorge Arevalo <jorgearevalo@libregis.org>
wrote:

Hello David, many thanks for your responses,

Sorry for not providing the content of the fill_table3_function, but it
just executes 3 insert queries in 3 different tables. And I've checked the
time consuming operation is in this query (by the way, there was a little
mistake in the name of the fields of the inner select, I've corrected it)

SELECT value1,value2,value3,value4,
value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13',
'field14'], ARRAY[field9, field10, field11, field12, field13, field14) as
metadata, value7, (select array((select row(f1, f2) from table2 p where
p.f3 = field7))) as values_array FROM table1

This is the result of EXPLAIN ANALYZE

QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=67846.38..395773.45
rows=8419127 width=88) (actual time=7122.704..22670.680 rows=8419127
loops=1)
InitPlan 2 (returns $1)
-> Result (cost=67846.29..67846.29 rows=1 width=0) (actual
time=7009.063..7009.065 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on table2 p (cost=0.00..67846.29 rows=12689
width=20) (actual time=14.971..5069.840 rows=2537787 loops=1)
Filter: (f3 = field7)

So, there's a sequential scan over table2 (in the query to get
values_array), instead of a index scan. Could it be because the SELECT
returns more than approximately 5-10% of all rows in the table? (I've heard
that, under those conditions, a sequential scan is faster than index scan,
because the amount of I/O operations required for each row)

​What concerns me here is that the rows estimate on that table2 scan is
~13k while the actual count is ~ 2.5​M; you need to run ANALYZE on both
tables and see if your get similar results. Though given the need for
sequential scan regardless (see next comment) the estimate miss likely
doesn't affect actual performance or the plan that is chosen. But it is
still worth looking into.

Anyway, if I understood well, I should try:

- Avoiding that inner query by using a JOIN instead

I don't know...looking at your explain (and some reasoning) it looks as if
it is already doing that for you since there is only a single loop for the
InitPlan 1. This is a little beyond my comfort zone but you've now
provided a decent amount of information for others to speculate...though it
would help to enable various timings as well and try and run the full query
(with the function) in a development environment so that the entire routine
can be evaluated.​

- Return a composite type instead of an array

​Worth looking into but impossible to recommend without knowing what your
make believe fields are and are used for. More style than performance
since I do not know the relative costs of building up an array and creating
a composite.

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jorge Arevalo (#4)
Re: Query optimization

Jorge Arevalo <jorgearevalo@libregis.org> writes:

SELECT value1,value2,value3,value4,
value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13',
'field14'], ARRAY[field9, field10, field11, field12, field13, field14) as
metadata, value7, (select array((select row(f1, f2) from table2 p where
p.f3 = field7))) as values_array FROM table1

The parentheses/brackets don't seem to match up real well here ...

This is the result of EXPLAIN ANALYZE

QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=67846.38..395773.45
rows=8419127 width=88) (actual time=7122.704..22670.680 rows=8419127
loops=1)
InitPlan 2 (returns $1)
-> Result (cost=67846.29..67846.29 rows=1 width=0) (actual
time=7009.063..7009.065 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on table2 p (cost=0.00..67846.29 rows=12689
width=20) (actual time=14.971..5069.840 rows=2537787 loops=1)
Filter: (f3 = field7)

Hm. If I'm reading that right, you're building an array containing
2537787 entries, each of which is a composite datum containing two
columns of unmentioned datatypes. I suspect a big chunk of your
runtime is going into manipulating that array -- PG is not terribly
efficient with big arrays containing variable-width values.

I'm also a bit confused as to why the planner is saying that the (SELECT
ARRAY(...)) bit is an InitPlan and not a SubPlan. That implies that
"field7" in the innermost WHERE clause is not a reference to table1 but a
reference to table2. Is that really what you meant? IOW, are you sure
this query is performing the right calculation in the first place?

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7David G. Johnston
david.g.johnston@gmail.com
In reply to: Tom Lane (#6)
Re: Query optimization

On Wed, Oct 29, 2014 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jorge Arevalo <jorgearevalo@libregis.org> writes:

This is the result of EXPLAIN ANALYZE

QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------

Index Scan using table1_pkey on table1 (cost=67846.38..395773.45
rows=8419127 width=88) (actual time=7122.704..22670.680 rows=8419127
loops=1)
InitPlan 2 (returns $1)
-> Result (cost=67846.29..67846.29 rows=1 width=0) (actual
time=7009.063..7009.065 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on table2 p (cost=0.00..67846.29 rows=12689
width=20) (actual time=14.971..5069.840 rows=2537787 loops=1)
Filter: (f3 = field7)

Hm. If I'm reading that right, you're building an array containing
2537787 entries, each of which is a composite datum containing two
columns of unmentioned datatypes. I suspect a big chunk of your
runtime is going into manipulating that array -- PG is not terribly
efficient with big arrays containing variable-width values.

I'm also a bit confused as to why the planner is saying that the (SELECT
ARRAY(...)) bit is an InitPlan and not a SubPlan. That implies that
"field7" in the innermost WHERE clause is not a reference to table1 but a
reference to table2. Is that really what you meant? IOW, are you sure
this query is performing the right calculation in the first place?

I thought the InitPlan was in place because the planner choose to execute
the correlated subquery as a standalone query since it realizes that it is
going to have to end up processing the entire table anyway due to the lack
of a filter on the outer query. In effect executing "table1 JOIN (table2
subquery) ON (f3 = field7)"​.

David J.

#8Jorge Arevalo
jorgearevalo@libregis.org
In reply to: David G. Johnston (#7)
Re: Query optimization

On Wed, Oct 29, 2014 at 8:05 PM, David Johnston <david.g.johnston@gmail.com>
wrote:

On Wed, Oct 29, 2014 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jorge Arevalo <jorgearevalo@libregis.org> writes:

This is the result of EXPLAIN ANALYZE

QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------

Index Scan using table1_pkey on table1 (cost=67846.38..395773.45
rows=8419127 width=88) (actual time=7122.704..22670.680 rows=8419127
loops=1)
InitPlan 2 (returns $1)
-> Result (cost=67846.29..67846.29 rows=1 width=0) (actual
time=7009.063..7009.065 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on table2 p (cost=0.00..67846.29 rows=12689
width=20) (actual time=14.971..5069.840 rows=2537787 loops=1)
Filter: (f3 = field7)

Hm. If I'm reading that right, you're building an array containing
2537787 entries, each of which is a composite datum containing two
columns of unmentioned datatypes. I suspect a big chunk of your
runtime is going into manipulating that array -- PG is not terribly
efficient with big arrays containing variable-width values.

The seq scan over table2 is for finding entries in table2 (which contains
2537787) that matches a condition using a column from table1 (entries that
match table1.field7 = table2.f3). But the array isn't going to contain all
the entries, Just a few of them.

I think the time is being used in scanning table2 for all the rows of
table1 (plus than 8 million).

I'm also a bit confused as to why the planner is saying that the (SELECT

ARRAY(...)) bit is an InitPlan and not a SubPlan. That implies that
"field7" in the innermost WHERE clause is not a reference to table1 but a
reference to table2. Is that really what you meant? IOW, are you sure
this query is performing the right calculation in the first place?

I thought the InitPlan was in place because the planner choose to execute
the correlated subquery as a standalone query since it realizes that it is
going to have to end up processing the entire table anyway due to the lack
of a filter on the outer query. In effect executing "table1 JOIN (table2
subquery) ON (f3 = field7)"​.

David J.

Yes, for each row of table1, table2 is being scanned, to find all the
entries that satisfy table1.field7 = table2.f3. Sounds that a really heavy
task. I guess I should avoid it, right?

BTW, Tom, this is the query with all the parentheses/brackets

SELECT value1,value2,value3,value4, value5, hstore(ARRAY['field9',
'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9,
field10, field11, field12, field13, field14]) as metadata, value7, (select
array((select row(f1, f2) from table2 p where p.f3 = field7))) as
values_array FROM table1

Oh, and sorry for the top posting!

--
Jorge Arevalo
Freelance developer

http://about.me/jorgeas80

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jorge Arevalo (#8)
Re: Query optimization

Jorge Arevalo <jorgearevalo@libregis.org> writes:

On Wed, Oct 29, 2014 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm also a bit confused as to why the planner is saying that the (SELECT
ARRAY(...)) bit is an InitPlan and not a SubPlan. That implies that
"field7" in the innermost WHERE clause is not a reference to table1 but a
reference to table2. Is that really what you meant? IOW, are you sure
this query is performing the right calculation in the first place?

Yes, for each row of table1, table2 is being scanned, to find all the
entries that satisfy table1.field7 = table2.f3.

You might think that's what it's doing, but the plan shape says otherwise.
An array of 2537787 entries is being formed *once* and then referenced at
each row of table1. I'd take another look and see if "field7" isn't a
column name that exists in both tables.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10David G. Johnston
david.g.johnston@gmail.com
In reply to: Jorge Arevalo (#8)
Re: Query optimization

On Wed, Oct 29, 2014 at 12:14 PM, Jorge Arevalo <jorgearevalo@libregis.org>
wrote:

SELECT value1,value2,value3,value4, value5, hstore(ARRAY['field9',
'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9,
field10, field11, field12, field13, field14]) as metadata, value7, (select
array((select row(f1, f2) from table2 p where p.f3 = field7))) as
values_array FROM table1

​More generally, you really should table-prefix all column in correlated
subqueries.

[...] from table2 p where p.f3 = table1.field7 [...]

​I guess the InitPlan 1 you showed simply scanned table2 and applied the
filter which then was fed to InitPlan 2 where the array is built; that
array then is inserted into the outer query ~8M​ times...

David J.

#11Jorge Arevalo
jorgearevalo@libregis.org
In reply to: David G. Johnston (#10)
Re: Query optimization

On Wed, Oct 29, 2014 at 8:47 PM, David Johnston <david.g.johnston@gmail.com>
wrote:

On Wed, Oct 29, 2014 at 12:14 PM, Jorge Arevalo <jorgearevalo@libregis.org

wrote:

SELECT value1,value2,value3,value4, value5, hstore(ARRAY['field9',
'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9,
field10, field11, field12, field13, field14]) as metadata, value7, (select
array((select row(f1, f2) from table2 p where p.f3 = field7))) as
values_array FROM table1

​More generally, you really should table-prefix all column in correlated
subqueries.

[...] from table2 p where p.f3 = table1.field7 [...]

​I guess the InitPlan 1 you showed simply scanned table2 and applied the
filter which then was fed to InitPlan 2 where the array is built; that
array then is inserted into the outer query ~8M​ times...

David J.

Wow, you were right! There was a field with same name in both tables, and
that caused problems. I've just prefixed each field with the table
identifier, and now it works really fast.

Many thanks, guys!

--
Jorge Arevalo
Freelance developer

http://about.me/jorgeas80