Optimizing queries that use multiple tables and many order by columns

Started by Joshua Berryover 15 years ago5 messagesgeneral

yoberi@gmail.com

over 15 years ago

Hi Group,

I've never really learned how to optimize queries that join several tables
and have order by clauses that specify columns from each table. Is there
documentation that could help me optimize and have the proper indexes in
place? I've read through the PG Docs Chapter 11 on Indexes yet still lack
the needed understanding.

Here's my latest culprit:

select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn, JOB.CompanyCode,
Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode, anl.job, anl.lab
limit 10;

Here's the query plan using PG 8.4.4:
Limit (cost=21990.24..21990.27 rows=10 width=32)
-> Sort (cost=21990.24..22437.69 rows=178979 width=32)
Sort Key: job.companycode, anl.lab
-> Hash Join (cost=451.20..18122.57 rows=178979 width=32)
Hash Cond: (anl.job = job.job)
-> Seq Scan on analysis anl (cost=0.00..14091.79 rows=178979
width=23)
-> Hash (cost=287.20..287.20 rows=13120 width=17)
-> Seq Scan on job (cost=0.00..287.20 rows=13120
width=17)

If I change the above query to only order by one of the tables, I get better
results:
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn, JOB.CompanyCode,
Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode --, anl.job, anl.lab
limit 10;
Limit (cost=0.00..3.65 rows=10 width=32)
-> Nested Loop (cost=0.00..65269.13 rows=178979 width=32)
-> Index Scan using job_companycode on job (cost=0.00..972.67
rows=13120 width=17)
-> Index Scan using analysis_job_lab on analysis anl
(cost=0.00..4.63 rows=22 width=23)
Index Cond: (anl.job = job.job)

Any idea on how I can improve this? In the past I would tend to create a
cached copy of the query as a table that would be utilized, but I suspect
that there is a better way to go about this. I'm using a system (Clarion)
which heavily uses cursors via the ODBC driver (I use the psqlODBC latest
version) to get a handful of records at a time, so no actual LIMITs would be
used in the production queries; I've added the LIMITs here to try to
simulate the performance differences that I find when browsing the data
while ordering by the above columns.

Here are the relevant tables and indexes:

CREATE TABLE job
(
job bigint NOT NULL, -- Job #
companycode character(4), -- Company Code
recdbycode character(3), -- Initials of who checked in sample(s)
datein date, -- Date sample was received
project character varying, -- Project or Site name
remarks text, -- Remarks
--[CONSTRAINTs etc]
)

CREATE INDEX job_companycode
ON job
USING btree
(companycode);

CREATE INDEX job_companycode_job
ON samples.job
USING btree
(companycode, job);

CREATE TABLE analysis
(
lab bigint NOT NULL, -- Lab number
job bigint, -- Job number
sampletype character varying(5), -- General class of sample
priority character(1), -- Priority level
samplename character varying, -- Sample or Well name
CONSTRAINT rel_joblabk_to_jobk FOREIGN KEY (job)
REFERENCES job (job) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE RESTRICT,
--[CONSTRAINTs etc]
)

CREATE INDEX analysis_companycode_job_lab
ON analysis
USING btree
(companycode, job, lab);

CREATE INDEX analysis_job_lab
ON analysis
USING btree
(job, lab);

Thanks for any insights and tips you can provide!

Kind Regards,
-Joshua Berry

Wappler, Robert

rwappler@ophardt.com

over 15 years ago

In reply to: Joshua Berry (#1)

Re: Optimizing queries that use multiple tables and many order by columns

On 2010-08-25, Joshua Berry wrote:

Hi Group,

I've never really learned how to optimize queries that join
several tables and have order by clauses that specify columns
from each table. Is there documentation that could help me
optimize and have the proper indexes in place? I've read
through the PG Docs Chapter 11 on Indexes yet still lack the
needed understanding.

Here's my latest culprit:

select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode, anl.job, anl.lab
limit 10;

Could you try to remove the limit clause? I have seen several times,
that it may slow down a query. Although I haven't tested it, that an
OFFSET 0 clause can improve the situation, iirc.

From an algebraic point of view, I cannot see obvious inefficiencies.
Others, which now the internals of pg better, might see more.

Here's the query plan using PG 8.4.4:
Limit (cost=21990.24..21990.27 rows=10 width=32)
-> Sort (cost=21990.24..22437.69 rows=178979 width=32)
Sort Key: job.companycode, anl.lab
-> Hash Join (cost=451.20..18122.57 rows=178979 width=32)
Hash Cond: (anl.job = job.job) -> Seq Scan on analysis
anl (cost=0.00..14091.79 rows=178979 width=23) -> Hash
(cost=287.20..287.20 rows=13120 width=17)
-> Seq Scan on job (cost=0.00..287.20
rows=13120 width=17)

If I change the above query to only order by one of the
tables, I get better results:
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode --, anl.job, anl.lab
limit 10;
Limit (cost=0.00..3.65 rows=10 width=32)
-> Nested Loop (cost=0.00..65269.13 rows=178979 width=32)
-> Index Scan using job_companycode on job (cost=0.00..972.67
rows=13120 width=17) -> Index Scan using analysis_job_lab on
analysis anl
(cost=0.00..4.63 rows=22 width=23)
Index Cond: (anl.job = job.job)

That are estimated query plans, what does EXPLAIN ANALYZE say? The query
plans above do not execute the query instead they just make a rough
guess about the costs. Reality might be different. Also you may want to
run VACUUM ANALYZE before.

Any idea on how I can improve this? In the past I would tend
to create a cached copy of the query as a table that would be
utilized, but I suspect that there is a better way to go
about this. I'm using a system (Clarion) which heavily uses
cursors via the ODBC driver (I use the psqlODBC latest
version) to get a handful of records at a time, so no actual
LIMITs would be used in the production queries; I've added
the LIMITs here to try to simulate the performance
differences that I find when browsing the data while ordering
by the above columns.

Here are the relevant tables and indexes:

CREATE TABLE job
(
job bigint NOT NULL, -- Job #
companycode character(4), -- Company Code
recdbycode character(3), -- Initials of who checked in sample(s)
datein date, -- Date sample was received
project character varying, -- Project or Site name
remarks text, -- Remarks
--[CONSTRAINTs etc]
)

CREATE INDEX job_companycode
ON job
USING btree
(companycode);
CREATE INDEX job_companycode_job
ON samples.job
USING btree
(companycode, job);

Index job_companycode is not used in the plans. Additionally, it can be
constructed from the second index, as companycode is the primary sort
key.

CREATE TABLE analysis
(
lab bigint NOT NULL, -- Lab number
job bigint, -- Job number
sampletype character varying(5), -- General class of sample
priority character(1), -- Priority level
samplename character varying, -- Sample or Well name
CONSTRAINT rel_joblabk_to_jobk FOREIGN KEY (job)
REFERENCES job (job) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE RESTRICT,
--[CONSTRAINTs etc]
)

CREATE INDEX analysis_companycode_job_lab
ON analysis
USING btree
(companycode, job, lab);

CREATE INDEX analysis_job_lab
ON analysis
USING btree
(job, lab);

Maybe, the planner decides for a Sort Join, if there are sorted indexes
for anl.job and job.job. But the speed-up may vary depending on the
data.

Thanks for any insights and tips you can provide!

Kind Regards,
-Joshua Berry

HTH

--
Robert...

Joshua Berry

yoberi@gmail.com

over 15 years ago

In reply to: Wappler, Robert (#2)

Re: Optimizing queries that use multiple tables and many order by columns

On Wed, Aug 25, 2010 at 10:40 AM, Wappler, Robert <rwappler@ophardt.com>wrote:

On 2010-08-25, Joshua Berry wrote:

Here's my latest culprit:

select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode, anl.job, anl.lab
limit 10;

Could you try to remove the limit clause? I have seen several times,
that it may slow down a query. Although I haven't tested it, that an
OFFSET 0 clause can improve the situation, iirc.

The actual query uses a cursor, which seems to run the query and after the
entire set is ready to be fetched, it will be able to allow fetching. So,
I'm not using a limit in the actual queries, but something like this:

declare "SQL_CUR0453D910" cursor with hold for
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode, anl.job, anl.lab;
fetch 10 in "SQL_CUR0453D910";
close "SQL_CUR0453D910";

From an algebraic point of view, I cannot see obvious inefficiencies.

Others, which now the internals of pg better, might see more.

That are estimated query plans, what does EXPLAIN ANALYZE say? The query
plans above do not execute the query instead they just make a rough
guess about the costs. Reality might be different. Also you may want to
run VACUUM ANALYZE before.

The database is vacuum analyze'd and the stat target is the default of 100.
I'm also using PG 8.4.4 running on Centos 5.5 x86_64

--Here's what explain analyze says for the query
explain analyze
declare "SQL_CUR0453D910" cursor with hold for
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode, anl.job, anl.lab;

Sort (cost=38047.92..38495.65 rows=179095 width=32) (actual
time=1890.796..2271.248 rows=178979 loops=1)
Sort Key: job.companycode, anl.job, anl.lab
Sort Method: external merge Disk: 8416kB
-> Hash Join (cost=451.20..18134.05 rows=179095 width=32) (actual
time=8.239..260.848 rows=178979 loops=1)
Hash Cond: (anl.job = job.job)
-> Seq Scan on analysis anl (cost=0.00..14100.95 rows=179095
width=23) (actual time=0.026..91.602 rows=178979 loops=1)
-> Hash (cost=287.20..287.20 rows=13120 width=17) (actual
time=8.197..8.197 rows=13120 loops=1)
-> Seq Scan on job (cost=0.00..287.20 rows=13120 width=17)
(actual time=0.007..4.166 rows=13120 loops=1)
Total runtime: 2286.224 ms

Maybe, the planner decides for a Sort Join, if there are sorted indexes
for anl.job and job.job. But the speed-up may vary depending on the
data.

It seems to be reading the entire dataset, then sorting, right? There's not
much more that could be done to improve such queries, aside from increasing
memory and IO bandwidth.

But now that I've said that, there's the following query that deals with
exactly the same set of data, but the ordering involves only one of the two
joined tables.

explain analyze
declare "SQL_CUR0453D910" cursor with hold for
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode --, anl.job, anl.lab; --Only order by indexed
columns from job.

Nested Loop (cost=0.00..65305.66 rows=179095 width=32) (actual
time=0.084..288.976 rows=178979 loops=1)
-> Index Scan using job_companycode on job (cost=0.00..972.67 rows=13120
width=17) (actual time=0.045..7.328 rows=13120 loops=1)
-> Index Scan using analysis_job_lab on analysis anl (cost=0.00..4.63
rows=22 width=23) (actual time=0.006..0.015 rows=14 loops=13120)
Index Cond: (anl.job = job.job)
Total runtime: 303.230 ms

If I order by columns from the other table, analysis only, I get the follow
query and results:
explain analyze
declare "SQL_CUR0453D910" cursor with hold for
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by --job.companycode,
anl.job, anl.lab; --Only order by indexed columns from analysis.

Merge Join (cost=0.56..44872.45 rows=179095 width=32) (actual
time=0.078..368.620 rows=178979 loops=1)
Merge Cond: (anl.job = job.job)
-> Index Scan using analysis_job_lab on analysis anl
(cost=0.00..35245.47 rows=179095 width=23) (actual time=0.035..128.460
rows=178979 loops=1)
-> Index Scan using job_job_pk on job (cost=0.00..508.53 rows=13120
width=17) (actual time=0.039..53.733 rows=179005 loops=1)
Total runtime: 388.884 ms

Notice that in these cases the query completes in <400 ms and the other
query that involves ordering on columns from both of the joined tables
completes in >2300ms.

In the application here, these queries are used by a client application to
fill a window's listbox that can be scrolled up or down. If the user changes
direction of the scroll, it initiates a new cursor and query to fetch a page
of results. If the scrolling motion is in the same direction, it simply
continues to fetch more results from the cursor. But each time the direction
of movement changes, there can be a significant lag.

Any suggestions would be helpful! I'll assume for now that the indexes and
queries can't be improved, but rather that I should tweak more of the
postmaster settings. Please correct me if you know better and have time to
reply.

Thanks,
-Joshua

P.S. Is it possible to have indexes that involves several columns from
different but related tables? If so, where can I learn about them?

Show quoted text

Thanks for any insights and tips you can provide!

Kind Regards,
-Joshua Berry

HTH

--
Robert...

Wappler, Robert

rwappler@ophardt.com

over 15 years ago

In reply to: Joshua Berry (#3)

Re: Optimizing queries that use multiple tables and many order by columns

On 2010-08-25, Joshua Berry wrote:

--Here's what explain analyze says for the query
explain analyze
declare "SQL_CUR0453D910" cursor with hold for
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode, anl.job, anl.lab;

Sort (cost=38047.92..38495.65 rows=179095 width=32) (actual
time=1890.796..2271.248 rows=178979 loops=1)
Sort Key: job.companycode, anl.job, anl.lab
Sort Method: external merge Disk: 8416kB
-> Hash Join (cost=451.20..18134.05 rows=179095 width=32)
(actual time=8.239..260.848 rows=178979 loops=1)
Hash Cond: (anl.job = job.job) -> Seq Scan on analysis anl
(cost=0.00..14100.95 rows=179095 width=23) (actual
time=0.026..91.602 rows=178979 loops=1) -> Hash
(cost=287.20..287.20 rows=13120 width=17)
(actual time=8.197..8.197 rows=13120 loops=1)
-> Seq Scan on job (cost=0.00..287.20
rows=13120 width=17) (actual time=0.007..4.166 rows=13120 loops=1)
Total runtime: 2286.224 ms

Maybe, the planner decides for a Sort Join, if there
are sorted indexes

for anl.job and job.job. But the speed-up may vary
depending on the
data.

It seems to be reading the entire dataset, then sorting,
right? There's not much more that could be done to improve
such queries, aside from increasing memory and IO bandwidth.

It has to, because it has to start with the smallest row in the
resultset, which may be the last one read, if it cannot read the tuples
in sorted order.

But now that I've said that, there's the following query that
deals with exactly the same set of data, but the ordering
involves only one of the two joined tables.

explain analyze
declare "SQL_CUR0453D910" cursor with hold for
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode --, anl.job, anl.lab; --Only order
by indexed columns from job.

Nested Loop (cost=0.00..65305.66 rows=179095 width=32)
(actual time=0.084..288.976 rows=178979 loops=1)
-> Index Scan using job_companycode on job
(cost=0.00..972.67 rows=13120 width=17) (actual
time=0.045..7.328 rows=13120 loops=1)
-> Index Scan using analysis_job_lab on analysis anl
(cost=0.00..4.63 rows=22 width=23) (actual time=0.006..0.015
rows=14 loops=13120)
Index Cond: (anl.job = job.job)
Total runtime: 303.230 ms

If I order by columns from the other table, analysis only, I
get the follow query and results:
explain analyze
declare "SQL_CUR0453D910" cursor with hold for
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by --job.companycode,
anl.job, anl.lab; --Only order by indexed columns from analysis.

Merge Join (cost=0.56..44872.45 rows=179095 width=32)
(actual time=0.078..368.620 rows=178979 loops=1)
Merge Cond: (anl.job = job.job)
-> Index Scan using analysis_job_lab on analysis anl
(cost=0.00..35245.47 rows=179095 width=23) (actual
time=0.035..128.460 rows=178979 loops=1)
-> Index Scan using job_job_pk on job (cost=0.00..508.53
rows=13120 width=17) (actual time=0.039..53.733 rows=179005 loops=1)
Total runtime: 388.884 ms

Notice that in these cases the query completes in <400 ms and
the other query that involves ordering on columns from both
of the joined tables completes in >2300ms.

Because, these queries don't need to sort the result, they can read it
in order.

What I don't really get is, that you compare queries with different sort
orders, especially with a different number of sort keys. Of course, that
has a big influence on the time needed for sorting. While your first
query, which sorts on three keys really does a sort on the result, the
latter two don't need that, because they can read the tuples in the
correct order from the indexes. If I read the first plan correctly, that
sort costs you about 2 sec in query execution time, because the Hash
Join is done after 260ms.

Do you really have the requirement to sort anything? Or let me ask it
the other way round: Assuming you have too much data, to sort it on the
application side, which user can read all this from one single table in
the user interface?

In the application here, these queries are used by a client
application to fill a window's listbox that can be scrolled
up or down. If the user changes direction of the scroll, it
initiates a new cursor and query to fetch a page of results.
If the scrolling motion is in the same direction, it simply
continues to fetch more results from the cursor. But each
time the direction of movement changes, there can be a
significant lag.

Then, obviously you shouldn't create a new cursor. You can create
backwards scrollable cursors. See the SCROLL option of the DECLARE
statement.

Any suggestions would be helpful! I'll assume for now that
the indexes and queries can't be improved, but rather that I
should tweak more of the postmaster settings. Please correct
me if you know better and have time to reply.

These options heavily depend on the environment and the data set, I
always see them as some last resort, because they might slow down other
queries if tweaked to much towards a specific thing. I have not yet
played around with this a lot. The things simply work fast enough here.
Others can give you better hints on this.

Thanks,
-Joshua

P.S. Is it possible to have indexes that involves several
columns from different but related tables? If so, where can I
learn about them?

Nope. An index is tied to one table only. But another option is, to
precalculate the join. Depending on your needs (especially INSERT/UPDATE
performance), you could use triggers and/or a regular batch job, which
writes the joined results in another table. There you can index these
columns accordingly. In general, this is ugly and leads to redundancy
but can give a big performance boost and is sometimes the only option.

--
Robert...

Joshua Berry

yoberi@gmail.com

over 15 years ago

In reply to: Wappler, Robert (#4)

Re: Optimizing queries that use multiple tables and many order by columns

On Thu, Aug 26, 2010 at 2:51 AM, Wappler, Robert <rwappler@ophardt.com>wrote:

Do you really have the requirement to sort anything? Or let me ask it
the other way round: Assuming you have too much data, to sort it on the
application side, which user can read all this from one single table in
the user interface?

The tool that I'm using to pull this information together is really easy to
use and maintain when you use it's database drivers to generate the queries.
The extra sort here is so that the I could order the dataset by company,
then by job number, then by the specific lab number, where jobs are assigned
to a single company, and labs are assigned to a given job. The idea is for
the application to be a substitute for bringing the dataset into a
spreadsheet and peruse it there. I could just sort by company XOR both job
and lab, but in the case of sorting by company, all of the companies job
numbers would not necessarily be in order, and likewise the labs within the
jobs would also not. This could be smoothed over by cutting down the dataset
to a subset based on a few criteria, which is the next approach to take.

In the application here, these queries are used by a client
application to fill a window's listbox that can be scrolled
up or down. If the user changes direction of the scroll, it
initiates a new cursor and query to fetch a page of results.
If the scrolling motion is in the same direction, it simply
continues to fetch more results from the cursor. But each
time the direction of movement changes, there can be a
significant lag.

Then, obviously you shouldn't create a new cursor. You can create
backwards scrollable cursors. See the SCROLL option of the DECLARE
statement.

These queries are generated by the database driver and are not easily
tweakable. Generally they use a subset of whatever is available via the ODBC
interface. So, although not optimal, it's not something that I can improve
in the shortterm.

Any suggestions would be helpful! I'll assume for now that
the indexes and queries can't be improved, but rather that I
should tweak more of the postmaster settings. Please correct
me if you know better and have time to reply.

These options heavily depend on the environment and the data set, I
always see them as some last resort, because they might slow down other
queries if tweaked to much towards a specific thing. I have not yet
played around with this a lot. The things simply work fast enough here.
Others can give you better hints on this.

Thanks for you tips and insight. I'll make getting this portion of the
system "good enough" and look to refactor later when needed.

P.S. Is it possible to have indexes that involves several
columns from different but related tables? If so, where can I
learn about them?

Nope. An index is tied to one table only. But another option is, to
precalculate the join. Depending on your needs (especially INSERT/UPDATE
performance), you could use triggers and/or a regular batch job, which
writes the joined results in another table. There you can index these
columns accordingly. In general, this is ugly and leads to redundancy
but can give a big performance boost and is sometimes the only option.

That's an option. I do use triggers now to log user changes to the tables,
this wouldn't be too hard to do, but a bit hard to maintain down the road,
perhaps. It's great to have a backup plan in the case that I have a backlog
of support requests regarding the UI lbeing too laggy.

Kind Regards,
-Joshua

Show quoted text

--
Robert...