Performance of outer joins?

Started by Benjamin Smithover 19 years ago3 messagesgeneral
Jump to latest
#1Benjamin Smith
ben@schoolpathways.com

I have a situation that can be summarized to the following:

-- day in 20061215 format
Create table calendar (
day integer unique not null
);

Create table customers (
id serial unique not null,
name varchar,
address varchar,
);

Create table deliveries (
customers_id integer not null references customers(id),
calendar_day integer not null references calendar(day),
delivered bool not null default false,
unique(customers_id, calendar_id)
);

Imagine tens of thousands of customers, a few million deliveries. A query
that's structurally similar to the following query is rather slow. It's
taking over 30 seconds, climbing fast on reasonable HW. (SMP Opteron, 10k
SCSI, 4 GB RAM) If I remove the outer join, performance is < 1 second.

SELECT customers.id as customers_id,
customers.name AS customers_name,
calendar.day AS calendar_day,
CASE WHEN (deliveries.delivered IS NULL) THEN 'n/a'
WHEN (deliveries.delivered=TRUE) THEN 'yes'
ELSE 'no' END AS delivered
FROM customers
JOIN calendars ON
(
-- GIVE A CALENDAR OF POSSIBLE DAYS FOR DELIVERIES
calendar.day < 20061201
AND calendar.day >= 20060101
)
LEFT OUTER JOIN deliveries ON
(
customers.id=deliveries.customers_id
AND deliveries.calendar_day=calendar.day
)
;

What can I do to improve the performance of this oft-used query? Is there a
better way to do this, or am I doomed to looping thru results and parsing the
results in code?

Thanks,

-Ben

--
"I kept looking around for somebody to solve the problem.
Then I realized I am somebody"
-Anonymous

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Benjamin Smith (#1)
Re: Performance of outer joins?

Benjamin Smith <ben@schoolpathways.com> writes:

FROM customers
JOIN calendars ON
(
-- GIVE A CALENDAR OF POSSIBLE DAYS FOR DELIVERIES
calendar.day < 20061201
AND calendar.day >= 20060101
)

Haven't you written a cross-product join here? Seems like a bad
idea if you'd like it to be fast. Especially since you then left-joined
it to something else, meaning that every row in the cross product
must in fact generate output. Do you really want to be deluged with
a table showing all the customers you DIDN'T deliver to on each day,
for every day in the year?

regards, tom lane

#3macgillivary
macgillivary@gmail.com
In reply to: Benjamin Smith (#1)
Re: Performance of outer joins?

ben would something like this work in your situation?

SELECT customer.id, customer.name, deliveries.calendar_day,
deliveries.delivered
FROM ben_customers as customer, ben_deliveries as deliveries

WHERE customer.id = deliveries.customers_id
and deliveries.calendar_day in (Select day
from ben_calendar
where day < 20061201 and day >= 20060101)

I think that would cut down the deliveries table fairly quickly, thus
making the customer_id join nice and quick. You would also only be
returning data from the tables in your from clause. I suppose it
depends on what you are trying to display. If you want a list of all
the possible days, and deliveries on those days then I'd approach it a
bit differently.

On Dec 15, 7:59 pm, b...@schoolpathways.com (Benjamin Smith) wrote:

Show quoted text

What can I do to improve the performance of this oft-used query? Is there a
better way to do this, or am I doomed to looping thru results and parsing the
results in code?