Unable to get acceptable performance from EXCEPT

Started by Alfred Perlsteinover 25 years ago3 messages
#1Alfred Perlstein
bright@wintelcom.net

=# select count(*) from ref_old;
count
-------
10595
(1 row)

=# select count(*) from ref_new;
count
-------
22997
(1 row)

=# select ref_id from ref_old except select ref_id from ref_new;

Takes over 10 minutes, probably closer to half an hour.

I've also tried using 'NOT IN ( select ref_id from ref_new )'

ref_id is an int4, this is on Postgresql 7.0.

This confuses me because the way I'd plan to execute this query would
be something like this: (pseudo code)

result retval;
sort(ref_old);
sort(ref_new);
i = k = 0;
while (i < count(ref_old)) {
while(ref_old[i] > ref_new[k])
k++;
while(ref_old[i] == ref_new[k])
i++;
while(ref_old[i] < ref_new[k])
store(&retval, ref_old[i++]);
}
return (retval);

I can't imagine this algorithm would take over 10 minutes on my
hardware. Can anyone shed some light on what's going on here?

Is there a way to formulate my SQL to get Postgresql to follow
this algorithm?

thanks,
--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alfred Perlstein (#1)
Re: Unable to get acceptable performance from EXCEPT

Alfred Perlstein <bright@wintelcom.net> writes:

=# select ref_id from ref_old except select ref_id from ref_new;
Takes over 10 minutes, probably closer to half an hour.
I've also tried using 'NOT IN ( select ref_id from ref_new )'

Yup. EXCEPT is effectively translated to a NOT IN, if I recall
correctly, and neither IN ( sub-select ) nor NOT IN ( sub-select )
are implemented very efficiently. Basically you get O(N^2) behavior
because the inner select is rescanned for each outer tuple.

We have a TODO list item to try to be smarter about this...

Is there a way to formulate my SQL to get Postgresql to follow
this algorithm [ kind of like a mergejoin ]

No, but you could try

select ref_id from ref_old where not exists
(select ref_id from ref_new where ref_id = ref_old.ref_id);

which would at least be smart enough to consider using an index
on ref_new(ref_id) instead of a sequential scan.

regards, tom lane

#3Alfred Perlstein
bright@wintelcom.net
In reply to: Tom Lane (#2)
Re: Unable to get acceptable performance from EXCEPT

* Tom Lane <tgl@sss.pgh.pa.us> [000510 16:22] wrote:

Alfred Perlstein <bright@wintelcom.net> writes:

=# select ref_id from ref_old except select ref_id from ref_new;
Takes over 10 minutes, probably closer to half an hour.
I've also tried using 'NOT IN ( select ref_id from ref_new )'

Yup. EXCEPT is effectively translated to a NOT IN, if I recall
correctly, and neither IN ( sub-select ) nor NOT IN ( sub-select )
are implemented very efficiently. Basically you get O(N^2) behavior
because the inner select is rescanned for each outer tuple.

We have a TODO list item to try to be smarter about this...

Is there a way to formulate my SQL to get Postgresql to follow
this algorithm [ kind of like a mergejoin ]

No, but you could try

select ref_id from ref_old where not exists
(select ref_id from ref_new where ref_id = ref_old.ref_id);

which would at least be smart enough to consider using an index
on ref_new(ref_id) instead of a sequential scan.

Which cuts the query time down to less than a second!

thanks!

Ready for the evil magic?

select
distinct(o.ref_id)
from
ref_link o
where
o.stat_date < '2000-04-26 12:12:41-07'
AND not exists
(
select
n.ref_id
from
ref_link n
where
n.stat_date >= '2000-04-26 12:12:41-07'
AND n.ref_id = o.ref_id
)
;

Thanks a ton.

--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."