best way to query

Started by Steve Clarkabout 18 years ago7 messagesgeneral
Jump to latest
#1Steve Clark
sclark@netwolves.com

Hello List,

this is a noobie question:

I have had to take over an existing system - it was supposed to have
some contraints that
prevented dangling references - but...

any way i have 2 table - A and B.
each table has a key field and if a row is in B it should have a
corresponding row in A - but theres
the problem it doesn't for all the rows in B.

So I want to do something like
delete from B where key not in (select key from A order by key);

The problem is there are about 1,000,000 rows in A and 300,000 rows in
B. I let the above run
all night and it was still running the next morning. Does anyone have
an idea of a better way.

B = t_event_ack_log
A = t_unit_event_log

explain shows:

Aggregate (cost=4712921585.30..4712921585.31 rows=1 width=0)
-> Seq Scan on t_event_ack_log a (cost=103170.29..4712920878.60
rows=282677 width=0)
Filter: (NOT (subplan))
SubPlan
-> Materialize (cost=103170.29..117301.92 rows=1016163
width=4)
-> Index Scan using pk_tuel_eln on t_unit_event_log
(cost=0.00..98184.12 rows=1016163 width=4)

OBTW: how do I interpret the cost - the manual says:
planner's guess at how long it will take to run the statement
(measured in units of disk page fetches)"

Not sure I understand (measured in units of disk page fetches)

Thanks,
Steve

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Steve Clark (#1)
Re: best way to query

Steve Clark <sclark@netwolves.com> writes:

explain shows:

Aggregate (cost=4712921585.30..4712921585.31 rows=1 width=0)
-> Seq Scan on t_event_ack_log a (cost=103170.29..4712920878.60
rows=282677 width=0)
Filter: (NOT (subplan))
SubPlan
-> Materialize (cost=103170.29..117301.92 rows=1016163
width=4)
-> Index Scan using pk_tuel_eln on t_unit_event_log
(cost=0.00..98184.12 rows=1016163 width=4)

Yeah, that's going to suck. A brute force solution is to see if you
can get it to switch to a "hashed subplan" by increasing work_mem.

Also, whatever is the ORDER BY for?

regards, tom lane

#3Steve Clark
sclark@netwolves.com
In reply to: Tom Lane (#2)
Re: best way to query

Tom Lane wrote:

Steve Clark <sclark@netwolves.com> writes:

explain shows:

Aggregate (cost=4712921585.30..4712921585.31 rows=1 width=0)
-> Seq Scan on t_event_ack_log a (cost=103170.29..4712920878.60
rows=282677 width=0)
Filter: (NOT (subplan))
SubPlan
-> Materialize (cost=103170.29..117301.92 rows=1016163
width=4)
-> Index Scan using pk_tuel_eln on t_unit_event_log
(cost=0.00..98184.12 rows=1016163 width=4)

Yeah, that's going to suck. A brute force solution is to see if you
can get it to switch to a "hashed subplan" by increasing work_mem.

Also, whatever is the ORDER BY for?

regards, tom lane

without the order by it wants to do a seq scan of t_unit_event_log.
see below:
explain select count(*) from t_event_ack_log where event_log_no not
in (select event_log_no from t_unit_event_log);
QUERY PLAN
----------------------------------------------------------------------------------------
Aggregate (cost=12144872193.82..12144872193.82 rows=1 width=0)
-> Seq Scan on t_event_ack_log (cost=0.00..12144871485.07
rows=283497 width=0)
Filter: (NOT (subplan))
SubPlan
-> Seq Scan on t_unit_event_log (cost=0.00..40286.56
rows=1021156 width=4)
(5 rows)

Will try increasing work_memory.

#4Daniel Verite
daniel@manitou-mail.org
In reply to: Steve Clark (#1)
Re: best way to query

Steve Clark wrote:

any way i have 2 table - A and B.
each table has a key field and if a row is in B it should have a
corresponding row in A - but theres
the problem it doesn't for all the rows in B.

So I want to do something like
delete from B where key not in (select key from A order by key);

The problem is there are about 1,000,000 rows in A and 300,000 rows

in

B. I let the above run
all night and it was still running the next morning. Does anyone have

an idea of a better way.

An outer join is sometimes spectacularly more efficient for this
particular kind of query.

I'd suggest you try:

delete from B where key in
(select B.key from B left outer join A on A.key=B.key
where A.key is null)

--
Daniel
PostgreSQL-powered mail user agent and storage:
http://www.manitou-mail.org

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Steve Clark (#3)
Re: best way to query

Steve Clark <sclark@netwolves.com> writes:

Tom Lane wrote:

Also, whatever is the ORDER BY for?

without the order by it wants to do a seq scan of t_unit_event_log.
see below:
explain select count(*) from t_event_ack_log where event_log_no not
in (select event_log_no from t_unit_event_log);
QUERY PLAN
----------------------------------------------------------------------------------------
Aggregate (cost=12144872193.82..12144872193.82 rows=1 width=0)
-> Seq Scan on t_event_ack_log (cost=0.00..12144871485.07
rows=283497 width=0)
Filter: (NOT (subplan))
SubPlan
-> Seq Scan on t_unit_event_log (cost=0.00..40286.56
rows=1021156 width=4)
(5 rows)

Hmm, the big problem with that isn't the seqscan but the lack of a
Materialize step to buffer it; which says to me that you're running a
pretty old Postgres version (8.0 or older). You should think about an
update if you're running into performance issues.

regards, tom lane

#6Steve Clark
sclark@netwolves.com
In reply to: Daniel Verite (#4)
Re: best way to query

Daniel Verite wrote:

Steve Clark wrote:

any way i have 2 table - A and B.
each table has a key field and if a row is in B it should have a
corresponding row in A - but theres
the problem it doesn't for all the rows in B.

So I want to do something like
delete from B where key not in (select key from A order by key);

The problem is there are about 1,000,000 rows in A and 300,000 rows

in

B. I let the above run
all night and it was still running the next morning. Does anyone have

an idea of a better way.

An outer join is sometimes spectacularly more efficient for this
particular kind of query.

I'd suggest you try:

delete from B where key in
(select B.key from B left outer join A on A.key=B.key
where A.key is null)

WOW!

this runs in about 10 seconds - thanks Daniel.

explain select count(*) from t_event_ack_log where event_log_no in
(select t_event_ack_log.event_log_no from t_event_ack_log left outer
join t_unit_event_log a on a.event_log_no=t_event_ack_log.event_log_no
where a.event_log_no is null);
QUERY PLAN

----------------------------------------------------------------------------------------------------------------
Aggregate (cost=128349.56..128349.57 rows=1 width=0)
-> Hash Join (cost=94512.91..126935.36 rows=565681 width=0)
Hash Cond: (public.t_event_ack_log.event_log_no =
public.t_event_ack_log.event_log_no)
-> Seq Scan on t_event_ack_log (cost=0.00..14759.85
rows=565685 width=4)
-> Hash (cost=92609.85..92609.85 rows=152245 width=4)
-> HashAggregate (cost=91087.40..92609.85
rows=152245 width=4)
-> Hash Left Join (cost=57337.95..90380.29
rows=282842 width=4)
Hash Cond:
(public.t_event_ack_log.event_log_no = a.event_log_no)
Filter: (a.event_log_no IS NULL)
-> Seq Scan on t_event_ack_log
(cost=0.00..14759.85 rows=565685 width=4)
-> Hash (cost=40696.09..40696.09
rows=1014309 width=4)
-> Seq Scan on t_unit_event_log a
(cost=0.00..40696.09 rows=1014309 width=4)

#7Steve Clark
sclark@netwolves.com
In reply to: Tom Lane (#5)
Re: best way to query

Tom Lane wrote:

Steve Clark <sclark@netwolves.com> writes:

Tom Lane wrote:

Also, whatever is the ORDER BY for?

without the order by it wants to do a seq scan of t_unit_event_log.
see below:
explain select count(*) from t_event_ack_log where event_log_no not
in (select event_log_no from t_unit_event_log);
QUERY PLAN
----------------------------------------------------------------------------------------
Aggregate (cost=12144872193.82..12144872193.82 rows=1 width=0)
-> Seq Scan on t_event_ack_log (cost=0.00..12144871485.07
rows=283497 width=0)
Filter: (NOT (subplan))
SubPlan
-> Seq Scan on t_unit_event_log (cost=0.00..40286.56
rows=1021156 width=4)
(5 rows)

Hmm, the big problem with that isn't the seqscan but the lack of a
Materialize step to buffer it; which says to me that you're running a
pretty old Postgres version (8.0 or older). You should think about an
update if you're running into performance issues.

regards, tom lane

Hi Tom,

Actually this is

show server_version;
server_version
----------------
8.2.5

On FreeBSD 6.2

And Daniel gave me a query that does the job in just a few seconds.

Thanks,
Steve